mozilla :: #datapipeline

15 Mar 2017
01:24frankrvitillo|workweek: native table partitioning will be available in postgres 10, would have been useful for the aggregates data :) - https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=f0e44751d7175fa3394da2c8f85e3ceb3cdbfe63
07:47rvitillo|workweekfrank: cool
08:48rvitillo|workweekwhd: EMR 4 is dead
08:48whd!dance
11:46gfritzschemlopatka: mdoglio: http://georgf.github.io/fx-data-explorer/
11:51gfritzschemlopatka: mdoglio: dexter: https://drive.google.com/file/d/0BygBwOoX9dclRlVXcms0UFpjRlk/view?usp=sharing
12:10gfritzscheDexter: mdoglio: https://docs.google.com/document/d/1fhcd3z6YZj_nBe5KY0sfw1PqZyjS0-yWUgRBUV3Dq0c/edit#heading=h.edwdqq8ovo7i
12:34mreidTIL bugzilla emails you about overdue review requests on closed bugs. every day.
14:47frankrobotblake: where is the core ping direct-to-parquet in the Athena schema listing?
14:55gregglindhow is the workweek going?
14:56gregglindgfritzsche: that histogram dashboard is great
14:57gregglindgfritzsche, RaFromBRC does that answer the viz questions we have for quantum experiments?
15:11gfritzscheOk, this looks helpful: https://twitter.com/mollyclare/status/841798594965458948
15:19frankgfritzsche: hah! I do sometimes use words to mean what they mean
15:19frankbut not all the time :)
15:31mreidgfritzsche: you comin' to the events mtg?
17:07RaFromBRCgregglind: are you talking about this page? http://georgf.github.io/fx-data-explorer/stats.html
17:10gregglindyes
17:10gregglindsome of that has the right smell as well
17:10gregglindI think people are pecking around edges of it.
17:11RaFromBRCindeed... i like the dropdown on each graph, handles the faceting problem we were talking about well
17:13gregglindIt's certainly more well-thought out than anything I had
17:13gregglindI have been doing simulations in R of mixture models to think about how to talk about differences
17:16robotblakefrank: it's not in Athena because the table couldn't be created
17:16robotblakeDue to the binary fields
17:16frankoooh, right
17:16frankdang
17:16franksorry, totally forgot
17:16robotblake:)
17:17* frank 's brain is all over the place
18:51robotblakefrank: Does https://bugzilla.mozilla.org/show_bug.cgi?id=1347609 cover the UTF8 thing?
18:51firebotBug 1347609 NEW, nobody@mozilla.org Update core ping parquet schema
18:52frankrobotblake: no, it doesn't
18:52frankit's a new field someone has added
18:53frankthe conversion to UTF8 will need to be done within hindsight, since it doesn't allow conversions to be defined in the schema
18:53frankor I guess we'd have to change the core ping itself, which seems a bit much :)
19:14rraybornrvitillo|workweek: sorry to interrupt, but what is a reasonable upper bound of "small sample" WRT HBase? I'm running into issues that sunahsuh helped me trace to HBase. I plan on calling HBase iteratively on segments for this one time analysis.
19:25mreidrrayborn: cargo-cult says ~500 ids per request is reasonable
19:25rraybornThanks mreid
19:25mreidI can confirm that 100 at a time works great
19:26mreidfor my particular use case :)
19:28mlopatkarrayborn, mried we simulated exactly this problem to see what was a reasonable bound on number of requests we could handle for a service using Hbase. In the context of our service it basically topped out at 20requests per second
19:31mreidmlopatka: interesting
19:31mreidis that a limitation based on the size of the hbase cluster?
19:31rraybornmlopatka: What does 20 request per second translate to for max number of IDs with a 1 month lookback? Sorry I'm limited in my knowledge here
19:32rraybornAlso, will I annoy anyone for running HBase iteratively for batches of IDs?
19:32mlopatkarrayborn: actually I have no idea, to be clear we were evaluating it as a candidate for a new service
19:33rrayborn(We're trying to get ping histories for a lot of Heartbeat respondents)
19:33mreidrrayborn: what's the approx total # of clientids you're looking at?
19:35rraybornWe have ~6,000 that completed a survey, but we also have ~40,000 non-completions. I can pair down the number of non-completions because they're less important (but we're interested in including some of them to potentially evaluate response bias).
19:37mreidI would be interested to see how long it takes to filter one day of main_summary for 50k ids
16 Mar 2017
No messages
   
Last message: 74 days and 8 hours ago