mozilla :: #datapipeline

8 Sep 2017
00:00jgaunt^ amiyaguchi
00:03amiyaguchijgaunt: is there a reason for aggregating usage per client?
00:04amiyaguchiare you trying to identify clients that crash the most?
00:04jgauntI intend to determine each clients crash/usage rate and submit to a t-test by experiment branch
00:04jgauntfor inference
00:05jgaunttrying to identify whether a significant difference exists between branches
00:06amiyaguchicould you aggregate across clients to determine if there is a significant difference?
00:07amiyaguchiaggregating over the population is cheaper than aggregating per user
00:08jgauntthat defies the logic of the test
00:08jgauntthe sample needs to be clients/users, not pings
00:09amiyaguchiI see, you need to have information at the client granularity
00:09jgauntI affirm
00:09jgauntand realize how non-trivial that's becoming ;)
00:10jgauntneed to call it a day for now, thx for responding
00:10jgauntping me if you have any ideas
00:11amiyaguchisure thing, I guess I should brush up on experimental statistics
12:25Dextermh, both the HBase jobs are failing :S
14:18franksunahsuh: robhudson put in!
14:22sunahsuh!! that's awesome :D
14:28trinkshould probably add a continuity correction option too
14:33trinkmmm if it was actually returning the probability
14:34franktrink: right, that comes next
14:34frankimo our results should parallel scipy every time, so if they implement it so should we
15:01frankrobhudson: trink implemented p-values here (with continuity correction), you can just parallel what he did:
18:00suhere: is there any way to look up the values of each bucket for our histograms?
18:01sui'm looking at some histogram fields in longitudinal, and it seems that each histogram (for a subsession) are respresented as an array
18:01subut there's no info about what the value of each bucket is
18:02sui.e. [0,0,0,1,0,1,0], so 2 observations, one in the 4th bucket and one in the 6th bucket, but no info on what value those buckets represent
18:03mreidsu: you need to consult the histogram definition
18:09sumreid: here?
18:11mreidsu: there is code to compute the bucket bounds inside "histogram_tools" in python_moztelemetry
18:11mreidfrank: is there an easier way I'm not remembering?
18:12suoh i c
18:12susweet, that'll work!
18:12frankmreid: no easier way :(
18:12mreidk thx
18:12* frank put the bucket labels in main_summary for this very reason
18:14suthanks guys :)
18:15gfritzschesu: there is the histogram simulator:
18:18susweet! thanks georg
18:22amiyaguchitrink: I was looking at the cep dashboard for duplicates and noticed that only duplicates of the `Other` type exist as of Sept 5th. Are we deduplicating main pings and marking the duplicates as Other pings?
18:23mreidamiyaguchi: that is a consequence of the "generic" refactor - details in
18:23firebotBug 1397286 NEW, Move the full telemetry URI parsing into the moz_ingest
18:25trink^ the changes will go out next sprint, it is low priority since it is not actually used
18:27amiyaguchiSo it's caused by the duplicate message not containing normalizedChannel then getting bucketed as Other?
18:28jgauntfrank, sunahsuh, amiyaguchi - what's the trick to making .collect()-ed dataFrame data usable to numpy/scipy? I've tried passing my lists to np.array() w/o success
18:29trinkso appUpdateChannel will we added back to the message and the analysis plugin will be updated to normalize it during classification
18:30amiyaguchijgaunt: I think there's a DataFrame.toPandas() method, you should be able to convert into the datastructure of your choosing after
18:31trinkamiyaguchi: if you are interested in getting some exposure to the ingestion pipeline the bug doesn't have an owner yet :)
18:57jgauntamiyaguchi: thx for the reply and a follow-up from yesterday - decided to run with only the most recent day of data and that seems as valid. After controlling for usage there's no significant difference in crash rates between gecko and stylo
20:12sunahsuhin the absence of blake, anyone know why presto would be way behind re: new partitions added to hive?
20:14joygfritzsche: i thought this had scalars?
20:14joyseems it doesnt
20:15sunahsuhjoy: it does -- browser.engagement.max_concurrent_tab_count is in there, for example
20:15joybut then active_ticks isn;'t but that is found in
20:16sunahsuhthat's not in release yet
20:53franksunahsuh: how far behind
20:55sunahsuhfrank: a couple hours -- i re-ran p2h on the experiments dataset to refresh it with the new experiment they launched yesterday
20:55sunahsuhi see in the logs that the partition made it into hive
20:55franksunahsuh: but presto isn't seeing it?
20:55sunahsuhbut `show partitions FROM experiments` still doesn't see it
20:55sunahsuhvia re:dash
20:56sunahsuhprobably 2-3 hours at this point
20:56franksunahsuh: you are using Presto datasource, right?
20:57frankyeah, hive has the partitions
21:01franksunahsuh: probably make a bug, the default metastore cache on Presto is 2 mins, so it's definitely not that
21:01frankunless we changed it... to hours... in which case we should not have done that :)
9 Sep 2017
No messages
Last message: 11 days and 18 hours ago