mozilla :: #datapipeline

20 Apr 2017
00:13frankharter: that was it :)
08:54gfritzschehm, who owns addon_aggregates?
08:55gfritzscheit looks like we might want to have columns like "is system addon" and "is web extension" in there
12:53mreidgfritzsche: that's bmiroglio
12:56gfritzscheaaand i see we have a component for that, let me just file something :)
13:32bmirogliogfritzsche: I agree a web extension field would be useful. Is there a particular reason you'd want to see aggregates of system add-ons? Right now the dataset includes only non-system add-ons
13:34gfritzschebmiroglio: this came up through mjaritz
13:34gfritzschei guess there is also potential questions on uptake, rollout etc.
14:57mreidAnyone had an issue using Jupyter "nbviewer" site with bugzilla attachments?
15:01frankmreid: I just did
15:01frankI stuck your nb in a gist so I could view it
15:38frankrobotblake: newest EMR release has presto 0.166 with lambda functions!!!!!! When can we upgrade?
15:47mreidfrank: I wonder if it just needs a "json" content type
15:48frankmreid: I put a bug into nbviewer
15:48frankbecause, well, the 400 page asked :)
15:49mreidfrank++
15:49mreidI added a content type of application/json and it still doesn't work
15:49mreidalso +1 on presto w/ lambda functions
15:49mreidfrank: can you link to the gist in the bug?
15:49frankI am ridiculously excited for lambda functions
15:49frankyeah
15:50mreidme too
15:50mreidthough it's now gonna be even more tempting to do overly complicated sql :)
15:50frankI love myself some overly complicate sql
15:50mreidon that note
15:50mreidwould you mind reviewing that notebook for "does what it's supposed to"?
15:51frankmreid: can do
15:51mreidThanks!
15:51frankI kind of already did, but I'll make it official
15:51mreidya
15:51mreidI added the r?
15:51mreidhaha "something.ipynb" ftw
15:52frankhehe they make you add a name
15:52mreidbtw, if you want to extend the analysis you can use the same parquet dataset and it should be relatively fast
15:53frankooh, yeah, that sounds good
15:53frankI can make that plot I mentioned
15:53frankis it in your efs dir?
15:53mreidyup
15:53mreidor you can just straight up load your gist
15:53frankk, easier to cp from there
15:54frankmreid: remember I don't have the jupyter extensions for some reason
15:54* frank needs to fix that
15:54mreidya
15:54mreidfrank: does that sawtooth pattern seem suspicious to you?
15:54frankI think because I created my home dir wayyy before I had the correct bootstrap script
15:54frankmreid: not really, it fits with the daily subsession split
15:55mreidthis is time-between-dupes
15:55mreidso sending the same record after 24, 48, 72hrs etc
15:56frankyeah, but I guess what I'm saying is we already do things at 24 hour increments
15:56frankit's not crazy to think that missed bugs/other *things* (??) in that code may cause behavior that we're seeing
15:57mreidyeah, based on what I know about how submissions are sent, it didn't seem expected
15:57frankhmm, you know more than I
15:57frankcan you explain in more detail?
15:58mreidgfritzsche: Dexter: I did an analysis of the distribution of "time between duplicate submissions" @ the final cell in https://gist.github.com/fbertsch/3d388761e383e52e826e4979dab0f81a
15:58frankcalling in the cavalry I see :)
15:59mreidindeed
15:59mreid"what I know about how submissions are sent" doesn't go very far ;)
15:59Dextermreid, I'm in 1:1, will check later :)
15:59mreidthanks Dexter
16:03Dextermreid, what's their doc type?
16:03Dexter(also wow, seems like a lot of dupes)
16:04mreidDexter: main pings
16:04mreidDexter: for full filter criteria, see https://bugzilla.mozilla.org/show_bug.cgi?id=1348008#c2
16:04firebotBug 1348008 NEW, mreid@mozilla.com Verify that duplicates are being flagged properly
16:06Dextermreid, given the docs -> https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/concepts/submission.html
16:06Dexterthe 300 as the upper bound for the "final plot" in hrs looks comparable to the "max ping age"
16:06Dexterwhich is 14 days * 24 hours -> 336hrs
16:06Dexter(oh, I'll bbl)
16:07frankoh, that is very interesting^
16:09mreidDexter: the upper bound is just because I limited the dataset to 11 days
16:10mreidDexter: I looked at a longer time window a few weeks ago and found dupes coming in after 361 days o_O
16:12frankis there *any chance* that we have duplicate docIds without duplicate payloads?
16:16mreidfrank: I ran that check way back before UT launched, but not since then
16:17franksuppose it would be simple enough to check now
16:17mreidsimple, but fairly expensive
16:17frankwell, I mean test it out on a sample :)
16:17mreidsure
16:31gfritzschemreid: so, the N*24h pattern "should" be explainable
16:31gfritzschemreid: how is that for ratio to overall pings and clients?
16:39robotblakehttps://aws.amazon.com/blogs/aws/amazon-redshift-spectrum-exabyte-scale-in-place-queries-of-s3-data/
16:40robotblakeAppears to be redshift backed by Athena o.O
17:55frankgfritzsche: thanks for moving. Wasn't sure what the right component was since technically the docs are going to be in toolkit/components/telemetry
17:56frankre: bug 1358206
17:56firebothttps://bugzil.la/1358206 NEW, fbertsch@mozilla.com Add in-tree Docs for focus-event ping
18:00mreidrobotblake: saw that this morning, looks interesting
18:02gfritzschefrank: sure! i think those kind of bugs fit better into the product/... component behind them, where people care about it. similar with adding new probes etc.
18:02frankmakes sense. gfritzsche can I ping you with an r then?
18:03mreidgfritzsche: do you mean to check the distribution for time-between-pings and see if it is the same as for the duplicates?
18:03mreidie. it might be a stable fraction of total docs
18:03gfritzschefrank: i don't really know about the contents now - would it make sense for me to f? and for an involved client engineer to r? or so?
18:04frankgfritzsche: makes sense. Policy was going to r it also so I'm not too concerned
18:04frankI'll have sebastian do it, thanks!
18:19gfritzschemreid: added a comment to the bug... i mean "how many pings, proportionally are actually duped?" etc.
18:19mreidyeah, makes sense. Thanks!
18:20mreid#dupes / #total could flatten it out.
19:38trinkgfritzsche: were they any changes to the crash ping that would be hitting release (the avg ping size has been climbing all day)
20:12RaFromBRChttps://moz-experiments-viewer.herokuapp.com/
21 Apr 2017
No messages
   
Last message: 66 days and 9 hours ago