mozilla :: #datapipeline

17 May 2017
08:36gfritzschethanks mreid
09:24Dexterfrank, I could easily see the hw report job exploiting your "general longitudinal" code ;-)
12:45sunahsuhpresto appears to be totally borked (connection timeouts)
12:48sunahsuhboth from stmo and trying to hit https://telemetry-presto.dev.mozaws.net/
12:51mreidsunahsuh: does Athena work?
12:52sunahsuhit's a new dataset that i just loaded in the hive metastore
12:52sunahsuhi guess i could load it in athena too
12:53sunahsuhit's not pressing, i can wait :)
13:32Dextermreid, thanks for the review on the schema.. looks like you're correct, the property should be "title". It's "name" in almost all the schemas :'(
13:36mreidyeah... hopefully we can fix it everywhere when we figure out #46/47
13:37Dexter+ sqrt(-1)
13:39mreidheh
13:39mreidDexter: one more comment re: the schema version
13:39Dextermreid, sure, should I change that?
13:40mreidyes please
13:40Dextermh, I was mislead
13:41Dexterthe version number in the schema is the version number of the common ping format
13:41Dexterbut that's the first version of the "new-profile" ping
13:41Dexterit's like.. new-profile v1 using common ping format v4 :-S
13:42Dexterrenaming :)
13:42* Dexter claims the version number in the name should match the "payload" version, not the common ping format version
13:44Dextermreid, done
13:46mreidDexter: it needs to match the version we extract from the document
13:46Dexterah, I see
13:46mreidfor UT-style pings, that's the top-level "version" field
13:46mreider, desktop-ut-pings
13:46mreidcore pings use a "v" field
13:47mreidold pre-UT pings use a "ver" field
13:47mreidit's messy :(
13:47Dexterthanks for the clarification
13:47Dexterit makes sense now :)
13:47mreidnp
13:47mreidwe could change the ingestion code to use a "payload" version (which I agree would make more sense)
13:48mreidwe'd need to establish a standard for that
13:48trinkDexter: I agree also but that is not how payload is spec'ed http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/common-ping.html
13:48Dextermreid, yes "eventually" :)
13:49Dextertrink, well, it says "// the version of the ping format, currently 4"
13:49Dexterwhich refers to the version of the common ping format AFAICT
13:49Dexterthat page doesn't mention the *payload version*
13:50Dexterwhich should be in the payload (and could or could not match with what we put in the filename)
13:50mreidright
13:50Dexterfor example, for the main ping, it happens to match
13:50Dexterhttp://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html <-
13:50mreidsome pings don&#39;t have a version in the payload AIUI
13:50mreid(like deletion ping)
13:50trinkI meant they actual payload is not spec&#39;ed as requiring a version... it is just an opaque blob
13:50Dexterwe&#39;re at &#39;main&#39; ping v4
13:50Dexterexactly
13:50Dexteryeah, yes to both of you :)
13:50mreid:)
13:51mreidthere was a bug a while back about adding the &quot;ping version&quot; to the submission URI
13:51mreidthat seems like the right way to go
13:51mreidthen you don&#39;t need to look inside the doc to determine its version
13:51Dexteryup, I agree
13:52Dexterit would make ingestion faster too... I guess (or not influence it at all :D)
13:53mreidat least it would make it easier to decide what to do during ingestion
13:54mreidthe problem with adding another way to identify versions: https://xkcd.com/927/
13:55Dexterlol
13:57trinkonly in the document is easily doable with standardized naming and keeps it in one place... as for making ingestion faster don&#39;t get me started on that enumeration
13:57Dexter:-)
13:58Dexter(that would probably involve dropping JSON :D)
13:59trinkif it were only that simple :)
14:02* trink was waiting for mreid to link another xkcd
14:04Dexterhttps://xkcd.com/1349/
14:04Dexter:D
14:04* Dexter is not Mark, but tries to learn
14:07mreidDexter: is this dataset dead? https://metrics.services.mozilla.com/diagnostic-data-viewer/?dataset=fennec-v4-monthly
14:11Dextermh, interesting, looks like it is :S
14:13Dextergfritzsche, ^
14:13Dexterlooks like Georg scheduled this (https://bugzilla.mozilla.org/show_bug.cgi?id=1260715#c12)
14:13firebotBug 1260715 FIXED, gfritzsche@mozilla.com Review and schedule CSV summary export for the fennec-dashboard for Fennec 46
14:13DexterAnd that the notebook lives here: https://github.com/mozilla-services/data-pipeline/blob/master/reports/fennec_dashboard/summarize_csv.ipynb
14:18gfritzschemreid: this is effectively unused, things happen in re:dash now
14:18mreidgfritzsche: ok
14:18gfritzschewe raised at some point whether we should kill it, not sure why its still around
14:19mreidthat&#39;s just a diagnostic data viewer
14:19mreidI&#39;ll remove it
14:19gfritzschefrank should have more context
14:19mreidis there a bug for stopping hte job?
14:19* frank is looking
14:32spenrosefrank it turns out when you update a gist, GH keeps the URI but changes the RAW URI. which makes sense I suppose.
14:33frankspenrose: heh, sort of makes sense
14:37* mreid _ -> github
14:37spenrosefrank per your reply, i&#39;ll know more after my 1:1 this morning PDT about the best path forward. in the meantime.
14:38frankspenrose: sounds good. If it&#39;s easier I can check it in to mozilla-reports. I know the cli is a pain
14:39spenrosegreat, may take you up on that. isn&#39;t there now because ... actually i forget why. reasons :-)
14:40frankspenrose: that would be fine, all I need is a link to the gist
14:40spenrosecool. the URI my last comment should work permanently ... and I promise to tell you if that changes
14:41mreidfrank: may not be necessary: https://github.com/mozilla/python_mozetl/pull/35
14:41mreidspenrose: ^
14:42spenroseholy cow, amiyaguchi outta nowhere with the clutch save ....
14:42mreidI think the failing test is unrelated to this
14:42frankgood, I meant to put it in moz-reports as a band-aid until that lands
14:43mreidI think that change, in itself, is a bandaid
14:44frankmreid: agreed. Then let&#39;s try and land that. I&#39;ll go ahead and r
14:44mreidthanks
14:44mreidthere was another change to fix the tests
14:44spenroserecursion: not just a good idea
14:44frankthat PR still has failing tests
14:45frankmreid: https://github.com/mozilla/python_mozetl/pull/36 is still failing :/
14:46mreidfrank: I think it needs to be rebased on https://github.com/mozilla/python_mozetl/pull/32
14:46mreid:(
16:09robotblakesunahsuh: I&#39;m looking into the Presto issues
16:09sunahsuhthanks robotblake
16:41robotblakesunahsuh: Looks like the master node got into a bad state and EMR killed the entire cluster :\
16:41robotblakeShould be back up now
16:44mreidwhd: do you know if there&#39;s a bug on file to split &quot;new-profile&quot; pings out of other (see https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/48 )
16:45harterhey, spenrose - is there a bug open for getting tests for the search rollups?
16:46whdmreid: not that I&#39;m aware
16:47mreidwhd: thx, I&#39;ll file
16:47whdmreid: that being if a bug was filed it would be done already :/
16:48whdI do remember a parquet bug now that I think about it
16:50spenroseharter kinda-sorta
16:51spenroseharter it looks like a n t h o n y may have stepped up (see scrollback)
16:51spenrosemeanwhile i have a refactor that provides tests for all the code that doesn&#39;t require a spark context with access to e.g. aws
16:52spenroseanyway, gonna talk to m r e i d about this in a few minutes and can report back
16:52hartergreat, thanks for the context
17:09jezdezwendy: mdoglio: https://github.com/mozilla/telemetry-analysis-service/pull/467
17:28trinkgregglind: woo finally seeing experiments data message_matcher = &#39;Fields[environment.experiments] != NIL&#39;
17:33trink{&quot;pref-flip-test-nightly-1&quot;:{&quot;branch&quot;:&quot;control&quot;}}
17:52amiyaguchilanded changes for the search rollups in mozetl and airflow
17:55frankwhd^ we need to redeploy
18:01mreidamiyaguchi++
18:10amiyaguchimreid: also, any suggestions for intermittent failures because of toLocalIterator?
18:14mreidamiyaguchi: did you try shaking your fist at the intermitent failure?
18:15amiyaguchiI did, and it went away. But then it came back.
18:15mreiddang
18:16whdfrank: roger
18:17mreidamiyaguchi: can we pin the spark version to a newer release that doesn&#39;t have that problem?
18:17amiyaguchimreid: 2.0.3 or 2.1.1, I think we&#39;re stuck with whatever&#39;s default on EMR
18:18amiyaguchiI&#39;m thinking of maybe just using collect and warning that we can&#39;t write csv files larger than whatever fits on driver memory
18:18mreidyeah, that seems like a reasonable limitation
18:18amiyaguchiwe probably shouldnt be writing csv files that big anyways
18:18mreidhonestly if it&#39;s larger than driver memory, you shouldn&#39;t be using csv
18:18mreidexactly
18:19mreidmaybe catch an OOM exception and print a message saying &quot;use parquet!&quot;
18:19mreid:)
18:20amiyaguchiIt&#39;d be a wonder if python could catch java exceptions
18:20mreidadding a docstring to the &quot;write_csv&quot; function should suffice
18:31spenroseamiyaguchi you can also ask m p r e s s m a n if there&#39;s an alternate ingestion format. I picked CSV for no reason other than convenience
18:31spenroses/format/approach-that-would-work
18:35amiyaguchispenrose: csv gets used all over the place, this is a utility function I wrote for writing csv to local disk because spark on emr has some permission issues. It gets used for the topline dashboard.
18:36spenroseah, nevermind :-)
18 May 2017
No messages
   
Last message: 41 days and 19 hours ago