mozilla :: #datapipeline

10 Aug 2017
14:01frankmreid, sunahsuh: I'll have to catch up with you later, my eye appt is taking longer than expected :(
14:01mreidfrank: ack
15:05mdogliofrank: feel free to create a new release on github when you want push your changes to pypi
15:05frankmdoglio: already did :)
15:05mdogliocoolio
15:53amiyaguchimdoglio: would you have any clues on why I can't install a python package via `python setup.py install`?
15:53amiyaguchihttps://bugzilla.mozilla.org/show_bug.cgi?id=1388869
15:53firebotBug 1388869 NEW, nobody@mozilla.org TaskInstance: churn.churn 2017-08-02 00:00:00 due to setup.py installation failure
15:54mdoglioamiyaguchi: let me have a look
15:55amiyaguchimdoglio: its pretty easy to reproduce locally with a fresh virtualenv
15:56amiyaguchithe workaround I have is to separate the build and install steps (pip install . && python setup.py bdist_egg)
15:59mdogliothe error seems to raise when installing click-datetime
16:01amiyaguchimdoglio: I looked at the package though, they seem to package everything together
16:02amiyaguchiis the MANIFEST.in the only way to specify files to add to the package?
16:04mdogliothis is the issue https://github.com/click-contrib/click-datetime/issues/1
16:04mdoglioand this is the fix for it https://github.com/click-contrib/click-datetime/pull/5/files
16:08franklooks like that package is still new development
16:08frankmaybe we shouldn't use it as a dep just yet
16:08amiyaguchimdoglio: thanks, I thought it might have been related (the logs ended at click-datetime). It unintuitive that a dependent packages on pypi can break other things
16:08mdoglioyeah I was thinking the same thing
16:09mdogliospecially if it's 30 lines of code with comments and everything
16:09amiyaguchisure, it's easy enough to reimplement
16:12mdoglioit was added here https://github.com/mozilla/python_mozetl/commit/0539c1ad3e855a5156c39c869ace7f5a5f38897d, do we already have code that depends on that?
16:14mdogliooh I see it has been used for the hardware report
16:14amiyaguchimdoglio: hardware report, but it would have been used by most scripts
16:18frankhardware report could surely just use the string
16:18frankYYYYMMDD
16:23amiyaguchiI think validating input types is good regardless since the command line is the API between airflow and mozetl
16:24frankamiyaguchi: true, but in the meantime we are virtually guaranteed that there won't be any weird inputs, since they come from airflow
16:24frankjust saying - if you don't want to re implement it right away, that is a fine temporary workaround :)
16:27amiyaguchiI have a limitless supply of bandaids, I already fixed the install issue with another workaround
16:29amiyaguchiBut maybe I'll get around to the date times soon, it was on my list anyways on my refactoring list
16:47frankamiyaguchi: plz share bandaids, I need plenty
17:27mdogliorobotblake: yo, I found out that athena doesn't know much about the crash_summary dataset, while presto does
17:28mdogliocould you please check that the partition discovery for that dataset is enabled?
23:08jgauntfrank: how difficult would it be to adapt your SHIELD+Data+Join nb to pull for a pref-flip experiment?
23:08jgauntI wanted to partial usage out from the crash rates I found for Screenshots
23:09frankjgaunt: there is a *much* easier way to get a Longitudinal version of the experiment main pings
23:09frankif that is what you are looking for?
23:10jgauntthey're already there, huh
23:10jgauntI just filtered to crashes only
23:10jgauntfrank: ^
23:11frankjgaunt: "they're" already "there"? Not sure what "they" and "there" are in this case :)
23:11jgauntoops, let me back up
23:12jgaunthttps://irccloud.mozilla.com/pastebin/PNK7bUyd/
23:12jgauntwell that looks awful
23:13frankjgaunt: I think I got what you mean
23:13frankthe data is there, but even better, it is already in parquet
23:13jgauntthat initial RDD returned by cohorts.where(
23:13jgaunthas all the docTypes
23:13frankjgaunt: ah right, yeah that has crash pings
23:13jgauntit's in parquet, huh
23:13franknot crash pings yet :(
23:14jgauntI just need a few averages to factor out of numerators in a ChiSq
23:14jgaunthas a dash already been made for this?
23:14jgauntor just theoretically I can query it on re:dash :)
23:15frankjgaunt: as long as you're not asking for main crashes
23:15frankpretty much everything else can be done in re:dash
23:18jgauntfrank, right on - I'm not as familiar w/it.. is there docs on how these different data sources/table schemas are organized? looks overwhelming
23:18jgaunt*are there docs
23:18frankjgaunt: hmmm.. https://docs.telemetry.mozilla.org/tools/experiments.html
23:18frankunfortunately not really yet
23:19frankjgaunt: basically, query 'experiments' in re:dash
23:19frankand be sure to filter on a single experiment_id
23:19frankit has the same schema as main_summary
23:19jgauntwith Presto as data source?
23:19frankjgaunt: yup, also in Athena
23:30jgauntfrank: any tips for using contents of a JSON column in a WHERE clause?
23:30frankjgaunt: this should help - https://prestodb.io/docs/current/functions/json.html
23:30jgauntit doesn't like it
23:31jgauntin the WHERE
23:31jgauntI think it's late for you, I'll buzz off :)
11 Aug 2017
No messages
   
Last message: 11 days and 6 hours ago