mozilla :: #datapipeline

11 Aug 2017
15:02harterCan someone give Parquet2Hive a kick - I added some new data to the privacy_prefs_shield table and I want to get to work on some queries
15:05frankharter: can do
15:06frankharter: I'm not seeing privacy_pref_shield in the parquet2hive jobs
15:06harters/pref/prefs/ - my bad
15:07harterfrank: ^
15:07frankharter: it's not setup in the crontab, but I can run this oneoff
15:07harterreally? hmm
15:08frankharter: added it, seems it may be missing some partitions?
15:08frankhttps://irccloud.mozilla.com/pastebin/vCxtOvCE/
15:08frankdefinitely some interim dates missing there
15:09frankharter^
15:09harterah, WTH
15:09frankyeah the others are missing from s3
15:10frankharter: also open a bug for blake to add this to the cron
15:12harterreopening: https://bugzilla.mozilla.org/show_bug.cgi?id=1386430
15:12firebotBug 1386430 FIXED, bimsland@mozilla.com Add dataset to Presto
15:12harterthanks, frank!
15:13frankharter: you should actually have blake add autodiscovery to your telemetry-parquet/harter directory
15:13frankthen you don't have to ping him for your new datasets
15:15harterfrank, you are a fountain of wisdom
15:15frank
15:32sunahsuhsbt 1.0 is out: https://twitter.com/scala_sbt/status/896000631990767619
15:35franksunahsuh: faster incremental compiler!
15:35* frank does a little dance
15:37harterAirflow problems: running a local test causes: IOError: [Errno 13] Permission denied: '/app/unittests.cfg'
15:38hartercommand `docker-compose run web airflow test main_summary t12 20170809`
15:38sunahsuhusing docker?
15:38harteryep
15:38harterhttps://irccloud.mozilla.com/pastebin/N1YVE9xi/airflow_docker_error
15:39harterfull error^
15:47sunahsuhhuh. do you just want to run the dag?
15:47sunahsuher, task
15:51harteryes
15:51hartersunahsuh: ^
15:58sunahsuhusually i use the web ui for doing that locally
16:00sunahsuhharter:
16:00harterI'm trying to test a task that hasn't been submitted yet
16:00harterIs that still doable?
16:01harterIf so how do I start the local server?
16:01hartersunahsuh: ^
16:02sunahsuhyeah, that should be fine -- sorry, this reminds me there's a fix that we need in the dev docker file to actually run tasks that i have sitting locally
16:02sunahsuhharter: give me 2 mins to submit this 1-line PR
16:03harterawesome -thank you
16:06sunahsuhi tagged you on that review
16:06sunahsuhharter:
16:13sunahsuhharter: so, it looks like the makefile takes care of creating unittests.cfg for you
16:14harterI think the user doesn't have write permissions for that file
16:14sunahsuhi'd use the make commands instead of directly invoking `docker-compose` (`make build` and `make up`)
16:14harterMaybe related to https://github.com/mozilla/telemetry-airflow/commit/d9fe940a9b6da24ebf19aaacc90cb20c4828658b#diff-3254677a7917c6c01f55212f86c57fbfL64
16:14harterThat command is issued by the make file
16:14sunahsuhhrm. are you on a mac?
16:14hartersunahsuh: ^
16:14hartersunahsuh: Nope -
16:15harterwhere's the fun in that :)
16:18sunahsuhharter: does it exist now?
16:18sunahsuhthe file
16:20sunahsuhlooks like it's probably generated for the first time after the chown command in the dockerfile
16:21sunahsuhalso, remember how docker is supposed to solve these cross-platform dev problems? :P
16:22harterlol
16:22hartersunahsuh: where are you seeing the command to create this command?
16:23hartersunahsuh: also, I can't log into the web container - it keeps restarting
16:23sunahsuhhttps://github.com/apache/incubator-airflow/blob/3547cbffdbffac2f98a8aa05526e8c9671221025/airflow/config_templates/default_test.cfg#L14-L17
16:23sunahsuhlol
16:24sunahsuhare you getting an exception at least before it restarts?
16:27amiyaguchiharter: if you're on linux, I remember something about having to change the Dockerfile.dev user id (10001) to the local user id
16:30hartere.g. harterrt?
16:31amiyaguchi`id --user`
16:33harternote to self: I need to take a week and figure out docker
16:34* frank is still trying to figure out computers
16:38sunahsuh1. put transistors on a chip 2. ??? 3. profit!
16:39harterhey frank - any chance you can boop parquet2hive again? I fixed the backfill
16:39frankharter: can do
16:40harteramiyaguchi: that seems to have done it
16:40harteramiyaguchi: at least I'm getting a new set of errors now
16:40amiyaguchiyay
16:41amiyaguchithe other set of problems that I had with airflow on linux was that I had to run `make up` before running `make run`
16:41sunahsuhhuh
16:42frankharter: done, looks like all dates are now present
16:42franksunahsuh: directions unclear, do barbecue chips suffice?
16:43sunahsuhno, because they're gross
16:44hartersalt and vinegar or bust
16:44* sunahsuh is a purist
16:44frankI can't believe I work with such heathens
16:47harterdiversity ftw
16:47sunahsuhlol
16:48sunahsuhharter: is your comment a suggestion that i should ask someone else for a review? :P
16:48sunahsuh(in the pr)
16:51hartersunahsuh: kind of, I don't think I can give a meaningful review for that file. Though, I am testing the change locally
16:51harterIf you want to wait until this runs successfully I can note that in the PR
16:52sunahsuhhaha okay :)
17:02harterfrank: I still only see a few dates in the dataset https://sql.telemetry.mozilla.org/queries/16042/source
17:03frankharter: might take a bit for the change to propogate
17:03hartercool
17:03frankI'm not sure of the details but Presto probably does some caching
17:06jgauntfrank iirc you told me not to exceed 20 gig on atmo storage
17:07jgauntwhen I df -h it says I'm using 984G and deleting stored files hasn't brought that number down
17:07frankjgaunt: !!!
17:07frankjgaunt: omg you are
17:08jgauntI deleted every .csv I'd saved
17:08jgauntS3 is independent of that, right?
17:08frankjgaunt: yes it is
17:08frankjgaunt: this really isn't a huge deal, didn't mean to get you worried
17:08jgauntokay I was gonna ask if it's worth filing an issue
17:09jgauntI'm just perplexed
17:09frankjgaunt: it takes time for the filesystem to make those changes
17:09frankdeleting a TB of data will take a little bit :)
17:09jgauntright on, ty
17:30RaFromBRCfrank: ping
17:30frankRaFromBRC: pong
17:31RaFromBRCam i right in thinking that you were at least a second-hand participant in the convo btn openjck and wendy re: the data format for the fx hardware report?
17:33frankRaFromBRC: I was not there for the convo, but openjck and I chatted about it briefly during today's hw report meeting
17:33frankRaFromBRC: I was not there for the convo, but openjck and I chatted about it briefly during today's hw report meeting
17:33frankI am largely ambivalent, the data can always be munged to the desired format :)
17:33RaFromBRCah, right... i got there at 10:15, but the mtg was already over by then
17:34frankRaFromBRC: we were fast!
17:34RaFromBRCthat's what i want to run by you... i have a vision that i'm hoping you'll find interesting 0:)
17:34* frank listens
17:35RaFromBRCmy idea is that i'd like to put together a spark lib that assists w the conversion from the format that makes sense for the FHR to the one that makes sense for ensemble
17:36RaFromBRCw the goal of having the lib be flexible enough that it can help w the conversion of other data sets to the same format
17:36RaFromBRCbasically i want a toolkit that makes it easy to generate the format that ensemble needs, that we can reuse to ease the pain of creating ensemble-consumable data again and again
17:36frankRaFromBRC: yes, makes sense. I question the need for a spark lib though
17:37frankthis is "small data", e.g. just a few KB (*maybe* MB) JSON
17:37frankso should really just be done locally
17:37RaFromBRCso just do it in the javascript?
17:38frankRaFromBRC: I would say Python, since that's what most of our users will write their spark scripts in
17:38frankpull the result in locally, use this package to format it correctly
17:38frankand write it out to s3 for use by ensemble
17:39frankRaFromBRC: or, ensemble figures out which format it's in, and the munging lives there
17:39RaFromBRCsure, as long as we can drive it w airflow so it all happens automatically and we get notices when it doesn't work etc etc etc
17:40frankRaFromBRC: yup, that makes sense
17:40RaFromBRCi'm not attached to spark, but i am attached to keeping it in our job management infrastructure
17:41frankRaFromBRC: airflow can pretty much run anything
17:42RaFromBRCgreat
17:42frankRaFromBRC: we could have a separate DAG that is just "import into ensemble", which runs this script
17:42franks/DAG/node
17:45amiyaguchiit could be interesting to have it as an airflow operator if it's going to be written purely in python?
17:46RaFromBRCis an operator a different primitive than a node on the dag?
17:46frankamiyaguchi: yup! That's what I meant
17:46* frank is confusing terminology atm
17:49amiyaguchian operator ends up being a node in the dag (https://airflow.incubator.apache.org/code.html#airflow.models.BaseOperator)
17:50RaFromBRCman, look at all those arguments...
17:53frankRaFromBRC: it depends a bit on how ensemble is setup
17:53RaFromBRCanyway, that could work, although i do wonder if it doesn't make more sense to just do it in the same job that's already generating the data set that is currently being used
17:53frankRaFromBRC: right, exactly
17:54frankRaFromBRC: if we have an endpoint that the data needs to be sent to, then an operator makes sense
17:54frankRaFromBRC: but if ensemble is just going to auto-discover data from a certain location in s3, then we don't really need to worry about it
17:54RaFromBRCif the data is already being loaded and manipulated, why render it in one format only to immediately re-parse it and convert it to another?
17:54frank"just put your data here"
17:54RaFromBRCexactly
17:55RaFromBRCthat's how i'm thinking... the processing job just calls the library to help generate ensemble-compatible format, writes the output to s3, the ensemble instance is already set up to know where to find the data
17:55RaFromBRCif the other (non-ensemble) format is being used elsewhere, we can still generate that, too
17:59frankRaFromBRC: agreed. No reason to complicate it
19:04joywhat is ensemble?
19:05frankjoy: openjck is working on it, see https://github.com/openjck/ensemble
19:06frankjoy: started with thinking about how we could recreate the hardware report, just for other data
19:06joyi see
19:06joyty
19:07joynice idea. "Our goal is not to support every feature imaginable, but instead to build a minimalist platform for publishing useful visualizations quickly and easily." - if this minimal subset is done carefully, i think it would great
19:10frankjoy: different topic - is it odd that a user can be a heavy_user for a specific day (the last day of the 28 day period) without having any activity on that day
19:10joyso as of day D, this user is classified as heavy? that would mean their 28 day usage is > C (cuttoff)
19:10joyfrank: right?
19:10frankright
19:11joywhy odd?
19:11frankjoy^
19:11frankjust seems... counter intuitive
19:11joyi think i misunderstood then
19:11joycould you elaborate?
19:11frank"I want to see all the users who were heavy_users on day X"
19:11joyfrank: ^^
19:11frankwell, this user is included... but wasn't active on day X
19:11joyaaah
19:11joybut usually the question is rephrased as
19:12joyi want to see all users who are classified as heavy_users "as of day X"
19:12frankright, good point
19:12joyuser browsed a ton and atook a day of on X
19:13frankjoy: right. That new formulation of the question makes more sense
19:13frankbut, this could be some possible confusion for people, so we'll just need to be careful
19:13joyyes, we ought to settle on wording
19:13joyand maybe have a FAQ waiting
19:13joyon hand
19:13joyfrank: ^^
19:14frankjoy: agreed. FAQ sounds like a very data scientists job ;)
19:14joyhaha, will do
19:14frankjoy: you can make a bug for it and block it on bug 1388732
19:14firebothttps://bugzil.la/1388732 NEW, nobody@mozilla.org [Meta] Implement Heavy Users Dataset
19:15joyfrank: doing
19:22frankjoy: awesome, thanks!
20:18harterhey frank or robotblake - I'm still only seeing a few days in the privacy_prefs_shield table
20:18harterhttps://sql.telemetry.mozilla.org/queries/16117/source
20:19harterIf possible, I'd like to confirm everythings working before I go on PTO next week
20:23robotblakeHrm, I'll take a look
20:27ashortI'm trying to mess with spark configuration options. is there a convenient way to restart stuff on an instance started from ATMO, after changing its config files?
20:27ashorti don't see an init script or such
20:45amiyaguchiashort: this is the bootstrap script (https://github.com/mozilla/emr-bootstrap-spark/blob/master/ansible/files/bootstrap/telemetry.sh)
20:45amiyaguchispark on ATMO uses yarn as the resource manager
20:46amiyaguchiyou should be able to kill the spark context, and run it again via pyspark while passing in the configs. I've actually never tried it myself
20:48amiyaguchiI'd look at lines 217 (pyspark arguments including the spark configs) and 242 (setting the python intepreter for pyspark)
20:49ashortamiyaguchi: yeah I found this and the spark-defaults stuff in configuration.json
20:50ashortI just know nothing about this stack and am trying to find the easiest way to try out a config change
20:50ashortworst case I'll just open a PR and let someone else figure it out ;-)
20:50ashort(I want to turn spark.ui.killEnabled on, so I can add a cancel button in jupyter for spark jobs.)
20:53amiyaguchiah, is that why you cant kill anything in the spark ui?
20:53ashortRight.
20:53amiyaguchiemr really has an odd set of defaults
20:55robotblakeharter: Mind checking now?
20:57harterrobotblake++ looks great, thanks!
20:58robotblakeI'm not entirely sure what was happening with it :\
20:59harterOh well - probably me fooling around with it too much :)
21:26ashortamiyaguchi: yeah, looking at the docs this is supposed to be enabled by default, I don't see that anything has disabled it...
21:27amiyaguchiashort: maybe jobs are supposed to be killed via yarn
21:27ashortnow i gotta find out what yarn is
21:27amiyaguchiyet another resource negotiator
21:28ashortalso now wondering if the jupyter web frontend is rejecting DELETEs before they get forwarded to the spark web stuff
21:29amiyaguchiyou can query the state of the config somewhere, it might actually be disabled
21:29amiyaguchibecause killing via the spark ui doesn't work at all
12 Aug 2017
No messages
   
Last message: 10 days and 16 hours ago