mozilla :: #datapipeline

9 Aug 2017
17:50frankspenrose: is clients-daily running right now?
17:51frankspenrose: and will it be 100% of clients on all channels?
17:51frankspenrose: and when I say "right now" I mean in the general sense of "are we creating this dataset on a schedule"
17:54spenrosefrank https://docs.google.com/document/d/18UX37HaPXIRxSy7rAOc0z9X5MnJKnc_m-GUOjCvj0QE/edit
17:54spenroseI am trying to get to beta
17:55frankspenrose: trying to get to 100% of beta?
17:55spenroseI have narrowed down a blocker to a handful of possible causes and should resolve that today. Then I need ~24 hours to recreate at 1% for beta
17:55spenrose100% is TBD
17:56spenroseI really want to do it. B r e n d a n has emphasized the value of getting 1% in the hands of people running analyses right now
17:57frankspenrose: gotcha. Okay well to throw a wrench in this, I've been tasked with creating the heavy users dataset, see bug 1388732
17:57firebothttps://bugzil.la/1388732 NEW, nobody@mozilla.org [Meta] Implement Heavy Users Dataset
17:57spenroseanother issue came up yesterday: importance of integrating w/ heartbeat and other experiment data for 57 runup. meeting tomorrow to plan that out
17:57frankspenrose: which is...wait for it... one row per client-day
17:57frankprobably this will start separate but eventually these will need to be merged
17:57spenrosegot it
17:57spenrose10-4
17:57frankk, just something to keep in mind
17:58spenroseso right now 30 node ATMO clusters take about 80 minutes to do a month at 1%
17:58spenroseare you going to use main_summary as your source?
18:00frankspenrose: yes definitely
18:01frankspenrose: why a month? Shouldn't it just be one day at a time
18:01spenrosehere's my etl: https://github.com/mozilla/python_mozetl/pull/91/files
18:01frankI looked through it on the first PR
18:01frankspenrose: I'm confused why you're running over a month of main_summary
18:01spenrosesee the first point under "Open tasks / questions"
18:02spenroseclients-daily is activity-oriented, which means it needs to incorporate the usual 10-day lag until 55+ dominate
18:02spenrosethere are several thorny issues
18:03frankspenrose: I see, I thought you were aggregating by submission_date
18:04frankhttps://github.com/SamPenrose/python_etl/blob/3980c36193af4dd3c05c63536c67696cd6dc4b9e/mozetl/clientsdaily/rollup.py#L86
18:04frankbut obviously not
18:04frankwell then until this switches to submission_date aggregating I suppose we don't worry about it
18:04spenroseno, and while I have long been pro-submission date, activity date is clearly the right way to go for this project
18:04spenrose10-4
21:41spenrosehas anyone ever seen instability in "aws s3 ls"? https://pastebin.mozilla.org/9029294
21:41spenrosei have been repeatedly overwriting a keyspace, working day by day
21:42spenrosecurrently I am trying to get 2017-07-04 to write
21:42spenrose2017-07-04 (previously written) keeps popping in and out
21:42spenroseit should not exist on this run through the keyspace until the earlier days have been written
21:55frankspenrose: it could be eventual consistency
21:58spenrosebroadly, sure. the pop-in, pop-out is fascinating. and this is happening ~ 30 minutes after an "rm --recursive"
21:58spenrosebtw, "2017-07-04 (previously written) keeps ..." was a typo. mean "2017-07-06 (previously written) keeps ..."
10 Aug 2017
No messages
   
Last message: 6 days and 15 hours ago