mozilla :: #datapipeline

20 Mar 2017
12:06gfritzschewho is the cluster address service today?
12:06gfritzschelooking for gfritzsche-telemetry-analysis
12:06gfritzschejezdez: ?
12:06gfritzschemdoglio|lunch: ?
12:06jezdezgfritzsche: hey, so atmo is in an unwell state right now
12:07mreiduh oh
12:07jezdezwe have a fix in stage
12:07jezdezbut I'm waiting for a prod deploy to happen by cloudops
12:07gfritzschehm, does this mean we can't get the address anymore?
12:08gfritzscheor can it still be looked up manually?
12:08jezdezoh it can, just saying that looking up the address isn't really scalable if everyone just keeps launching clusters :)
12:09gfritzscheok, cool
12:09jezdezgfritzsche: ec2-54-187-123-45.us-west-2.compute.amazonaws.com
12:09gfritzschethanks
12:10gfritzschedid we start to persist data across cluster runs?
12:11jezdezgfritzsche: we did, last week
12:11gfritzscheoh!
12:11gfritzschenice!
12:11jezdezsee what's new popup in the footer of atmo
12:11gfritzschethat was the "limit to 20gb" news?
12:12jezdezgfritzsche: yes
12:12jezdez"Persistent cluster storage"
12:12gfritzscheright, good stuff :)
12:14jezdezgfritzsche: hat tip to f r a n k
14:14mreidatmo scheduled jobs seem unhappy...
14:15jezdezmreid: we just did a prod deploy
14:15jezdezand there is a big backlog of jobs to process through
14:15mreidok
14:15jezdezjobs as in atmo queue jobs, not spark jobs
14:15mreidgotcha
14:26sunahsuh"ImportError: No module named moztelemetry"
14:26sunahsuhis what mine failed with..
14:29franksounds like an issue with the bootstrap script?
14:30mreidI had a bootstrap fail too
14:34jezdezthat should be unrelated to the atmo backend issues
14:34jezdezmaybe a package mirror fell over? /cc jason
14:35sunahsuhstrange -- i can't find any logs for this run
14:36sunahsuheither from the machines or for the step
14:37jezdezsunahsuh: what do you mean?
14:37sunahsuhi'm looking at the cluster on the aws console
14:38jezdezwhich cluster is that?
14:38sunahsuhj-1W6B220IA62XY
14:40jezdezsunahsuh: weeeird
14:40sunahsuhyeah, all the clusters look like that
14:40jezdezuuuh
14:41sunahsuhall the ones launched at 13:59
14:41sunahsuher, and failed
14:42* jezdez scratches head
14:42sunahsuhoh, wait, no, all the running and terminated ones too
14:42sunahsuhno logs
14:44jezdezso 13:59 is close to when prod got deployed
14:44jezdezwhich fixed the issue that made the queue fill up
14:44jezdezthat didn't change any of the log config though
14:44jezdezor parameters or any of that
14:45jezdezhttps://us-west-2.console.aws.amazon.com/cloudtrail/home?region=us-west-2#/events?EventName=RunJobFlow&StartTime=2017-03-20T12:45:00.000Z
14:45sunahsuhi'm looking and all the jobs have logs up to 3/17, and then none after that
14:47sunahsuhfrank: ^ does this coincide with the home dir change?
14:47jezdezsunahsuh: he wasn't around, atmo deploy window is thursday usually
14:47sunahsuhoh that's right
14:48sunahsuhokay, so it seems to correlate with the last deploy maybe..
14:49jezdezso that deploy was on thursday 8:45 PM approx
14:49jezdezthat was 16th
14:50jezdezthen another fix release on friday 17th 1 pm
14:52jezdezhttps://github.com/mozilla/telemetry-analysis-service/compare/2017.3.2...2017.3.3 was the diff in the release on friday
14:53jezdezhttps://github.com/mozilla/telemetry-analysis-service/compare/2017.3.1...2017.3.2 the one on thursday
14:58sunahsuhhmm, well, j-39M6AQWF75KXM is the last one i see that has logs
14:58sunahsuhand j-17996006CNCIN is the first one without
14:58jezdezthat's on friday late
14:58jezdezweird
14:59sunahsuhso it looks like maybe some time between 3/17 20:00 and 3/17 22:00
15:00jezdezok, since we've had something similar recently, how do we check if some config changed via the testing setup?
15:00jezdezdid a bucket policy change or someting?
15:00jezdezjason: hallp
15:00frankoh I know I know
15:00frankrobotblake asked about log bucket policy
15:00jezdezO_O
15:01frankhopefully the info is still there, we just can't see it
15:01frankjezdez https://irccloud.mozilla.com/pastebin/eH5hK8gm/
15:02jezdezfrank: uh, when was that?
15:02frank3/14
15:02frankbut probably didn't make those changes until later
15:02jezdezwtf
15:05jezdezhm, so cloudservices-aws-dev has read/write perms
15:06trinkfyi: the real-time data platform packages for the Mar06 sprint have been released https://people-mozilla.org/~mtrinkala/packages/
15:07jezdezfrank: sunahsuh: do you have any other leads? will have to let jason and robotblake look at this
15:08frankI think that's the issue, right?
15:08frankdoes seem odd that I can't even see it in the aws cli
15:08jezdezfrank: I don't know how our aws permissions are structured
15:09frankwell first we should probably make sure the logs are still there
15:09frankif they are not, that really bumps up priority
15:10sunahsuhyeah, not really sure what else might be going on -- and it's hard to debug the failing jobs without logs :/
15:11frankhmm, except, if it were a perms issue, wouldn't NONE of the logs be available?
15:12sunahsuhyou mean like the previous runs?
15:12frankyeah
15:13frankassuming it is a bucket policy
15:13* frank goes to actually look
15:15jezdezif the atmo user lost its permissions to access the logs it would explain that it started at the time when it lost them
15:15frankyeah that's my worry
15:15frankMy theory was we lost read permissions, but we lost write permissions that is no good
15:15jezdezyeah :-/
15:16frankI think we need to escalate
15:20jezdezopening a bug
15:22frankjezdez: can also email ops: https://mana.mozilla.org/wiki/display/SVCOPS/Contacting+Cloud+Operations
15:22frankwhd, robotblake, you guys around? ^
15:24jezdezhttps://bugzilla.mozilla.org/show_bug.cgi?id=1348862
15:24firebotBug 1348862 NEW, nobody@mozilla.org AWS log files not accessible for ATMO clusters
15:24jezdezjason: ^
15:24jasonwhat's happening?
15:25jasonwhich bucket is this?
15:25jezdeztelemetry-analysis-logs-2
15:25jasontelemetry-analysis-logs-2
15:25jasonhttps://bugzilla.mozilla.org/show_bug.cgi?id=1347086
15:25firebotBug 1347086 is not accessible
15:25jezdezcan't see the bug
15:25jasoncc'd you
15:26jezdezjason: hm, try jezdez@ please
15:27jasonah I was midaired
15:27jezdezjason: gotcha
15:28jasonso we modified the bucket policy because of 1347086
15:28jasonbut let me just check it
15:28jezdezmakes sense to me
15:28robotblakeI'm on now too
15:29jasonrobotblake: can you help with it ^
15:29jasonI need to head to another meeting
15:33robotblakeYeah
15:33robotblakeLooking
16:32sunahsuhjezdez: i don't have a master address on the cluster i spun up this morning :(
16:32sunahsuhStart date
16:32sunahsuh 2017-03-20 14:36 (so after the deploy)
16:32franksunahsuh: afaik they are all failing
16:32sunahsuhdoh
16:33jezdezsunahsuh: yeah, in addition to the log issues, a regression from friday had atmo spin like crazy trying to run job queue
16:33jezdezerr queued jobs
16:33jezdezso this isn't surprising
16:33sunahsuhahh so we're still working through the backlog?
16:33jezdezI've had jason purge the queue since it should just pick up the tasks again after a while
16:33sunahsuhgot it
16:33jezdezthat was a bit ago and I've been in meetings/fire fighting with other stuff
16:33jezdezthis monday is really a monday folks
16:34sunahsuh:/
16:35mdoglioit is one of those indeed
16:47mdoglioon a side note: both taar and the demo site (taarweb) run on py3 :)
16:49frankmdoglio: wooooh!
16:49frankmdoglio: Where is the code for the demo frontend?
16:50mdogliohttps://github.com/maurodoglio/taarweb
16:50mdoglioand here is the library https://github.com/maurodoglio/taar
16:51frankmdoglio: .ebextensions is for elastic beanstalk then?
16:51mdogliocorrect
16:52frankthat's what I was looking for
16:52mdoglioit's a way to customize the environment/deployment
16:52frankdid you manually deploy
16:52mdoglioI'm gonna create a cookiecutter from taarweb to use for demos
16:52mdoglioit's very convenient for demos that require services on aws
16:53frankyeah it seems like it
16:53frankmdoglio: even better to have the ansible config as well :)
16:53frankthen just ansible-playbook and boom, a demo
16:53mdogliowell, you don't need that really
16:53frankno?
16:53mdogliothere is a command line tool to do the deploy
16:54frankoh perfect
16:54mdogliohttps://pypi.python.org/pypi/awsebcli/3.0.3
16:54mdoglioI'm gonna write something about that
16:55franknice :)
17:04mreidmdoglio: for the record, I am interested in your experience w/ elasticbeanstal
17:04mreidk
17:04* mdoglio adds mreid to his contact list
17:07jezdezmdoglio: huzzah! (py3k)
17:08mdogliojezdez: you traced the path man
17:10jezdezmdoglio: ;)
17:25jezdezrobotblake: *highfives*
17:25mlopatkaOh, thats a new one. I get an error when trying to terminate my cluster currently
17:25mlopatka"Forbidden (403)
17:25mlopatkaCSRF verification failed. Request aborted."
17:26jezdezmlopatka: a classic, a deploy just went out and the csrf token got invalidated for the termination form
17:26jezdezreload and it should work
17:26jezdezsunahsuh: frank: robotblake says the the logs perms are now fixed
17:26sunahsuh\o/
17:26frankyay
17:27mlopatkajezdez: thanks! that worked
17:36jezdezsunahsuh: chutten: atmo has caught up and should again update the cluster master addresses
17:37sunahsuh
17:38sunahsuhwhew, thanks jezdez -- hope you reward yourself with a beer tonight :)
17:38jezdezsunahsuh: thanks to jason and robotblake <3
17:44jezdezsunahsuh: frank: eh, and thank you for debugging this earlier as well
17:45chuttenjezdez: thanks!
18:33franksunahsuh: ping
18:33sunahsuhpong
18:33franksunahsuh: I heard you&#39;ve been writing event ETL scripts
18:34franker, jobs. Whatevs
18:34sunahsuhindeed
18:34sunahsuhthey are currently the bane of my existence
18:34franksunahsuh: so we are planning a new ping for Focus with the event format
18:34frankis there any way we can plug-n-play :)
18:36sunahsuhhaha well, i have a PR i&#39;m going to send out that at least creates an Events utility class
18:36sunahsuhso there&#39;s a decent chance that&#39;ll be useful for you
18:36frankooh, yes that sounds dandy
18:38sunahsuhwhen is the new ping going out?
18:38frankApril 6 is the target for Android Focus release
18:39sunahsuhokay, awesome
18:39frankyeah, and not like we need those ETL jobs setup immediately
20:56robhudsonooh, we&#39;re releasing Focus for android?
20:58ddurstNot sure who to ping about this, but I fixed a failing job (and backfilled the data). If it&#39;s merged today, it will be back to running smoothly upon tomorrow&#39;s regularly-scheduled run.
20:58ddursthttps://github.com/mozilla-services/data-pipeline/pull/245
21:00frankrobhudson: yup!
21 Mar 2017
No messages
   
Last message: 39 days and 19 hours ago