mozilla :: #datapipeline

17 Mar 2017
13:41Dexterjezdez, heya!
13:41jezdezhey Dexter
13:41Dexterjezdez, Huston, we have a problem :(
13:41DexterI just spawned a cluster on ATMO
13:41Dexterbut I'm unable to access it
13:42Dexterthe line it suggests to access the cluster is
13:42Dexter"ssh -L 8888:localhost:8888 hadoop@"
13:42Dexterthere's nothing after the @
13:42jezdezwhat's the cluster id?
13:42jezdezthe number in the url
13:43jezdez1217?
13:43Dexterjezdez, jobflow ids -> j-P52Y3MOX3FZ
13:43jezdezgotcha
13:43jezdezseems like the master address job wasn't executed due to a api throttling from aws
13:43jezdezin other words, the cluster is there, we just haven't synced the address yet
13:43jezdezlemme look it up real quick
13:44jezdezDexter: ec2-54-187-17-110.us-west-2.compute.amazonaws.com
13:44jezdezit's just a UI issue basically
13:45jezdezonce the job queue has caught up the address should show up
13:45Dexterthanks jezdez
13:45Dexterit's not in the UI though
13:45jezdezyeah, we have to fetch the address from aws
13:45jezdezand do that in a async task on the server
13:46jezdezthat task is using the aws api to get the address
13:46jezdezsince the aws api is throttling at the moment, it didn't succeed in getting the address
13:46jezdezgonna check if there are more clusters missing the info
13:47mreidthis "DescribeClusters" throttling is a huge pain
13:47Dexterthanks jezdez
13:47Dexter:)
13:48jezdezmreid: I've tried to work around it by making the updating smarter and have a plan in motion to further improve the situation
13:49jezdezthis just happened today again since yesterday's deploy had a regression that prevented jobs to run at all (that I fixed earlier today) and we're catching up right now with the updating tasks
13:49jezdezthe task system that we use isn't smart enough to drop the tasks that are currently scheduled
13:50jezdezin other words, 1) we'll have the queue flushed at some point and updates will work as normal 2) I have a mid-term plan in motion (https://github.com/mozilla/telemetry-analysis-service/pull/318) to make the queuing smarter
13:50mreidjezdez: can we just ask Amazon to increase the throttling?
13:51jezdezmreid: not that I'm aware of
13:51mreidseems like it should be a trivial operation on their end
13:51jezdezno idea
13:52mreidjezdez: for the record, I'm not blaming any shortcomings on you. This is one of the few Amazon APIs where they rate-limit on something you're likely to want to poll (at least in such a way that it actually affects us)
13:52jezdezyep
13:52jezdezthe weird part is that I'm often seeing throttling in the aws console as well
13:52jezdezno worries, don't feeling blamed
13:52mreidok good
13:52mreid:)
13:52jezdezjust that.. it's not ideal
13:52mreidagreed
13:52mreidI was thinking the other day "we should write a proxy service that checks once a minute and everything can talk to that"
13:52jezdezdistributed systems are.. ahrd
13:53jezdezhard even
13:53mreidbut then... wtf why should we have to bother with that? :)
13:53jezdezmreid: exactly
13:53jezdezif there is a place to increase the api limits I haven't found them
13:53mreidI think you have to actually contact amazon support and ask
14:01jezdezmreid: looking at the support center case history, we've increased the limits for various services many times
14:02jezdezwhd: you seem to have processed some of the limit increases
14:02jezdezdo you happen to know if it's possible to increase the api access limits for EMR? we're being throttled on atmo for a while now, the more jobs/clusters are being spawned through it
14:06mreidjezdez: another possibility would be to have the bootstrap script write out an "I'm ready" blob to S3 when it's finished and we could poll for that. But again... that's working around the API that does what we want.
14:07jezdezmreid: yeah there are many ways to cross that bridge
14:07mreidnone of them very appealing. I think increasing the rate limit is the best option if we can do it
14:08jezdezI'm hesitant to use any non-official ones to make sure this will work for a while
14:08jezdezfwiw, we do a few things right now where we query the api more than we need
14:08mreidyeah, the exponential backoff should help
14:08jezdezthat helped a bit until we hit the regression that made the queue fill up again
14:09mreidheh
14:09jezdezright now atmo checks the status of the last run of a scheduled job (describe_cluster) to determine if the job should run
14:10jezdezthat's fine in general, but since we currently use a stupid pattern of check every minute if a job should run, this is kind of backfiring
14:10jezdezthe goal of the celery adoption is to have guarentees from its own scheduler so we can scrape the whole checking-every-minute pattern
14:10jezdezwhich was naive by design, for a simpler implementation
14:12jezdezI think I can have the better solution in place that actually uses the celery scheduler to plan job runs for next week's release
14:19mreidjezdez: cool
14:29gfritzschemlopatka: https://wiki.mozilla.org/Areweyet
14:31gfritzschefavourite: http://arewemetayet.com/
14:32mreidwe need a "are we data-driven yet" :)
14:33spenroseNO
14:35mlopatkagfritzsche: This should be the absolute first thing they show is during onboarding
14:38mreidspenrose: implementation is easy, at least
14:39* arewedatadrivenyet NO
14:39arewedatadrivenyetquarterly OKR: achieved
14:39mreidhehe
14:53harterHey frank, is s3://telemetry-spark-emr-2/bootstrap/telemetry.sh under version control somewhere?
14:57rvitilloharter: https://github.com/mozilla/emr-bootstrap-spark/blob/master/ansible/files/bootstrap/telemetry.sh
16:28sunahsuhjezdez: adding another report of missing master url :/
16:28jezdezsunahsuh: :(
16:29jezdezsentry says the throttling exceptions *are* falling
16:29jezdezso we'll soon be good
16:29jezdezsorry for the issues people
16:30jezdezsunahsuh: ec2-54-186-154-205.us-west-2.compute.amazonaws.com
16:30sunahsuhoh, sorry, i should have mentioned i have the address :)
16:30jezdezah, ok :)
16:30jezdezstill I'm sorry for the annoyance
16:39mlopatkajezdez can i get an ip address too?
16:40jezdezmlopatka: ec2-54-218-24-131.us-west-2.compute.amazonaws.com
16:40mlopatkathank you!
16:41mreidjezdez > aws console
16:41jezdezheh
16:42jezdezI'm about to leave the office, so please let mreid, sunahsuh or others with aws console access know if this comes up again
16:42jezdezI hope that the throttle is quiting down
16:42jezdezerr quieting
16:45bmirogliowith persistent cluster storage, should i now be mindful about everything im dumping to disk, or is it just dot files/dirs?
16:47jezdezbmiroglio: it's everything in $HOME
16:48jezdezso be mindful about it, yeah
16:48jezdezIIRC f r a n k is monitoring the size of the file system behind this
16:49bmirogliojezdez: got it, thanks!
17:27jezdezbmiroglio: np! thanks for asking
17:43gfritzschejezdez: i heard you are the revamped host info service?
17:43gfritzschejezdez: have one for gfritzsche-telemetry-analysis?
17:44mreidgfritzsche: ec2-54-190-10-87.us-west-2.compute.amazonaws.com
17:45* mreid is the new jezdez service
17:45gfritzschecheers jezdez^Wmreid
17:50mreidhehe
17:50mreidgregglind: ping?
17:51mreidgregglind: https://bugzilla.mozilla.org/show_bug.cgi?id=1337927#c2
17:51firebotBug 1337927 ASSIGNED, mtrinkala@mozilla.com Update schema validation code to handle new doctypes without a code change
17:51mreid^^ has that "couple of weeks" elapsed? Is it now safe to turn on the shield schema validation and throw away old pings?
17:51mreidold-style shield v2 pings
17:58mreidgregglind: I needinfo'd you on the bug
18:23gregglindcool
18:23gregglindthanks mreid
18:23gregglindI think not. Tab Center is the holdout :(
18:24mreiddang
18:24gregglindthat okay?
18:24gregglindSorry :(
18:24gregglindWe can use a 'temporarily stupider' on for shield-studies :(
18:24mreidtrink: ^^
18:24mreidwe need the multi-version support :(
18:24gregglindwell, OR
18:24mreidcuz we need the smart one for outputing to parquet
18:24gregglindthat part is just ragged for now.
18:24mreidthough I guess we could use a simple json schema and an advanced parquet one
18:24gregglindI see I see.
18:24gregglindyes
18:25gregglindThis isn't going to be the last time this happens :)
18:25mreidthere's a workweek coming up where we get to push on the schema service stuff
18:25gregglindOn monday or tuesday, you can feel free to tell me to CRACK THE WHIP AND EVERY GETEVERYONE IN LINE OR ELSE
18:25gregglindyes
18:25gregglindI will accept that whip role :)
18:25mreidsounds good
18:26mreids/good/not terrible/ :)
18:51gfritzschemdoglio: https://github.com/georgf/probe-scraper
19:32trinkmreid: so we will sort it out at the work week I take it
19:48mreidyep
20:11ddurstso.... atmo questions belong in this channel?
20:13ddurstor somewhere else?
20:15jezdezddurst: either here or on the bugtracker yep
20:16jezdezsee link in the footer of atmo
20:17ddursthmm. I think the issue is I don't know if there's a bug. One of my jobs failed, but I haven't changed anything since it was created, and the logs tell me nothing.
20:17ddurstwasn't sure if there was more that can be looked at than the logs I can see
20:21sunahsuhddurst: what's the job's name?
20:22ddurstsunahsuh: daily-export-of-previous-day-oom-crash-data-v1-to-s3
20:23sunahsuhokay, i found the cluster for the last run: j-1MTTQNSWFSHCX
20:24sunahsuh`HTTPError: HTTP Error 429: TOO MANY REQUESTS`
20:25sunahsuhthat's in the EMR cluster logs
20:26sunahsuhunder "Log Files" here: https://pageshot.net/G5CtwFP1bZX2rcrS/us-west-2.console.aws.amazon.com
20:30sunahsuhor, if you go to the https://analysis.telemetry.mozilla.org/jobs/<your job id>/#results tab and download the notebook it should have the output from the failing cell
20:31ddurstah
20:31ddurstI did not realize that
20:32ddurstso it exceeded the limit for crash-stats.
20:32ddursthuh
20:33ddurstthat&#39;s a .... first
20:34sunahsuhlook at the job history, it started failiyg 3/9
20:36ddurstwha?
20:36ddurstwhere do you see that?
20:37sunahsuhin the admin view for atmo :/
20:37ddurst:\
20:37ddurstthis is the first notification I&#39;ve gotten for it
20:37sunahsuhyeah, notifications have just been turned on :(
20:37* ddurst lol
18 Mar 2017
No messages
   
Last message: 65 days and 19 hours ago