mozilla :: #balrog

17 May 2017
10:31dilekbhearsum: I sent PR for "Schedule for deletion" button
10:31dilekbhearsum: No longer returns "null" :)
12:48dilek bhearsum: I sent PR for "Schedule for deletion" button
12:48dilekbhearsum: No longer returns "null" :)
12:49bhearsumhey dilek, that's great! i've got a couple of meetings this morning i'll check it out in a bit
12:53dilekAwesome ! :))
12:54dilekBut I will say something
12:54dilekbhearsum : I can not use diff for the "update" button
12:55dilekto GetDiff function
12:55dilekWhat can I do for it
14:25bhearsumaksareen: just one minor thing needs a fix, but i think we can get this merged today
14:25bhearsumand in production!
14:32aksareenbhearsum: updated the PR
14:41bhearsumaksareen: looks good now, thanks
14:49cloudops-ansiblebalrog-admin #114: building master-d6dc856d68919c155316ef132a73b4e57389a12c
14:57bhearsumdilek: okay! the updated version looks much better
14:58bhearsumthere's a test failure to fix, but that's the only thing blocking me from merging it
14:58bhearsumi also left a comment about the diff, let me know if you have any questions about that
14:59cloudops-ansiblebalrog-admin #114: master-d6dc856d68919c155316ef132a73b4e57389a12c deployed to stage /cc relud bhearsum
15:00cloudops-ansiblebalrog-web #83: mozilla/balrog:master-d6dc856d68919c155316ef132a73b4e57389a12c deployed to stage /cc relud bhearsum
15:02dilekbhearsum: I see your e-mail
15:03dilekbhearsum: just now I need , it fixed
15:03dilek"===" instead of "=="
15:03bhearsumthat's right, you need to fix that test failure
15:04dilekCan I send it back to PR?
15:04bhearsumwhat do you mean?
15:06bhearsumcan you rephrase? i'm not sure i understand the question
15:06dilekokey, one minute
15:07dilekI just need to do now:
15:07dilekExpected '===' and instead of '=='.
15:14dilekbhearsum: I sent PR
15:15dilekbhearsum: Is there something else I should do?
15:15bhearsumokay, let's see what Taskcluster says! if it passes i'll get it merged
15:15bhearsumdilek: do you want to work on the follow-up of making diffs for change_type="update" work?
15:17dilekbhearsum: backend or frontend ?
15:17bhearsumit will end up touching both
15:17dilekbecause there's no backend endpoint that will allow I to do this
15:17bhearsumyeah, you'll need to create that, and update the frontend to use it
15:18dilekI can not do the backend. But I can do it frontend. But if I do not do both, will not you?
15:19dilek*will not merge you ?
15:19bhearsumthe frontend requires the backend changes, so if you're unable to do the backend you won't be able to do the frontend
15:20bhearsumthere's other frontend work you could do instead though
15:21dilekYes, you are right.
15:27bhearsumdilek: feel free to look over to find another bug to work on
15:28bhearsumi'll be back shortly, time for lunch
15:30dilekbhearsum :) taskcluster is complted
16:32bhearsumand it's merged!
16:35bhearsumrelud: ^ is something up with ansible?
16:38reludbhearsum: the taskcluster build has to push the docker image before the jenkins build starts, and then ansible doesn't notify until it's done, which usually takes about 10 minutes, so probably not
16:38bhearsumah, ok
16:39bhearsumit was just weird to see those joins and parts without messages
16:39reludah i can't see those rn, one moment
16:40bhearsumno worries
16:40dilekbhearsum : thank you ! :) see you again, cheers :)
16:40reludnope, those nicks aren't related to ansible
16:40bhearsumrelud: ah, ok
16:45cloudops-ansiblebalrog-web #84: mozilla/balrog:master-f6e7c578b348c5a0d01f8b9f561603a8b3e8f9c1 deployed to stage /cc relud bhearsum
16:45bhearsumrelud: do you think we can do a reset of the stage db today, and import the latest prod dump?
16:46bhearsumcould be later this week even, come to think of it
16:47cloudops-ansiblebalrog-admin #115: master-f6e7c578b348c5a0d01f8b9f561603a8b3e8f9c1 deployed to stage /cc relud bhearsum
18:23bhearsumrelud, miles: i don't have anything today
18:26reludbhearsum: cool. you ready for a deploy?
18:35reludbhearsum: any migrations?
18:41* bhearsum triple checks
18:45cloudops-ansiblebalrog-web #83: web in prod failed /cc relud
18:45relud^ that's fine
18:46reludi accidentally started a prod deploy on a stack i was aborting
18:46reludso i had to kill it
19:17bhearsumrelud: ah, ok
19:17bhearsumand sorry, no migrations!
19:17bhearsumforgot to say so
19:18reludcool, deploying
19:26cloudops-ansiblebalrog-web #84: mozilla/balrog:master-f6e7c578b348c5a0d01f8b9f561603a8b3e8f9c1 canary deployed to prod /cc relud bhearsum
19:38cloudops-ansiblebalrog-admin #115: master-f6e7c578b348c5a0d01f8b9f561603a8b3e8f9c1 deployed to prod /cc relud bhearsum
19:42bhearsumso far so good
20:26cloudops-ansiblebalrog-web #84: please check balrog canary and promote to full deploy /cc relud bhearsum
20:28* bhearsum pokes web
20:29bhearsumseems ok
20:37cloudops-ansiblebalrog-web #84: mozilla/balrog:master-f6e7c578b348c5a0d01f8b9f561603a8b3e8f9c1 deployed to prod /cc relud bhearsum
20:38bhearsumdatadog looks happy so far, too
20:38bhearsumhm, maybe some increased 500s...
20:38bhearsumneed more data
20:41bhearsummight just be a temporary spike during the swap
20:42bhearsumi think we had one during the last deploy, too
20:42bhearsumrelud: ^ fyi, waiting for a bit more to roll in first, i think things are ok though
20:44bhearsumdatadog seems to be a bit lagged at the moment
20:47bhearsumrelud, miles: something is wrong with web it turns out, big spike in 500s
20:47bhearsumand ec2 throughput dropped
20:48bhearsumvery strange, we didn't change any code for that app AFAICT
20:48bhearsumalso curious is that i don't see an abnormal change in reqs/sec
20:48bhearsumand manual requests to it work fine
20:49bhearsumvery strange that reqs/sec is still is so high, but throughput is so low
20:50bhearsumand no exceptions on sentry
20:52milesbhearsum: i'm around, has relud ponged?
20:53milesi'm in a meeting
20:53milesi'll look
20:53reludmiles: i'm out
20:53reludi missed the first ping
20:53reludbhearsum: rolling back?
20:53bhearsumcan i see some logs first?
20:53milesbhearsum: yes, getting them now
20:54reludi don't know why i said 'im out', i'm here!
20:54reludi'm just out of a meeting
20:55bhearsumwishful thinking?
20:55miles"later nerds" -relud, mid 2017
20:56reludbhearsum: what kind of logs are you hoping for?
20:56bhearsumjust some recent web app logs
20:56milesi'm pulling them but slow brt
20:58milesoh wow terrible copypaste
20:58milessorry getting a better on
20:58bhearsumweird, i don't see any 500s there
20:59milesw/ grep don't see them either, could be limited to specific hosts
21:00bhearsumis it possible to find out if we just have one or two bad hosts or something? that would explain the normal req/sec but lowered throughput,i think
21:01bhearsumyou would think that __heartbeat__ would catch that though...
21:01milesthere were some recently terminated hosts
21:02mileswe're still seeing normal amounts of 2xxs
21:02milesadded 2xxs graph in DD
21:03reludi'm using athena to find elb logs on the 500s
21:03bhearsumcpu looks normal too
21:03milesnice relud, sounds good
21:03reludbhearsum: i'd like to roll back
21:03bhearsumok, go ahead
21:03reludbhearsum: if it doesn't fix the problem, then we can roll forward
21:08bhearsumi found some URLs in the s3 elb logs that got 500s, and i can't reproduce the 500s by hand
21:10reludlooks like it's about 4%
21:10reludthe ones i saw had v long user agents
21:10bhearsumsome of the ones i see look pretty legit, eg: firefox 30.0 in the update url and user agent
21:10reludrollback complete, seeing 500s disappear
21:11bhearsumdoes that rule out bad nodes?
21:11reludthe ones i saw are a v small sample, and the user agent is almost legit, except a repeating string and the end, like Facicons or Rpidity
21:16bhearsumi'm looking over this code again, and we didn't make any changes to the web code at all, this is very strange
21:16reludit definitely cleared with rollback
21:16bhearsumi think throughput is going back up now too, which makes sense if we got rid of 500s
21:17bhearsumi'm baffled though
21:17reludbhearsum: what throughput metric are you seeing?
21:17bhearsumavg ec2 network throughput on
21:17reludah, that
21:18bhearsumeverything i can see makes me think we had a few bad nodes in the web pool, is that still a possibility?
21:19reludbhearsum: sure, lemme check
21:19relud<3 athena
21:19bhearsumif there&#39;s any python output (like an exception) in app logs that would be insightful too
21:20milesafaik nothing is logged by the app
21:20reludbhearsum: we have a winner!
21:20bhearsumrelud: oh good!
21:20reludbhearsum: all 500 codes served by a single host
21:20reludattempting to locate it
21:20milesooh cool
21:20bhearsumi&#39;m relieved! i would feel awful if i broke a second deploy in a row!
21:23reludbhearsum: no application logs
21:23bhearsumok, if they all came from one host it&#39;s probably not going to yield anything anyways....
21:24bhearsumi need to head to the train in ~10min. do you want to roll forward today, or leave it for another day?
21:24bhearsumi can be back online when i get to the train
21:27reludi&#39;ll pull the host and roll forward
21:28bhearsumok, thanks
21:28bhearsumi&#39;ll pop back in in ~20min to make sure things are okay still
21:54bhearsumrelud, miles: looks like things are pretty happy now. i&#39;m still surprised the average throughput dropped though
21:56reludyep, looks good to me too. average throughput: well, the requests throughput didn&#39;t drop, so that&#39;s cool
21:57bhearsumyeah, but it suggests we&#39;re returning less content to a bunch of requests...
21:57* bhearsum compares vs 1 week ago
21:57reludi think the throughput change might have been having one host do very little, plus having fewer hosts in the ASG
21:58bhearsumoh, is that average per host?
21:58reludwe have a base amount of network io per host, which affects the total
21:59reludand we had 5 fewer hosts post-deploy, because it coincided with traffic falling for the day
21:59reludso i think it&#39;s fine
21:59bhearsumso it sounds like you&#39;re saying that our total in/out is actually no different than normal
22:00bhearsumprobably i should ignore that graph and lok at Total EC2 in/out?
22:05* bhearsum does that
22:06bhearsumgonna drop again now, feel free to call me if anything goes haywire - thanks again for the deploy today
18 May 2017
No messages
Last message: 127 days and 10 hours ago