mozilla :: #taskcluster

20 Mar 2017
12:42wcostastandups: what a nightmare to update toolchain
12:42standupsOk, submitted #43917 for https://www.standu.ps/user/wcosta/
14:05jhfordjonasfj: hey, please take another look at https://github.com/taskcluster/taskcluster-queue/pull/155 . it should address most of the feedback
14:05jhfordand also https://github.com/taskcluster/remote-signed-s3/pull/3
14:09jhfordi also commented on https://github.com/taskcluster/taskcluster-queue/pull/157
14:17wcostastandups: build cctools for darwin 11.2.0, take 1.236453e35 https://treeherder.mozilla.org/#/jobs?repo=try&revision=be958cd4726f
14:17standupsOk, submitted #43926 for https://www.standu.ps/user/wcosta/
15:46franziskushm, do you guys know what's going on here? https://treeherder.mozilla.org/#/jobs?repo=nss&revision=545e059dbb174a644b4f5e457a7ad2c816b49357
15:47franziskusjhford: ^^^
15:48franziskusnwm
15:49franziskusmy fault
16:50wlachholy guacamole
16:51wlachsomething is causing os x to retrigger like a madman
16:51wlachhttps://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e73241bcb49a399d1de1e512d0334eeece0dcffd&group_state=expanded
16:51wlachhmm, actually more than just osx
16:51dustinjust on that push?
16:52dustinI see nothing at that link
16:52dustinoh
16:52dustinnow I see it :)
16:52dustinthose are buildbot jobs
16:53dustin14:56:33 INFO - WARNING: Not a supported OS_TARGET for NSPR in moz.build: "". Use --with-system-nspr
16:53dustinnot sure why they're being retried though
16:54wlachright
16:56wlachjmaher thinks it's pulse_actions
16:56garndtdoes a make check failure automatically trigger a retry in buildbot?
16:57wlachI suspect that error message is a red herring
16:57wlach(like most error messages)
17:01dustinhaha
17:05bstackwlach: is something wrong with pulse_actions? I'm technically trying to take care of that for the time being
17:06bstackI definitely saw an uptick in the standard error logs from papertrail in the last day or so
17:16wlachbstack: I suspect so, if you could look at some of the jobs in the treeherder link just above it might prove pulse_actions as the culprit
17:16bstackI'm not quite sure what to do given the fact that it is broken though
17:17bstackwlach|mtg: I don't see any jobs in that link above?
17:17* bstack is doing something wrong I'm sure
17:17bstackoh... it just took forever to load them
17:18bstackwow, that is fun to look at
17:18bstackjmaher: what makes you think this is pulse_actions? I can try to help figure out what's wrong.
17:19jmaherbstack: well, something keeps retriggering all the osx talos jobs on a specific revision almost daily, 6 times/job
17:19bstackalthough I'm pretty confused by pulse_actions too. I'm just trying to take care of it in a.rmenzg's absence
17:19jmaherthat looks to me like some form of trigger all talos jobs
17:19bstackare those buildbot jobs at the moment?
17:19garndtjmaher: it looks like a lot of the builds were retried, which then probably recreates all those test jobs?
17:21jmahergarndt: oh, builds are retried as well?
17:21jmaherbstack: yes, all buildbot jobs
17:21garndtfrom the one TH link I looked at, lots of buildbot builds were retried
17:21jmaherjust osx ones, right?
17:22garndtno
17:22garndtwindows too
17:24jmaherlooking at https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e73241bcb49a399d1de1e512d0334eeece0dcffd&group_state=expanded&selectedJob=84364794, I see we keep failing unittests in the build and retriggering
17:25bstackhmm, pulse_actions is running out of memory on heroku
17:25bstackprobably unrelated, but that seems like an issue too
17:26jmaherbstack: garndt: this is the nasty one: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=13b48d5e00f4b98718f2a16cac1b2ae2bc7c00c1&selectedJob=72282176&group_state=expanded
17:29dustinboy I'm glad I have a hidpi screen to see that view
17:29bstackoh yeah, this seems related to pulse_actions. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=13b48d5e00f4b98718f2a16cac1b2ae2bc7c00c1&selectedJob=74574362&group_state=expanded has been running since the 4th of Feb?
17:30bstackI wonder if that's just been retrying every time pulse_actions restarts for a month
17:30jmaherbstack: :(
17:31bstackI seem to remember something about jobs getting stuck in a queue and that they can be cleared out
17:31bstacklet me check my bugs/email/etc
17:33bstackwhile we're looking at pulse_actions. I'm going to do bug 1325657 today. I've been watching the logs for a week and there's something trying to create jobs very occasionally, but it always fails. I'm not sure by what process it is happening.
17:33firebothttps://bugzil.la/1325657 ASSIGNED, bstack@mozilla.com pulse_actions - Remove powerful scopes once treeherder handles tc creds
17:33jmaherbstack: a bunch of ghosts!
17:33bstackthis retry-every-night thing looks like related to bug 1324432
17:33firebothttps://bugzil.la/1324432 NEW, bstack@mozilla.com pulse_actions - We should acknowledge messages that fail due to an error
17:34bstackperhaps it triggers some jobs and then fails on a later one?
17:34bstackyeah, the ghosts are definitely haunting us today
17:35jmaherour osx load will drop by 10% when this is fixed
17:36bstackoh wow, ok
17:36jmaherbstack: I am just making that number up- it will probably be noticeable though
17:38bstackI'm going to restart the service and we can see if it tries to trigger more jobs
17:40jmaherbstack: good idea
17:40jmaherwe could maybe flush the queue/database if so
17:41bstackaha, both times it is trying to add jobs to resultset 161444
17:41jmaheroh, good find
17:44* bstack looks through the code. one sec
17:45bstackjmaher: do you have access to the New Relic thing pulse_actions is hooked up to?
17:47jmaherbstack: I do not
17:47bstackah, ok
17:47bstackI don't think it will be that useful for this anway, beyond figuring out what's causing this to fail
17:47jmaheror at least I am not aware I have access
17:48bstackit seems to silently fail somewhere in mozilla_ci_tools trigger_range function
17:48bstackwe see this log line https://github.com/mozilla/mozilla_ci_tools/blob/master/mozci/mozci.py#L508
17:49bstackand https://github.com/mozilla/mozilla_ci_tools/blob/master/mozci/mozci.py#L512
17:49bstackbut not anything after that
17:49jmaherbstack: so then how do the jobs get scheduled?
17:50bstackI'm not actually sure. still trying to see how
17:50bstackit's possible this is all a red herring
17:50jmaheroh true
17:53bstackoh, resultset 161444 is another push? https://treeherder.mozilla.org/api/project/mozilla-inbound/resultset/?id=161444
17:54bstackwomp
17:54jmaherthat is jan 18th
17:54jmaherwow: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=98ee7f4c3b1acb2ec084e5f4fb20a0a493783aae
17:54jmaherlook at the 'sch' at the bottom
17:55bstackoh no
17:56bstack"21198 mins overdue, typically takes ~3 mins"
17:56bstackyou don't say
17:56jmaherheh
17:58bstackI remember the queue can be cleared somehow
17:58bstackI just don't recall how somebody would do it
17:58bstackI wonder if there's a runbook somewhere
18:00bstackhttps://wiki.mozilla.org/Auto-tools/Projects/Pulse_actions doesn't really say what to do
18:00jmaherwe could email arm.en
18:03dustinmaybe just drop the queue in pulse?
18:03bstackcan I do that if I don't own it?
18:05bstackI guess that one also isn't really causing an issue beyond making a lot of Sch jobs on that one push
18:05bstackI suspect there's an infinite timeout somewhere in there and that's why the job is never finishing
18:05bstackor infitite retries rather
18:06jmaherthe gift that keeps on giving
18:06bstackI really need to get spell check turned on in this irc thing
18:06bstackI guess m.cote could drop a queue for us
18:07jmaheryeah, true
18:09bstackthat seems like a good next step. drop the queues this is using and see if the retriggers stop
18:10bstackfallout will just be that whoever was doing a legitimate retrigger will need to do it again
18:11jmaherit might be an ok price to pay
18:11bstack ReadTimeout: HTTPSConnectionPool(host='secure.pub.build.mozilla.org', port=443): Read timed out. (read timeout=10)
18:11bstackis secure.pub.build.m.o still a thing? I see that in the logs somewhat looking around
18:12bstackmcote: do you have a minute to drop a couple queues for us?
18:13mcotesure
18:14bstackoh wait. sorry. spoke to soon. I need to figure out which ones should be dropped
18:15bstackthe exchanges are exchange/treeherder/v1/resultset-actions, exchange/treeherder/v1/job-actions, exchange/treeherder/v1/resultset-runnable-job-actions, and exchange/build/normalized
18:15bstackbut that's not what we want here, right
18:15bstackaha, the pulse user is pulse_actions
18:15bstackif it has any durable queues, can we drain them?
18:16jmaherI don't see why not
18:19bstackmcote: ^ whenever you get a chance
18:23mcotesorry what exactly am I doing? :)
18:24bstackhaha sorry. dropping pulse_actions queue
18:24bstackif that's a thing we can do. a.rmen is out still, so we can't do it ourselves
18:25mcotewhat's the full queue name?
18:25mcoteI don't see any queues under queue/pulse-actions
18:26mcoteoh pulse_actions
18:26mcotebtw you can share pulse users now
18:26mcotethough armen would have to add you, since he's the only owner of pulse_actions right now
18:26mcotebstack: anyway plz confirm: I am deleting queue/pulse_actions/pulse_actions correct?
18:27bstackyeah. I think he's been out since before we could share
18:27mcoteaha
18:27bstackmcote: that sounds good to me. ty
18:27mcotedone
18:27mcoteoh it's apparently automatically recreated heh
18:28bstacknice. it didn't try to create that job again
18:28mcotealthough doesn't seem to be filling up again
18:28bstackty mcote. we should be good here now
18:28bstackat least w.r.t. pulse
18:28mcotecool
18:28mcotehaha
18:28mcotethat's as far as I can help you ;)
18:30bstackjmaher: maybe now we wait and see if the mac talos jobs get triggered again tonight and if they do it's probably not pulse_actions?
18:36jmaherbstack: yeah, if they do, I will really call this a ghost
18:37bstackhehe, perfect
19:51Callekgarndt: whats the chance I can get https://tools.taskcluster.net/aws-provisioner/#gecko-1-b-macosx64/ bumped in capacity slight so my *explicitly* low work manual test can run?
19:51Calleklike 1 or 2 tasks more capacity
19:52Callekas in says 10 running and 0 capacity for anything more..
19:52Callek(despite 1 actually pending)
19:53dustinit says there aren't any tasks pending
19:57dustinhuh, why are there 68 gecko-misc tasks pending??
19:59dustinoh, failing deadline-exceeded :(
20:01dustinhmm
20:02dustin"[alert-operator] diskspace threshold reached
20:04dustinnow we get to see if https://docs.taskcluster.net/reference/integrations/aws-provisioner/api-docs#terminateAllInstancesOfWorkerType works..
20:15dustinit doesn't :)
20:15dustinanyway, manually terminated - hopefully that will work now (I added a custom diskSpaceThreshold)
20:29dustinbstack: is there any reason not to deploy tc-gh to production now?
20:30bstackI don't think so
20:30bstackwhat is changing?
20:32bstackoh the docs change
20:32bstack++
20:34dustinjust renaming github -> taskcluster-github yeah
20:34dustinthanks
21 Mar 2017
No messages
   
Last message: 124 days and 2 hours ago