mozilla :: #taskcluster

 
27 Jun 2017
17:18jonasfjpmoore|away: https://github.com/taskcluster/taskcluster-worker/pull/292
17:42dustinhttps://crates.io/crates/hawk <-- we&#39;ve released a Hawk implementation in rust
17:42dustinnext step: rust TC client
17:48akii would consider porting scriptworker to rust
18:17garndtwooooo
18:17garndt\o/ dustin
18:51Aryxjhford: hi, can you check if the scheduler is okay? the gecko-t-* is not at max capacity and has >1k pending tasks
18:58dustingarndt: ^^
19:10Alex_GaynorHi all, what&#39;s the right place to see the depth of the queue for macOS builds?
19:39garndtAlex_Gaynor: if you look at gecko-{1,2,3}-b-macosx64 here: https://tools.taskcluster.net/aws-provisioner/
19:39Alex_Gaynorgarndt: perfect, thanks!
19:40garndtnp
19:48Alex_Gaynorgarndt: Hmm, on closer examination I&#39;m not sure these are the right counts, the sum of the &quot;Pending Tasks&quot; for &quot;macos&quot; worker types is 15, but I&#39;ve got like 50 jobs in taskcluster by myself :-)
19:58garndtAlex_Gaynor: could you give me some task IDs for ones you see as pending still? Based on the link I gave you as well as a separate DB I maintain are showing only 3 pending right now for those worker types
19:58Alex_Gaynorgarndt: hmm, I&#39;m not sure how to go from a treeherder build to a taskcluster task ID. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1354b8b5c298da8d832c6601d737a84831a2e53&group_state=expanded is the treeherder build
19:59garndtoh, not a build, but tests
19:59garndtthe build is done there
20:00garndtso for the worker types (gecko-t-osx-1010) for those test jobs, it apepars there is quite the backlog
20:00garndtwcosta: ^^ are those your jobs you said you retriggered?
20:00Alex_GaynorThat&#39;s what I was afraid of :-) Where would I go to see that backlog?
20:00garndthttps://queue.taskcluster.net/v1/pending/releng-hardware/gecko-t-osx-1010
20:01garndtunfortunately the pending counts for physical hardware machines do not have a public UI yet
20:01Alex_Gaynor JSON is a wonderful UI
20:01garndt:)
20:02Alex_Gaynorthanks much!
20:02garndtAlex_Gaynor: np, I&#39;m sorry for the backlog, I&#39;m going to try to pinpoint where they&#39;re coming from (I think I know though)
20:03garndtsince this is a fixed hardware pool, it&#39;s rather difficult to churn through lots of jobs and so if someone pushes a lot of jobs it can starve out other pushes until they can complete
20:04Alex_Gaynorgarndt: yeah, totally understand *shakes fist at apple, let us have cloud!*
20:04garndtthey certainly do make scaling difficult
20:49Aryxdustin: hi, do you know more about the &#39;cratertest&#39; job type? there are 14k pending
20:54garndtAryx: I&#39;m looking into an issue where we weren&#39;t getting instances that caused backlogs on most of our worker types. Workers are now being spawned and that pool of jobs should clear out soon enough
20:54jhfordAryx: https://github.com/brson/taskcluster-crater
20:54Aryxgarndt: thanks, the other instance types looked back to normal or on the way to that
20:55Aryxjhford: thanks
20:55jhfordnp
20:55garndtjhford: I just filed a bug, something weird was going on.
20:56jhfordgarndt: i don&#39;t see it, could you link me?
20:57garndthttps://bugzilla.mozilla.org/show_bug.cgi?id=1376564
20:57firebotBug 1376564 NEW, nobody@mozilla.org Provisioning iteration failed - instances not spawning
20:58garndtbasically a lot of worker types were waiting for instances, had a lot of tasks in the backlog, but no instances were being spawned, and not many spot requests opened at any given time. rebooted the provisioner, and within 30 minutes alone we got 700 intsances in us-west-2 (things are still spawning)
20:58garndtlink to the signalfx graph is in the bug where you can see things dramatically drop once we got instances again
20:58jhfordthat&#39;s strange
20:59garndtthis is also related to the deadmansnitch email we all got too
21:00jhfordyes, i saw that, which is why i started looking into this as well
21:01jhfordso the deadman&#39;s snitch thing is not a clear indication that the provisioner isn&#39;t working, just that it hasn&#39;t finished an iteration in a long time. for instance, I just got an email and it&#39;s still running an iteration
21:02garndtright, i was looking into this before the email, just another datapoint that something was going weird
21:06jhfordso one thing to note is that the provisioner ui is still not updated until after a successful iteration
21:07jhfordi was thinking that since we now have very quick iterations, we should limit ourselves to something like 200 spot requests per iteration at most
21:08jhfordthe EC2-Manager has *much* better state end points
21:10garndtare the end points documented up on our docs site?
21:13jhfordgarndt: any objections to this landing? https://irccloud.mozilla.com/pastebin/cROiSWku/
21:13jhfordgarndt: not yet, i haven&#39;t added the api to manifest.json yet
21:13garndtI don&#39;t have any immediate objections to that change, not exactly sure what would be the outcome of something like this
21:18jhfordthe change would be that when we have >200 instances per region, we know that we&#39;re going to submit the first 200 and start over. this has two main effects. First is that we get more frequent change calculations, second is state updates;
21:18garndtok
21:18garndtcan you please make an action item to add this to the manifest.json too so we can get the documentatio published and get a new tc-client that we can use too if needed
21:19jhfordwhen we have huge clumps of instances, we could get into a state where instances from earlier in the iteration have already started working through the queue but we still need to finish the iteration
21:19garndtah ok
21:20garndtyea, more frequent state changes would be awesome
21:20jhfordso we don&#39;t *have* to use the ec2-manager endpoitns. it&#39;s one of the tidying up things i&#39;d planned on doing about submitting an issue to tools.tc.net about using the new endpoints, but hadn&#39;t had a chance to yet
21:21garndtok, well if the provisioner is updating the UI more frequently, that&#39;s ideal, but if there is a better way to report this state that&#39;s cool
21:22garndtyou just said it has much better state endpoints, so I figured I should be using that
21:22jhfordit does, and we should!
21:23jhfordwell, not more accurate, they&#39;re from the same datasource, it&#39;s just that the ec2-manager endpoints get *this second* view of our idea of state, where the aws-provisioner only gets the last iteration&#39;s view
21:23jhfordi could change the provisioner&#39;s /state/ endpoints to call ec2-manager, instead of storing the state from the ec2-manager at the end of the iteration
21:24garndtk
21:24garndtwell I think as long as iterations complete much quicker, it&#39;s not an urgent thing to do right this minute, but something to do sometime soon
21:24jhfordyep
21:24jhfordagreed
21:25dustinAryx: don&#39;t worry about cratertest
21:26dustinthey dump 10k&#39;s of tasks all at once, so pending is expected
21:26Aryxok
21:26garndtit&#39;s going down somewhat quickly
21:26garndtdown to 5k now
21:26Aryxthank you
21:30jhfordfwiw, if you&#39;d like to use the ec2-manager endpoints just to check on a worker quickly, you can use https://ec2-manager.taskcluster.net/v1/worker-types/gecko-t-linux-medium/stats and https://ec2-manager.taskcluster.net/v1/worker-types/gecko-t-linux-medium/state
21:31jhford(replacing the gecko-t-linux-medium with whichever worker type is more appropriate)
21:38garndtah ok
21:58maja_zfdustin: hi! does anyone have taskcluster stickers? I want!
21:59garndtYup. We&#39;re in the market street room in parc 55 and have a bunch
22:01maja_zfyay! I will visit at some point from Hilton.
 
27 Jun 2017
   
Last message: 21 minutes and 32 seconds ago