mozilla :: #taskcluster

12 Jul 2017
00:50sfinker, how do you get a one-click loaner these days? I did create interactive task, it gave me a spinner but also said it was running and ready, but it never gave me an SSH window or any other options, until the job finished and ended.
00:50sfinkand it no longer has live.html or whatever it was called under run artifacts
00:53sfinkmy latest is though it's still pending atm
00:54sfinkand again, it's now running but not giving me a way to shell into it
00:59sfinkooh! It popped up an option to log in!
13:21gbrowndustin: I didn't mean to ni you on that bug without explanation. sorry! (you correctly interpreted my intent though: was wondering if you thought putting tooltool on the docker image was a good idea)
13:21dustinhaha, I often snark when I'm flagged ni? on a bug without a "?" in it, but it was clear enough this time
13:23gbrowncool. thanks.
13:54Callekdustin: why is failing ? a link to the log actually seems to link me to the live log and has no error output at the end...
13:54Callek(in fact live log seems still going on...)
13:58Callekahhh damn
13:58* Callek curses at slow l10n stuff
13:58CallekI'm surprised we're not killing the task though
13:58dustinit only gets killed on the next worker reclaim
13:58Callekahhh ok
14:03Callekdustin: grenade: so looks like our windows worker is *not* killing properly
14:03Callekfigured it was worth sharing :-)
14:03dustinkilling what?
14:04Callekeven better
14:04Callekdustin: after deadline exceeded
14:04dustinit looks like the task finished..
14:04grenadepmoore: ^
14:04dustinso there was no reclaim
14:05dustinin the second pastebin
14:05dustinin the first, yes, it looks like "killing task" didn't work
14:05dustinor maybe took a while?
14:07Callekdustin: that is both teh same log, I just didn't try pushing it to the same pastebin due to length limits
14:08Callekmy point is that we tried to kill the task, it didn't work, then we happily let the tc worker itself do uploading of artifacts, as if nothing was wrong
14:08dustinyeah :/
14:08dustinI think that might be a pmoore question then
14:08Callekfull log is
14:09Callekand if I were to try finding out why the exception based on that log, and artifacts uploaded, I'd most certainly not have seen the deadline-exceeded message and attempt at killing
14:13Callekwith 6 chunks, total runtime was ~1 hour 20 min
14:14Callek(per chunk)
14:50* pmoore spots his name
14:52pmooreso if i understand correctly, the question is, why is the task not terminating, when it expires?
14:52pmooreor rather, when the deadline is exceeded
14:53pmooreis it possible that deadline is being used to implement a maxRunTime behaviour?
14:53pmoorein other words, deadline is typically like a day away
14:54pmooreit is a point in time when it no longer makes sense to even run the task
14:54pmooremaxRunTime is to limit the time of an individual task, once it has started
14:54pmoorequeue handles deadline, worker handles maxRunTime
14:55pmoorethe worker only can discover that a task's deadline was exceeded when it either goes to reclaim a task, or resolve it
14:55pmooreCallek: ^
14:55pmooreso if we want tight controls around how long the task can run for, it will be safer/better to implement that via maxRunTime
14:55pmooretrue, worker handling of deadline-exceeded isn't very optimal
14:55pmoorebut it should be a rare thing that deadline is exceeded when a task is running, as deadline exceeded is typcially in the order of days
14:56pmooreand task execution is typically < an hour
14:56pmooreso it is unlikely for the two to coincide under normal conditions
14:56pmooreso the worker hasn&#39;t been optimised to handle this use case, and possibly still continues even when task has exceeded deadline
14:57pmooreuploading artifact, imho is not necessarily a bad thing, even if deadline was exceeded, as it shares state about the task execution
14:57pmoorethe task status is exception, so downstream tasks should not get triggered, that would try to use the task artifacts
14:57dustinpmoore: I think the question is, when the reclaim failed, why did generic worker say it was killing the process but hte process didn&#39;t die?
14:57pmoorebecause it was an evil process?
14:57dustinthat&#39;s one possibility
14:58pmooreno idea
14:58pmoorei think it is something we can look into
14:58pmoorebut i suspect it is not a common use case, so might not mandate a large investment of effort (as probably maxRunTime should have been used for this effect)
14:58pmoorebut agreed, the behaviour is not optimal
14:58dustinwell, not a use case, but definitely a bug :)
14:59pmooretask is resolved as exception, so i see this more as an inefficiency
14:59dustinit&#39;s a waste of resources to keep running a task - especially one htat might be in an infinite loop
14:59pmooreagreed about the infinite loop
14:59pmooreit can also be a waste of resources to spend X man hours on implementing something with saves Y amount of dollars, when X man hours costs more than Y dollars :)
14:59dustinCallek: btw, status = exception should usually lead you to &quot;run status&quot; to find the reason, rather than the log
15:00pmoorebut yes, i agree that it looks like this task was not successfully killed
15:00pmooreand we might want to look into that
15:00pmoorebut setting maxRunTime correctly should solve it for 99% of the time
15:01dustindoes the kill after maxRunTime work?
15:02dustinis it a different code-path?
15:02pmoorejust to be clear, i don&#39;t want it to appear like i don&#39;t care - i do care, i&#39;m just trying to highlight i think this &quot;bug&quot; is more of an inefficiency that probably isn&#39;t highly impactful
15:02pmoorei do care a lot about our services
15:02pmooreand want them to work smoothly :)
15:03dustinme too :)
15:03pmoore(i realise it might not have looked like that from my comments)
15:04dustinI&#39;ll file a bug.. I think we&#39;re reasonably confident this is low-impact (so we let a few tasks continue executing until they die naturally)
15:04pmoore(and the other thing i haven&#39;t stated is that since we intend to migrate from generic-worker to taskcluster-worker, there is less focus on fixing minor issues that slow the path to upgrade)
15:04* Callek reads scrollback
15:04dustinand we can increase priority if it turns out to be a big hit
15:04dustingood point
15:06firebotBug 1380342 NEW, [generic-worker] tasks aren&#39;t killed after deadline-exceeded
15:06pmoorethanks dustin!
15:06pmooreand Callek
15:06Callekpmoore: On windows I&#39;ve seen infinite loops and deadlocks MANY times before, so anything that expects tokill stuff should actually kill stuff, in escalating values of enforcement. Additionally if we don&#39;t properly kill but the worker tries to take another job thats even worse, as we get resource contention and file system locking issues
15:07pmooreyeah that would suck
15:07pmoorepoint taken
15:07dustinit didn&#39;t mark the task as completed, so the second worry isn&#39;t too bad
15:07dustinand let&#39;s see if we lose workers to infinite loops..
15:08pmooreCallek: can you see from the log if it has attempted to take another task?
15:08pmoorei&#39;m not sure where the log is, and i&#39;m afraid i have to go pick up my car from the garage now
15:08pmooreand pay 500 euros or so for an oil change, or whatever the going price is these days
15:09Callekpmoore: I don&#39;t think I can, at least not easily, but I think its a low priority of concerns
15:09pmooreCallek: if you can dump your findings in the bug, i&#39;ll take a look in the morning when i&#39;m back
15:09Callekthe largest issue is resource waste imo
15:09Callekthe others are mere theory-crafting and issues I&#39;ve seen with buildbot
15:10dustinyep, add some data, and we can make more decisions
16:55bhearsumif i want to subscribe to artifact creation events, is pulse the right way to do it? someone was telling me that it&#39;s possible to miss some events that way, and that i may want to look directly at azure(?) instead
16:57dustinpulse is the right choice
16:58bhearsummight i miss events with pulse?
16:59dustinnot normally, assuming you use a persistent queue and don&#39;t get killed by pulseguardian
16:59bhearsumahhhh, okay
17:46ahaldustin: this try push modifies taskcluster/
17:46ahalbut the spidermonkey tasks didn&#39;t run despite:
17:46ahalis that a bug or intentional?
17:47dustinwere they optimized away, or just not targeted?
17:47ahalgood question
17:48mshalwcosta: have you had any luck tracking down the valgrind errors?
17:48wcostamshal: not yet, I was looking at other things this morning, going to ptrace valgrind in a loaner machine to see where it is trying to load symbols from
17:49ahaldustin: ah, it&#39;s because of
17:49ahalso I guess intentional
17:49dustin&#39;intentional&#39; is not usually a word I use in association with try syntax :)
17:49dustinbut, yeah
17:50wcostabut one &quot;interesting&quot; thing is that I found that centos base image is updated and overwritten in docker hub, so each time we build a new image, we may pull a different centos docker image
17:51dustinyeah, nondeterministic image builds are teh awful
17:51mshalwcosta: that&#39;s a good idea. Might also be worth reaching out to sewardj or njn for some valgrind expertise
17:52wcostamshal: yep, sure, let me grab some context and reach them out
18:01Callekdustin: so to follow up, the deadline being 1hr, was due to &quot;edit&quot; on a real task and then &quot;update timestamps&quot; afaict
18:01Callekthe deadline was ~ a day out, but didn&#39;t actually update in the edit view to more than an hour
18:02* Callek is not certain if it emmediately went to ~1hr, or only did when the initial deadline in there passed
18:02tomprincedustin: ?
18:02CallekI just wanted to circle back and say that
18:12dustinCallek: &quot;update timestamps&quot; updates all the timestamps by the same offset
18:13Callekdustin: yea that seems like its not entirely true by the task I edited having a day for deadline initially and me never changing that value
18:13* Callek suspects it is the intent though
18:14dustinwell, it is the case :)
18:14dustinwhat was the original task?
18:36Callekwoa wtf at these measurements for timing
18:36Callekfor the hg checkout
18:37Callekdustin: ...original task was (likely) fwiw it was one of the date l10n tasks for w64...
19:00dustinI wonder if those are happening on the wrong drive?
19:00* dustin recalls things about different drives at different speeds, formatting drives, etc.
19:06Callekdustin: 3:02 PM <gps> Callek: this smells like bug 1305174. might want to ping someone in taskcluster about a potential regression to worker provisioning
19:06Callek3:02 PM <firebot> NEW, EBS initialization makes I/O absurdly slow on freshly provisioned instances
19:06firebot NEW, EBS initialization makes I/O absurdly slow on freshly provisioned instances
19:06Callekdustin: taht was re: the 1 hour checkout
19:07dustinyeah, that&#39;s what i was thinking
19:08dustinISTR discussion of changing how we initialize EBS volumes to avoid wasting 20 minutes reformatting them on startup recently
19:08dustinwhich I think grenade was working on
19:08dustinwant to open a bug?
19:08gpsthat 20 minutes prevents this very thing
19:09gpsideally the &quot;scratch&quot; EBS volumes are fresh and don&#39;t come from AMIs
19:09gpsthen you can just do a quick format and we&#39;re good
19:09dustinyeah, and that was discussed in that context, so I doubt we just turned it off
19:09dustinbut, I wasn&#39;t there
19:09dustinand I think nobody currently online was there
19:09dustinso we should probably file a bug and let those folks sort it out :)
19:10dustinI don&#39;t see anything incriminating on
19:14dustinI think is about releng hosts, so probably not a good place to tag this on
19:41Aryxdustin: hi, bustage:
19:41dustinhm, that stinks
19:42dustinI guess I should have done my own try push instead of trusting the author :/
19:42dustinanyway, that&#39;ll be a backout
19:42Aryxthe next push (1 minute after yours) only has this failure so far:
19:44Aryx&quot;ok&quot;, no tc jobs scheduled after that
20:16dustinit&#39;s definitely broken
20:17dustinmaking un-executable images
21:18tomprincegarndt: Are you available to meet sometime about taskcluster and thunderbird?
21:21garndttomprince: sorry I&#39;m on pto for the next couple of weeks but I&#39;ll try to reply to your email in the next day or so.
21:22tomprincegarndt: Okay, no problem. Thanks.
21:25Silne30dustin: Hey, man. I have been documenting the process to add a test suite to task cluster.
21:25Silne30I was going to put it up on MDN, but I was wondering if it would be good to put it in the gecko readthedocs or the taskcluster documentation.
21:33jonasfjSilne30: if it docs related to how you add test suites for gecko in-tree, then the obvious location is the readthedocs section in /taskcluster/... on mozilla-central
21:34Silne30jonasfj: Yes. That&#39;s what I am looking to do.
21:34jonasfjI figured... checkout:
21:35jonasfjthat way the docs are close the the config files you&#39;d be tweaking
21:36jonasfjSilne30: also the docs are probably specific to the YAML format and transforms we use in-tree, hence, the docs can be updated when the in-tree YAML format changes. And these changes can ride the trains to different branches..
21:38dustinI think there&#39;s a move to get developer docs out of MDN too
21:38dustinso that makes sense
21:38dustinnote there already is some content there
21:39Silne30dustin: Yep. I see that. Anything that mentions the mozharness script and how to set that up for taskcluster? Or the whole checkouts flow rather than test-archives?
21:40dustinthe latter may be worth mentioning
21:40dustinthe former is more mozharness-specific so it should probably be a pointer to some more mozharnessy documentation
21:42Silne30Makes sense.
21:43Silne30I definitely think it should be pointed to since taskcluster expects a mozharness script (or so I was told)
21:51dustinfor tests, yes
21:52dustinthere&#39;s a big old stack of things, and they&#39;re all used for other stuff too, so documenting it all in one place feels weird
21:52dustinlike including in your car manual how to cast an engine block
21:55Silne30dustin: I am open to suggestions as far as where to put this stuff.
21:55Silne30I was thinking of putting that stuff under &quot;adding a new test suite&quot;
22:06dustinI&#39;m not sure where to put the mozharness bits though
22:06dustinI don&#39;t see a mozharness section on the readthedocs stuff
22:08dustinjlund|pto: may have some ideas about where that should go
22:08dustinok, gotta run
22:09Silne30dustin: Later
13 Jul 2017
No messages
Last message: 72 days and 6 hours ago