mozilla :: #taskcluster

16 Mar 2017
03:04dustinhe's generally not around after midnight :)
11:18Tomcat|sheriffdutypmoore: ping
11:19pmooreTomcat|sheriffduty: hey
11:19Tomcat|sheriffdutyhey pmoore we have a problem with nightlys
11:19Tomcat|sheriffdutyfor the tc ones based
11:19Tomcat|sheriffdutyyesterday bug 1347569 broke them
11:19firebot FIXED, Decision Task for Nightly Desktop + Android failed with KeyError: u'NS7jKig_R8-1F_7DWSTQ-Q'
11:19Tomcat|sheriffdutyand today we have bug 1347889
11:20firebot NEW, Nightly Decision Task for Android + Desktop fail with 400 Client Error: Bad Request for url: http://
11:20pmooreOk I'll take a look
11:20Tomcat|sheriffdutydo you know who could look at this or do we need to wait till Callek and dustin wake up
11:20Tomcat|sheriffdutythe problem is that since yesterday no linux nightlys were generated
11:20Tomcat|sheriffdutyso already got pinged for this :)
11:25jhfordwhimboo: heh, yah, i'm not around after midnight :) what's up?
11:26pmooreTomcat|sheriffduty: i wonder if this might be an in-tree problem this time, it looks like we are getting a 400 back from the queue, indicating an invalid request
11:26pmooreTomcat|sheriffduty: also it says "data.created should be string"
11:27pmooreso i wonder if something has changed in-tree wrt nightly task generation ... :/
11:28pmooremy guess is, there is a decision task generated for the nightly builds, and that only runs when the nightlies run, so maybe it landed and didn't cause any bustage
11:28pmooresince you would only see the bustage the next time the nightlies run
11:29pmooreiirc we now have an in-tree mechanism for scheduling tasks on a cron, i can take a look in-tree to see if i can find it
11:29Tomcat|sheriffduty maybe ?
11:30pmoorehmmm - could be, i'm not sure - let me see if i can find the generated task to find the problem
11:32pmoorehmmm in we have `pushdate: 0` - i wonder if that is normal (could be unrelated, just looks strange)
11:33Tomcat|sheriffdutyi could retrigger them, but not sure if this would help
11:36pmooreTomcat|sheriffduty: so i suspect that there is a mach command that will take the parameters file from and produce out a list of tasks definitions, and those definitions will have a badly formatted "created" parameter
11:37Tomcat|sheriffdutypmoore: there was a mach change in tree
11:38pmooreTomcat|sheriffduty: the people that probably know how this task generation stuff work (as I think they have worked on it) are dustin, jlund|away, ahal, gps, Callek
11:39pmooreTomcat|sheriffduty: that change looks ok to me
11:39pmooreTomcat|sheriffduty: i think dustin will be around soon - he will know
11:40pmooreyou could play with backing out any changes under /taskcluster top level directory
11:41pmooreTomcat|sheriffduty: if there is only a single change in there, it is likely to be the one
11:41pmooreTomcat|sheriffduty: it certainly looks like an issue with task generation to me, since it looks like a malform created parameter in the task definition
11:41Tomcat|sheriffdutypmoore: ok will sylvestre
11:41whimboojhford: hah. irc said you are not away :)
11:42Tomcat|sheriffdutypmoore: will ping sylvestre
11:42whimboojhford: meanwhile figured out how to get a url for an artifact via the tc client
11:42jhfordhehe, no worries, that field is generally not to be trusted
11:42jhfordyou're not generating it by doing URL generating yourself are you?
11:43whimboojhford: is that not what I should do? 8D
11:43whimboono, i use queue.buildUrl()
11:43jhfordhehe, cool
11:43whimboobuildsignedUrl doesnt seem to work
11:43jhfordare you using the python client? are you >= 1.0.0?
11:43whimboomaybe its for something else
11:43whimboojust upgraded from 0.3.3 or so
11:43whimboonow that you all got fixed
11:44whimboothanks for that!
11:54pmooreTomcat|sheriffduty: i've created bug 1347896 to see if we can find a way to test task generation for cron tasks, so we'll get test failures in future at the time the change lands, rather than bustage when the scheduled task runs (e.g. nightlies)
11:54firebot NEW, Test that performs task generation dry run and validation for cron tasks
11:54Tomcat|sheriffdutyyeah that would be great
11:55pmooreTomcat|sheriffduty: feel free to comment in there if this becomes higher priorities, e.g. if bust scheduled tasks become more common and start adding more risk to projects
11:55* pmoore can't type today
12:27CallekTomcat|lunch: pmoore: hihi, where do we sit with regard to bustage?
12:28jlastHi jmaher
12:28jmaher|afkI want to know how we invoke docker to invoke public/image.tar.zst ? is there a decompression step or special docker cli options?
12:28jmaher|afkjlast: hi, asking here ^
12:28jmaher|afkwcosta: jonasfj ^^ would either of you guys know?
12:30pmooreCallek: i haven't fixed anything, i just had my lunch :-)
12:34jdescottesjmaher|afk: looking at taskcluster/taskgraph/ looks like there is some plumbing to read zst images
12:35jmaher|afkjdescottes: that is sounding familiar, good find
12:35jmaher|afkjlast: does that work for you?
12:38wcostajmaher|afk: iirc, garndt implemented the image compression, he might know better
12:40jmaher|afkwcosta: got it; jlast garndt is in a GMT-5 timezone, might be another bit before he is online; maybe what jdescottes pulled up will get that image loaded for you?
12:42Callekpmoore: soooo, heres what the task that failed to submit looks like:
12:42Callekjhford: ^
12:42Callekthis is from
12:44Tomcat|sheriffdutywell Callek we still have the nightly bustage
12:44pmooreCallek: looks like something should convert those relative timestamps to actual timestamps, and didn't run
12:45pmooreCallek: are you sure that is the final version, or could that be pre-conversion?
12:45Callekthats the ver in the task-graph.json
12:45Tomcat|sheriffdutyCallek: also informed pascalc so that he is informed of the problem
12:45Callekthe linux-nightly also has:
12:45pmooreCallek: yeah, i don't know the in-tree stuff well enough to know if that is the final version or not
12:47pmooreCallek: also "attributes" has no place in a task definition
12:47Callekpmoore: yea, this is task graph (attributes is there, but isn't submitted)
12:47Callekpmoore: note the task def is in the "task" {} there, I pasted more than that :-)
12:48pmooreah yes
12:48pmooreso looks like the conversion of the relative-timestamp didn't run
12:48pmooresounds like a solvable problem
12:48jlastHmm, the image.tar.zst is a bit strange. When I load it in I get an image name of desktop-test
12:48jlastand then this command: `docker run -it desktop-test:b8ff6bbb2b00b0163f84217450bca815c91eb623840b727fea8829d839f9a9e4 /bin/bash -c "export SHELL=/bin/bash; ls"`
12:49CallekTomcat|sheriffduty: sooo, fwiw it looks like this set of nightlies *did* go through and worked
12:49jlastresults in this output: `artifacts bin buildprops.json checkouts Documents Music Pictures scripts Videos workspace`
12:49CallekTomcat|sheriffduty: its just this set of tasks didn't submit right, (which is tasks to index l10n properly)
12:50Tomcat|sheriffdutyoh in ideed
12:50Callekpmoore: Tomcat|sheriffduty ooo I bet the issue is related to scopes...... `index:insert-task:....`
12:53jmaher|afkjlast: so desktop-test is the image name that we have called this; are those the directories you see from the root dir?
12:54jmaher|afkjlast: ok, so that is the base image, then we usually run a script upon loading to download the build/test package and do the unpacking/setup/running
12:54jlasti see
12:54Callekdustin: jhford garndt: ooook, so heres the issue --- we created a set of tasks to help us throw index's up properly, in doing so we are left with needing to post index's from cron jobs, to do so we need to grant scopes to those cron jobs, to do so we're needing to be granting scopes to all people who can push to the repo, the scope we need to grant for is
12:54Callek`index:insert-task:gecko.v2.*` can you forsee any actual issue here?
12:54jmaher|afkjlast: [task 2017-03-15T22:03:46.161261Z] executing ['/home/worker/bin/', '--no-read-buildbot-config', '--installer-url=', '--test-packages-url=', '--mochitest-suite=mochitest-devtools-chrome-chunked', '--total-chunk=10', '--thi
12:54Callek(this does mean that basically everyone will have that scope!)
12:55jmaher|afkjlast: a lot of magic there (that is a most recent finished m-c dt2 run)
12:55jlastoh i see
12:55jlasti see the script
12:55jmaher|afkso you can give it similar commands and it will work
12:56jlasthmm - where would we see the checkout of m-c?
12:57jlasthere is our script for running - mochitest. it relies on us being able to link assets from the patch:
12:57Callekcatlee: offhand, do we blindly use index.gecko.v2.* routes for anything shipping, without any secondary CoT checks?
12:57jmaher|afkjlast: in this case we download the build and test packages, so we are not running from source
12:57jmaher|afkall the tests are added to
12:58jlastdoes the build include the patch?
12:58Callekcatlee: because fixing nightlies with this set of patches that just (recently) landed, means we grant that scope to pretty much everyone... (including try the way this is set up)
12:58pmooreCallek: the errors i saw were 400 from the queue, suggesting a bad payload, rather than a 403
12:58Callekpmoore: o hrm
12:58pmooreCallek: and the logs also mentioned the badly formed "created" parameter in the task definition
12:58Callekpmoore: though, noteworthy afaict we need the scope anyway
12:58jmaher|afkjlast: so if you pushed to try/etc. then the resulting build generates a firefox package and a series of test packages
12:58Callekpmoore: what logs said the created param?
12:58pmoorethe log of the decision task that failed
12:59Callekpmoore: oooo I see that now!
12:59jmaher|afkjlast: as a note, I am going to be afk (as my nick says) in ~5 minutes
13:02Callekpmoore: I was looking at the last line (with traceback) which didn't have that message but was also a 404...
13:02Callekso yea, looking closer
13:04Callekpmoore: `s/relative-timestamp/relative-datestamp/`
13:04CallekI suspect the scopes issue will be another bustage, but lets get this code change in....
13:04* Callek writes the patch
13:05Callekjonasfj: are you around at this time and free for review, or should I await dustin's arrival and grab him
13:06pmooreCallek: i can also review, especially if you test the change first and demonstrate it results in a valid task definition
13:07Callekpmoore: it will still be relative-datestamp in the resulting def, the because *that* gets resolved at actual task creation time
13:07pmooreCallek: can we not generate tasks without submitting them?
13:08Callekpmoore: well, the relative-datestamp stuff was explicitly made to be there until the *very* end and inspectable this way, (especially for subsequent runs) to make diffing better.... so not that I know of
13:08Callekand given that I'm only changing 2 lines of code (well 1 line of code and 1 line of docs) I'm not sure you have the overall context necessary to confidently grant review :-)
13:08* Callek could be wrong of course, and would happily take the review+
13:14pmooreCallek: regardless of the artifacts that get published in the decision task, it should be possible to perform a "dry run" and generate real task definitions with the current timestamp, and if it isn't that sounds like a missing feature which we should address (albeit separately from this bug)
13:15pmooreCallek: indeed this failure could have been caught ahead of time, if real task definitions could have been created and validated
13:16pmoore(using the task definition json schema)
13:16pmoorenote i've created bug 1347896 for this
13:16firebot NEW, Test that performs task generation dry run and validation for cron tasks
13:19Callekpmoore: this *specific* issue is about being able to create too, which is caught on actual submission though
13:20CallekI think we do need a way to "dry run" the actual submission based on this... which means we should be able to submit to tc and get a check against scopes and such, without actually creating the task
13:20Callekwhich may be harder
13:21Callekgetting the graph gen done and some in-tree validation is good too, but not really enough (if scopes will fail due to an in tree change that affects scopes, then we're in a similar boat without that test failing)
13:21Callekjonasfj: pmoore dustin Tomcat|sheriffduty: OK, I have a patch, but need a reviewer, I'll be away for 5-10 min
13:22pmooreCallek: if there is a mechanism to generate actual task definitions, independently of submitting them, then there is a defined way to validate that they are well-formed client-side via the published json schema - agreed scope validity is harder to validate at the moment - we might want to implement a queue feature that allows submitting a task to see if it
13:22pmoorewould be valid, without persisting it nor scheduling it
13:23garndtjmaher|afk: so there is a mach command that helps (which I have never used) called `taskcluster-load-image` which has some help text...there is also insturctions in the "run locally" section of any task in the task inspector that describes how to load the the case of tasks using a compressed image it should have some text there that explains that
13:23garndtit's uncompressed before loading into docker
13:24pmooreCallek: please put your patch in bug 1347889
13:24firebot NEW, Nightly Decision Task for Android + Desktop fail with 400 Client Error: Bad Request for url: http://
13:31Callekpmoore: am I flagging you?
13:31CallekTomcat|sheriffduty: agree with "use autoland on r+, and re-ping me tomorrow if its busted again"?
13:32CallekTomcat|sheriffduty: since we actually did get the nightlies out today, just not those indexing-tasks
13:32Tomcat|sheriffdutyCallek: sure :)
13:35catleeCallek: I don't know
13:36catleesounds like a problem if try can get nightly scopes though
13:36pmooreCallek: if you'd like me to review it, i am happy to do so - note garndt is also around now if you prefer him to review it
13:37Callekcatlee: its more a factor of how the index is designed and we need to grant gecko.v2.* of course we can grant that for L2/L3 and explicitly say "try" for try I guess
13:37Callekgarndt: do you know the taskgraph in tree code, at all?
13:42garndtnot as much anymore, so if you're not completely blocked this moment, it might be best to put up a review flagging dustin (and feel free to tag me as well so I can learn what's being done), and we'll get to it.
13:42garndtare you going to post the patch on bug 1347889?
13:42firebot NEW, Nightly Decision Task for Android + Desktop fail with 400 Client Error: Bad Request for url: http://
13:43Callekgarndt: yea
13:46Callekpatch posted, flagged dustin
13:46garndtthank you
13:48garndtI really think that patch is appropriate given what I find elsewhere, like
13:59dustinCallek: I landed the patch
14:01Callekdustin: thanks
16:37Callekdustin: garndt: can someone please add `index:insert-task:gecko.v2.*` to* I don't have the scope to do so
16:37Callekand jlorenzo is blocked on that
16:37dustinI can
16:37Callek(breaks date pushes) and is needed for the nightly code now
16:37akishouldn't you have the scope? you're in releng
16:38CallekI don't have the scope to add the index:insert-task:* stuff
16:38Callek(you can't grant scopes you don't have, or some such)
16:39akiadding it to the role means we can't edit it anymore
16:39dustinhm, these don't look right
16:39dustinaki: no, it's just the scopes you're adding
16:39dustin* doesn't look like enough to run an m-c nightly
16:40dustinoh, it inherits * too
16:42dustinI'll add the right routes everywhere queue:route:index. appears
16:57dustinCallek: updated
17:07jlorenzodustin: hey.thanks for the update! We're moving forward, one more scope is needed :
17:07dustinthat one you should be able to add to the project:releng:nightly:* roles yourself
17:08* jlorenzo does so
17:12* jlorenzo added index:insert-task:project.releng.funsize.level-3.* to*
17:13dustinwhat did we figure out with the heroku caching issues?
17:13dustinError: Cannot find module './build/Debug/buffertools.node'
17:13dustinfrom auth-staging
17:29Callekjlorenzo: dustin: ahhhhhh I'm not sure we *want* the funsize route this way (not sure how funsize will react) -- sfraser fyi ^
17:29Calleksfraser: this is a case of "a task after the fact is adding the funsize route to the l10n task that it relates to..."
17:30sfraserthis seems bad, at least in the sense it will result in duplicate partials being generated, if the other job has something funsize can understand
17:30dustinI'm confused
17:30dustinhow does it matter when the index is added?
17:33sfraserneed a few minutes, sorry
17:39Calleksfraser: dustin: so, in this case Its a "Task X had routes at `started`" and now "Task Y sets routes to Task X when Task Y runs" -- in the funsize routes in particular its the exact same route regardless of what task runs it (duplicates) and is used explicitly for notification to funsize
17:40CallekI'm not certain how funsize ingests that stuff, or what TC does in terms of pulse messages (differences?) here?
17:47dustinCallek: to be clear, routes != index paths
17:47dustinsuddenly I'm worried that funsize is listening for pulse messages intended for the index - those won't be sent at all anymore
17:48sfraserback, dinner won't burn now
17:48dustinmaybe we need to disable this morph until we can move funsize in-tree
17:49sfraserfunsize is currently listening for pulse messages for u'route.index.project.releng.funsize.level-3.mozilla-central', and u'route.index.project.releng.funsize.level-3.mozilla-aurora'
17:49dustinok, darn
17:49dustinso those won't be sent by index tasks
17:50sfraserhow do you mean 'sent by index tasks'?
17:50dustinthere are no such messages
17:51sfraserthere are at the moment?
17:51dustinfor tasks that have had this morph applied
17:51sfraserfunsize doesn't have to use an index route. It does at the moment because the namespace documentation and talking to people both ended up with the agreement that it was the right place for it
17:52sfrasermorph is patch in this context? or task transform?
17:53sfraserI'm also not sure I'm following the description of "a task after the fact is adding the funsize route to the l10n task that it relates to..."
17:53dustinbasically what we do is, when a task has too many routes, we pull out the routes pointing to the index
17:53dustinand just call index.insertTask() directly
17:53sfraserHow can a task that happens afterwards add a message to a task that's already done?
17:53dustinright, it can't :)
17:54dustinso there's no message with either of the pulse route keys you mentioned above, for tasks that have had this "pull out the routes" morph applied
17:54dustinwhich is to say, for tasks with a lot of routes
17:54sfraser(bear in mind Id on't know what index.insertTask() does. I am guessing that morph is patch)
17:55dustinshould we talk by vidyo?
17:55dustinmorph is not pach
17:55dustinor patch
17:55sfraserso what's a morph?
17:56sfraserwould it be simplest to stop funsize using route.index.something, since it was never an index message?
17:57dustinbut if we need to add a route for each l10n, we'll be right back where we started with too many routes per task
17:58sfraserAre we likely to need to add a route for each l10n?
17:58dustinCallek: ^^?
17:58dustinI'd really like for funsize to just be a worker -- isn't that the plan anyway?
17:58dustinrather than trying to attach it to teh task-graph by listening to routes
17:58akiplan is to move it in tree and have it be generate partial -> signing scriptworker -> balrog scriptworker
17:58sfraserit is, but there's no timescale attached at the moment
17:58akiyeah, that
17:59Callekdustin: let me circle back after this mtg :-)
17:59sfraserI'm on PTO tomorrow, so if you need me for things it'll have to be Monday
18:00sfraserthat said, the change to funsize is simple
18:00akibalrog scriptworker doesn't support release bits yet, and may not support partials yet, so the time scale is longer than we want it to be even if we start now
18:01sfraserI've looked at the balrog scriptworker, and I think it will support partials.
18:01akinot for releases though aiui
18:01sfraserIt does require the public balrog api to be done, though, so that the decision task can work out which previous releases to build partials from
18:01sfraser(I'm told the balrogvpnproxy scope is on the wishlist for things to remove, so I'm reluctant to start using that)
18:02akiideally balrog scriptworker skips that
18:05sfraserif I'm not around, then add the new routes to and as long as it's the same task sending those pulse messges, things will carry on working
18:05dustinprobably we should just remove the morph temporarily and increase the number of l10n chunks again
18:05dustinuntil this gets figured out
18:05dustinbut I'll let Callek chime
18:05akii would just increase the number of l10n chunks on m-c as well as m-a and let it ride the trains
18:05akirather than worry about it on merge day
18:06akiunless i'm missing something
18:06dustinsounds good to me :)
18:06akior error if there's > max number of locales
18:06akiper task
18:06akino, just increasing chunks on m-c is better
18:06Callekdustin: I'm ok with that, this only landed on central --- and the chunking on central alone isn't an issue (it needs to be bigger for aurora, but we can do additional chunking on central to make aurora be fine)... I'm also explicitly going to say leave the morph support in but rip out index morphs for now
18:07dustinyeah that's what I'm thinking
18:11catleedustin: just to follow up on a comment you made on the thread - I've heard from a few folks that 'diff -u' isn't great for diffing task graphs
18:11catleee.g. taskIds are random, and the order isn't deterministic?
18:13sfraserThere's , there should be a cmdline or python variant
18:14dustinthe graphs have labels, not taskids
18:14dustinso that's not an issue
18:23catleeI'll get a specific example
18:23aki|mach taskgraph optimized| has taskIds
18:24akii don't know if there's a way to get it pre-taskId but optimized
18:25akior even post-target_task_method but pre-taskId
18:28akilooks like i should be using target?
18:36akiyeah, target-graph seems to do it
18:42akilooks like optimized and morphed have taskIds, so they're diff -u resistant. target-graph should cover most of our needs i think
18:43dustinyeah, if you need to compare those you can swap task['label'] back in using something like
18:43dustinbut usually you want to compare the target-graph
18:45dustinwhat you miss with these is, you really need to try it with a variety of parameters.yml's
18:45akiyeah, i had a number for nightly tests
18:45dustinyeah I have a few too
18:46akimaybe we should land those for tests?
18:47dustinI worry there will be a lot of false-negatives there
18:47dustine.g., around include_nightlies
18:47dustinbut if someone managed to write some tasks like that which had a good success rate, that'd be cool
18:54akiin-tree graph tests that run periodically is on our trello board :) in the brainstorm/to triage column
19:01akiit might be nice to at least test the nightly and release graphs
20:31nanliu_is taskcluster deployable for external users or is it only for Mozilla projects?
20:35catleewhat happens when you try and reclaim a finished task?
20:38catleeyeah...with any other message?
20:38catleeare the error codes documented anywhere?
20:39aki ?
20:40catleeis that the only time you'd get a 409 from reclaimTask?
20:40akinot sure if you can tell what the 409 is for, or if you then have to run a task status
20:40akii think expired / exception / completed all 409
20:40akiso i was going to special case user-cancel to kill the task
20:41akii haven't figured out if i have to run a 2nd query to figure if it's user-cancel
20:41jonasfjnanliu_: <- bstack is playing with redeployability, but afaik we&#39;re not there yet
20:41catleeI mean, the only time reclaimTask returns a 409 is when the task has already completed?
20:41akiexpired != completed, so it depends what you mean
20:41bstackalthough it can be used by outside users via the github integration right now. what do you want to use it for, nanliu_ ?
20:41bstackI wouldn&#39;t necessarily just use it for anything, but if it&#39;s vaguely moz related, that seems fine
20:42akimaybe that would give you a 410
20:42nanliu_jonasfj, is the deployment something we can work on? our internal projects are not moz related.
20:43catleeaki: so what I&#39;m seeing is that BBB tries to reclaim a ask that&#39;s completed, gets a 409, and assumes it means deadline-exceeded, and then tries to cancel the BB job
20:43nanliu_we most likely would want to deploy it on kubernetes, and would like to write plugins like gce/openstack/k8s.
20:43jonasfjwe want to be redeployable, but have lots of other priorities..
20:44jonasfjbstack is working on it, afaik he plans to do k8b...
20:44catleeaki: I think I can just ignore a 409 in that case
20:44jonasfjhe knows more... if you&#39;re interested in taking on some of the tasks, wrt. redeployability bstack is the driving force :)
20:44bstackalways looking for help
20:44akicatlee: possible
20:45akinanliu_: i would really love to see tc redeployable :D
20:45catleejonasfj: where can I look to see in which cases reclaimTask returns 409?
20:45bstackI have some components running already and a script to set everything up for you
20:45bstackthe script should probably be ported to terraform
20:45bstackbut it Works
20:45nanliu_well, we are very comfortable with bleeding edge :D
20:46akicatlee: jonasfj: i would love to know the answer to that too :)
20:46jonasfjcatlee: I guess I should document rather than &quot;reclaim a task more to be added later...&quot;
20:46nanliu_helm is our preferred k8s deployment tool
20:47jonasfjcatlee, aki: short answer is that reclaim returns 409 if you don&#39;t have a the claim anymore. claim-expired, deadline-exceeded, task-canceled, etc..
20:47nanliu_bstack, if you have anything as a starting point, I&#39;m happy to be the guinea pig
20:47jonasfjnote: a claim is for a specific run
20:48bstacknanliu_: conveniently, it is a helm thing
20:48bstackthere&#39;s still a ton of work to make it actually run tasks
20:49bstackbut the auth service runs
20:49aki&quot;Wow, it runs&quot; :)
20:49nanliu_sold, &quot;Wow it runs&quot; :P
20:50bstackalso until we move to taskcluster-pulse, it relies on using mozilla&#39;s rabbitmq
20:50bstackbasically, is a large list and not even complete
20:55bstacknanliu_: what sort of timeline are you thinking of?
20:55catleejonasfj: including a run being finished?
20:56jonasfjcatlee: yeah, if the task is resolved you can&#39;t reclaim it
20:56nanliu_bstack, we are evaluating options, taskcluster maps to our requirement better than most products out there.
20:57nanliu_secret management, service pool, workflow, etc
20:58jonasfjcatlee: the reclaimTask end-point returns temporary credentials, so it would be bad to allow reclaims after task resolution.
20:59jonasfjcatlee, aki: have you guys seen:
20:59akithat&#39;s been updated with claimWork!
20:59bstackBrb, water main is broken and I&#39;m thirsty.
21:00bstacknanliu_: want to send an email to me with some more thoughts on what you want to be able to do and I can try to help out more?
21:00bstackI&#39;ll also be back on irc in like 15 minutes.
21:01nanliu_sounds great, is your email bstack @ mozilla?
21:02akiso for scriptworker, i ignore 409 on reclaimTask because I assume the async reclaimTask request is happening after we finished the task. I should check to see if i&#39;m still running that same task, and then kill it
21:02nanliu_found it, I&#39;ll send an email shortly
21:02akiand then reclaimTask more often to catch user-cancel
21:04jonasfjaki: yeah... also the credentials you use to do the next reclaimTask, reportCompleted/Failed should be the temporary credentials returned from the most recent claimWork/reclaimTask call for the given taskId/runId
21:06akiadded to
21:15jonasfjstandups: PR documenting reclaimTask ->
21:15standupsOk, submitted #43819 for
21:17garndtnanliu: if you don&#39;t mind me asking, what project are you looking to use this for?
21:20jonasfjaki: fyi, before you reclaim every 30s seriously consider if it&#39;s 3 workers or 3k workers that does this... It sort of affects the req/s our services needs to process :)
21:21jonasfjyou can lower reclaim interval, but probably make it configurable if you do
21:21akijonasfj: yup, it&#39;ll be configurable. and right now it&#39;s <20 workers
21:21jonasfjie. if you have 5-30 workers then who cares, if you have 5k we care :)
21:22akithe recommended way is pulse?
21:22nanliugarndt, for a variety of internal projects:
21:23jonasfjaki: the recommended thing is to not be hyper responsive :)
21:23jonasfjaki: but certainly using pulse probably carries less overhead on your workers too
21:24jonasfjand pulse is super fast
21:24akii was thinking leaving the default at 5m, setting the puppet conf to 30s for the pool of ~20prod workers, but configurable
21:24jonasfjjust not scalable to 10k connections :)
21:24jonasfjmakes sense..
21:24akii don&#39;t think scriptworker is scalable to 10k til we have a provisioner
21:25jonasfjaki: but increasing the request interval also makes is more susecptible to crashes..
21:25jonasfjie. network issues, temporary instabilities, etc.
21:25jonasfjless talk => more robustness
21:26jonasfjie. the less our services chat to eachother the more reliable they are...
21:26akii can have it ignore intermittent network issues and only act on 409
21:26jonasfjjust saying the value to increased responsiveness isn&#39;t high
21:27jonasfjwhether it takes 30s, 2min, 5min or 15min to stop processing a cancelled tasks rarely matters much
21:27akiwell, with a pool of 4 scriptworkers of type X, having one stuck finishing a cancelled job while dozens are queued can be costly
21:27akiesp since they&#39;re gatekeepers to releases
21:27akiall 4 stuck on cancelled jobs is possible
21:28jonasfjprobably the tasks would be either (A) short, or (B) you add more workers
21:28garndtnanliu: awesome to hear! welcome
21:28akii&#39;ll add the support and we can tune the reclaim interval
21:28jonasfjanyways, just saying there is pros and cons...
21:29jonasfjaki: cool, just remember that takenUntil - 2min (or so) is the maximum (let this be dynamic)
21:29akii don&#39;t think we&#39;ll be going above 5min for reclaim interval
21:29jonasfjmakes sense..
21:29jonasfjjust please don&#39;t hardcode it
21:30jonasfjuse something like: min(takenUntil - 2min, hardcoded_config_value)
21:30jonasfjaki: ^ if we code that in all our workers we can increase responsiveness serverside, if in some future we decide to make reclaims super fast using a caching layer or something...
21:31akiit&#39;s in the scriptworker.yaml config
21:31jonasfjtakenUntil is the value from the task.status structure
21:31jhfordjonasfj: why don&#39;t we use redis to store cancellations with a boolean TTL&#39;d key and expose a /is-cancelled/:slugid endpoint. since the slugid is a guid, we dont really need to auth the endpoint
21:31jonasfjright now it&#39;s always: now() + 20min
21:31jhfordb/c we&#39;ve hit this problem elsewhere
21:32jhfordlike, when we killed pulse
21:32akii think that&#39;s a fine default, 20min, since we don&#39;t report workers going offline
21:32jhfordthen you could poll that endpoint in these workers with minimal overhead, i suspect
21:32akii would like that
21:32jonasfjjhford: redis probably would run into connections issues too, with number of connections..
21:32jonasfjjust thinking
21:32akithen reclaimTask could be set at an interval independent of cancellation polling
21:33jhfordjonasfj: no, don&#39;t expose redis protocol, wrap it with a tiny webservice
21:33jonasfjjhford: but yeah, we could speed up things and reduce reclaim interval
21:33jonasfjjhford: yes
21:33jhfordlike, maybe not even taskcluster-lib-api for the polling webservice
21:33jhforduse it for inserting into redis, but write a raw http server for good measure!!!
21:33jonasfjjhford: we could do that, that&#39;s why I&#39;m telling worker developers to not assume that takenUntil won&#39;t be less than 20min in the future
21:33jonasfjprobably overkill
21:34jonasfjbut yeah,
21:34jhfordwell, time spent on a cancelled job is wasted money and op.cost
21:35akishould we assume takenUntil will always be at least 3min in the future? otherwise the -2m function above might go negative :)
21:37jonasfjaki: that&#39;s a good point :)
21:37jonasfjit&#39;s probably optimistic to assume that workers would work well if I reduced takenUntil to +30s
21:38jonasfjideal would probably be: min(takenUntil - now() * 70%, CONFIG_VALUE)
21:41catleeBBB reclaims if the task is within 10 minutes of expiring
21:41catleebut it uses fixed credentials I think
21:43akii think hardcoding it in the config, and adjusting when the defaults change, is a fine option
21:43akirather than assuming changes won&#39;t break the algorithm baked in
21:44akiscriptworkers have to deal with a lot less variance in the types of tasks they have to deal with
21:47catleejonasfj: so what&#39;s happening is that we have two things looking at the same BB job. (1 - the reflector) is basically just in charge of reclaiming tasks. (2 - bb listener) is watching for events that a BB job has finished, and goes to resolve the task
21:48catlee(1) looks at the build, sees that it&#39;s finished in BB, and tries to reclaim the task in TC because it&#39;s the responsibility of (2) to resolve the task
21:48catlee(2) ends up resolving it first
21:48catleeso I&#39;d like to be able to say that we can simply ignore a 409 from reclaimTask in this case
21:49akithat means we can&#39;t cancel bbb jobs, no?
21:49catleeno, that&#39;s a different code path