mozilla :: #taskcluster

18 Apr 2017
09:46wcostapmoore: wb
09:51pmoorewcosta: thanks!
09:52wcostapmoore: did you configure gw to reboot after every task
09:52pmoorewcosta: yes
09:53pmoorefor OS X and win7
09:53pmoorenot for win10
09:53wcostayes, I am interested in OSX
10:03pmoorewcosta: what am i missing here? See from today - but already has auth:azure-table:read-write:fakeaccount/DuMmYtAbLe (and hasn't been edited for 2 months)
10:05pmooreahhh i bet in the test it limits the scopes
10:05* pmoore checks
10:20pmoorejonasfj: dustin: did we have a backwardly-incompatible change to auth scopes? (no longer supporting auth:azure-table-access:* scopes)?
10:22pmoore(i couldn't find a reference to "auth:azure-table-access" in auth api docs any more) - but i see some references in scope inspector still, to such scopes, but don't want to remove them in case they are used but not documented
12:33Tomcat|sheriffdutypmoore: garndt happy merge day!
12:34Tomcat|sheriffdutygarndt: pmoore do you know if needs some action
12:34pmooreTomcat|sheriffduty: hey Tomcat!
12:34Tomcat|sheriffdutypmoore: need this some more software
12:37pmooreTomcat|sheriffduty: this looks like the issue I saw just before I went on PTO, let me see if I can find a bug number
12:40pmooreTomcat|sheriffduty: related to bug 1354216 ?
12:40firebot FIXED, cctools job permafail with Artifact "public/" not found f
12:42garndtTomcat|sheriffduty: that taskcluster-vcs error is recoverable as a few lines below it it says it's cloning the repo. That bug probably has some information on what might need to be uplifted to beta
12:43pmooreTomcat|sheriffduty: has bug 1354216 reached mozilla-beta yet? does it just need landing there? ted might know
12:48Tomcat|sheriffdutypmoore: garndt ok its now merge day i guess thats why we see this
12:53Tomcat|sheriffdutyted: ^ do we need to uplift bug 1354216 ?
12:53firebot FIXED, cctools job permafail with Artifact "public/" not found f
12:53tedTomcat|sheriffduty: it won't hurt anything, certainly
12:53tedand it will stop us wasting resources building a broken job
12:54tedit's also not harming anything, aside from having a broken job that runs on every push
12:54tedin that the output of that job isn't actually used for anything
12:56Tomcat|sheriffdutyted: yeah would be nice because its tier 1 and so would be bad to have it perma orange
12:56Tomcat|sheriffdutyted: is this NPOTB or do we need approval ?
12:57tedTomcat|sheriffduty: it's NPOTB
12:58tedthe change only affects those cctools jobs
12:58Tomcat|sheriffdutyted: ok cool thank you sir
13:12dustinwb pmoore :)
13:19Tomcat|sheriffdutyhi dustin
13:19Tomcat|sheriffdutydustin: does means we need a package for something ?
13:20dustinI think it meanas the send-build-stats step couldn't figure out the firefox package name
13:20dustinthat's about all I got :)
13:21dustinmaybe that mozharness step shouldn't run for addon builds?
13:21* dustin really doesn't know
13:23pmooredustin: thanks! nice to be back :)
14:31garndt!t-rex moved gecko-t-linux-large and xlarge to new docker-worker deployment which uses node6
14:52pmooregarndt: thanks for the generic-worker commit
14:54garndtdon't thank me yet, you haven't seen the trouble I caused in my other PR :)
15:02pmoore|mtgcome on vidyo
15:21jonasfjstandups: started the morning by landing 4 PRs on tc-worker :)
15:21standupsOk, submitted #44904 for
15:27camdEli: thanks for the review! :)
15:28Elicamd: np
15:30bastienHello guys
15:31bastienI cannot trigger a TC hook i just created :
15:31bastieni get a scope error saying i'm missing queue:create-task:aws-provisioner-v1/releng-task
15:31bastienbut i have this scope in my credentials
15:31bastienclientId : email/
15:32bastieni also created a client to do the same, giving him this scope and still get the same error message
15:37dustinwhat is the error message/
15:39bastienhi dustin , here it is
15:40dustinlooksl ike you're not using the client you think you are?
15:40dustinoh, hang on
15:40dustinthe role hook-id:project-shipit/dev-babadie needs to have that scope
15:41dustinI suspect you can grant it?
15:41bastienoh right, i'm trying that
15:41dustin create a role with that name (no assume: )
15:42bastienyeap, that worked
15:42bastienthanks a lot, i totally forgot to create the hook role
15:47dustinsure, no worries
15:47dustinI forgot too :)
15:50bastienIs there a way to give some extra payload to a task through triggerHook ?
15:50dustinnot yet :)
15:50dustinthere's a GSoC proposal for that
15:52bastienok, so right now the only way to customize a task is to create it directly with the final parameters
15:52dustinyes :(
15:55bastienok, that does not change a lot of things for my needs
15:56bastienDo you have release notes for taskcluster , so i can track future changes ..?
16:01dustinbut we are maintaining an RFC repo now that you can follow to see what's coming up
16:21bstackwe should probably end up doing something similar to release notes eventually
16:21bstackalthough like we've discussed, I don't think tc as a whole will be versioned
16:22dustinrust might provide a good model, with "This Week In Rust"
16:22dustinmaybe once the RFCs get going, we can add a bit where when we close one, that goes in the TWITC(h)?
16:23dustin"This Week In Taskcluster, Hurrah!"
16:23dustinalong with any retros from the week, and any other breaking changes not part of an RFC
16:24bstackI think we can safely just call it TWIT
16:24bstackmuch like linus named git after himself
16:24bstackwe can name our changelog after ourselves
17:06bstackhaha, try stuff is hard afterall
17:06firebotBug 591688 ASSIGNED, Push to try should validate |try:| parameters
17:07bstacktime to go change my commit message
17:08bstackTIL don't write "try:" in your commit messages
19:11pmooreseems to be quite a backlog on gecko-t-win7-32 :/
19:12dustinIs that still bearing a full load and just relying on maxCapacity to cap its cost?
19:12pmooregood question
19:15pmoorehow do i graph pending count of a worker type over a period of time?
19:15pmoorewe store that data in statsum, right?
19:16pmoorebstack: do you know?
19:16bstackbstack knows all
19:16bstackbstack does not know that though
19:16bstackbut bstack can look real quick. one sec
19:17bstackevery time I go to they have like 15 "NEW! COOL!" features
19:18bstackwhich makes me think that they're generally not that new or cool
19:20pmoorebstack: How likely are you to recommend SignalFx to a friend?
19:20bstackbut everything else out there is like 4/10
19:21pmoorebstack: unfortunately it has to be a whole number....
19:21pmooreshall i go with 6 or 7 ?
19:22bstackgo with 7 I guess
19:22bstackthat was supposed to be "hehe"
19:22bstack"ehe" just makes it sound like I'm clearing my throat
19:23bstackfor a great proclamation that signalfx is meh at best
19:23dustinI'll go with 6/10 to round out to an average of 6.5
19:23bstackpmoore: is what you want?
19:23dustintheir API is pretty crappy
19:23bstackthese are all d.ustin-graphs
19:24dustin describes all of the stats the tool creates
19:25pmoorei couldn't see an easy way to search for graphs that other people have set up - should we create a group or something that lets us share graphs? is that something signalfx allows?
19:25pmoorethanks dustin for the link
19:25dustinwhich doesn't include a pending count
19:26dustinthere are shared dashboards I think
19:26dustinthe queue might be measuring pending?
19:26* dustin only measures what he manages
19:26dustinlet me walk that back
19:26dustinI don't manage anything
19:27dustinI'll just see myself out
19:27bstackpmoore: doesn't have a big list?
19:27* dustin backs toward door
19:27bstackmanager! dustin is now a manager!
19:27bstackI think dustin and garndt now have to fight until one backs away from the watering hole
19:28bstackat least that's how I understand this works
19:28dustincareful, if I'm your boss you'll be sorry
19:28dustinthat's not a threat, it's a prediction :)
19:28pmoorebstack: ah, d'oh - i first saw all the AWS icons and at a glance assumed these were generic - but now i see they are exactly what i was after
19:28bstackyeah, I would expect "custom" to be higher than "built in"
19:29bstackbut who am I to judge signalfx's ui decisions
19:29dustinor "demo"
19:29bstackthey're just so good at them
19:29dustinsure you don't want to join me at 6/10?
19:29* bstack lowers to 6/10
19:30bstackdustin: since we already have it open, let's fix the daily Major: pending for test > 30m
19:31bstackthey have a new detector that is standard deviation based
19:31dustinfix how?
19:31bstackor we can just bump it to 45 minutes
19:31bstackhaha, not fix the actual problem :p
19:31bstackthat's a bit harder
19:31dustinhow would stddev help?
19:31dustinbumping to 45 would be more effective
19:32bstackok, let's do that
19:32* bstack clicks buttons
19:32bstackI dunno, stddev sounded cool
19:32bstackit was one their New! Cool! things
19:32bstackand I got hooked on the marketing
19:33dustinwoah, this looks way better on a hidpi screen
19:33dustinfundamentally I think John hit the nail on the head with this stuff: pending time is important to our customers' customers, but it's the wrong thing for us to make promises about
19:34dustin?? it's a configuration change in taskcluster-stats-collector...
19:35dustinand if you look at the 14-day view, those windows workerTypes are killing us :(
19:35bstackbut there's an alert that says things like "if X > 30"
19:35dustinalthough happily I get a chance to use the term "kilominute"
19:35bstackshould I change that back?
19:35dustinok, yah, that's fair
19:35bstackit is 45 now
19:35dustinwe should probably adjust the SLO too
19:35* bstack coughs
19:36dustinstill, looking back over two weeks, we've not been between 30 and 45 minutes for any but an hour or two
19:36bstacklink to the view you're looking at now?
19:36bstackoh... the 45 minute detector has gone off already
19:37dustinbstack: want to do a tech topic on that paper?
19:37* dustin adds to pocket, wishes pocket supported pdfs
19:37bstackyeah, happy to talk about it at some point. I'm not sure it's suited for that in a lot of ways
19:37bstackmaybe an rfc?
19:37bstackor a tech talk
19:37bstackanything is fine with me really
19:37dustinI would like tech talks to get back to a more research-y focus
19:38dustinmore like "paper group" and less like "product talk"
19:38bstackyeah. I like that too
19:38bstackwow, that one windows worker is killing us
19:38dustinso basically we would all promise to read the paper, then you would summarize it and we'd discuss :)
19:39dustinwhat if we made all windows tasks that are capacity-constrained like this have a super-short deadline, like 2h longer than the expected build time?
19:39dustinpmoore: ^^?
19:39dustinthe other option is to exclude them from the SLI in tc-stats-collector
19:39bstackthe one nice thing about non-remote work is you can have these graphs up on tvs around the room and sorta end up knowing these things without needing to go look
19:39bstackmaybe I should wall mount a tv in my kitchen for 24x7 graphs
19:40dustinI think that's what interior designers mean when they talk about ahm-bey-once
19:40bstackthere's no remote work equivalent for somebody looking up at a graph and going "WTF!"
19:41bstackI want my kitchen's ambiance to be sad and terrifying graphs.
19:41dustinwell, sometimes I look at my cat and do that
19:41dustinsorta equivalent
19:41bstackok, so... what do we do with this knowledge
19:41pmooredustin: i'll need to checkin with grenade to find out if this is a general problem or not - i tried to create a graph, but it isn't showing anything :/
19:41bstackwhen signalfx just starts showing you cat butts in place of graphs, you know you've messed up
19:41dustinanyway, I'll be happy to r+ config changes to tc-stats-collector, and I'm also happy to schedule a tech topic whenever you'd like
19:41pmooretbh i don't really understand the interface
19:42pmoorei'm guessing the queue submits data points for tc-stats-collector.tasks.gecko-t-win7-32.pending
19:43pmoorebstack: i would kill for cat butts at this moment
19:43bstackyeah, schedule for whenever
19:43dustinpmoore: no, tc-stats-collector submits them
19:43pmooredustin: ah right, that is the aggregation service thiny
19:43garndtI like it
19:44dustinpmoore: but I have no idea why they're not showing up
19:45dustinI'm guessing we haven't logged to that metric in a long time
19:45pmoorethis is a queue data point, right?
19:45dustinwell, it's *related* to the queue
19:45dustinif that's what you mean
19:45dustinbut it's tracked in tc-stats-collector
19:45pmoorei.e. the queue submits the data to the stats collector
19:45pmooreah, the stats collector routinely queries the queue?
19:45dustinstats collector listens to the normal task pulse messages
19:45garndtI came into this late, what are we trying to answer about that win worker type?
19:45dustinI don't think it actually calls the queue
19:46pmooreah ok
19:46dustingarndt: why is its p95 pending time so high?
19:46dustinfor so long?
19:47dustinI suspect we're pushing more production (but T3) jobs at it than maxCapacity allos
19:47dustinand relying on maxCapacity to limit our cost
19:47pmooredustin: does it count messages published on task defined exchange and then somehow remove them when it smells a task on task claimed exchange?
19:47dustinpmoore: basically
19:47dustinthere's a lot of corner cases :)
19:48pmooreno worries
19:48dustinbut the idea is that at any given time, it is giving an accurate pending time for each pending task
19:48garndtWhy are we not increasing the maxCapacity? we have had the budget for it, it was just never higher because up until recently we did not need to increase it. Is there a risk of increasing it because it'll compete with releng?
19:48pmoorebut to answer garndt i was just curious if there is a constant backlog, or just a spike now
19:48dustinso you can actually look at the minimum of that value to find out the best current case (e.g., if we had high priority tasks they would keep the minimum low)
19:48dustinpmoore: if you look at that 14d view, it's over several days
19:49dustingarndt: I think that risk has been part of the reason
19:49dustinand also "it's T3 why spend money on it"
19:49dustinbut if that's the case we should be doing something to limit the number of tasks to match, imho
19:50garndtI think that last reason is going to become less of a reason soon, we need ot be spending more time getting windows moved over and standing up win10
19:51pmooreah nice
19:51* dustin so maybe we should just crank maxCapacity way up and see how many instances we need
19:51garndtso seems like waiting for results is not the greatest
19:51pmooreso since 15 april i don't see any heavy load:
19:51* dustin huh, ctrl-enter adds "/me" in irccloud
19:51garndtI can tell you how many jobs are scheduled per hour if that helps determine max capacity
19:52dustinpmoore: depends on your definition of "heavy"
19:53dustinpmoore: under "Axes" set the Y axis maximum to 60
19:53pmooregarndt: dustin: so to circle back to the start on this one, i thought since we have stats, it would be useful to see what the history is like of the pending count on gecko-t-win7-32, to understand if we need to bump capacity
19:53dustinso a spike and the current overage today
19:53* bstack starts futzing with auth-staging
19:54garndtpmoore: yea, I'm not sure if we are capturing a snapshot of pending numbers per worker type
19:54* garndt hasn't looked it in awhile
19:54bstackdustin: are you just wandering around hitting ctrl-enter in various form fields around the web?
19:55dustinhaha, it's somehow gotten into my muscle memory
19:55dustinprobably from typing up all those RFCs :)
19:55* dustin goes back to reading about OIDC
19:56bstackaws provisioner just hiccuped in sentry. but it's an azure thing and provisioner doesn't really care about azure, right?
19:56bstackit's just a convenience for viewing
19:59dustinand workerTypes
19:59dustinblob storage is just for display
19:59pmooredustin: garndt: bstack: thanks for all your help - i think i got what i wanted, which is
20:00pmooreoh not quite
20:00pmoore(the 12h interval is wrong)
20:00pmoorebut anyway, i think i've worked my way around it, looks like pending time is ~ 1h atm
20:00bstackAlways happy to help with the graphs
20:00pmooreso i agree let's bump the maxCapacity
20:02pmoorei've bumped to 512
20:04garndtyea, I think anywhere from 400-500 is good. Looking at our average druation for that worker type, and the per hour breakdown of scheduled seems that during peak times of the day is when we start to struggle. Also a few hours ago we had double the number of jobs for that worker type scheduled than usual
20:04garndtand then probably never recovered since
20:04dustinI was just going to say, 513 would be too much
20:05garndt!t-rex on this day, let it be know, gecko-t-win7-32 maxCapacity was bumped from 200 to 512
20:05garndtdamn it! known!
20:06garndtthis way I can search for it in the logs later if I need to
20:06dustinlet it be dun knowed
20:10pmoorelet it also be known, on this day, garndt successfully won a game of software-kerplunk by successfully removing hundreds of lines of code from generic-worker without causing any test failures
20:10catleewe dun knowed it good
20:11* pmoore hopes kerplunk wasn't just a british thing of the 80's
20:17dustinok grampa
20:17dustinwhat else did you do in the olden days?
20:18dustin(I vaguely remember kerplunk)
20:18dustin is a nice similar game
20:20jlund|mtghm, weird...
20:20jlund|mtgthis task is green yet the result, pushing artifacts to releases/* dir, didn't happen:
20:20jlund|mtgalso, there are no task artifacts (including logs)
20:22dustinthe more I think about our frontend broadly, the more I think we should have a dedicated frontend app for task/taskgroup/index exploration
20:22dustinthat's nice and tightly integrated
20:22dustinand separate from all the roles, clients, hooks, etc. administration
20:22garndtjlund|mtg: that task confuses's using a worker type that we provisioner, but the worker id is "taskcluster-cli"....that's not one of our worker ids (we use the ec2 instance id as the worker id)
20:23jlund|mtghm, interesting
20:23jlund|mtgthanks, that might be the smoking gun
20:23dustingood eye garndt
20:23jlund|mtgI'll investigate
20:24dustinworkerGroup opt-linux64 is weird, too
20:25garndtyea, our worker group for provisioned instances I believe is the region the instance is in
20:27pmooregarndt: i'm rolling out your claimWork fix to 8.2.0 of generic-worker, and putting that on our beta worker types in
20:28garndtheh, you put more faith in that change than I do
20:28garndtbut I guess it's just the beta worker type :)
20:28garndtwhat uses that?
20:28jlund|mtglatest theory is someone in releaseduty accidently used tc-cli to force completed on the wrong task
20:28pmooregarndt: if you (or anyone else) wants to do a try push, first wait for release to appear on generic worker releases tab, then merge the OCC PR, then wait for that to update the worker types, and then wait 30m for old instances to die, then modify worker type name in try push to have -beta suffix
20:29garndtjlund|mtg: yea, if the task was resolved as completed as the task was running elsewhere, it probably hit issues trying to create artifacts
20:29jlund|mtgwe use opt-linux64 in a few places in release automation fwiw:
20:29garndtpmoore: I can assure you that I won't be the one doing that today :)
20:29pmooregarndt: oh yes, i wouldn't roll out just as I'm about to head to bed! so, only beta worker types :) nothing using them - we use it for testing rollout
20:30pmooregarndt: so releases just completed: - so that above PR should be safe to merge if you can find someone in your timezone to review it, otherwise grenade will likely do it when he's back :)
20:31dustinjlund|mtg: do you actually use the workerType?
20:32dustinit's old and will be delteed eventually
20:32dustinas in, is it just a dummy for tc-cli, or are you depending on actual hosts to run the tasks
20:32akiwe do, but we could switch to something else
20:32dustinwhere should I file a bug for that?
20:32dustinit's not a rush, but should get cleaned up
20:33dustinrelease automation I suppose?
20:33akii think so. releasetasks appears to have issues disabled
20:34jlund|mtgha, disabled? interesting
20:34jlund|mtgI guess that forces bz ;)
20:35dustin :)
20:35firebotBug 1357548 NEW, Use a different workerType for release tasks than "opt-linux64"
20:37* aki adds to releasewarrior
20:37garndtall of these warriors
20:41grenadeJust a heads up to anyone watching pending counts for win7 GPU. Termination counts are very high on g2.2xlarge. bb cloud-tools is regularly alerting on high spot bid prices for the same instance type.
20:43catleewe've had alerts for not being able to get g2.2xlarge a few times in the past few days
20:45grenadeThe TC instances i monitor are being terminated regularly.
20:45catlee[sns alert] Apr 18 13:20:19 2017-04-18 13:20:19,849 - g-w732: market price too expensive in all available regions; spot instances needed: 146
20:47dustinlooks like we have just over 200 instances
20:48dustinmost in use1, followed by eu1 and usw2
20:48dustin "maxPrice": 0.9,
20:49dustin I don't have info readily available on what we're paying for them
20:49grenadeWe had 363 termination notices yesterday.
20:49dustinI think we typically bid 2x the goaing rate
20:49grenade922 on the 14th
20:51garndtseems like quite the volatile market
20:54garndtlooks like in March we averaged $.2/hr
20:55garndtfor the g2's
20:56garndtalthough last week, we were more around .4/hr
20:58garndtwhy can't all days be cheap like the weekends??
21:00grenadeWhen ec2 regions include other planets, it will always be the weekend somewhere...
21:04grenadeAh science.
19 Apr 2017
No messages
Last message: 7 days and 3 hours ago