mozilla :: #taskcluster

11 Oct 2017
01:21dustinnew tc-login is deployed -- should be more robust to failures in mozillians and LDAP
01:30dustingarndt: I'm guessing you have a few dozen tools tabs open now?
01:31garndta few
01:32garndt6
01:33garndtdo I win a prize?
01:33dustinhttps://irccloud.mozilla.com/pastebin/RYv9loDE/
01:33dustintrying to figure out why it's hitting login so much
01:36dustinand that's more than 6 :(
01:37dustinin theory it should add 5 minutes of jitter, and once one tab renews, the others reset their timers
01:37garndtok, I have one open now viewing a task group
01:37garndtin case you want to see what happens with that
01:37dustinyeah, I saw that renew
01:37dustinhuh
01:38dustinand that was only 7 minutes later
01:38dustinhmm
01:38dustinanyway, login should be able to handle it, so I don't think this is critical
01:45dustinsigh, I can reproduce
01:48dustinI just can't win :(
07:17ttaubertpmoore|away: heyyy, awake yet? :)
08:24pmoore|awayttaubert: I'm in Dsseldorf airport, about to board a plane to visit you!
08:25pmoore|awayWhat's up?
08:49pmoore|transitttaubert: boarding now, see you later!
09:07ttaubertpmoore|transit: ok!
13:28dustinso mozillians just barfed again
13:28dustinbut the tc-client-web retried the resulting 500
13:28dustinand everything seems OKis
13:28dustin*OKish
13:29dustinlooks like it was for me, at that :)
13:35Elidustin:
13:36garndtstop breaking things, sheesh
13:40dustinlooks like I *fixed* things :)
13:40dustinalso re-opened the mozillians bug
16:32akihas anything changed with gecko-3-images ? https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-3-images seems to show them idle. i have a pending task, and the latest run task timed out after 2 hours
16:34dustinI don't see any pending tasks for gecko-3-images
16:35dustinoh, sorry, now I see 1
16:35dustinyeah, and now the provisioner is starting instances
16:35dustindid that task just get created?
16:36dustinyep, instance starting up in us-east-1a
16:37dustini-023fdd07f878b2ba9 started and is running the task
16:37akiyes, just got created
16:37akiawesome
16:37dustinI don't see any issue here :)
16:38akii guess i assumed the others were idle
16:39dustinno, they were dead :)
16:39akiok
16:40akithank you!
17:00garndtaki: yea that list is not a list of active machines. It's a list of machines seen in the last 24 hours.
17:01akii may not be the only one confused by that, but if i am, now i know :)
17:37bhearsumhi folks, i'm trying to figure out how to get our staging releases reporting to the staging balrog instance. that instance is behind the vpn, so i imagine that if i made a copy of the existing balrog vpn proxy (or made it configurable in https://github.com/taskcluster/docker-worker/blob/master/src/lib/features.js) that would probably work. is that a decent approach, or is there a better way?
17:40garndthonestly, this whole vpn proxy thing is hugely fragile and needs to be replaced entirely. I'm not a fan of doing anything more to extend what this thing is doing.
17:41dustinseconded
17:41dustinthat was 22% nicer than how I was going to say the same thing :)
17:41bhearsumyup, i agree too - and it's going to get phased out when everything shifts to balrogworker
17:41bhearsumi'm just looking for something short term to make sure we can do effective staging releases ahead of Firefox 57
17:41dustincan you make the staging instance not require VPN?
17:42bhearsumunfortunately not, FoxSec made a decree a few months back about that
17:42dustineven staging?
17:42bhearsumyep
17:42dustinsux for you :(
17:43dustinso the existing VPN confg doesn't have staging access?
17:43dustin& if not, could it?
17:43bhearsumi think it might be able to route to it already (if not, that's easy to fix), but AFAICT it's hardcoded to production, per https://github.com/taskcluster/docker-worker/blob/master/src/lib/features/balrog_vpn_proxy.js
17:44garndtprobably because of this? https://github.com/taskcluster/docker-worker/blob/master/src/lib/features/balrog_vpn_proxy.js#L80
17:44bhearsumgarndt: that was my first thought - but i think PROXY_ADDR is also a problem
17:44bhearsumthat needs to be different for stage
17:45dustinhm
17:45garndt3 years ago.... amazing how much I forgot
17:45bhearsumi was trying to see if we could configure that address in the task, but it looks like there's no way (currently) to forward args if i'm reading https://github.com/taskcluster/docker-worker/blob/master/src/lib/task.js#L80 correctly
17:45dustindeploying a new balrog VPN proxy has a pretty big risk of closing trees (given what we saw with relengapi proxy)
17:46bhearsumeven if it's never enabled for mozilla-central and other important trees?
17:47garndtthis will use a staging balrog worker type too?
17:47dustinI don't remember the details of how that went wrong
17:47bhearsumgarndt: i think we'd still use funsize-balrog, but with a "balrogStageVPNProxy" feature instead of "balrogVPNProxy"
17:47dustinhow hard is it to run docker-worker on hardware? Could we set up an instance or two in on releng hardware/VMs?
17:48dustinor we could add a flow to our new VPC :)
17:48garndtyea, at the time this was all writen we didn't have the awesome new vpc
17:48garndtso hackity hack hacks
17:48dustinI'm not sure adding the flow to the VPC is great, either -- it means we need an isolated subnet for these workers, for example
17:48garndtyea, I was just about to say these would need to be zoned off
17:48dustinunless we are OK giving all gecko workers access to balrog
17:49bhearsumif we're just talking balrog stage, that's probably fine
17:50bhearsumi think i'm going to re-poke about making stage public again too...that's clearly the easiest path forward if i can manage it
17:50dustinat which point we might as well just build balrogworker
17:51garndtthis wouldn't be as bad if we had a dedicated worker type for staging, then it would be a worker type configuration about which balrog service we're proxying for
17:55catleewe have balrogworker for nightly
17:56bhearsumyeah, that doesn't help for staging releases for 57 though -- it still uses funsize + the vpn proxy
17:56catleeyeah
17:56catleeI wonder if we could set up a tunnel inside our VPC to stage
17:57bhearsumdo you mean an ec2 machine that would listen on a public addr, and proxy to balrog stage?
17:58bhearsumor maybe i should just ask: what do you mean exactly? :)
17:58dustinhaah
17:58dustinssssh, don't tel foxsec
18:04catleeI'm not really sure what I mean
18:04catleetunnels!
18:04catleepipes!
18:04catleeseries of tubes!
18:04garndttubes
18:04garndtmany many tubes supported by turtles
18:06catleeall the way down
18:07akian api-forwarder in the dmz?
18:08akii suppose that's a proxy =\
18:08bhearsum:)
18:09bhearsumwhatever we do, it only exists until we switch to balrogworker, i think
18:09bhearsumwhich i think we're trying to do for 59? unless i'm confusing projects again
18:10akiyeah, 58b1 is good, 59bX is a need
18:11akican we set up a standalone long-lived docker-worker and switch workerTypes?
18:11dustinso it sounds like the leading option is hacking balrogVpnProxy to support this somehow
18:11bhearsumok, so that means that when 59.0 hits release we could kill whatever special thing we do for balrog stage + taskcluster
18:11dustinyeah
18:11bhearsumdustin: that or public proxy seem like the easiest from where i'm sitting, assuming i can't convince cloudops to open up stage publicly
18:11dustinpublic proxy is basically opening it publicly :)
18:12bhearsumyeah
18:12dustinwhether that proxy is an EC2 instance you run quietly in a corner, or a config on the ZLB
18:12dustinsame difference
18:12dustinI like that solution but it seems unlikely :)
18:13bhearsumyeah, fair point
18:13bhearsumi'm not sure i want to risk my neck going behind foxsec like that, come to think of it
18:13dustinI think you're right that the risk of deploying a bogus balrogVpnProxy is just that it hurts things that use it
18:13dustinit happens that most stuff uses relengApiProxy, which is why that closed the trees
18:14bhearsumahhhh
18:14dustinand I suspect even if that goes sideways and requires some rollbacks and whatnot, it's probably still easier than balrogworker
18:15bhearsumagreed
18:15bhearsumso, assuming public stage is a non-option, should i throw together a patch for docker-worker that copy/pastes all the balrogvpnproxy stuff to a stage version?
18:16garndthrm, so how will tasks define which feature to use...the stage version or the real one
18:16bhearsumgarndt: i was going to define a new feature called "balrogStageVPNProxy"
18:17bhearsumand then we'll adjust our graph generation code to set the right one
18:18garndtah ok
18:20dustinsounds good
18:20bhearsumok, cool
18:20dustinI look forward to deleting that, then deleting docker-worker for good measure :)
18:20bhearsumhaha
18:20bhearsumhopefully both versions of the proxy can die after 52 ESR dies
18:20dustinbut for now it seems the best course
18:20dustinyep
18:20bhearsumalright, i'll send a PR if the public stage doesn't pan out
18:21bhearsumthanks everyone - i appreciate your time
18:25garndtDocket-worker will never die, it will always live on in my heart
18:25dustinas long as the bugs are filed in your heart too, we're good
18:26dustin"I lov[an error occurred, 500 which means a server error linking the erotic-einstien container]"
18:26dustin>>puzzled look from family<<
18:28dustinit&#39;s funny that we basically talk to the same people all day in like 20 different venues
18:28dustinirc, email, bugzilla, github, sentry, slack, ..
18:30* garndt sends dustin a sms
18:30dustinsms, vidyo, ...
18:45jonasfjdustin: I noticed that err.retries is missing in sentry, from -> https://github.com/jonasfj/node-mozillians-client/blob/master/src/mozillians.js#L137
18:46jonasfjdustin: I&#39;m like 99% sure that we&#39;re missing a critical &quot;await&quot; at --> https://github.com/jonasfj/node-mozillians-client/blob/master/src/mozillians.js#L134
18:46* jonasfj bang head, as bstack walks in...
18:53jonasfjdustin: -> https://github.com/taskcluster/taskcluster-login/pull/66
18:56jonasfjI&#39;m still surprised how long and well, tc-login worked with mozillians considering that we didn&#39;t have retries..
18:58dustinjonasfj: want to merge and then verify it still works?
18:58dustinjust watch papertrail for non-200 responses :)
18:59jonasfjwill do...
19:02jonasfjyay, login still works (at-least for me)
19:14dustinme too :)
19:14dustinplusses and minuses -- if login is down everyone with a tools tab open will know it soon
20:24jonasfjyeah...
20:24jonasfjdustin: I was looking at the create tasks stuff in decision logic... why not do:
20:24jonasfjhttps://irccloud.mozilla.com/pastebin/qAvhIHhx/
20:25jonasfjso you have a map task_id_to_future from: task_id -> future
20:25jonasfjwhen scheduling a task, you WAIT for task_id_to_future[dep] for all dependencies
20:25nalexanderCan I get some more guidance on interactive tasks? Can I get a shell &quot;inside&quot; a task? Right now I &quot;edit as interactive&quot; and replace the last command with bash. But I can&#39;t actually get to that shell (with the environment from the task!) via the interactive shell.html link; I get a new shell (that does not have the environment from the task).
20:26nalexanderFor example, https://tools.taskcluster.net/groups/CRxz3MDVQhmoZa8g9LbDrQ/tasks/CRxz3MDVQhmoZa8g9LbDrQ/details
20:26jonasfjyou visit in post-order, ensuring that task_id_to_future[dep] is populated before tasks depending on dep starts waiting on it..
20:26nalexanderI want to connect _into_ that task.
20:27jonasfjnalexander: shell.html works fine in browser
20:28jonasfjso..
20:28nalexanderjonasfj: but it&#39;s not in the context of the task.
20:28jonasfjoh..
20:28nalexanderThe environment variables must be different.
20:28jonasfjthat&#39;s possible...
20:28nalexanderWhich makes it really hard to debug hard to debug problems :)
20:28jonasfjI&#39;m not sure...
20:28jonasfjnalexander: you could run screen as your command maybe... and then attach to it?
20:29nalexanderjonasfj: do you think I could start `screen` or something like that?
20:29* jonasfj grasping at straws
20:29nalexanderGreat minds :)
20:29jonasfjlol
20:29jonasfjwell, you&#39;ld need to install screen for sure :)
20:30jonasfjnalexander: my suggestion would be to checkout what env vars are different.. perhaps we should file a bug and fix it so the two commands are more a like..
20:32nalexanderjonasfj: you know, I think the issue might be that I&#39;m not invoking `run-task` correctly. Let me experiment.
20:32nalexanderjonasfj: &#39;cuz the env vars actually do seem correct; maybe this is my own error.
20:32jonasfjit could be you don&#39;t have task.payload.env in the interactive shell, I can&#39;t remember if those get set or not
20:42glandiumdustin: can you elaborate on the r- in 1405570?
20:43dustinit was a follow-on from the previous comment
20:45jonasfjalso maybe &quot;no un-optimized dependencies&quot; == &quot;fully optimized dependencies&quot;
20:45jonasfjso we always never use double negatives :)
20:46dustinglandium: although if it&#39;s in the target set it will always run regardless of the hash :/
20:49dustinotoh, if nothing depends on these tasks and they are *not* in the target set, then they&#39;ll not be in the target task graph and thus not even considered in the optimization phase
20:50dustinoh, I&#39;m thinking optimize_target_tasks is false
20:50dustinhttps://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/decision.py#81
20:50dustinit&#39;s True
20:50dustinso, yes, they should be in the target task set
20:53dustinand I think we should have a new optimization strategy similar to IndexSearch with should_remove_task -> False
20:54dustin(or maybe reuse the same class with IndexSearch(allow_replace={False,True}))
20:55dustinI suppose the ideal thing here would be to have should_remove_task return False iff the task label is in the target task set
20:56dustinbut that would be hard to configure
21:04glandiumdustin: how would they get in the target task set?
21:06dustinhttps://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/target_tasks.py#121
21:06dustinso they might already be
21:15glandiumah, yes, they are https://bugzilla.mozilla.org/show_bug.cgi?id=1405570#c4
21:15firebotBug 1405570 NEW, mh+mozilla@glandium.org Download of the clang tidy artifact is failing
21:15glandiumjonasfj: fwiw, I find fully optimized dependencies less explicit
21:16dustinso I think the core of my disagreement with your patch is that it should not apply to docker images or other index-hashed things
21:16dustinI do too -- maybe &quot;no dependencies remaining after optimization&quot;
21:16dustinok, irccloud is having a very bad day.. /me takes off
21:17glandiumdustin: well, I think anything that is defined in the tree to should be running if there never has been an artifact produced for it
21:17glandiumdocker images should not be an exception
21:17dustinwhy?
21:17glandiumif you don&#39;t want something to run, don&#39;t define it
21:18dustinwell the entire release process is defined, but we don&#39;t want to run that on every push :)
21:19dustinhm, but &quot;if you don&#39;t want something to run, don&#39;t include it in the target task set&quot; makes sene
21:19dustinin which case, your change would be correct
21:19dustinok, I think I&#39;m coming around to see it your way :)
21:19akiwe add stuff to the target task set all the time that we need to optimize out. that may be a separate thing
21:20dustinyeah, I&#39;m just saying we have lots of stuff that&#39;s defined and we don&#39;t want to run :)
21:20dustin*always want to run
21:21akii have an action task that specifies a do_not_optimize list -- should we do the same here? for docker images that you want to run all the time, explicitly list it in do_not_optimize?
21:22glandiumaki: we still want them to be optimized
21:22dustinnot with glandium&#39;s patch :)
21:23akimaybe i don&#39;t understand the problem. my only concern is that we don&#39;t change the behavior of optimization if it&#39;ll break the incoming releasetasks graphs
21:23dustinwe&#39;re not changing the behavior of optimization
21:23dustinin general :)
21:24akiok. i read too much into &quot;if you don&#39;t want something to run, don&#39;t define it&quot; then
21:27dustin-> r+
21:27dustin*now* I&#39;m leaving :)
21:28glandiumdustin: thanks
21:31jonasfjmaybe &quot;with all dependencies optimized ...&quot; or something... the &quot;no un-optimized&quot; is hard to unwrap, anyways it&#39;s just nits :)
21:34dustinyeah, I kinda don&#39;t like the comment either but life will go on and I can&#39;t think of a better way to phrase the whole thing :)
22:55bhearsumis the empty row on https://tools.taskcluster.net/index something that should be filed?
12 Oct 2017
No messages
   
Last message: 6 days ago