mozilla :: #rust-infra

8 Sep 2017
05:16acrichtosimulacrum: perf found a bug in cargo --
05:16acrichtothat being said thousands of open file descriptors sounds like a bug?
05:16acrichtoespecially if they're being inherited by children?
05:19acrichtothis is probably libgit2 being super bad
05:24sfackleracrichto: you should be able to use /proc/<pid>/fds to check out what the file descriptors correspond to
05:24acrichtolooking yeah
05:24acrichtogotta wait for it again
05:24acrichtonothign looks awry yet
05:24acrichtoshoulda known that earlier...
05:36acrichtosimulacrum: bug in cargo --
13:34simulacrumGood that we found it!
17:07acrichtolarsberg: if you&#39;d like to keep following along, osx is still super dead wrt travis
17:07acrichtoit&#39;s also super amplified right now coincidentally
17:07acrichtocurrently, as of a few hours ago, travis has just flat out halted all OSX builds
17:07acrichtoand due to the way their queueing works
17:07acrichtoour 60 parallelism slots were just sitting there saying &quot;waiting on an osx worker&quot;
17:07acrichtowhen of course that&#39;s not gonna happen for a few hours
17:07acrichtoso despite having hundreds of queued linux builds
17:08acrichtoosx being frozen freezes the entire world
17:08acrichtoand of course the incident is just called &quot;increased error rates&quot;
17:08acrichtofor &quot;we shut down everything&quot;
18:04larsbergacrichto: Ugh, that totally sucks. From #embassy, it looks like they&#39;ve been basically down for the last ~9 hours doing various things to try to reboot, etc.?
18:05larsberglol and I just got a macstadium warning &quot;atlanta = gonna be funtimes on Monday&quot;
18:05larsberg(due to Hurricane Irma)
18:06larsbergI&#39;m not sure what to say/do about the current situation. I mean, it appears that either we wait for things to &quot;get better&quot; or accelerate migration along the lines Servo is doing (to Task Cluster), though that does require use to manage our own macs again
18:06larsbergWhich is a total nightmare for Servo right now - I think we lose ~1 per month and have to randomly reboot them all the time. Or just ssh in and shut them down &quot;until we have more time to think about it&quot;
18:07acrichtoI pinged ted about talking to someone at circleci
18:07acrichtoI think that&#39;s our most viable route forward, if any
18:07acrichtolots of unknowns there, of course
18:07larsbergI suspect that getting cross-compile is most viable
18:08acrichtohm really?
18:08acrichtolike we still need macs to run tests
18:08larsbergI&#39;m not sure that anybody is really going to have a magical bullet for running well on macos, though certainly some folks might have more ops than others
18:08acrichtoso someone&#39;s gotta manage those macs
18:08tedyeah, but they don&#39;t need to be as fast, which is nice
18:08acrichtoat this point I sort of just don&#39;t trust travis at all
18:08acrichtolike sure downtime is fine
18:08acrichtobut they should talk to us
18:08larsbergRight, I&#39;m mainly talking about reducing the risk associated with them. But for sure not being &#39;lol ded&#39; once a month is good
18:08acrichtoor respond when we bring issues to them
18:08tedalso presumably your test runs take a lot less time than the actual build
18:08acrichto100% of the time travis has said &quot;meh we&#39;re working on it&quot;
18:09larsbergYeah, point taken
18:09acrichtothis is something Id want to talk to circleci about though
18:09acrichtolike just being different definitely isn&#39;t enough
18:09acrichtoted: indeed! (tests take almost no time compared to the bulids themselves)
18:09acrichtowe have a setup to run test binaries in some capacity remotely
18:09acrichtoin that we emulate ARM tests right now
18:10acrichtobut we don&#39;t haev support just yet for full compiler test suites
18:10acrichtoI don&#39;t think it&#39;d be the hardest thing the world though
18:10tedour build times are *way* better for our mac builds now that we can run them on beefy ec2 instances
18:10tedrather than crappy mac minis
18:10tedof course now our problem is that we constantly have a test backlog because we can generate builds faster than we can run all the tests on them on our pool of available mac hardware
18:11* ted bbl
18:11acrichtohm that&#39;s an excelltn point yeah
20:01ericktaturon / acrichto / aidanhs: who is running the meeting today? I can&#39;t make it
20:01aturonerickt: i am
20:01ericktaturon: you&#39;re bad at pto :p
20:02ericktso anyway, if you got a moment, I did want to talk through while I got some free time now
20:03aturonerickt: i don&#39;t have a moment right now, since i *am* in fact on PTO until the meeting :)
20:03ericktbiggest thing was just to call attention to it. anyone else want to help review?
20:04ericktthis isn&#39;t the simplification that I promised, but it is the whole &quot;got user management working&quot; and etc
20:05ericktnext would be to try to slim it down to help teach all these tools
20:55aidanhsim happy to take a look
21:20ericktaidanhs: <3
21:31TimNNo/ (from mobile)
21:31aidanhs(I am here, but on mobile so messages may be slow)
21:32acrichtoaturon: simulacrum: ping mtg
21:32aturonhere now, sorry i&#39;m late
21:33aturonso i know carols10cents and shep are away
21:33aturonand i believe that erickt can&#39;t make it today
21:33aturonso i guess we should go ahead and start, simulacrum will maybe join in a bit
21:34aturonso PR status is: sadness
21:34aidanhsmy inbox feels the same way
21:34acrichto&quot;lol no&quot;
21:34aturonafaik there&#39;s not really anything to discuss reactively here
21:34acrichto&quot;At the moment we do not have an ETA for when we will resume builds&quot;
21:34aturonthat is, there&#39;s essentially nothing to do
21:34acrichtoyup :(
21:35aturonhm we should probably update @ruststatus at least!
21:35aturonseem to&#39;ve gotten a bit out the habit
21:35* aturon doesn&#39;t remember the syntax...
21:35aidanhsuhh, it doesnt seem to be running, my server rebooted
21:36aturonah, that&#39;s why i couldn&#39;t find the name :)
21:36aidanhsIll get it back up when i get home
21:36aturonnoted as action item
21:36aturonthe other PR counts look mostly within range
21:36aturon(non-bors, i mean)
21:37aturonlonger-term of course we need to get our head around the ongoing Travis situation, but at the moment there are no obviously-good options
21:37aturonso i think for now we just move on?
21:37aidanhsone thing about triage id like to bting up while were here
21:37aturongo for it
21:38aidanhsissue triage specifically
21:38aidanhswe kinda agreed to do issue triage on a trial basis and im not sure its working
21:38aidanhsin the sense
21:39aidanhsits a lot of extra overhead on top of cargobomb and pr triage on the same day
21:39acrichtoit does seem that way yeah
21:39aidanhsi would like to propose either rotating days, cutting dosn process to just c flags or zomething else
21:39aidanhsmaybe both!
21:40simulI&#39;m wondering if its a matter of assigning more than one person per day or alternating pr vs issue triage day
21:40aturonso one thing is, we can also for sure expand the set of triagers
21:40aturonno need for this to be limited to folks otherwise doing infra work
21:40aidanhsyeah thsts what i mean by rotatjon - pr days stay, issues day swaps
21:40simulCould also be time to automate
21:41aidanhssimul: where would you start?
21:41simulIf we automate PR triage thatd free time
21:41aidanhsah ok
21:41simulBors tags seem the easiest
21:41aturonyeah that&#39;d be a good impl period item
21:41aidanhsim +1
21:42aturoni also want to make sure that we&#39;re deriving value from issue triage
21:42aturondo we have steady consumers of this info?
21:42simulValue was intended to be libs and lang team triage
21:42simulBut not sure that happened
21:42aidanhsi feel the c tags are pretty basic stuff, just figuring if something is a bug or feature request
21:43aturoni think in both cases the 1 hour triage meeting is already pretty full
21:43aturonok, so here are some proposals:
21:43aidanhsback in 5
21:43simulIt could be something to outsource more
21:44aturon- I will email all@rlo to get more feedback on what people want out of the issue tracker, just to check we&#39;re well-aligned
21:44aturon- we&#39;ll put on impl period plan to start working toward tag automation with bors, which should be really easy and hopefully pave the way for other things
21:44aturon- i&#39;ll put on my medium-term list thinking about a distinct triage group -- there are other aspects of triage we need to get more systematic about anyway
21:45simulIt would be great to get a consistent story as to what issues are
21:45aturonand as we discussed before, to be able to close more of them :)
21:45aturonhm so all my items are aimed at solving the problem longer term, what should we do meantime
21:46simulI think alternating is a good solution
21:46aturonsimul: ok so i&#39;m a little unclear on the exact proposal there
21:47simulA/B days basically and a=issue, b is pr
21:48aturonhm so to be clear, i think it&#39;s important we do PR daily -- are you imagining having a separate roster for issues?
21:48* aturon feels like he&#39;s being dense
21:48simulAh, I didn&#39;t see it as too important
21:48simulBut sure, separate roster is good
21:48aturonso my worry is that if we switch to every other day it&#39;ll pile up, end up not actually reducing load
21:49aturon(and of course we&#39;ll be less on top of state changes etc)
21:49aidanhsbtw, should i get all@rlo emails? wasnt clear
21:49aturonok, i&#39;ll make a separate roster on the PR tracking thing
21:49aturonaidanhs: you should
21:49aturoneveryone on any subteam (including shepherds) should
21:49aturonaidanhs: you should have a couple from me from yesterday
21:49simulI haven&#39;t received anything but rust-ops email I think
21:49aturonuhhh ok hold on
21:49aidanhsok, i dont think i got one
21:50simulI definitely don&#39;t recall seeing mail yesterday
21:51aturonok, the mailmap for all@ was missing infra
21:51aturoni will send the emails i sent out to infra@
21:51aturonactually lemme do that now to test
21:52aturonsimul: aidanhs: either of you able to check?
21:53simulI can try
21:53aturonok anyway, let&#39;s keep going
21:53simulYes received 2
21:53aturonso lemme add a column on the roster
21:54aturoncolumn added
21:55aturoni&#39;ll do issues on friday :)
21:55aturoni&#39;ll send out an email to infra@ to get more signups
21:55aturonand longer-term, we&#39;ll look to expanding the group of triagers,
21:55aturonso that we can focus more on infra construction and maintenance
21:56aturonalright anything else on triage?
21:56aidanhswell next item js about a different triage
21:56aturonhah indeed
21:56aturonlet&#39;s talk cargobom!
21:57aidanhsi ha e written i structions in painful ddtsil to make it a very mrchanical process
21:57aturonthis is in the cargobomb README right?
21:58simulI looked at instructions a week ago. Looked a little confusing, but I think part of it was that I couldn&#39;t try the steps as both machines were busy
21:58aidanhsits fairly dull and very amenable to automation
21:59aidanhssimul: you could pick up the beta since its finished running
21:59aturonIIRC there are some specific perms needed for this that we don&#39;t all yet have?
21:59simulI&#39;ll try to after meeting
22:00aidanhsok, give me a ping if you have issues so we can improve it
22:00acrichtodo we need more people doing triage here?
22:00acrichtospecifically for cargobomb?
22:00acrichtoor more people just &quot;doing all the steps&quot;?
22:00aturonacrichto: afaik we don&#39;t have a formal cargobomb triage rotation yet
22:00aidanhsmore people capable i think, especially if we want regular nightly runs
22:01aidanhsme and tomprince were away for a weekend and a few prs just sat with nothing happening
22:01aturonpart of the problem is that cargobomb also takes much longer to run than crater did
22:02aturonit would be useful, ultimately, to have a mode that doesn&#39;t actually run tests and can go as fast as crater used to (not sure how much work is needed)
22:02acrichtoI think the timing is less so tests and moreso one machine vs thousands
22:02aidanhsi suspect we can vet it down from four days to two with some work
22:02acrichtoin that this runs all on one machine iirc as opposed to taskcluster
22:03aturonthat&#39;s unforuntate
22:03aidanhsand yes it is an embarrasingly parallel problem
22:03aidanhsit just needs some love really
22:03aturonhm ok, sorry to distract us on the perf issue, that&#39;s separable
22:03aturonaidanhs: did you have specific thoughts re: establishing a rotation?
22:04* aturon is worried about stretching thin
22:04aidanhspart of my plan with indoctrinating more people is to make more people able to give it love
22:05aidanhsit probably doesnt need a formal rotation - just checking in every couple of days and following releva t stepz would suffice
22:05aidanhsjust to build familiarity
22:05aidanhs(thats my lrocess
22:05aturonhm, i&#39;m just thinking if we don&#39;t assign days, it won&#39;t happen
22:06aturonbut maybe we&#39;re not quite ready yet?
22:06simulacrumI don&#39;t think until we get cargobomb to run faster it&#39;ll happen
22:06aturonso far i believe taht tomprince and aidanhs are the only ones who&#39;ve been able to operate it?
22:06simulacrumI should be able to (but have never done it)
22:06simulacrumi.e. I think I have necessary permissions
22:06aturonok, so maybe let&#39;s aim again for simulacrum to get through a run
22:07aturonget a bit more polished on teh process
22:07aturonand then we can reconvene and figure out what&#39;s next
22:07simulacrumI&#39;m going to deal with whatever is left on the beta run after meeting and we can go from there
22:07aturonit&#39;s definitely something that the other teams rely increasingly heavily on
22:07aidanhsmaybe a tally per triager of chrcking in twice a week and it doesnt matter which days
22:08aturoni&#39;ve added an action item for simulacrum for now
22:08aidanhsthe process is so ad hoc right now, it wouldnt make much of a difference
22:08aidanhsbut we can revisit
22:08aturonalso this one is *very* ripe for automation
22:08aturonmaybe we can attract enough interest with the impl period to build a bit more of a group working on it
22:08aidanhsanyone following the sgeps will realise this agonisingly quickly
22:09aturonso, while we&#39;re on the topic of cargobomb -- i wanted to ask whether we should consider a rename
22:09aidanhsi am +1
22:09aturonwe&#39;ve run into issues with names like this in the past -- even the Libz Blitz, it turns out, has strongly negative connotations in Europe
22:09aidanhsits a fun play on words
22:09aturon(which should&#39;ve been obvious...)
22:10aidanhs(crater is)
22:10simulacrumWe could go back to calling it crater, I guess.
22:10aturonyeah, that&#39;s kinda what i was thinking
22:10aturoncrater 2.0
22:10aturonsince crater is decommissioned at this point
22:10aturonany objections/concerns?
22:11aturonoh there was a nominated issue, one sec
22:11aturon(also simulacrum, did you remove your agenda item about cargo?)
22:11simulacrumyeah, acrichto figured it out
22:11aturonnominated issue:
22:12simulacrumlibgit2 wasn&#39;t being cooperative or something like that
22:12simulacrumi.e. 1000s of file descriptors were getting left
22:12aturonsimulacrum: you nominated ^, did you have specific thoughts?
22:12acrichtoaturon: oh that nominated issue is a TODO item for me
22:12acrichtoto dig up the steps to get it back and post it somewhere
22:12aturonok cool
22:12simulacrumno specific thoughts, just wanted to discuss at some point
22:12simulacrumI guess we can assign to acrichto for now?
22:13aturondone, and un-nominated
22:13aturonaction item also recorded
22:13aidanhsthis was already with acrichto
22:13aturonaidanhs: you had an item on google docs perms?
22:13aidanhsah youve said thst
22:14aidanhssorry, cwtchjgn hp
22:14aidanhsok, docs
22:14aidanhsall of them are write jf yiu have thr link
22:14aidanhswhich makes them open for vandalism
22:15aidanhsthe infra folder is linked in this chsnnel titlr
22:15aturonheh, such security
22:15aidanhscargobomb sheet in readme, triage sheet on forge
22:16aidanhsanyway, you get the idea - these links are going far and wide and its going to be a hassle when sokeone inevitably vandalises
22:16simulacrumTransitioning to view-only and providing edit-only permissions to infra team seems fairly straightforward, I think.
22:17aidanhsi have no good suggestions that doesnt require rveryone to grt a google account thiugh
22:17aturonso i&#39;m happy to make the link read-only and send edit invites to the infra members
22:17simulacrumYeah, I guess it&#39;d require google accounts? Not sure if drive works with non-google... probably not
22:17aidanhsif were happy with that requirement then i have no objectiins
22:17simulacrumWe can also transition to notes elsewhere per-meeting (i.e. dropbox)
22:18aturonaidanhs: mind sending mail to infra@rlo on this topic?
22:18aturonthe other consideration is that most other subteams are now using dropbox paper for collaborative docs
22:18aturonwhich is great, BUT doesn&#39;t have a spreadsheet replacement
22:18aturon(which is why i haven&#39;t pushed on it for us)
22:19aidanhsdoes thag require a droobox acct?
22:19aidanhswell, ill look into it
22:19aturonyou need an account for editing, though it can go through other existing accounts
22:20aturonaidanhs: next up, &quot;toggle trick for PR triage&quot;?
22:20aidanhso e min, sorry
22:22aidanhsjust a tip if a pr author is on holiday, you can save time for other people by bumping tbe or by applyinv and unaplying a label
22:22simulacrumhm, interesting thought
22:22aturonyeah, good trick!
22:23aidanhsmeanz we dont have a stack of growing waiting on aughor prs when we know were going tk be waiting a while
22:23aidanhsive been doing it, but thiught id put it out there
22:23aidanhswill add to forve if people like
22:23aturonaidanhs: please do!
22:23aturonnext up -- something that came up recently with RLS rollout
22:24aturonnamely that we don&#39;t have anything like a beta-dev channel for testing out artifacts/rustup etc
22:24aturoni&#39;m not sure whether nightly should already be sufficient, or whether there&#39;s an actual need here
22:24aturonbut as we start adding more tools this could be an issue
22:24aturonacrichto: do you have thoughts?
22:24simulacrumI feel like we almost need a nightly-dev or something
22:24acrichtothis is basically already implemented on the technical side
22:25acrichtoit&#39;s how we do stable releases (dev first then prod)
22:25aidanhssorry, can you clarifg
22:25aidanhswhat happened?
22:25aturonoh sure, sorry
22:25aturonso the RLS is going to be shipped as a first-class artifact with rustup
22:25aturonit is meant to go out with the current beta
22:26aturonnrc was also working on the rustup side, for other reasons,
22:26aturonand didn&#39;t have any way to test without just pushing a new beta
22:26aturonat least, that&#39;s my understanding
22:26aturonit wasn&#39;t totally clear to me why nightly wouldn&#39;t be enough for this
22:26acrichtoI don&#39;t really see this happening that often though, so I don&#39;t think there&#39;s much need to change this
22:26acrichtomost of the time this at worst prevents a nightly for a few nights
22:27simulacrumThe primary problem I see today is that there is no easy way for me to revert `rustup update`
22:27acrichtowe&#39;re sort of trying to rush the rls now into beta which is causing nonstandard problems
22:27aturonacrichto: so usually iterating on nightly suffices?
22:27acrichtofrom what I&#39;ve seen yeah
22:27aturonalright, cool
22:27aidanhslol, impl period deadline problems?
22:27acrichtolike the rls is totally broken on beta right now, but no one&#39;s noticed (it&#39;s not causing problems)
22:28aturonjust wanted to make sure we&#39;re prepped for more tools coming online
22:28acrichtoshould be, yeah
22:28aturonok great
22:28aturonhm so we&#39;re basically out of time
22:28aturonyou should all have the impl period planning email now
22:28aturonwith document here:
22:29simulacrumYes, I have two emails (one referencing planning and one about mailing list troubles): is that correct?
22:29aturonaidanhs: can i confirm that you&#39;re up for leading the charge on config management work during impl period?
22:29aidanhsparameter store item was just: this seems reasonable for sgorinv secrefs
22:29aturonsimulacrum: can i confirm that you&#39;re up for leading on perf and rustbuild?
22:29simulacrumI think so. I am somewhat worried about time on my part but I think as long as <48 hours for responses is enough I should be fine
22:30aturonoh totally, and you won&#39;t be the *only* one, just a good point of contact/coordination
22:30simulacrum(realistically <24 hrs is 98% reality probably)
22:30aturonthat&#39;s plenty!!
22:30aturoni&#39;ve got woboats lined up for taking on rustup
22:31aturon(which is listed under dev-tools; it&#39;s always been a bit ambiguous)
22:31aturonnot sure yet about cargobomb
22:31aturonanyway, thanks for confirming, i&#39;ll be sending mail to working group leads on ~monday with more details
22:31aturonand we&#39;re over time
22:31aturonhave a great weekend everybody!
22:31simulacrumSounds good. Thanks!
22:33simulacrumacrichto: Current status with travis is that we aren&#39;t running builds, correct?
22:33acrichtoosx is just flat out turned off
22:33acrichtoand has been all day
22:34aidanhsaturon: I&#39;m happy to help out with cargobomb, also cc tomprince
22:35aturonaidanhs: <3, i&#39;ll follow up separately
22:36aidanhsRustStatus: incident start
22:36aidanhsthat sorts out alarmed ferris
22:37aidanhsanyone want to handle the tweet? &quot;RustStatus: tweet Blah blah&quot;
22:39simulacrumaidanhs: About travis?
22:39simulacrumI can
22:40aidanhssomething more eloquent than &quot;OSX is wedged on travis, so are our builds&quot;
22:41simulacrumRustStatus: tweet Due to the ongoing Travis incident, the macOS builders are currently not running and as such the approved PR queue is not progressing.
22:41simulacrumaidanhs: Started upload of beta
22:42simulacrumI&#39;ll check back in in ~30 minutes
22:43simulacrumaidanhs: One thing to add to the docs: If I accidentally kill the upload, what do I do?
22:43simulacrum(or any step_)
22:45aidanhsfor upload, you just run it again
22:45aidanhsI kinda clarify that a little in
22:45aidanhssince the upload occasionally fails
22:46aidanhswith prepare and run, I tend to delete-ex-target and delete-ex and start from scratch. that may be excessively paranoid
22:48aidanhssimulacrum: by the way, I&#39;ve dug up another PR that wasn&#39;t on the spreadsheet so once the beta is all done and you&#39;ve notified the requester (typically acrichto for betas, though there should really be a note in the first column), when you get 5min you can also take a look at that
22:49aidanhsit can be tomorrow, there&#39;s no rush. cargobomb shepherd is a relaxed role for now
22:51simulacrumhm, I&#39;ll probably do it now
22:51simulacrumaidanhs: Are lines like &quot;Sep 08 22:44:45.946 ERRO Could not read log for glium_macros-0.0.1 1.20.0: Couldn&#39;t open result file.&quot; fine?
22:53aidanhswell they&#39;ve never caused a failed upload for me, so I&#39;ve never looked into them. I suspect they&#39;re not actually fine, but you can ignore them for now
22:54aidanhssimulacrum: good point though. could you raise an issue?
23:03simulacrumacrichto: FYI, beta run completed
23:03simulacrumNot sure if I&#39;ll have time to triage (maybe est31 wants to?)
23:04acrichtosimulacrum: thanks!
23:16simulacrumaidanhs: How long should preparation take? (Should I wait?
23:34aidanhsdont wait
23:34aidanhsabout 6 hours
23:35aidanhshence it being a step in its own right
23:36simulacrumOkay, I&#39;m going to triage the beta run
9 Sep 2017
No messages
Last message: 13 days and 1 hour ago