mozilla :: #perf

13 Sep 2017
19:55digitaraldchutten mystor: looking to a hang analysis for 52 vs 57 for marketing; how much effort would be it be to have a comparison of impact of hangs before/after?
19:55chuttenWhat is your definition of "hang"?
19:56mystordigitarald: bhr won't be much help here, the entire data collection work has been rewritten since 52 and the data we collect now is not comparable to the data we used to collect :-/
19:56digitaraldchutten: BHR data
19:56digitaraldmystor: what is the last comparable version, 57 vs ?
19:56mystordigitarald: 57
19:57mystordigitarald: This stuff is changing very fast
19:57mystordigitarald: :S
19:57digitaraldmystor: when we look at aggregates, like MTBF of hangs over 2s; would that be comparable?
19:57mystordigitarald: We could perhaps count the number of hangs on 52 nightly over DAUs then and the number of hangs on 57 nightly over DAUs today?
19:58mystordigitarald: The overall count per process is probably a comparable number?
19:58chuttenProbably over usage hours would be a "better" metric
19:58mystorchutten: sure - over usage hours
19:58mystoron windows 64-bit or something to control for other factors
19:59digitaraldright, filtered down to e10s/win10
20:00mystordigitarald: Only main process hangs or all processes? I think we also switched to e10s multi since then which might throw a wrench in the mix
20:00digitaraldmystor: that is something I would explore; but main hangs are a meaningful metric otherwise
20:01digitaralddcamp: ^
20:02dcampchutten/mystor so, this is somewhat open-ended. The real question is "Can we show a meaningful improvement in 'hangs', where hangs are probably pauses over a certain threshold"
20:02digitaraldthe threshold makes especially sense as we lowered the collection threshold
20:03mystordigitarald: We didn't exactly lower the collection threshold - we lowered the native stack collection threshold
20:04mystordigitarald: The overall "is it a hang" threshold has stayed the same
20:04digitaralddthayer: ^
20:04mystor(the hangs which we didn't collect native stacks for before were just largely unactionable)
20:11mystordigitarald: Should I look into how hard numbers like that would be to get and whether or not they make sense?
20:39dthayermystor, digitarald: want me to take a look? I have the code lying around that should get this - just need to frankenstein it together
20:40mystordthayer: That would be fantastic!
21:06smaugjmaher: How does one run tp5o_scroll?
21:07RyanVM...dare I ask why you're asking?
21:07* RyanVM is in the middle of bisecting some 57-as-Beta bustage in that
21:07smaugRyanVM: because I'd like to run it :)
21:08smaugRyanVM: ah
21:08smaugso I saw some failures on try
21:08RyanVMakin to ?
21:08* RyanVM has it down to an inbound merge now, working on bisecting within it
21:08RyanVMyeah, those are the ones
21:08smaugRyanVM: that looks familiar
21:09smaugRyanVM: thanks!
21:09RyanVMwell, that's interesting
21:09RyanVMwondering what you're doing to get those on vanilla trunk
21:12RyanVMsmaug: anyway, I think you want |mach talos-test --suite g1-e10s|
21:12RyanVMgot that by running the |mach talos-test --print-suites| command
21:13smaugRyanVM: Talos wiki page has a dead link to or some such
21:13smaugwhich apparently should be downloaded
21:13smaugbut maybe mach can download it
21:14jmahersmaug: oh, |./mach talos-test -a tp5o_scroll| should work; and it will auto download from tooltool
21:14RyanVMjmaher: so FWIW, I intend to backout with great prejudice whatever ended up causing ^ given that the merge to Beta is tomorrow
21:16jmaherRyanVM: I think keeping the trees green is good; I am not aware of what specific problems are going on
21:16jmaherI am in some meetings
21:19digitaralddthayer: yes please. we need numbers soon for messaging
21:19digitaralda qf:p1 if anybody asks ;)
22:33dthayermystor: is there anything you can think of that should justify an order of magnitude reduction in hangs > 128 ms between 52 and 57? feels too good to be true, but it's there pre and post BHR format change
22:36dthayer(these are preliminary numbers from the 1% sample, and I'm running a 100% sample right now to verify - so it might be an artifact of that but I doubt that the margin of error is even close to large enough)
22:39digitaralddthayer: we fixed a ton? :)
22:40digitaraldyou could doublecheck with a different threshold; did we collect 128ms hangs on 52?
23:04mystordthayer: yeah, we just didn't fetch stacks for them
23:04mystorErr that was re digitarald
23:05mystordthayer: it would be cool to get 1% intermediate numbers for each version between perhaps?
23:05mystordthayer: then we could see where it changes...
23:05dthayermystor: good call
23:06smaugdthayer: there has been several GC/CC handling changes
23:06smaugas an example
23:07smaugsome of them landed ...hmm, two days ago
23:07smaugbut one major one is to use idle time for GC/CC when possible
23:07smaugit means that we get more 50ms slices, but then hopefully less others
23:08smaugalso, we do less full GCs, and more zone GCs
23:08smaugetc. etc.
23:11mystorI'm optimistic that the number is real ^.^
23:17dthayerat a 10% sample, I'm seeing an 89% reduction in 2s hangs, and at a 100% sample, I'm seeing a 95% reduction in 128ms hangs
23:17dthayerlooking into other versions now
23:17RyanVMsmaug: fun story
23:17RyanVMlooks like it was bug 864845
23:17RyanVMwhich actually had talos issues the first time it tried to land
23:17smaugRyanVM: interesting
23:18RyanVMguess Boris won't mind it getting backed out again then :P
23:37dthayerdigitarald, mystor: the picture looks pretty conclusive (units are hangs > 128ms per second - lower is better)
23:38mystordthayer: That's some beautiful data
23:38dthayerthe 52 and 57 numbers are using 100% samples, the rest are 1%
14 Sep 2017
No messages
Last message: 6 days and 15 hours ago