mozilla :: #perf

20 Mar 2017
10:03jmaher|afkigoldan: hi
11:43gfritzschegsvelto: hey, was there any fundamental problem with this?
11:44gsveltogfritzsche, no, that's the approach I'm taking in the patch
11:45gfritzschegsvelto: But this suggests changes?
11:46gfritzscheI would like the high-level logic to stay as simple as possible
11:46gsveltoit should, let me describe the whole process as I envision it:
11:46gsvelto- first when the crashreporter runs it creates the crash ping and puts it in a file in a separate directory as we discussed
11:47gsveltoit also flags the event file with ping UUID (as it already does, so we don't even need to change this), this will later signal the crash manager that we tried to send a ping for this crash
11:47gsveltothen the ping is handed over to the pingsender which tries to send it, if it succeeds it removes the crash ping file, otherwise the crash ping file stays where it is
11:47gsveltowhen we restart firefox the crash manager scans the event files for crashes
11:48gsveltowhen reading the event files it will check if a ping UUID is present in the event file then it will look for the ping file
11:48gsveltoif the file is not there then it means it's already been sent, there's nothing to be done left
11:49gsveltoif the file is there then it means that the pingsender could not send it, the crash manager will then use the contents of the crash ping file to populate the crash ping it normally sends for each crash
11:49gsveltoand then delete the crash ping file
11:49gfritzscheSo, why do you need to generate a ping again in the crash manager?
11:49gfritzscheThis seems like confusing logic
11:50gsveltowell, there's three scenarios
11:50gfritzscheAs i see it, you handed data over to telemetry/ping land
11:50gfritzscheAnd you should forget about it
11:50gsveltoso it goes along with the flow of normal telemetry
11:50gfritzscheWell, but the crash manager generating it again goes against that
11:50gsveltolet me show you the code
11:50gfritzscheWe should only ever generate a ping once for the same data
11:50gsveltothis is where we send the crash ping in the crash manager
11:51gsveltothe only change required is to populate the ping with the contents taken from the file instead of assembled in Firefox
11:51gfritzscheYes, but this is for crashes that have not been processed before
11:51gfritzscheThis is an easy change locally, but a confusing one logically
11:52gfritzschegenerate ping, pass to ping sender, restart, maybe pass new ping over again?
11:52gfritzscheThis is how this looks for me for ping data flow
11:52gsveltoto the pingsender you mean?
11:52gfritzscheTo Telemetry, from the crash manager
11:52igoldan|afkjmaher|afk: why don't we do PGO builds on mozilla-inbound? my own answer: that would quickly overrun the test machines, right? plus, by the time the build finishes, a new changeset will appear
11:52gsveltoBTW there's a third scenario we need to cover
11:53gsveltothe one in which the event file does not contain a crash ping UUID because for some reason we couldn't generate the ping at all in the crashreporter client
11:53gsveltoin that case the old crash manager logic generates a brand new crash ping instead and sends that
11:53gfritzschewhich is expected, sure
11:54gsveltoI think the changes will be rather small
11:54gsveltolimited to the crashreporter client, and crash manager
11:54gfritzscheI would like to stay us with the original plan for those crashsender pings
11:55gfritzscheThe logic gets hard to follow otherwise
11:55gsveltook, maybe I've missed something then, how should I do it?
11:55gfritzscheAs i wrote originally
11:56gfritzscheA few times :)
11:56gfritzscheSave ping to disk, pass to pingsender, on restart telemetry checks if ping is still present, sends it
11:57gsveltoBut then how does the crashmanager know it doesn't have to send a ping of its own?
11:57gfritzscheBecause of the client uuid annotation you mentioned above
11:57gsveltoAlright, so the crash manager would send a ping only if the UUID is not there at all, all other scenarios would be covered by telemetry
11:58gsveltoAlright, got it
11:58gsveltoOK, I'll do it that way then, the ping would still be stored outside of the user profile, right?
11:59gfritzscheRight, next to the crash report dir
12:00gfritzsche$DIR/Pending Pings/ ?
12:01gfritzscheGeneric in case we'd drop in other pings later
12:01gsveltoOK, makes sense
12:01gsveltoThis keeps everything within the telemetry toolkit, makes it fully self-contained
12:04gfritzscheYeah, data flow logic between telemetry and crash reporter is what i was worried about
12:05gsveltogfritzsche, yeah that's the part I had missed, since I'm not 100% familiar with the telemetry I assumed you wanted me to use the public methods to handle it (i.e. submitExternalPing()), not hook it up directly with the core logic, but now I understand this can be used for other stuff too
12:05gsveltoBTW I'm pushing the updated patch for bug 1341349 with your comments addressed
12:19igoldanrwood: hi, Robert
12:20igoldanrwood: went throught this doc page:
12:21rwoodigoldan: good'ay
12:21igoldanrwood: what should I understand by "coalesced job"? a job which got canceled to let other ones more recent run?
12:22igoldanrwood: want to make sure I get the meaning
12:22igoldanrwood: as it something very common to perf sheriffing
12:28rwoodigoldan: yeah, if the load is high so some jobs are skipped to run the jobs on newer builds instead, so if that happens then we may need to backfill the missing jobs if trying to pin down a talos regression
12:29igoldanrwood: good that I know this now; I was wondering: who's responsible for my missing test data points?
12:29igoldanrwood: :))
12:30igoldanrwood: this is something that really helps developers check their actual work faster
12:30rwoodigoldan: try jobs won't be coalesced
12:30igoldanrwood: nor PGO builds, from what I've read
12:30igoldanrwood: right?
12:31igoldanrwood: as they are run less frequently
12:31rwoodigoldan: correct re: PGO
12:31igoldanrwood: and take longer; would waste consumed CPU for nothing
12:32igoldanrwood: actually, now that I think about it... the coalescing only affects jobs which have not started yet
12:33rwoodigoldan: also on inbound and autoland we have a tool called SETA running, it determines what jobs are low priority and can be skipped based on historical data, but that doesn't apply to talos
12:36igoldanrwood: you mean that a Talos job will never be canceled?
12:37rwoodigoldan: not be SETA that is
12:38rwoodthere could be missing talos jobs by other reasons so backfilling of talos data might be needed sometimes
12:39igoldanrwood: you meant "not by SETA", right?
12:40igoldanrwood: what kind of reasons?
12:40igoldanrwood: besides non-functional builds
12:43rwoodigoldan: coalescing or sometimes builds are broken or the infra goes down etc
12:44rwoodnot that often though
12:44igoldanrwood: I see
14:27igoldanjmaher|afk: I know about the tp5n pack, with web pages
14:28igoldanjmaher|afk: but what's the deal with tp5o?
14:28jmaher|afkigoldan: tp5o is just using a smaller set of the tp5n pages
14:30jmaher|afkigoldan: does that make sense?
14:30igoldanjmaher|afk: I guess these are more special web pages?
14:32jmaher|afkigoldan: we have the 100 pages in the file, then any other tests that need stuff have their own pages- many tests use some or all of the pages
14:33igoldanjmaher|afk: got it
14:33igoldanjmaher|afk: tp5 is just another subset, right?
14:33igoldanjmaher|afk: from
14:34jmaher|afkyes, but tp5 as a job name in treeherder refers to tp5o
15:45igoldanjmaher: if I repeat an old Talos job from treeherder, it will run using the build it belongs to?
15:46jmaherigoldan: yes, if you retrigger the job via treeherder
15:46igoldanjmaher: so, technically, you could backfill data points which are pretty old
15:46igoldanjmaher: of course, using treeherder
15:46jmaherigoldan: yes, only up to 4 months old, then the builds are gone
15:46jmahermaybe it is 90 days, not sure the exact time
15:47jmaherigoldan: but the key is keeping it as close to real time as possible, as in <1 week from original landing
15:47igoldanjmaher: yes
16:33igoldanjmaher: by, guys
16:33jmaherbye igoldan
16:33jmaherhave a good evening
16:33igoldanrwood: I wish you a great day
16:33igoldanjmaher: thanks :)
16:33igoldanjmaher: see you tomorrow
22:16njnmstange: ping
22:17mstangenjn: pong
22:17njnmstange: hi. I want to talk about bug 1348402
22:17mstangeoh cool!
22:17mstangelet&#39;s talk
22:19njnmstange: PROFILER_LABEL currently only touches the PseudoStack, doesn&#39;t need to lock gPSMutex
22:19njnit&#39;s already quite hot; here&#39;s a profile I did a while back using printf:
22:21mstangethere&#39;s &quot;hot&quot; and there&#39;s &quot;extremely hot + especially in certain benchmarks&quot;
22:22njnI&#39;m having trouble seeing how we can opt out of PseudoStack manipulations when the profiler is inactive; seems like a recipe for unbalanced-ness
22:23sewardjmstange: do you have any order-of-magnitude feel for how often you want to call PROFILER_LABEL ?
22:23mstangenjn: the RAII class already takes care of unbalanced-ness by checking the handle in the destructor
22:24mstangesewardj: I&#39;d like to call it for every DOM property access
22:24mstangenjn: these are the results that I&#39;m concerned about:
22:24mstangein the table that ehsan links to, the total score goes from 1763.25 to 1598.88
22:25* njn is freshly happy he removed all those mozilla_sampler_* functions
22:25mstangesewardj: I can&#39;t tell you a number, I&#39;d have to run dromaeo myself and add some logging
22:25mstangesewardj: are you interested in the number of calls per second?
22:26njnmstange: so there will be many additional short-lived entries on teh PseudoStack?
22:26mstangenjn: not at the same time, but they will be very short-lived, yes
22:26mstangeif we&#39;re talking about fixing bug 785440, that is.
22:27sewardjmstange: well, I see it&#39;s difficult to come up with a number, yes
22:27njngreat thread synchronization there, as usual
22:28sewardjI&#39;m sure it works fine .. on x86
22:28njnit&#39;s funny, but it&#39;s also the crux of the matter
22:28njncurrently, IsActive() requires locking gPSMutex
22:29njnI fiddled with avoiding that in bug 1347274
22:29sewardjIf isProfiling is very hot, at least put it in its own cache line
22:29sewardjso we don&#39;t wind up dragging adjacent data between cores
22:30njnin that one there were three fields that needed to be accessible without locking gPSMutex
22:30njnfor this bug we only need IsActive() to be accessible without locking gPSMutex, which is better
22:30mstangenjn: unless pushing a pseudo stack frame is cheap enough that we don&#39;t need to check isActive
22:31njnmstange: TLS access plus a function call plus a bunch of memory writes; seems slow
22:32njnprofiler off: 1571.41 -> 1569.03
22:32njnprofiler on: 1480.88 -> 1429.88
22:32njnhow bad is a reduction of ~50 units
22:33mstangenjn: with the plan I proposed I think it would be three or four pointer-sized values pushed on the stack, + TLS access + one memory write
22:33mstangeI don&#39;t understand how the impact with the profiler on can be so low
22:33mstangewhen you compare it to the numbers ehsan was getting without the isActive checks
22:34njnmstange: someone needs to resurrect ehsan&#39;s patch and re-measure
22:35mstangegood point
22:35mstangeI have that resurrected patch
22:35mstangelet me attach it
22:36njnI have no idea how expensive TLS accesses are
22:39mstangejrmuizel told me on Friday that on Mac+Linux they are now almost free
22:39sewardjnjn: at least on Linux, a register is reserved for pointing at the thread&#39;s local state
22:39mstangebecause you have a thread index register and so it&#39;s only one pointer addition
22:39sewardjnjn: and when the kernel schedules the thread it makes it point at the right thing (for that thread)
22:39mstangefor Windows we need to fix
22:40njnyay for XP removal
23:02sewardjmstange: you mentioned some time about changing symbolisation in the plugin on Linux
23:03mstangesewardj: to use nm instead of dump_syms?
23:03mstangesewardj: that part is done
23:03sewardjmstange: (yes) so that it doesn&#39;t even try to run dump_syms
23:03mstangethe order
23:03sewardjmstange: right, right
23:03mstangeyes, that part is also done, since about one or two weeks ago
23:03sewardjmstange: ok .. so it goes directly to nm, yes?
23:04mstangesewardj: yes, it should
23:04mstange16 days ago actually,
23:04sewardjmstange: ok, good, thanks
21 Mar 2017
No messages
Last message: 5 days and 22 hours ago