mozilla :: #missioncontrol

7 Sep 2017
10:43poonamhello
10:43mdogliohi poonam
10:44poonamhi
10:44poonamhello
13:05mdogliofrank: you there?
13:10mdoglioI think I found out what's wrong with the job. it's producing gigantic parquet files. Like 250MB for 5 minutes of data
13:12mdoglioI guess parquet compression doesn't work well with binary data :p
13:14mdoglioI'm trying to understand what's the commit that makes the job explode
13:37frankmdoglio: oooh, now that is interesting
13:38frankmdoglio: are you sure that it's a single column? I'd also consider the possibility that we have *way* more rows now
13:39mdoglioit's 2 columns actually, client_id and client_id + session
13:39mdogliothey are in the same commit
13:40frankmdoglio: hmm, the hll columns
13:40mdogliosorry I'm wrong
13:40mdoglioit's just the client_id one
13:40mdogliothe session one should be in the next commit
13:40frankyes, but they presumably both suffer from that problem?
13:41mdoglioI think so
13:41mdoglioanyway, adding hll(client_id) + a bunch of stats based on that increases the parquet file size by ~10X
13:41frankmdoglio: sounds like we just need some more machines and more partitions
13:42frankif we end up splitting the data too much, we can always make a daily job that merges rows together
13:42frankin fact I think I already made a bug for that
13:45frankmdoglio: thoughts?^
13:46mdoglioso the contract with the stakeholder was to give data in sub-hour
13:46frankmdoglio: yes that would still happen
13:47frankbut for example, we might have 10 partitions with a single set of dimensions
13:47frankthen at days end we can merge them for better performance down the line
13:48mdogliowe can try that, but I think we will need to reduce the number of aggregates at some point
13:49mdogliofor example, move from 5m to 15m or 30m
13:54frankmdoglio: we should probably figure out how much we need to decrease by
13:54frankand look at the size of set of values for each dimension
13:55frankother possibilities include limiting those sets to known values
13:55frankmdoglio: when I made core_client_count I kept a tally of number of distinct values for each dimension
13:56frankyou can see that I labelled some of them: https://github.com/mozilla/telemetry-airflow/blob/master/jobs/core_client_count_view.sh
13:56mdogliooh to avoid creating dimensions containing junk?
13:56frankyup, that's just unnecessary extra rows
13:58mdoglioyup
13:58mdoglioI want to see if leaving out the profile age dimension the size is still acceptable
13:59frankmdoglio: sounds good, that's what, 30 some odd buckets?
17:20frankyou can't run anything for after timeout in travis :( https://github.com/travis-ci/travis-ci/issues/4221
22:17digitaraldfrank: INPUT_EVENT_RESPONSE_POST_STARTUP_MS landed; can we switch over input latency from INPUT_EVENT_RESPONSE_COALESCED_MS?
22:17digitaraldhttps://bugzilla.mozilla.org/show_bug.cgi?id=1373814 for more background
22:18digitaraldit is basically measuring input latency after the browser is interactive
22:19digitaraldStartup https://mzl.la/2j9w7zZ vs Post-Startup https://mzl.la/2j90sPq , notice the buckets in the end; that are captured by our 2.5+s MTBF metric
22:19digitaraldsphilp: ^
22:21frankdigitarald: yes, I'll put a bug in
22:22frankdigitarald: we won't be able to do a total "switch-over", it will just be two separate metrics
22:22frankis that okay?
22:26digitaraldfrank: ok. thank you!
8 Sep 2017
No messages
   
Last message: 14 days and 2 hours ago