mozilla :: #datapipeline

11 Jul 2017
11:01Dextermreid, fwiw I'm currently running a latency analysis on all the main-pings, without restricting the thing to shutdown pings
11:59mreidDexter: great, can't wait to see :)
13:37Dextermreid, https://gist.github.com/Dexterp37/f42814bd388635b6a4a9b9f2d1df2013 (I'm getting this properly reviewed on mozilla-reports)
13:38Dextersummary: 80% within the first hour vs 20% within the first hour with no pingsender
13:40Dexterthis is slightly lower than the percentage we have with shutdown main-pings only
13:40Dexterbut this is expected, as other pings are not being sent with the pingsender :)
13:58mreidDexter++
14:08frankDexter: is it in the plans to send all pings with pingsender?
14:08frankbecause this is such a great improvement it seems sad not to share :)
14:13Dexterfrank, heh, "not yet" :)
14:13Dexterwell, no plans at all to be honest
14:13Dexterwe're using it for the new-profile ping too already
14:13frankDexter: ooh, nice! That makes sense
14:14frankis it a difficult ask?
14:14Dexterand we'll probably use it for the "update" ping too
14:14Dexternope, not really difficult, but it has an impact on shutdown (even if it's minimal)
14:14Dexterso we should be careful about that
14:16Dexterfrank, ^
14:16frankDexter: what effect does it have on shutdown?
14:17Dexterfrank, there's a little regression in the shutdown duration -> https://github.com/mozilla/mozilla-reports/blob/master/projects/main_ping_delays_pingsender_beta.kp/knowledge.md#did-we-regress-shutdownduration
14:17Dexter2 to 8 ms
14:18Dexternot something to be scared about given the high win
14:18Dexterbut something to consider and closely monitor if we keep adding pings at shutdown :)
14:18frankDexter: shutdown isn't something quantum is concerned about, right?
14:19Dexterfrank, heh, right, but still something we don't want to regress. See https://bugzilla.mozilla.org/show_bug.cgi?id=1365978#c3
14:19firebotBug 1365978 FIXED, alessio.placitelli@gmail.com Validate sending shutdown pings using the PingSender [Beta]
14:20frankvery cool. Great job, Dexter!
14:21Dexterthanks frank ! Team effort FTW! (Georg, Chris, Gabriele, etc.!)
15:35sunahsuhfrank: do you know where we're pulling the telemetry tutorials from for new EFS home directories?
15:37franksunahsuh: moz-reports
15:37sunahsuhhmmm
15:57gfritzschefrank: for most pings it may not be a big deal... currently its a specific fix for latency for pings that are submitted on shutdown
15:57gfritzscheeverything submitted early doesn't have that problem
15:58gfritzscheso no need for the ping sender there
16:05mreidright, generally pings are sent out as they are generated, rightr?
16:06frankgfritzsche: that makes sense
16:06* frank wonders what the latency is of other types of pings
16:13gfritzscheright, most pings should be sent out "right away"
16:14gfritzscheonly if our send queue is spammed, connection quality is bad or we are shutting down, pings are delayed
16:28mreidPSA: this is KDD which rweiss mentioned: http://www.kdd.org/kdd2017/
17:13gfritzschethats a great concept
18:17jgauntfrank: did you get my email?
18:18frankjgaunt: I did, haven't had a chance to look yet :(
18:18jgauntokay
18:19frankjgaunt: I'll take a quick peek right now :)
18:21joyi would like a special dataset request like experiments have: could all funnelcake pings be stored somewhere?
18:22joyi think that would be equivalent to saying ig distribution_id == 'mozilla[0-9]+' (some regex like that) then save those pings somewhere else
18:24frankjoy: raw pings, or a parquet dataset?
18:24joyfrank: parquet withe verything? for example i wanted fx_migration_logins_jank_ms but that isn't in main_summary
18:24joyso if parquet with everything that would nice
18:25joybut failing that raw is okay
18:25joythere are only few hundred k in any funnelcake
18:25frankjoy: my best guess is by end of q3 we will have all measures in main_summary
18:25frankso depends on your timeline
18:25joyaah cool
18:26frankjoy: also, if you want new histograms in main_summary, it's easy to do. We just add them to the whitelist
19:09robhudsondid longitudinal change at all? I'm seeing "cannot resolve '`gc_ms`' given input columns: ..."
19:16sunahsuhrobhudson: in stmo? jupyter?
19:16robhudsonjupyter. Using the "Longitudinal Dataset Tutorial"
19:25frankwow, it is definitely not there
19:25frankrobhudson: can you make a bug in Datasets: Longitudinal?
19:25robhudsonsure
19:27frankrobhudson: wait, scratch that
19:27frankit's a pre-release measure
19:27frankwe don't have prerelease (aka opt-in) measures in longitudinal anymore
20:46trinkfinally, the third party links come from the wiki https://mozilla-services.github.io/lua_sandbox_extensions/
20:48trink^ mreid
20:55mreidWoot
21:20sunahsuhre-running the experiments aggregate + import for yesterday now
21:39trinkfyi there will be no deployment this week but the June 26 real-time data platform packages are available https://hsadmin.trink.com/packages/
21:51joymreid: will respond to the testing document
21:51joyquick question, the parquet datasets
21:51joywhat is the format? is it protobuf encoded?
21:52joywhere is the encoding? github repo?
21:52mreidParquet is the data format
21:53mreidhttps://parquet.apache.org/
21:53joyim thinking (drumroll), when i do a sql query e.g. selec client_id, sum(active_hrs) from main_summayr where sampleid=42
21:53joyhow does spark handle the indexing?
21:54joyi mean where does spark step in and where does parquet step in?
21:54joycould i use alternatives to spark and just include ways to take advantage of the indexing?
21:58frankjoy: parquet is independent of spark
21:58joyfrank: right. So how does spark handle the 'indices'
21:58mreidThere are other implementations of parquet readers
21:58frankjoy: for example, Presto reads and writes parquet as well
21:59joywhen told say filter on submission-date_s3
21:59frankjoy: which indices? there is no traditional RDBMS index support there
21:59frankah, it's partitions
21:59joyright right
21:59joysorry wrong jargon
21:59frankjoy: for example, the filename is submission_date_s3=2017001/somethinhsomething.parquet
21:59joyaah, i see
21:59frankand spark/presto/all readers know to add a column called `submission_date_s3`
21:59frankand thus they can filter out certain files before even reading the contents
22:00joyso i spark and can quickly see which files it needs to read
22:00joyup front
22:00frankright
22:00joyfrank: exactly
22:00joyso what partitions do we have?: submission)_date_s3, sample_id and ?
22:00frankthat's it, just those two
22:01frank(( for main_summary, at least ))
22:01joyfrank: thanks!
22:01frankjoy: no problem!
22:01joyalso, if i wanted to read a sample parquet file
22:02joysay in C
22:02joywhat would i need? apart from the parquet libraries
22:02joywould i need some protobuff? or is the parquet file self documenting?
22:05mreidYou'd just need the parquet lib
22:06mreidThough there are some other special bits to read it straight from S3
22:28frankjoy: have you been doing analysis in c lately
22:28joyno, but suppose i wanted to integrate with other languages
22:28joyfrank: ^^
22:28joymreid: what are those special bits? github link if you have?
22:30frankjoy: you would need something like botocore is for python
22:30frankwhich can read from s3
22:30frankthen another library to decode the parquet
22:30joyi see, yes
22:30mreidJoy wants to read from R :)
22:30franknoooooooooooo
22:30joymreid: hehee, curious is all ...
22:31franks/c/r, eh?
22:31joyoh no, write in C
22:31joyand then call from R
22:31mreidDoes Zeppelin have R support?
22:31joyso Scala when it reads the S3 files
22:31mreidIt let's you do things in multiple langs
22:31joydoes it stream the data from S3? or copy?
22:32joyright, but i wanted to see if i can access the parquet files directly
22:32mreidjoy: it uses partial reads from S3
22:32joySparkR is is bit messy
22:32joymreid: where can i read more on that?
22:32joypartials reads from S3
22:32mreidI bet you $5 it's less messy than implementing it yourself :)
22:32joyyes, indeed, i dont want to
22:33frankjoy: consider this: https://aws.amazon.com/blogs/big-data/running-sparklyr-rstudios-r-interface-to-spark-on-amazon-emr/
22:33joyyes, will look nto it
22:33frankjoy: if you give it a shot and it's nice, we could consider supporting it
22:33mreidJoy there's range support when reading a file from S3. You can request a specific byte range
22:34joylet me try sparklyr
22:40jgauntfrank, mreid - get_pings_properties() was deprecated in favor of using the Dataset API, correct?
22:41jgauntor is there still a way to pull pings with a subset of properties from a bigger RDD?
22:43frankjgaunt: get_pings_properties has not been deprecated yet
22:43frankwe're still supporting it
22:43jgauntmy mistake, th
22:43jgaunt*ty
12 Jul 2017
No messages
   
Last message: 9 days and 12 hours ago