mozilla :: #datapipeline

13 Jul 2017
00:36wlach|afkkitcambridge|sf: there's some examples of just running a few tests in the readme
12:00mreidkitcambridge: the t-b-v tests need a lot of memory, so make sure you have a docker machine with at least 4GB of RAM.
12:13gfritzschemreid: hey, in client_count, is activity_date == submission_date?
12:13gfritzschei.e. received on edge?
12:15mreidclient_count breaks down by subsession start date
12:16mreidgfritzsche: ^
12:27gfritzschemreid: oh, ok, no way to break down by submission date then
12:27mreidno not yet
12:28gfritzscheany other aggregate datasets that can give me DAU by submission date per country?
12:28gfritzscheor just go for main_summary et al?
12:29mreidDAU by sumission date?
12:29mreidlet me check
12:30mreidI think we only have global dau by submission date, or MAU by country.
12:40gfritzschealright, sounds like longitudinal or main_summary
12:47mreidyup. for main_summary, it's best to start up a cluster and use Spark SQL (not sure what the state of main_summary in hive is at the moment)
13:33frankmreid: do you know why we use subsession_start_date in client_count
13:33frankI did the same for core_client_count, but sanitized it
13:34mreidfrank: because in theory it's more correct (the clients were definitely active on that day)
13:35mreidin practice, client-side dates are messy
13:35mreidIt matters less with pingsender, I think
13:35frankhmm, we should change it now
13:36* frank makes a bug
13:36mreidfrank: the example case is people who don't use the browser over the weekend.
13:36mreidpre-ping sender
13:37mreidthey'd have activity on Friday, but would not appear until Monday
13:37frankmreid: right, right. But this is nice because it will make client_count iterative just by it's very nature
13:38mreidyep, making things trivially iterative is the main argument for using submission date
13:38mreidwell that and the fact that it's a reliable timestamp
13:38frankI like both those reasons :)
13:38mreidme too :)
13:39mreidonce ping sender hits release, we should [re]do the analysis to look at the difference between using submission date / activity date
13:39frankmreid: when is it hitting release
13:40mreidnot sure... it's on beta now
13:40frankso on 55?
13:46gfritzsche55, yes
13:50frankmreid: should we cancel this morning?
13:50mreidup to you
13:50mreidfrank: let's check in for 5min
13:51frankmreid: sounds good
14:35mreidfrank: since experiments aggregates regenerates data for the entire experiment each time, can we just mark the airflow failures as success? or do we need to re-run them?
14:35frankmreid: sunahsuh was rerunning them
14:35frankI think I'm okay just marking them as complete
14:36frankthere's no benefit that I know of to create that old data that won't get used
14:36mreidI guess it doesn't hurt to rerun them
14:36mreidI just wanna get rid of the red boxes in airflow :)
14:43frank+1 to that
15:07frankharter: doc ping
15:09harterfrank: shoot
15:09frankharter: how do I make an internal link to another wiki page
15:10harterUse an absolute path from the root of the gitbook
15:10frankharter: got an example? I couldn't find one
15:10harterE.g. /tools/
15:11frankah got it, same as any other link then?
15:12harterYep, should have a lot of examples
15:12frankperfect, thanks!
16:23frankrobotblake: was there any word on if Athena will support views
16:23frankjust ran into a use case for one
16:26mreidfrank: "yes, eventually"
16:27frankmreid: that word, "eventually"
16:27mreidit's like eventual consistency
16:27mreidjust keep checking until you get the answer you want
16:27frank"eventually, we'll be everything you could want in a SQL system"
16:27robotblakeIt sounded like with the next big update, end of July is sorta what it sounded like
16:28frankoh dang, that is much sooner than I anticipated
16:28* frank does a little dance
16:28robotblakeDon't quote me on that though
16:28frankrobotblake: too late
16:28frankkidding :)
18:08wlachdoes anyone have good advice on synthesizing a bunch of parquet data with approximately this schema (ideally with python, but I'll take what I can get)? stackoverflow, etc is not particularly helpful
18:08wlachbasically I just want to be able to create 100-1000 rows of data for testing purposes
18:10frankwlach: I'm not sure exactly what you're asking
18:10frankdo you mean doing some data munging and resulting in that schema?
18:10frankwlach: you can run the error_aggregator in batch mode, which will create the data your looking for
18:11wlachfrank: I just want to create a parquet file whose data matches that schema
18:11frankwlach: any reason you can't just cp once of the current error-aggregator files?
18:12frankwlach: e.g. s3://telemetry-parquet/error_aggregates/v1/submission_date=2017-06-09/part-00000-fef10751-e300-4958-8246-6c2c9d1e6abf.snappy.parquet
18:12wlachfrank: that's what mdoglio did last month, but ideally I would like to be able to control what's in it for the purposes of creating a reasonable dev-only experience (where we don't have access to such things)
18:13frankwlach: ahh. Can it include using spark?
18:14frankwlach: if so, something like what we do here could work:
18:14wlachfrank: I would like to be able to do this just by running a script, but creating a docker container with spark or whatever would be fine
18:15frankwlach: more similar to the error_aggregates output would be this one:
18:15franke.g. for each set of dimensions, creates a row
18:16wlachfrank: looking
18:20wlachfrank: hmm, you mean this part?
18:20frankwlach: yup, then
18:21frankthat takes the randomList and makes a DataFrame, which you can easily save as parquet
18:24wlachfrank: ah that makes sense, where are we getting the values to create that randomList? My scala is still pretty weak but I'm not really seeing how that works
18:25frankwlach: right above there
18:25frankeach value gets a list of possible values
18:26wlachfrank: ah! I see
18:26gfritzschehm, bug 1380737 should probably start out in Dataplatform::General?
18:26firebot NEW, Write an airflow job to populate a derived database on hbase for TAAR
18:27frankgfritzsche: yes
18:29frankwlach: no problem! glad that solves it
18:31wlachfrank: yeah I was hoping for a solution that was pure-python (or just using the jvm+tools) but I think this will work :)
18:31frankwlach: you could try and replicate in python, using RDD.toDF
18:31frankit should be able to infer the schema
18:31frankand using some itertools magic
18:31wlachfrank: yeah but I would still need to pull in spark/etc so I don't know if I'd be much further ahead
18:32frankright, good poin
18:32wlachI really need to improve my scala skills anyway
19:19harterrobotblake: any recent changes to the docs.tmo DNS record?
19:23robotblakeShouldn't be
19:32joyis is true that approxQuantile can only be used a amethod on a data frame
19:32joyand not inside a SQL eury
19:32joyin pySpark?
19:32joyi assumed SQL an data frames were equivalent
19:37frankjoy: I thought you could use them either way as well
19:37frankare you getting an error?
19:37joyone sec
19:38joy u"Undefined function: 'approxQuantile'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 8"
19:38joyfrank: ^^
19:38joythis is how i called it: select approxQuantile(usg,0.9,0.001) as q from table0
19:40frankjoy: try `approx_percentile`
19:40frankjoy: scratch that, `percentile_approx`
20:16harterweird. It appears to be resolved now. github couldn't find the DNS record for a while, then the custom URL got flicked back and forth a few times.
20:48robotblakeharter: I re-added it
20:48robotblakeIt was missing from the repo
20:50harterAh, awesome, thanks!
21:27digitaraldrobotblake: (continuing from #telemetry) tried Presto: Error running query: line 9:8: Schema telemetry does not exist
21:28robotblakeTry removing the "telemetry." at the beginning of the table name
22:50suhi guys! question about boolean histograms: I'm looking at histogram_parent_ssl_handshake_version (it's measuring whether a page load in a subsession used ssl or not... i think) and it seems to have 3 possible keys: 0, 1, or 2
22:51sufor example: Row(histogram_parent_http_pageload_is_ssl={0: 1, 1: 50, 2: 0}),
22:51sunone of the records have a non-zero value for the '2' bucket
22:52suit's a boolean so it should have jsut 2 buckets (T/F), right?
22:52suis the '2' bucket just a placeholder?
22:52suand '1' = T and '0' = F?
23:54digitaraldrobotblake: what's the ETA for fixing the issue that hits?
23:55robotblakedigitarald: Before tomorrow, the _v4 version of the table is fixed already, so if you change your query to hit "main_summary_v4" it should work
23:56digitaraldrobotblake: ok, so to rephrase I should try again tomorrow with _v4 appended to the table
23:56robotblakeIt'll work right now with the _v4 table
23:56robotblakeTomorrow it will work with the non suffixed table
23:56robotblakeThey've got the same data
14 Jul 2017
No messages
Last message: 10 days and 17 hours ago