mozilla :: #datapipeline

16 Mar 2017
07:42rvitillo|workweekrrayborn: have a look at how harter paginates the request to HBase; note that the problem he encountered was that collect()ing large amounts of data was hitting some Spark limitations (in other words HBase was working just fine).
07:44rvitillo|workweeksunahsuh: can you confirm that rrayborn was hitting the same issue?
07:45rraybornrvitillo|workweek
07:46rvitillo|workweekrrayborn: ignore the 20 request per second bit
07:47rvitillo|workweekwe are running some benchmarks and that number applies to a very specific case
07:47rraybornThanks, appreciate the guidance (and existence of HBase :) )
07:47rvitillo|workweekhttps://gist.github.com/harterrt/6932fd145b91be031c381cdb9934fe77
07:48rvitillo|workweek(harters notebook with pagination)
07:49rvitillo|workweekI think we are going to change the API to either support pagination directly or something as this seems to be a recurring problem
11:52mreidrvitillo|workweek: the hbase api made it infinitely easier to figure out that "main ping spike" problem btw
12:41rvitillo|workweekmreid: nice
12:43mreidI'm seeing an airflow jupyter job failing with "jupyter_client.kernelspec.NoSuchKernel: No such kernel named conda-root-py"
12:48mreidanyone seen that one before?
13:39rvitillo|workweekmreid: https://github.com/mozilla/telemetry-analysis-service/issues/267
13:40rvitillo|workweeklet me file a patch to fix this on Airflow
13:43mreidrvitillo|workweek: thanks!
13:44rvitillo|workweekmreid: https://github.com/mozilla/telemetry-airflow/pull/98
13:45mreidgreat
13:45mreidattempt #3 for this job is just about to fail :)
13:45mreidso it should be safe to deploy that fix in a few more mins
14:40mreidrvitillo|workweek: mdoglio: want me to review airflow #98?
14:45mdogliomreid: I got it thanks
14:47mreidcheers
14:54mreidnot sure how I missed this last month: https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/
15:29gregglindgfritzsche: mreid others. cohort/population/group/experiment for that "tag the animals that are being watched" part of gfritzsche telemetry annotation feature.
15:29gregglindcan we Just Decide, since it's seems to be a bikeshed :)
15:30mreidindeed
15:30gregglindand it's the only thing anyone has an opinion about :)
15:30mreidI defer to you guys, I've just seen "cohort" used elsewhere for this.
15:34gfritzschegregglind: mreid: i was waiting for some more input there
15:35gfritzscheboth cohort and population are overloaded terms i think
15:35mreidso is experiment :)
15:35gfritzscheexperiment seems less conflicting here?
15:36gregglindmy hunch is that people will want to tag for reasons beyond experiment.
15:36gfritzschei don't know
15:36gregglindso I tend to prefer either of the others.
15:36gregglindonce you are tagged, you get analysis for free :)
15:36gfritzschenaming is hard...
15:36gregglindwhich is a very strong carrot.
15:36mreidsunahsuh: do you have an opinion on what we call these "named groups"?
15:36gregglindthen mreid, want to flip for it? or arm-wrestle?
15:37gregglindMatt_g called them 'wildlife tags' at one point :)
15:37Matt_Gdeer tags :)
15:37gregglindI am -1 on experiment, +0 at cohort or population
15:37gfritzschecohort seems overloaded with "retention cohorts" etc.
15:37sunahsuhi kinda like cohort, but i think anything other than experiment will be confusing to folks for a while..
15:38gregglindI think expeirment is confusing :)
15:38gregglindbecause this will probably mark other things too.
15:38sunahsuhlike, the people that'll want to call it cohorts will know what's up if they're called experiments
15:38gfritzscheexisting tooling & analysis uses cohorts for onboarding groups
15:38gregglindI liked "interesting" for a while.
15:38gregglindso cohorts -1000000 then
15:38sunahsuhthe people that'll want to call it experiments won't know what's up if they're called cohorts/populations
15:38gregglind"people that'll want to call it experiments"
15:39gregglindwho are those people?
15:39mreidI like this reasoning
15:39sunahsuhengineers that will want to register experiments
15:39gregglindthey don't need to see this part of the sausage
15:39gregglindthey see the launcher interface
15:39gfritzscheso, whats the problem with calling it experiments?
15:39gregglindnot all the tagged things will actually be experiments
15:39gfritzscheimplementing engineers don't care about the launcher
15:39mreidI have come back around to +1 on experiments :)
15:39gregglindI am straingly -1
15:39gregglindon experiments
15:40gregglindbecause these aren't.
15:40gregglindand it's a more general tagging method.
15:40gfritzschethey are now, we just expect this to be used more generally
15:40gregglindI am thinking ahead to rollouts
15:40gregglindand other reasons to tag
15:41gfritzscheany significant problem with having rollouts marked up as experiments, internally?
15:41gregglindI find it pollutive to call things what they aren't
15:41gregglindbecaseu it's affects how people think about things
15:41sunahsuhfor rollouts wouldn't we use the buildid?
15:41mreidconfusion seems to be a problem, but it sounds like we're stuck with some level of confusion no mater what name we apply
15:42gregglindI want confusion to be mind-expanding, not limiting
15:42gregglindsetX here is a tag, that's it.
15:42gfritzschedo we have an issue-free naming?
15:42gregglindand has no specific semantic purpose.
15:42gfritzscheotherwise we need to pick one that we can use
15:42gregglindMy vote is +1 popualations.
15:43mreidgregglind: the api as described is inherently experiment-y
15:43mreidie. setX(tag, branch)
15:43gregglindNo, it's inherently nested-groupy :)
15:43gregglindI am find with group also.
15:44mreidwithout context, group is vague
15:44gregglindYes.
15:44gfritzscheso, given limited time here
15:44gregglindThat is a benefit I claim.
15:44gregglindAnd exactly my argument.
15:44mreid"conditions"?
15:44gregglindI am better with all of those (conditions, group, population) than experiment
15:44sunahsuhi think the principle of least surprise points to experiments still
15:45gfritzschei'm leaning towards just going with "experiments"
15:45gregglindIt's only least surprise *today* and in the future, it's inifinite surprise
15:45gregglindthat's my argument.
15:45gfritzschehappy to hear better options with good arguments for, otherwise using that
15:45gfritzsche(back to workweek)
15:45gregglindMy argument is that it will immediately becoem innacurate.
15:45gregglindand we have to go back and add more tags.
15:46gregglindRealizing that we will get some stats for free, *just by tagging using this mechanism* will be very attracitive for non a/b purposes
15:47gregglindsetMark('shortname-for-group','all')
15:48sunahsuhfeels scopecreep-y to me ;)
15:49gregglindI don't know how to argue with that. It's a natural consequence of putting the code in.
15:49gregglindjust like people immediately realize that experiments launcher is a pref-rollout service.
15:50gregglindif you allow tagging, and give dashboards on those tags free, and that is ithe really cheap way to get a dashboard, people will use it.
15:50gregglindand that is something that seems... good to encourage :)
15:50mreidgregglind: I think encouraging that behaviour by calling it "experimenting" might be a good thing
15:51gregglindI find that frustrating, because then all the things in that list aren't actual experiments.
15:51gregglindor aren't controlled or interventional experiments, which I think is misleadingly scienc-y
15:52gregglindokay, so I have this assumption that this is the one place I will get to mark tags :)
15:53gregglindif that's not true, and we want to spin a clone of this for every new way we have to tag groups, then I will back off.
15:53gregglindI somehow suspect that isn't true.
15:53sunahsuhbut then they shouldn't be using a dashboard that's designed for a/b tests -- most of it would be useless unless you have a hypothesis
15:53mreidthe "generalized pref rollout" thing would be good to revisit in any case
15:53gregglindI am not sure it's useless.
15:53gregglindsometimes count is plenty.
15:54gregglindthis feels to me like an *extremely* powerful tool.
15:54sunahsuhin that case we'd be doing a lot of computational work that wouldn't be useful though -- it sounds like if all you want are counts, that should be another mechanism
15:55gregglindMaybe, but we don't have that tool
15:55gregglindand this is easy from a client.
15:56gregglindJump with me over the cliff here that if this goes out, we are giving people a way of saying "these packets are interesting, count and display them"
15:57gregglindif you want to give a 2nd api to do that more formally (setTag)
15:57gregglindthat is also fine.
15:57gregglindmaybe in that case, setTag is always against the 'main' overall population in the dashboard
15:57gregglindI could be convinced of that
15:58gregglindsetTag(aTag)
15:58gregglindand that has a different special case etl and dashboard
15:59gregglinddoes that seem plausible?
15:59mreidyes, making the generalized solution a problem for future-us seems wise
15:59mreidgiven the time-crunch we're likely to be under
15:59gregglindMy question is: is people using setExperiment to get that behaviour a perverion or not?
15:59mreidnot that we won't do a good job, just that we don't have to solve the general case yet
16:00gregglindwell, the timeCrunch here is literally do we call this in a way that encourages that perversion or frowns at it.
16:00gregglindand I am hearing "its a perversion we don't want to encourage yet"
16:01gregglindI am tempted to suggest "setExperimentBranch" and accept that people can pervert it, but you Officially view it as a Perversion
16:02gregglind(for comparison, you can use pref-flip experiment as a pref rollout service, but we *also* view that as a pervision, and are telling people to wait a month)
16:04gregglindI will accept experiments IF we commit to thinking a little about how to support cheap tags
16:04mreidI suspect we will find ourselves wanting a different solution for the pref rollout thing
16:06gregglindand we will disuss how to mark users more generally then?
16:07mreid+1 for that
16:07gregglindOkay, so, I will mark it in the rfp and move on
16:08mreidyay
16:08mreid</bikeshed>
16:11sunahsuh
16:12mreidmdoglio: rvitillo|workweek you guys gonna do an airflow deploy?
16:12gregglindI like that in irssi I can&#39;t see emoji, so I never know if sunahsuh approves
16:13mreidgregglind:
16:13gregglindexactly
16:13mreiddid that not show either?
16:13gregglindSo, I added a future work describing that
16:13gregglindit did not in irssi
16:14mreidlame
16:14gregglindI suspect anything in private unicode space doesn&#39;t.
16:14mreidgregglind: U+1F3E0
16:14mreid:)
16:14gregglindokay, I claim that spec is done enough, and needs build :)
16:16mreidsunahsuh: when we added &quot;addons&quot; to the main_summary table, we converted from map{addon_id: {addon_details}} to array[Row(addon_id, other, addon, fields)]. Do you remember the specifics of why?
16:17mreidthe same logic may apply here
16:17gregglindbtw, sunahsuh and others, Matt_G is presenting the automagic dashboard claims at Product Club *RIGHT NOW*
16:17gregglindNO PRESSURE
16:18sunahsuh:D
16:19sunahsuhmreid: i don&#39;t, but perhaps it was easier to explode out into the separate dataset?
16:19gregglindsomeone mentioned a histograms.json file taht listed probes? is that real, or a fever draem I had?
16:21mreidgregglind: this one? http://hg.mozilla.org/mozilla-central/raw-file/tip/toolkit/components/telemetry/Histograms.json
16:24gregglindthat seems useful!
16:26mreidit&#39;s super useful
16:26sunahsuhare opt-in vs opt-out marked in the file?
16:27mreidthey are
16:27mreidsee also http://georgf.github.io/fx-data-explorer/index.html
16:27sunahsuhwhat&#39;s the field for that?
16:30mreidsunahsuh: releaseChannelCollection
16:30sunahsuhoh, awesome, thanks
16:30mreid(with a value of &quot;opt-out&quot;)
16:50rvitillo|workweekmreid: done
16:50mreidrvitillo|workweek: thanks
19:07ddurstmreid: ok, one last question, which may be a loaded one
19:07mreiduh oh
19:07ddurstis there anywhere I can refer to a redash datasource and see a) what the format of the date-type fields are, b) what formatting functions work, or c) a best practice?
19:08ddurstBecause every time I go to a new datasource, I have no idea what I&#39;m doing with dates and I feel like I know nothing about databases.
19:08* ddurst ^ some of those things are accurate
19:08mreidddurst: this is a common pain point for everyone, for every data source.
19:08ddurstsomething in docs, somewhere?
19:09mreiddates are stored in at least half a dozen different ways in incoming data, and they are mostly stored as-is in those various and sundry forms
19:09ddurstI feel like I should&#39;ve been writing this down for every one I&#39;ve asked about.
19:09mreidmany of the telemetry fields are documented with the ping structure itself
19:10mreidthe &quot;best practice&quot; now is pretty bad, so I can&#39;t recommend anything too enthusiastically
19:10ddurstok. Then I will just start writing them down as I ask them, for everyone&#39;s sake.
19:10mreidbut what I tend to do is select a few tens of values from a column and then export as CSV from re:dash
19:11mreidthat gives you the true underlying representation. If you just select them and view in the re:dash tables, sometimes it tries to be smart and reformat them for you.
19:11mreidthe &quot;proper&quot; solution is to change all these columns to use actual date types
19:11mreidso that the corresponding SQL can be sane
19:11ddurstsure
19:12ddurstbut that&#39;s... not happening soon
19:12mreidit might
19:12ddurstI&#39;m cool with converting
19:12ddurstWhat about sources that are not reflected in ping docs, like DSMO-RS?
19:12mreidit&#39;s blocked right now on validating that parquet can actually handle date formats nicely (which it couldn&#39;t when we first started using parquet)
19:13mreidI&#39;m not sure about other datasets, though the same general approach should work - select some values, look at CSV
19:13mreidpainful, but it works :(
19:14ddurstfair enough
19:19robotblakemreid: parquet should be able to store dates, times, timestamps, and intervals
19:19robotblakehttps://github.com/apache/parquet-format/blob/master/LogicalTypes.md#datetime-types
19:19robotblakeAnd Hive (though maybe not Athena) supposedly supports all of those
19:20mreidcool
19:20mreidrobotblake: thanks
19:21mreidI&#39;m not sure exactly where the incompatibility was when we started w/ parquet, but if we can confirm that everything works fine, it would be worth the pain to start using real date types
19:27robotblakeLooks like Athena only supports &quot;timestamp&quot; :(
19:30robotblakeThough what exactly constitutes a &quot;timestamp&quot; is anyones guess
19:42mreidif it&#39;s nanos since epoch, we&#39;re in luck ;)
19:44robotblakeLooks like it might be?
19:48mreidthat would be cool, as all our &quot;timestamp&quot; fields would just work o_O
22:13jezdezheads up all. atmo was updated and clusters are now persisting home dirs. more in the what&#39;s new section in the footer
22:16sunahsuh0_0
17 Mar 2017
No messages
   
Last message: 101 days and 7 hours ago