mozilla :: #datapipeline

14 Mar 2017
00:35amiyaguchisunahsuh: thanks!
15:20frankmdoglio: does spark-shell work as expected on an ATMO cluster
15:20mdoglioit should
15:20frankIt keeps hanging on me. Hmm.
15:21mdoglioyou must have a noteobok open
15:21mdoglionotebook, even
15:21frankooh, that my have been it
15:21frankI may not have closed them correctly
15:21frankI'll try again
16:03sunahsuhwwhat's the vidyo room again for data club?
16:03franksunahsuh: MetricsProgram
16:03sunahsuhoh, nm
16:03sunahsuhthanks frank :)
16:07jezdezrobotblake: eeenteresting
16:07rvitillo|workweekRaFromBRC you look sort of frozen
16:08RaFromBRCheh... it's early here and i'm underslept
17:02trinkif we start flagging anomalies on a field by field basis the flag list could equal the size of the ping
17:03trinkI would opt for throwing it in the error stream after x anomalies
17:04trinkspenrose: ^
17:05franktrink: why not just use a single field for flagging. 2 flag options * 2000 fields = 4000, so just an int
17:06frankhmm, never mind, would need a bool for each flag...
17:08robotblakefrank: Like a bitmask?
17:09frankyeah, but that's 2^2000
17:13trinkthat would only work if the field set was fixed and stable anyway
17:24spenrosetrink i think we want to work backwards from "what should we expose to which analysts?" once we know what we're trying to accomplish (and we do a round of discovery on how often fields/records will be flagged under policies X, Y, Z) it will be much easier to weigh implementation decisions
17:31trinkthere really there type of problems data corruption, bad/buggy instrumentation, malicious data. 1 & 3 should be discarded and for 2 those fields should be black listed and never trusted even if the data sometimes looks right
17:35bytesizedWould anyone be able to tell me how much RAM is available to me when I create a size 1 Spark cluster? Additionally, does anyone know if the version of Python available is 32 or 64 bit?
17:37spenrosemostly agree on corruption. malicious data is often of intense interest to e.g. search, addon teams. agree on bad instrumentation
17:38spenrosebut per presentation i don't think we can get the big decisions right on our first try. i think we need a tech and org structure that can support iteration
17:38ashortbytesized: for the latter question, 'import platform; print platform.architecture()' will tell you
17:39bytesizedashort: Alright, I will try that once I have started one up
17:41frankbytesized: it is a c3.4xlarge instance
17:41frankbytesized: 16 vcpu, 30 GiB
17:42bytesizedfrank: Great, thanks
17:42frankSpark worker utilizes only a subset of that though
17:42bytesizedfrank: hmm. Any idea how much ram would be available to the worker?
17:43trinkspenrose: what I meant by malicious is data not produced by our client
17:43frankabout half - 15 GiB
17:43spenrosetrink ah, got it
17:44bytesizedfrank: Awesome. Thanks for your help
17:44franknp :)
18:03amiyaguchiis removing a table from hive as simple as `drop table if exists my_table;`?
18:05frankamiyaguchi: correctomundo
18:08amiyaguchifrank: cool :) Also, do you know if there are any settings we add to presto? I saw the schema evolution notebook that suggests adding a config hive.parquet.use_column_names=true
18:08amiyaguchiI'm having issues with presto throwing nullpointerexceptions
18:09frankamiyaguchi: yes we set that
18:09frankthat's the only one we set, afaik
18:09amiyaguchifrank: and everything else is default emr?
18:09frankrobotblake, is that correct?
18:10frankblake will have more info, he maintains our Presto instance
18:10robotblakeThat should be correct
18:10amiyaguchiour emr-bootstrap-presto repo doesn't seem to document the config
18:12frankamiyaguchi: might be worth putting a bug out for it then
18:13amiyaguchifrank: is it a bugzilla or github issue type of bug?
18:26robotblakeAthena doesn't support BINARY fields :|
18:29frankrobotblake: the only option for those is to convert them them
18:29frankthem then*
18:29* frank could have left off the "then"
18:30robotblakeI think I can store them as STRING
18:31robotblakeTo be fair, in the particular case I noticed the issue, it's that the fields literally are strings, they just don't have the UTF8 converted_type set in the schema
18:31frankrobotblake: which dataset?
18:32robotblakeIt's one of the direct-to-parquet... just a sec
18:33frankplease don't be core. please don't be core.
18:33robotblakeIt's totally core
18:33frankwell ain't that just the thing I needed
18:45robotblakeWith that said, all other datasets besides it and client_count are in Athena now
18:46frankawesome awesome!
18:48mreidrobotblake: does that mean I can query main_summary with impunity via Athena?
18:49robotblakeThere are some limitations, I think only 5 queries simultaneously until we get the limit raised
18:49mreidthat seems alright
18:49robotblakeAh shit, I didn't load main_summary yet, just a sec
18:49robotblakeSorry mreid :(
18:50mreidrobotblake: I don't actually need it yet, was just curious
18:51mreidI just get nervous when querying it via our presto cluster since there's some risk of impacting everyone else :-/
18:51frankrobotblake: for a select * from mobile_clients
18:52robotblakeAlso there's that
18:52frankis that related to what you mentioned
18:52robotblakeThat's the case sensitive stupidity
18:52frankmreid: this is presto :)
18:52mreidah ha
18:53frankrobotblake: ah crap, I see
18:53frankcamel cased and not
18:54gregglindsunahsuh, I have some deep in the weeds questions for you, for our quantum stuff.
18:54sunahsuhhaha okay
18:55gregglindmostly about how unthrottle semantics and sampling worked. osmose has been trying to implement your recalled ideas.
18:56gregglindand we had a few edge cases.
20:28rvitillo|workweekrobotblake: "With that said, all other datasets besides it and client_count are in Athena now and you say that just like that!? \o/ Nice job!
20:29robotblakeWell, there're a couple little bugs to work out, but yeah :)
21:34sunahsuhhmm, we don't have data for 3/13 in main_summary loaded in presto
21:35sunahsuhbut the parquet output seems fine
21:35sunahsuhand p2h looks like it ran at 8:00
21:35sunahsuhrobotblake: ^
21:36robotblakeI'll kick it off
21:37sunahsuhany ideas why it's not there to begin with?
21:40robotblakeAll I can figure is I messed something up when I was setting up Athena and it got skipped
21:40sunahsuh*shrug* okay
21:44frankrobotblake: did you check out that code I sent you about replacing MSCK REPAIR TABLE
21:44frankthat is the last step and parquet2hive will be incredible :)
21:45robotblakefrank: Yeah, glanced at it, was pretty similar to the code I had written before, hoping to take a stab at it after I wrap up some other work
21:45frankdo we have a bug out for that?
21:45frankI think so...
21:46robotblakesunahsuh: Looks like it OOM'd when it ran before
21:46robotblakeYeah, it's assigned to me
21:46sunahsuhahh, that's.. worrisome
21:47frankyeah, bug 1306323
21:47firebot NEW, p2h: msck repair table takes a long time
21:48frankrobotblake: and we get to close some bugs!
21:48frankbug 1330101 and bug 1333066
21:48firebot FIXED, bimsland Make p2h spit out raw SQL
21:48firebot NEW, bimsland P2H should use 'schema' field of parquet metadata to get schema
21:50robotblakefrank: Did you look at
21:50robotblakeI added the deploy stuff, just need to get added to pypi
21:50frankDidn't get the update email for some reason
21:50frankyou need added?
21:52robotblakeI'm not a maintainer on the parquet2hive package on pypi
21:53frankugh the pypi website is awful
21:56frankrobotblake: I can't login to the roles portion
21:56frankprobably roberto will have to do it
21:56robotblakeYeah :\
21:56frankrvitillo|workweek: when you get a chance, add robotblake to the parquet2hive package on pypi
21:57rvitillo|workweekwhats the user name?
21:58rvitillo|workweekfrank: robotblake done
21:59franknice, let
21:59franklet's get this change out
22:01robotblakefrank: Should we do 0.3.0 since it's technically backwards compatible? I feel a little weird about making it 1.0.0 :\
22:02frankSure, that's fine
22:06frankrobotblake: auto deploy worked!
22:10frankrobotblake: maybe when we stabilize on 1.0.0, we can have the default be it spits out sql
22:10robotblakeYeah, that's my hope
22:10frankbecause really that makes a lot more sense than passing --sql
22:54rraybornDoes anyone have an example of saving a large RDD to s3? I'm getting permissions errors and am quite unfamiliar with writing to s3 except through S3Transfer
22:55frankrrayborn: this job does:
22:55frankit transforms it to a Dataset though
22:56frankthere's a few other etl jobs in RTMO
23:00rraybornThanks frank, I'll take a look
23:02amiyaguchigrr presto, why do you have to fail on such a simple query
23:30robotblakeamiyaguchi: Is that consistent?
23:32amiyaguchirobotblake: I can run queries in spark that I can't run in presto, at least for the one above. I'm in the process of digging deeper, since I found a single column that I can't read.
23:32robotblakeI just ran that exact spark sql query in the presto-cli and got the same results back
23:33amiyaguchialso the behavior on my own presto cluster (usinng presto-cli) seems consistent with the one powering redash
23:34robotblakeJust ran it on redash too
23:34amiyaguchiit fails on redash, doesn't it?
23:35robotblakeCan you see that?
23:36amiyaguchiyeah, that's a surprising result. I wonder what the difference between my setup and the dev on is
23:38amiyaguchivanilla emr presto instance aside from, using parquet2hive from pip
23:38robotblakeWhat version are you running?
23:39amiyaguchiThe one on emr 5.4.0
23:39amiyaguchiso 0.1666
23:39robotblakeDid you mean "hive.parquet.use-column-names = true"?
23:39robotblakeuse vs read
23:40amiyaguchiyes, I just mistyped it here
23:41robotblakeI wonder if something changed from the version we're running to your version
23:41amiyaguchiwhat version of emr are we running? I might as well just completely mirror the setup.
23:42amiyaguchiOr maybe it might be worthwhile to set up athena instead? I
23:43robotblakeI have Athena hooked up on the prod redash
23:43robotblakeIf you switch the source to "Athena (Testing)"
23:44robotblakeNote: The tables are prefixed with "telemtry." so churn is "SELECT COUNT(*) FROM telemetry.churn ..."
23:44robotblakeI need to figure out how to change redash to use the telemetry db by default
23:45amiyaguchiIs it possible for me to just stand up a superset instance and add it to the athena/presto security group?
23:46amiyaguchiall I really want is to have the data to play with, though I do plan on doing some bigger queries
23:46robotblakeDoes superset support Athena yet? It's not the same as the Presto API
23:50amiyaguchisuperset uses sqlalchemy for the db connectors, so maybe it isn't supported yet
23:51robotblakeAh, yeah, so Athena is only "officially" available via a JDBC driver
23:52robotblakeIn practice I reverse engineered the service definitions for it so it can be used directly from boto / python but that doesn't get you the sqlalchemy layer
23:55amiyaguchimaybe I should reevaluate my approach if this isn't going to be future proof
23:55robotblakeI'm working on it as a side project so it'd be interesting to make a sqlalchemy layer, I seem to remember the presto sqlalchemy adapter is relatively simple
23:56amiyaguchihmm, but I think for now I'll go with dumping the data into redshift or something of the sort. It would be nice to have less moving parts
23:57amiyaguchiit might turn out that superset doesn't do what I want :/
23:57robotblakeMakes sense :\
23:58amiyaguchirobotblake: thanks for the discussion, i think it's saved me some time :)
15 Mar 2017
No messages
Last message: 12 days and 17 hours ago