mozilla :: #datapipeline

7 Sep 2017
11:08Dextergah, it's alwyas a great day when your job fails
12:05mdoglioDexter: it's also a great day when you have to git bisect a dozen of commits and each iteration takes 10 to 15 minutes
12:47DexterLooks liek OH today is way ahead of my timezone :D I'll ask anyway, in case any gentle soul shows some pity :-P
12:47DexterMy HBase job is once again failing. Good news: I got the logs, and looks like it's once again OOM... but different OOM
12:48DexterBad news: running the very same code, on a cluster with the same configuration works.
12:52DexterSo, did something happen on the 2nd, 3rd and 5th of September to Airflow/configuration?
12:54mdoglioDexter: same emr version?
12:55Dextermdoglio, yes, 5.8.0
13:02sunahsuhDexter: we had a main summary outage over the weekend
13:08Dextermh, thanks sunahsuh ! So 2nd and 3rd could somehow be correlated (even if I don't have an explaination). Did that happen in the past? Dependent jobs failing due to a main_summary outage?
13:10sunahsuhhmm, it could -- maybe hbase wasn't happy with simultaneous runs?
15:43franksunahsuh: re: repartitioning before collecting (due to the 2G limit in Spark), if you use coalesce it won't shuffle
15:46sunahsuhfrank: but if we target one partition per core in a 10-node, we'd want a lower limit of ~160 partitions, which doesn't look like it buys us much
15:47sunahsuhthe intermediate collect tasks had 200 tasks each when i was running tests yesterday
15:47franksunahsuh: hmm, yeah that's not much different
15:47frankI mean you *could* go lower
15:47frankbut obviously we'd lose processing power for anything that happens after the coalesce
15:48frankand I'm not even sure if that would fix the problem anyways
15:48frank200 tasks doesn't sound like something that would hit the 2G limit
15:48sunahsuhso we have 2 different limits we're hitting: 2G addressability limit re: persisting the dataset
15:48sunahsuh4G driver memory limit when collection tasks
15:49sunahsuhwe're hitting the 4G driver memory limit when we don't collect out each metric's results
15:51frankokay, that clears it up. Then this would not fix it
15:51sunahsuhmeaning, it appears when we don't do that, spark is running all the metric aggregations at once and sending all the task results back (all 22k of them)
15:51sunahsuhwhich is what we'd want if we weren't hitting the driver memory limit
15:52frankpresumably we'd hit the driver memory limit no matter how we sliced the incoming data
15:52sunahsuhwell, unless we batch up the metrics, which i ran a test of
15:52sunahsuhand that worked -- but i think we could have much larger batch sizes
15:53frankright, what I meant was "if we include all of it it's going to hit the limit"
15:53sunahsuhthe log messages we were getting seem to suggest that the we're *just* hitting the limit
15:54frankif that's the case, then we could maybe just do two batches?
15:56sunahsuhwell, like i said, when we 1. change the cluster size and 2. add more metrics, we'll keep having to tune this
15:57sunahsuhor we can figure out the upper limit
15:57sunahsuhlook at the partition number, dynamically group the metrics and not worry about this particular issue again
15:57frankI like the sound of that second solution :)
15:59sunahsuh:) but, that'll probably take like half a day of engineer + cluster time so probably not worth it for now, as fun of a problem as this is
16:56mreidsunahsuh: want me to file a bug with a summary of this convo?
16:57sunahsuhsure, that'd be great, thanks mreid!
17:08mreidsunahsuh: frank:
17:08firebotBug 1397827 NEW, Optimize number of partitions for experiment aggregation job
17:08frankgreat, thanks mreid!
19:07frankI'm getting "Permission Denied" errors trying to do anything with AWSScala on an ATMO machines:
19:07frankthings that work in the CLI. Also, they work locally
19:08frankany suggestions^?
19:09robotblakefrank: In regards to, what exactly is missing?
19:09firebotBug 1383827 NEW, Make Separate re:dash metadata with Admin access to users table
19:10robotblakeThere is a users table that's visible
19:10robotblakeDoesn't have all the fields visible but
19:11frankrobotblake: I don't have access
19:11frankremember we discussed this :) it needs to be another data source with access to the users table
19:11frankrobotblake: "Error running query: permission denied for relation users "
19:12robotblakeWhat fields are you trying to pull?
19:12frankrobotblake: I need ids and names
19:12frankrobotblake: so that I can figure out who queries belong to
19:13robotblakeEveryone should have access to that table
19:14frankrobotblake: I can run that one you sent, and a copy of it
19:14frankbut I can't run this:
19:15robotblakeYeah, that table contains password hashes and such which we don't want to expose
19:15robotblakeIt's got column level permissions set up
19:16frankrobotblake: ahh, okay that explains it
19:16frankin that case you can close the bug -- thanks!
19:22robotblakeSure thing
20:51frankjgaunt: are you trying to fix that error still
20:52frankjgaunt: plz use approxCountDistinct :)
20:54sunahsuhfrank: he's in office hours if you want to jump in :)
20:54* frank goes back to debugging :)
21:04jgauntfrank - instead of .count()? the dataFrame .countDistinct() wasn't giving me issues
21:07frankjgaunt: the nodup.count() is just giving you the number of distinct document ids
21:07jgauntah, now I see what you mean
21:08frankjgaunt: right, just get an approximate count
21:08frankmap to documentId, get an approximate count distinct of those
21:09frankmight need to transform to a Dataset first to use approxCountDistinct, not sure if there is a comparable RDD function
23:26jgauntamiyaguchi, frank I filtered those main pings down to only clients with crashes and found the .countApprox() method for RDDs as you two suggested, respectively
23:27jgauntand I still can't get the count - gonna have to let this one go
23:27jgauntthx again for the suggestions
23:27amiyaguchijgaunt: as in letting go of document deduplication?
23:28jgauntI think just letting go of the duplication count - they should still be removed
23:28jgaunt...assuming the rest of the code runs now :)
23:48jgauntit doesn't; boo hoo
23:56amiyaguchijgaunt: the goal is to find the crash rate (crashes normalized by some usage metric) across two sub-populations of clients?
23:57jgauntamiyaguchi: yeah crashes per activeTicks, upTime, and totalTime were the 3 I were going to look at
23:59amiyaguchijgaunt: curious where it crashes now?
23:59jgauntnow it dies when I try to .createDataFrame() from the main pings
8 Sep 2017
No messages
Last message: 14 days and 5 hours ago