mozilla :: #datapipeline

17 Apr 2017
13:12gregglindgood morning peeps
13:20mreidhiya gregglind
16:31trinkfyi, the Apr03 sprint data platform real-time packages are available https://people-mozilla.org/~mtrinkala/packages/
17:02harterkparlante, amiyaguchi, mreid: anyone want to chat quick about the specs for this dataset?
17:02kparlantesure
17:02hartermy room?
17:02mreidharter: yes
17:10trinkmreid: whd: https://bugzilla.mozilla.org/show_bug.cgi?id=1337927 are we unblocked for this sprint?
17:10firebotBug 1337927 ASSIGNED, mtrinkala@mozilla.com Update schema validation code to handle new doctypes without a code change
17:30mreidtrink: I believe we need to continue accepting old shield pings for the foreseeable future :(
17:34trinkthat should block everything else
17:34trink?
17:35trinkwhy don't we just back the shield schemas out
17:44amiyaguchiI've been getting a lot of node failures on spark, 3 physical nodes @ 30gb memory each working on a dataset around 300gb in size
17:45amiyaguchiI'm wondering if partition size would make any difference during shuffles
17:46amiyaguchiI think executors might go OOM? It's really hard to debug the reason spark comes to a crawl near the end of the stage
17:51mreidtrink: yeah, backing out the shield schemas until we have version-awareness is probably a good idea
19:02robotblakeHeads up that it looks like bintray / sbt may be having issues and preventing ATMO clusters from starting up
19:21frankamiyaguchi: the spark log isn't giving any hints? But also with Spark it's almost *always* memory pressure. Why not just use a few more machines?
19:23amiyaguchifrank: the logs don't give any real useful hints, but its definitely memory. I'm going to try bumping up the size of machines from 5 to 10 so everything fits into memory with less spilling and see if that helps any
19:23amiyaguchithe way its failing is just kind of mysterious to me though
19:24frankhmm, that's too bad. Is this an RDD or sparkSQL
19:26amiyaguchiits mostly RDD for this
19:27frankyeah probably just some sort of data skew then
19:27amiyaguchiI haven't narrowed it down to the huge map function or the reducebykey shuffle
19:28frankalmost definitely the latter
19:29amiyaguchii figured, although I don't know why the executors are dying
19:30amiyaguchii guess it doesn't need the parallelization during the shuffle stage and kills them off?
19:31frank"doesn't need the parallelization"? I'm not sure what you mean
19:33amiyaguchiI think during the few runs that I had last week, each node would get split into 4 logical executors
19:33amiyaguchifor a total of 12, and only 3 would be alive by the end of the reducebykey stage
19:34amiyaguchiprobably because its an io bound thing, instead of a cpu bound thing
19:35frankare you repartitioning/coalescing after the reduceByKey?
19:35amiyaguchiat least I think that makes sense, I would have to try it would a different number of physical nodes to verify that explaination
19:35amiyaguchiI only repartition once at the end, since the result is pretty small
19:38amiyaguchiI read that repartitioning the rdd's into smaller blocks of 128mb is better for the shuffle stage because it uses a mechanism similar to bittorrent to redistribute data
19:41frankequal sizes for all executors is what you want
19:44amiyaguchiI wonder if the problem that I have goes away if the size of the rdd is less than the total executor memory
21:31robotblakeThe sbt issues that were affecting ATMO / Airflow appear to be resolved and we're working on something to make sure this doesn't happen again
21:31robotblake^ amiyaguchi
21:33amiyaguchirobotblake: awesome :)
18 Apr 2017
No messages
   
Last message: 6 days and 8 hours ago