Professional Documents
Culture Documents
Aaron Davidson
October 28, 2015
About Databricks
Founded by creators of Spark and remains largest
contributor
2
What have we learned?
Hosted service + focus on Spark = lots of user feedback
Community!
3
Outline: What are the problems?
● Moving beyond Python performance
● Using Spark with new languages (R)
● Network and CPU-bound workloads
● Miscellaneous common pitfalls
4
Python: Who uses it, anyway?
/data
PySpark Architecture
sc.textFile(“/data”) Java-to-Python
.filter(lambda s: “foobar” in s) communication
.count() is expensive!
Driver
/data
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()
11
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()
Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()
12
Moving beyond Python performance
Using RDDs
data = sc.textFile(...).split("\t")
data.map(lambda x: (x[0], [int(x[1]), 1])) \
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \
.map(lambda x: [x[0], x[1][0] / x[1][1]]) \
.collect()
Using DataFrames
sqlCtx.table("people") \
.groupBy("name") \
.agg("name", avg("age")) \
.collect()
13
Using Spark with other languages (R)
- As adoption rises, new groups of people try Spark:
- People who never used Hadoop or distributed computing
- People who are familiar with statistical languages
Spark R docs
See talk: Enabling exploratory data science with Spark and R
Network and CPU-bound workloads
- Databricks uses S3 heavily, instead of HDFS
- S3 is a key-value based blob store “in the cloud”
- Accessed over the network
- Intended for large object storage
- ~10-200 ms latency for reads and writes
- Adapters for HDFS-like access (s3n/s3a) through Spark
- Strong consistency with some caveats (updates and us-east-1)
S3 as data storage
“Traditional”
Databricks
Data Warehouse Amazon S3
Instance
Answer: buffering!
S3 Performance Problem #2
sc.textFile(“/data”).filter(s => doCompute(s)).count()
Time
Network CPU
Utilization
Time
S3: Pipelining to the rescue
S3
User
Reading
program
Thread Pipe/
Buffer
Time
S3: Results
● Max network throughput (1 Gb/s on our NICs)
● Use 100% of a core across 8 threads (largely SSL)
● With this optimization S3, has worked well:
○ Spark hides latency via its inherent batching (except for
driver metadata lookups)
○ Network is pretty fast
Why is network “pretty fast?”
r3.2xlarge: