Professional Documents
Culture Documents
Debuggingtuninginspark PDF
Debuggingtuninginspark PDF
Debuggingtuninginspark PDF
Shiao-An Yuan
@sayuan
2016-08-11
Spark Overview
● Cluster Manager (aka Master)
● Worker (aka Slave)
● Driver
● Executor
http://spark.apache.org/docs/latest/cluster-overview.html
RDD (Resilient Distributed Dataset)
A fault-tolerant collection of elements that can be operated
on in parallel
Word Count
val sc: SparkContext = ...
val result = sc.textFile(file) // RDD[String]
.flatMap(_.split(" ")) // RDD[String]
.map(_ -> 1) // RDD[(String, Int)]
.groupByKey() // RDD[(String, Iterable[Int])]
.map(x => (x._1, x._2.sum)) // RDD[(String, Int)]
.collect() // Array[(String, Int])
Lazy, Transformation, Action, Job
● Kryo serialization
○ Much faster
○ Registration needed
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Common Failures
● Large shuffle blocks
○ java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
■ Increase partition count
○ MetadataFetchFailedException, FetchFailedException
■ Increase partition count
■ Increase `spark.executor.memory`
■ …
○ java.lang.OutOfMemoryError: GC overhead over limit exceeded
■ May caused by shuffle spill
java.lang.OutOfMemoryError: Java heap space
● Driver
○ Increase `spark.driver.memory`
○ collect()
■ take()
■ saveAsTextFile()
● Executor
○ Increase `spark.executor.memory`
○ More nodes
java.io.IOException: No space left on device
● SPARK_WORKER_DIR
● SPARK_LOCAL_DIRS, spark.local.dir
● Shuffle files
○ Only delete after the RDD object has been GC
Other Tips
● Event logs
○ spark.eventLog.enabled=true
○ ${SPARK_HOME}/sbin/start-history-server.sh
Partitions
● Rule of thumb: ~128 MB per partition
● If #partitions <= 2000, but close, bump to just > 2000
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Executors, Cores, Memory!?
● 32 nodes
● 16 cores each
● 64 GB of RAM each
● If you have an application need 32 cores, what is the
correct setting?
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications
Why Spark Debugging / Tuning is Hard?
● Distributed
● Lazy
● Hard to do benchmark
● Spark is sensitive
Conclusion
● When in doubt, repartition!
● Avoid shuffle if you can
● Choose a reasonable partition count
● Premature optimization is the root of all evil -- Donald Knuth
Reference
● Tuning and Debugging in Apache Spark
● Top 5 Mistakes to Avoid When Writing Apache Spark
Applications
● How-to: Tune Your Apache Spark Jobs (Part 1)
● How-to: Tune Your Apache Spark Jobs (Part 2)