Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/330617232
Comprehensive Guide for Tuning Spark Big Data Applications and Infrastructure
Article · January 2019
CITATIONS READS
0 3,922
1 author:
Kaniska Mandal
IEEE Computer Society
25 PUBLICATIONS 0 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
COVITA - Covid19 Text Analyzer View project
Deep Learning IOT View project
All content following this page was uploaded by Kaniska Mandal on 25 January 2019.
The user has requested enhancement of the downloaded file.

Comprehensive Guide for Tuning Spark Big Data Applications and Infrastructure
KANISKA MANDAL
INTRODUCTION
Spark Job Tuning and Troubleshooting - Key Points:
Parallelism
Parallelism is important concept in Spark in order to completely utilize the cluster. By default, Spark
sets the number of partitions of an input file according to its configurations and distributed shuffles.
For example, parallelism = at least 2~4 times of spark.executor.instances * spark.executor.cores

spark.executor.memory + spark.yarn.executor.memoryOverhead = memory per node / number of
executors per node
spark.yarn.executor.memoryOverhead = spark.executor.memory * 0.10
Size of Result
Note that executor size must be larger than your largest partition. calling collect/reduce will pull data
back to the driver.So if you try to pull more data back to the driver than will fit in the driver then it
will fail.
Grouping Query
reduceByKey , aggregateByKey offer much better performance compared to groupByKey.

Note reduceByKey first internally merges the values for each key using as ‘associative and
accumulative’ reduce function and performs local merges on each mapper before sending result to
reducer.
Output will be Hash-partitioned. Always prefer reduceByKey first to reduce data size and then do
required join operation. reduceByKey has disk I/O overhead so set spark.local.dir to allocate large
temp space for long-running jobs.
Reading Cassandra Data
While reading huge amount of data from cassandra ensure that data partitioned with proper partition
key.
val largeSetOfData = spark.read .cassandraFormat(“cassandraSourceTable”,

“test_keyspace").options(ReadConf.SplitSizeInMBParam.option(32)).load()
We can specify splitCount ( no of sprak paritions to read cassandra table into ) and increase
input.split.size_in_mb (if we want to pull moredata into spark)
We can also consider large raw denormalized dataset in parquet format as a source data.
we can increase input.split.size_in_mb if we want to pull more data into spark
The idea is to apply a paralleled function on parent data frame which uses broadcasted smaller
dataset.
resultDF = parentDF.transformFn(broadcast(filteredDF)
we can also try tuning spark.sql.autoBroadcastJoinThreshold
if we use sqlContext.read.format("org.apache.spark.sql.cassandra") , we can cache the filterdDF if

the data size is not very big and ser-deser cost is not big
filterdDF.registerTempTable("tempdf");
sqlContext.cacheTable("tempdf");
In either case if we are holding lots of data in memory , we can try to repartition existing dataset to
re-sync with cassandra dataset as repartitionByCassandraReplica("test", "parentTable",
partitionCount)
If the bigDataOperation (e.g. subtractByKey) introduces big shuffles even after broadcast-hash-join,
then we can use mapByPartiton() and programmatically enforce hashPartitions if needed and filter
data from cache within partition.
Joining Data
While querying such large dataset along with smaller datasets, if we can properly use 'Distribute by' /
'Cluster by' / Repartition By' based on different usecases, then Spark will take care of joining
multiple dataframes efficiently and even work fine with skewed data frames.Internally, Spark will
use SortMergeJoin and will avoid hash-partition shuffles.
Find wrong Joins by figuring out an abnormal number of stages. A suspicious plan can be one
requiring 10 stages instead of 2–3 for a basic join operation between two DataFrames.Be cognizant
of optimizations for Left Joins, Cartesian Joins and One-2-Many Joins.
Try to leverage Spark SQL Analytics Queries as much as possible,.Always perform local data
processing or business logic computation inside mapByPartition.
Broadcast
Then we can try to perform costly transformation on (larger dataframe) using broadcasted smaller
data-frame.
Enabling BroadcastHashJoin (spark.sql.autoBroadcastJoinThreshold) optimizes joining a large
dataset with and a smaller table.Broadcast allows to send a read-only variable cached on each node
once, rather than sending a copy for all tasks.
While queuing multiple dimensions in an ETL job, first broadcast common dimensions before the job
starts i.e. send lookup tables, dictionaries to worker node.
Example of joining billions of user_impressions with few thousand websites:

impressions.join(broadcast(website) as “w”, $”w.id” ===$”website_id”)
Partitioning and Re-Partitioning
Note that too few partitions leads to

Less Concurrency
More susceptible to data skew
Increased memory pressure for groupBy, reduceByKey, sortByKey
Increase no of partitions for shuffle (default partition size is 2 GB)
If we point spark to correctly partitioned source database, spark sql will perform extreme
optimization leveraging column pruning and filter pushdown.
So we should first fix partition-hotspotting in cassandra and shouldn't allow partition to grow
unbounded.
If the wide transformation introduces big shuffles even after broadcast-hash-join, then we can use
mapByPartiton() and programmatically enforce hashPartitions if needed and filter data from cache
within partition.
In any case, its a good idea to perform contextual business logic on a set of keys inside
mapByPartiton(..)
SET spark.sql.shuffle.partitions=5
If we partition the tables properly then spark can effectively execute JoinWithCassandraTable ~
which pulls only the partition keys which matchy our RDD entries from Cassandra so that it only
works on partition keys.
This avoids pulling the complete data set down to spark.
More details: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/

2_loading.md#using-joinwithcassandratable
If we are holding lots of data in memory , we can try to repartition existing dataset to re-sync with
cassandra dataset e.g. repartitionByCassandraReplica("keyspace_name", "table_with_large_data",
partitionCount)
The repartitionByCassandraReplica method can be used prior to calling joinWithCassandraTable

to obtain data locality, such that each spark partition will only require queries to their local node. This
method can also be used with two Cassandra Tables which have partitioned with different partition
keys.
Aggressively repartition before processing the data.
For example, repartition by customerId, zoneId, siteId and then apply processing logic in parallel on
the partitions.
Caching
Its a good idea to cache intermediate calculated result if we need to use it in subsequent
transformations.
For example,
If we use sqlContext.read.format("org.apache.spark.sql.cassandra") , then we can cache a smaller

filtered dataframe say filterdDF assuming the data size is not very big and ser-deser cost is not big.
filterdDF.registerTempTable(“tempdf”);sqlContext.cacheTable(“tempdf");
At the same time, we should be aware of caching overhead.
Inappropriate use of caching.
Due to Spark’s caching strategy (in-memory then swap to disk) the cache can end up in a slightly
slower storage. Also, using that storage space for caching purposes means that it’s not available for
processing.
In the end, caching might cost more than simply reading the DataFrame.
Data Locality
Data locality can have a major impact on the performance of Spark jobs specially for Long Running
Jobs. If data and the code that operates on it are together then computation tends to be fast.
Typically it is faster to ship serialized code from place to place than a chunk of data because code
size is much smaller than data. Spark builds its scheduling around this general principle of data
locality.
https://spark.apache.org/docs/latest/configuration.html#scheduling
spark.locality.wait 3s How long to wait to launch a data-local task before giving up

and
launching it on a less-local node. The same wait will be used
to step
through multiple locality levels(process-local, node-local,
rack-local and
then any). It is also possible to customize the waiting time for
each level
by setting spark.locality.wait.node, etc. You should increase
this setting
if your tasks are long and see poor locality, but the default
usually works
well.
spark.locality.wait.node spark.locality.wait Customize the locality wait for node locality. For
example, you can set
this to 0 to skip node locality and search immediately for
rack locality
(if your cluster has rack information)
spark.locality.wait.process spark.locality.wait Customize the locality wait for process locality. This
affects tasks that
attempt to access cached data in a particular executor
process.
spark.locality.wait.rack spark.locality.wait Customize the locality wait for rack locality.
Intermediate Data Persistence
In some of the heavy shuffles, it is faster to persist them on disk to prevent re-calculations.
This is especially true if you’re re-using scala variables further down the chain.
Obviously, you’ll need to look into the total calculation time and average network throughput to see
if it is worth while to persist.
EMRFS makes it effortless to do a DISK_ONLY persistence.
Additional Notes
Handling Large Number of Data

For exceptions like: org.apache.spark.broadcast.TorrentBroadcast timeout issues, we have to enable
spark.network.timeout
spark.kryoserializer.buffer.max
spark.rpc.askTimeout
spark.sql.broadcastTimeout 300 (Timeout in seconds for the broadcast wait time in broadcast
joins) spark.sql.autoBroadcastJoinThreshold 10485760 (10 MB)
Ensure all the objects passed to closure are serializable. One common mistake - using
sparkConfigObject (which may internally refers to non-serializable config objet) inside closure
Prefer Dataset structures rather than DataFrames to leverage Tungsten Optimization to avoid
slowness of query
Avoid Scala-based User-Defined Functions (UDFs) with Spark SQL as much as possible.
Check if you have Highly imbalanced datasets by reviewing the execution duration of each
task.
Though bit outdated, following cheat-sheet offers very good idea about apache spark settings.
http://c2fo.io/c2fo/spark/aws/emr/2016/07/06/apache-spark-config-cheatsheet/
In order to avoid memory issue set these configuration params

— max-size in Flight and max-size of map output to fetch from each reducer task in flight, set
compression for map-output.
In order to leverage ‘whole-stage-code gen’ for faster computation on massive no of rows, specify
spark.sql.codegen.wholestage = true
Check if one needs to increase the buffer size for Serialization framework (specially Kryo)
Configure sustainedSchedulerBacklogTimeout and check if there is a backlog of tasks pending
in Driver after a specific timeout.
Watch out for Shuffle problems. Check Spark UI page for

— Task taking longer time
— Shards have more inputs of shuffle inputs
— Check if any speculative tasks launching
Best Practices for Spark Streaming Jobs
If you are still using Kafka Direct Streams, then you have to manage the Offset for data consistency.
Legacy DStream Example

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/
streaming/DirectKafkaWordCount.scala
https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html
Offset Management
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-
streaming/
There are two approaches to getting exactly once semantics during data production:Use a single-
writer per partition and every time you get a network error check the last message in that partition to
see if your last write succeeded.
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
Issues with Checkpoint
If you enable Sparkcheckpointing, offsets will be stored in the checkpoint. This is easy to enable, but
there are drawbacks. Your output operationmust be idempotent, since you will get repeated outputs;
transactions are not an option. Furthermore, you cannot recover from a checkpoint ifyour application
code has changed. For planned upgrades, you can mitigate this by running the new code at the same
time as the old code (sinceoutputs need to be idempotent anyway, they should not clash). But for
unplanned failures that require code changes, you will lose data unless youhave another way to
identify known good starting offsets.
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
For Exactly-once using transactional writes
For data stores that support transactions, saving offsets in the same transaction as the results can keep
the two in sync, even in failure situations.If you're careful about detecting repeated or skipped offset
ranges, rolling back the transaction prevents duplicated or lost messages from affecting results. This
gives the equivalent of exactly-once semantics, and is straightforward to use even for aggregations.
The first important point is that the stream is started using the last successfully committed offsets as
the beginning point. This allows for failure recovery:
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/
TransactionalPerPartition.scala
For the very first time the job is run, the table can be pre-loaded with appropriate starting
offsets.
https://github.com/koeninger/kafka-exactly-once
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
https://www.youtube.com/watch?v=fXnNEq1v3VA
https://blog.cloudera.com/blog/2017/06/offset-management-for-apache-kafka-with-apache-spark-
streaming/
Recovering from Failures with Checkpointing
In case of a failure or intentional shutdown, you can recover the previous progress and state of a
previous query, and continue where it left off.This is done using checkpointing and write ahead logs.
You can configure a query with a checkpoint location, and the query will save all the progress
information (i.e. range of offsets processed in each trigger) and the running aggregates to the
checkpoint location.
Check Data Lag
https://github.com/ksimar/Sample-Scala-Programs-to-use-Kafka-using-its-JavaAPI/blob/master/src/
main/scala/ConsumerOffsetsDemo.scala
https://stackoverflow.com/questions/39982946/recover-lost-message-in-kafka-using-offset
Achieving Zero Data Loss
http://aseigneurin.github.io/2016/05/07/spark-kafka-achieving-zero-data-loss.html
https://stackoverflow.com/questions/39982946/recover-lost-message-in-kafka-using-offset
Legacy Code
https://www.infoq.com/articles/traffic-data-monitoring-iot-kafka-and-spark-streaming
https://github.com/databricks/reference-apps/blob/master/timeseries/scala/timeseries-weather/src/
main/scala/com/databricks/apps/WeatherApp.scala
https://databricks.gitbooks.io/databricks-spark-reference-applications/twitter_classifier/examine.html
Structured Streaming API
Simple Example
log.warn("Reading from Kafka”)
spark. readStream.format(“kafka")
.option("kafka.bootstrap.servers", “localhost:9092”)
.option(“subscribePattern","topic.*")
.option(“startingOffsets","earliest")
.option(“endingOffsets","latest")
.option("enable.auto.commit", false)
// Cannot be set to true in Spark Strucutured Streaming
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#kafka-specific-
configurations
.option("group.id", “Structured-Streaming-Examples")
.option("failOnDataLoss", false)
// when starting a fresh kafka (default location is temporary (/tmp) and cassandra is not (var/lib)), we
have saved different offsets in Cassandra than real offsets in kafka (that contains nothing)
.option(startingOption, partitionsAndOffsets)
//this only applies when a new query is started and that resuming will always pick up from where the
query left off
.load().withColumn(KafkaService.radioStructureName, from_json($"value".cast(StringType),
KafkaService.schemaOutput)
// nested structure with our json , // From binary to JSON object
).as[SimpleSongAggregationKafka]
.filter(_.radioCount !=null) //TODO find a better way to filter bad json
Spark Reference Code
https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html
https://github.com/polomarcus/Spark-Structured-Streaming-Examples/blob/master/src/main/scala/
kafka/KafkaSource.scala
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-
streaming-queries
Note
The number of tasks required by the query depends on how many partitions the query can read from
the sources in parallel. Therefore, beforestarting a continuous processing query, you must ensure
there are enough cores in the cluster to all the tasks in parallel. For example, if you arereading from a
Kafka topic that has 10 partitions, then the cluster must have at least 10 cores for the query to make
progress.
Stopping a continuous processing stream may produce spurious task termination warnings. These can
be safely ignored.
Beware of Using Foreach
The foreach operation allows arbitrary operations to be computed on the output data. As of Spark 2.1,
this is available only for Scala and Java.To use this, you will have to implement the
interfaceForeachWriter(Scala/Javadocs), which has methods that get called whenever there is a
sequence of rows generated as output after a trigger. Note the following important points.
The writer must be serializable, as it will be serialized and sent to the executors for execution.
All the three methods, open, process and close will be called on the executors.
The writer must do all the initialization (e.g. opening connections, starting a transaction, etc.) only
when the open method is called. Be aware that, if there is any initialization in the class as soon as the
object is created, then that initialization will happen in the driver(because that is where the instance is
being created), which may not be what you intend.
version and partition are two parameters in open that uniquely represent a set of rows that needs to be
pushed out. version is a monotonically increasing id that increases with every trigger. partition is an
id that represents a partition of the output, since the output is distributed and will be processed on
multiple executors.
open can use the version and partition to choose whether it needs to write the sequence of rows.
Accordingly, it can return true (proceed with writing), or false (no need to write). If false is returned,
then process will not be called on any row. For example, after a partial failure, some of the output
partitions of the failed trigger may have already been committed to a database. Based on metadata
stored in the database, the writer can identify partitions that have already been committed and
accordingly return false to skip committing them again.
Whenever open is called, close will also be called (unless the JVM exits due to some error). This is
true even if open returns false. If there is any error in processing and writing the data, close will be
called with the error. It is your responsibility to clean up state (e.g.connections, transactions, etc.) that
have been created in open such that there are no resource leaks.
Sliding Window
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-
operations-on-event-time
Handling Late Data
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-
data-and-watermarking
Streaming DeDuplication
Powerful Range Join
https://www.slideshare.net/databricks/deep-dive-into-stateful-stream-processing-in-structured-
streaming-by-tathagata-das
Powerful MapGroupsWithState
Stream-stream Joins
Any row received from one input stream can match with any future, yet-to-be-received row from the
other input stream. Hence, for both the input streams, we buffer past input as streaming state, so that
we can match every future input with past input and accordingly generate joined results.Furthermore,
similar to streaming aggregations, we automatically handle late, out-of-order data and can limit the
state using watermarks.
Reporting Metrics
Continuous Processing
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-
streaming-queries
The trigger settings of a streaming query defines the timing of streaming data processing, whether the
query is going to executed as micro-batch query with a fixed batch interval or as a continuous
processing query. Here are the different kinds of triggers that are supported.
Trigger Type Description

unspecified(default) If no trigger setting is explicitly specified, then by default,
the query will be executed in micro-batch mode, where
micro-batches will be generated as soon as the previous
micro-batch has completed processing.
Fixed interval micro-batches The query will be executed with micro-batches mode,
where micro-batches will be kicked off at the user-specified
intervals.
If the previous micro-batch completes within the
interval, then the engine will wait until the interval is over
before kicking off the next micro-batch.
If the previous micro-batch takes longer than the interval to
complete (i.e. if an interval boundary is missed), then the
next micro-batch will start as soon as the previous one
completes (i.e., it will not wait for the next interval
boundary).
If no new data is available, then no micro-batch will be
kicked off.
One-time micro-batch. The query will execute *only one* micro-batch to process
all the available data and then stop on its own. This is
useful in scenarios you want to periodically spin up a
cluster, process everything that is available since the last
period, and then shutdown the cluster. In some case, this
may lead to significant cost savings. Continuous with
fixed checkpoint intervalThe query will be executed in the
new low-latency, continuous processing mode. Read more
about this in theContinuousProcessing section below.
Comparison With Kafka Streams
https://www.slideshare.net/gschmutz/spark-structured-streaming-vs-kafka-streams-two-stream-
processing-platforms-compared-95628104
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/
integration/KStreamAggregationIntegrationTest.java
Tune the Big Data Infrastructure (Elastic Map Reduce Computes)

Getting the Right Partition Size and Instance Type.
The general principles to be followed when tuning partition for Spark application are as follows:
Too few partitions – Cannot utilize all cores available in the cluster.
Too many partitions – Excessive overhead in managing many small tasks.
Reasonable partitions – Helps us to utilize the cores available in the cluster and avoids
excessive overhead in managing small tasks.
For example, with about 2000 partitions and 250GB data, each partition or task works out to be only
about 125MB, which is close to the128MB that is recommended in the official docs.
At that partition size, it is more efficient to run c3.8xlarge instances with a lower memory to core
ratio.
# c3.8xlarge: 32 VCPUS, 60GB Memory

$CORES_PER_EXECUTOR=4
$MEMORY_PER_EXECUTOR=6.5G
# Available memory for shuffle, more than enough for 125mb
6.5 / 4 * 0.2 * 0.8 = 0.26G
Another option is to use i2.2xlarge memory instances to eliminate any possibility of a memory
constraint issue.
Lets consider another example, where the number of partitions based on the input dataset size ~ 1.5
GB (1500 MB) and going with 128MB per partition, the number of partitions will be:
Total input dataset size / partition size => 1500 / 128 = 11.71 =~12 partitions.
Now, let us perform a test by reducing the partition size and increasing the number of partitions.
Consider partition size as 64 MB.
Number of partitions = Total input dataset size / partition size => 1500 / 64 = 23.43 =~23 partitions.
Sample Test script:
./bin/spark-submit—nameAnalysisDataFramePartitionTest--master yarn—deploy-mode cluster—

executor-memory 2g —executor-cores 2 —num-executors 2 —conf spark.sql.shuffle.partitions=23
—conf spark.default.parallelism=23 —class com.test.AnalysisDataFramePartitionTest/data/
AnalysisDataFramePartitionTest.jar /user/inputData.csv
Testing with few such combinations of spark.default.parallellism and spark.sql.shuffle.partitions will

help us figure out the optimal configuration which provides best performance (lowest execution time)
Other things to consider for partitions

Avoid Key Names in Lexicographic Order
Use hashing / random prefixes or reverse date-times.
Compression
If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers
be four bytes instead of eight.
You can add these options in spark-env.sh.
Compress data set to minimize bandwidth from S3 to EC2.
Ensure splittable compression is used or have each file be the optimal size for parallelization on the
cluster
Saving Files in proper formats
Prefer columnar file formats like Parquet for increased performance on reads.While saving
DataFrame into file system, set the number of records per file, df.write.option(“maxRecordsPerFile",
10000).save(....)
Ref: http://www.gatorsmile.io/anticipated-feature-in-spark-2-2-max-records-written-per-file/..
Right Cluster Types

Persistent clusters remain alive all the time, even when a job has completed. These clusters are
preferred when continuous jobs have to run on the data.
Since clusters take a few minutes to boot up and spin, using persistent clusters saves a significant
amount of time which would otherwise be lost during the initialization process.
Generally, persistent clusters are preferred in testing environments. Errors in such an environment
running transient clusters would close the job and might shut down the cluster, but with persistent
clusters jobs can continue to be submitted and modified.
Right number of Executors
The ideal number of executors depends on various factors:
— Incoming events per second, especially during peaks
— Buffering capabilities of the streaming source
— Maximum allowed lag, i.e. is it tolerable if the Streaming application lags behind by 3 minutes
during a very high peak
— It can be tweaked by running the streaming application in a preproduction environment and
monitoring the streaming statistics in the Spark UI.
— As a general guideline:
Processing Time + Reserved Capacity <= Batch Duration
The reserved capacity depends on the aforementioned factors. The tradeoff lies between idling cluster
resources versus maximum allowed lag during peaks.
Shuffle Memory Usage, Executor Memory-to-CPU ratio
"Stragglers" are tasks within a stage that take much longer to execute than other tasks.
In order to avoid Stragglers we need to remember:

Shuffle Less Often –To minimize the number of shuffles in a computation requiring
several transformations, preserve partitioning across narrow
transformations to avoid reshuffling data.
Shuffle Better - Computation cannot be completed without a shuffle sometimes. All wide
transformations and all shuffles are not equally expensive or prone to
failure.
Each core in an executor runs a single task at any one time. Hence, with 26GB per executor and 4
cores each executor, the HEAP_SIZE allocated for each task is 26G/4 or 4G.
However, not all the memory allocated to the executor is used for shuffle operations.
The memory available for shuffle can be calculated as such:

// Per task
24/4 * 0.2 * 0.8=0.96GB
// 0.2 -> spark.shuffle.memoryFraction

// 0.8 -> spark.shuffle.safetyFraction
If your task is already spilling to disk, try using this formula to find out how much space it actually
needs. This might help you to better fine tune the RAM-to-CPU ratio for your executor tasks:
shuffle_write * shuffle_spill_mem * (4)executor_core

shuffle_spill_disk * (24)executor_mem * (0.2)shuffle_mem_fraction * (0.8)shuffle_safety_fraction
Example
Lets assume multiple smaller executors launched inside a node, let's say spark.executor.cores = 2 and
maximum 4 executors in each r4.2xlarge instance and 3 executors in the instance where spark driver
will be run.
In order to make sure that YARN will launch as many executor instances as possible, we can set
spark.executor.instances = 7 * 4 + 3 = 31.
Be aware that sometimes having many tiny executors also has some drawbacks as it will end up
having more JVMs running (overheads) and more copies of data if broadcast variables are used.
Yarn resource allocation
Spark asks Yarn for the memory and cores for executing a job, so we need to ensure that Yarn itself
has sufficient resources to provide to Spark.
In Yarn, memory in a single executor container is divided into Spark executor memory plus overhead
memory (spark.yarn.executor.memoryOverhead).
This memory is the off-heap memory which is used for VM overheads and other native overheads.
If we need data locality then don’t use task nodes. Because the task nodes do not have local HDFS
storage, they are effectively useless.
Any gains from leveraging the spot prices are likely wasted by the time lost due to poor data locality.
Only use instance-backed Core nodes
Auto scaling for EMR
— Understand Amazon EMR Scale-Down Behavior
— Configuring YARN Capacity Scheduler Queues in AWS EMR
— Submitting a Spark Job to Different Queues help scale the Job independent of other jobs
Ref: https://mitylytics.com/2017/11/configuring-multiple-queues-aws-emr-yarn/
https://aws.amazon.com/blogs/big-data/best-practices-for-resizing-and-automatic-scaling-in-amazon-
emr/
key useful metrics YARNMemoryAvailablePercentage and ContainerPendingRatio.
Auto scaling policy should also specify the MaxCapacity and MinCapacity of instances.
EMR Auto-scaling and Spark Dynamic Allocation may not be mixed together for certain usecases for
Transient Instances with pre-defined loads
Scale out rules:
Core – we auto scale on HDFS utilization. If HDFSUtilization >= 80 for 5 minutes add nodes.
Task – We have 3 rules related to yarn, apps pending and containers pending.
If YARNMemoryAvailablePercentage <= 15 add nodes
If AppsPending-Out >= 2 add nodes.
If ContainerPending-Out >= 75 add nodes.
Auto-Scaling and Capacity Scheduler Gotchas
By default, EMR sets user-limit-factor in the capacity-scheduler.xml to 1.

Because of this JOBS cannot be run concurrently.
If you change this, even with a single queue, you can run concurrent jobs. Check this value in your
EMR configuration.
Turns out, Spark task stores its shuffle output in the local disks of the node and they are available
through the external shuffle service that runs on the node manager service on every node.
The idea being that when an executor is completed, you can still get its output files from the node
manager external shuffle service. But, with auto-scaling, these nodes get decommissioned and that
output is no longer available.
So one then need to switch to a model of manually scaling up and scaling down as needed. Not as
convenient, but it still gets the job done. Just will also need to setup some cron jobs on the master
node to scale the cluster down (during a quiet period) to avoid cost run-ups.
Special considerations for long-running jobs
Queue plays a very crucial role for long-running streaming jobs.
Because Spark driver and Application Master share a single JVM, any error in Spark driver stops
long-running job. Fortunately it is possible to configure maximum number of attempts that will be
made to re-run the application. It is reasonable to set higher value than default 2 (derived fromYARN
cluster property yarn.resourcemanager.am.max-attempts ).
Generally 4 works quite well, higher value may cause unnecessary restarts even if the reason of the
failure is permanent.
spark-submit --master yarn --deploy-mode cluster --conf spark.yarn.maxAppAttempts=4
Check if 4 attempts get exhausted in few hours for a long-running job.

Then in order to avoid this situation, the attempt counter should be reset one every hour of so.
—conf spark.yarn.am.attemptFailuresValidityInterval=1h
Set maximum number of executor failures before the application fails.
By default it is max(2 * num executors, 3), well suited for batch jobs but not for long-running jobs.
So specify the following configuration parameters

--conf spark.yarn.executor.failuresValidityInterval=1h
--conf spark.task.maxFailures=8
Note without a separate YARN queue your long-running job will be preempted by a massive Hive
query sooner or later
Important points regarding scaling spark-streaming jobs in EMR
Enabling the spark.dynamicAllocation property allows Spark to add and remove executors
dynamically based on the workload.
When using Spark streaming ensure that the executor idle timeout is greater than the batch timeout to
ensure the unused executors are removed from the cluster.
If the time to execute the entire job is taking longer than expected, then increase parallelism by
increasing the number of cores per executor.However, more than 5 cores can lead to poor
performance due to increased HDFS I/O.
Spark-Configuration Reference: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-

configure.html 
For example,
for a 6 node r3.4xlarge cluster (5 executors per node):
spark.executor.instances: “30"
spark.yarn.executor.memoryOverhead: “3072"
spark.executor.memory: “21G"
spark.yarn.driver.memoryOverhead: “1034"
spark.driver.memory: “6G"
spark.executor.cores: “3"
spark.driver.cores: “1"
spark.default.parallelism: “180"
spark.dynamicAllocation.enabled: “false"
For the YARN configurations users can cut the workload time by 50% simply by switching off this
configuration:
spark.dynamicAllocation.enabled.
It basically allows the reuse of Spark executors in a multi-steps workloads.
Amazon S3 Troubleshooting
Retry InternalErrorsInternal errors are errors that occur in an Amazon S3 environment.A request to
receive an InternalError response may not have been processed. For example, if a PUT request
returns an InternalError, subsequentGET operations may retrieve old or updated values. If Amazon
S3 returns an InternalError response, resubmit the request.
Adjust the application for repeated SlowDown errors
Like other distributed systems, S3's protection mechanism detects inadvertent or unintentional
resource over-consumption and reacts accordingly.
A SlowDown error will occur after one of the protection mechanisms is triggered by a higher request
rate. Decreasing your request rate will reduce or eliminate this type of error.
In general, most users won't encounter these errors frequently; however, if you want to learn more, or
have a serious or unexpected SlowDown error, post these errors to Amazon S3 Developer Forum
https: //forums.aws.csdn.net/
Proper Usage of S3
https://medium.com/@subhojit20_27731/apache-spark-and-amazon-s3-gotchas-and-best-practices-
a767242f3d98
Remember S3 is an object store and not a file system, hence the issues arising out of eventual
consistency, non-atomic rename operations i.e., every time the executors writes the result of the job,
each of them write to a temporary directory outside the main directory (on S3) where the files had to
be written and once all the executors are done a rename is done to get atomic exclusivity.
This is all fine in a standard filesystem like hdfs where renames are instantaneous but on an object
store like S3, this is not conducive as renameson S3 are done at 6MB/s.
To overcome above problem, ensure setting the following two conf parameters
1) spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version > 2
For default value of this parameter i.e. 1, commitTask moves data generated by a task from the task
temporary directory to job temporary directory and when all tasks complete, commitJob moves data
to from job temporary directory the final destination.
Because the driver is doing the work of commitJob, for S3, this operation can take a long time.
A user may often think that his/her cell is “hanging”. However, when the value of
mapreduce.fileoutputcommitter.algorithm.version is 2,
commitTask will move data generated by a task directly to the final destination and commitJob is
basically a no-op.
2) spark.speculation=falseIn case this parameter is set to true then if one or more tasks are running
slowly in a stage, they will be re-launched.As mentioned in above the write operation on S3 through
spark job is very slow and hence we can see a lot of tasks getting re-launched as the output data size
increases.
This along with eventual consistency (while moving files from temporary directory to main data
directory) may cause FileOutputCommitter to go into dead lock and hence the job could fail.
Alternatively, one can write the output first to the local HDFS on EMR and then move the data to S3
using the hadoop distcp command.
This improves the overall output speed drastically. However, you will need enough EBS storage on
your EMR nodes to ensure all your output data fits in.
Further, you can write the output data in Parquet format which will compress the output size
considerably.
https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-
efficiently-between-hdfs-and-amazon-s3/#5
EMR File System (EMRFS) is best suited for transient clusters as the data resides in S3 irrespective
of the lifetime of the cluster.
Ref: https://docs.aws.amazon.com/AmazonS3/latest/dev/troubleshooting.html(14)
EMR Troubleshooting Reference

http://docs.amazonaws.cn/en_us/emr/latest/ManagementGuide/emr-mgmt.pdf
Troubleshoot a Failed Cluster (p. 258)
Step 1: Gather Data About the Issue (p. 258)

Step 2: Check the Environment (p. 259)
Step 3: Look at the Last State Change (p. 260)
Step 4: Examine the Log Files (p. 260)
Step 5: Test the Cluster Step by Step (p. 261)
Troubleshoot a Slow Cluster (p. 261)

Step 1: Gather Data About the Issue (p. 262)
Step 2: Check the Environment (p. 262)
Step 3: Examine the Log Files (p. 263)
Step 4: Check Cluster and Instance Health (p. 264)
Step 5: Check for Arrested Groups (p. 265)
Step 6: Review Configuration Settings (p. 266)
Step 7: Examine Input Data (p. 267)
Common Errors in Amazon EMR

Input and Output Errors (p. 268)
Permissions Errors (p. 270)
Resource Errors (p. 270)
Streaming Cluster Errors (p. 275)
Custom JAR Cluster Errors (p. 276)

Hive Cluster Errors (p. 277)
VPC Errors (p. 278)
AWS GovCloud (US)
Errors (p. 280)
Other Issues (p. 281)
EMR Logs
A cluster generates several types of log files, including:
Step logs — These logs are generated by the Amazon EMR service and contain information about the
cluster and the results of each step.
The log files are stored in /mnt/var/log/hadoop/steps/ directory on the master node.
Each step logs its results in a separate numbered subdirectory: / mnt/var/log/hadoop/steps/s-stepId1/

for the first step, /mnt/var/log/hadoop/steps/s-stepId2/, for the second step, and so on.
The 13-character step identifiers (e.g. stepId1, stepId2) are unique to a cluster.
Hadoop and YARN component logs — The logs for components associated with both Apache YARN
and MapReduce, for example, arecontained in separate folders in /mnt/var/log.
The log file locations for the Hadoop components under /mnt/var/log are as follows: hadoop-hdfs,
hadoop mapreduce, hadoop-httpfs, and hadoop-yarn.
The hadoop-state-pusher directory is for the output of the Hadoop state pusher process.
Bootstrap action logs — If your job uses bootstrap actions, the results of those actions are logged.The
log files are stored in /mnt/var/log/bootstrap-actions/ on the master node.
Each bootstrap action logs its results in a separate numbered subdirectory: /mnt/var/log/bootstrap-
actions/1/ for the first bootstrap action, /mnt/var/log/bootstrap-actions/2/, for the second bootstrap
action, and so on.
Instance state logs — These logs provide information about the CPU, memory state, and garbage
collector threads of the node. The log files ares tored in /mnt/var/log/instance-state/ on the master
node.
Spark on Yarn Reference
Reference
(slides 59 - 88) https://www.slideshare.net/AmazonWebServices/amazon-emr-deep-dive-best-
practices-67651043
https://dzone.com/articles/apache-spark-performance-tuning-degree-of-parallel
https://dzone.com/articles/apache-spark-on-yarn-resource-planning
https://github.com/treselle-systems/sfo_fire_service_call_analysis_using_spark
View publication stats

Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Article · January 2019

COVITA - Covid19 Text Analyzer View project

Deep Learning IOT View project

The user has requested enhancement of the downloaded file.

Spark Job Tuning and Troubleshooting - Key Points:

For example, parallelism = at least 2~4 times of spark.executor.instances * spark.executor.cores

reduceByKey , aggregateByKey offer much better performance compared to groupByKey.

Reading Cassandra Data

val largeSetOfData = spark.read .cassandraFormat(“cassandraSourceTable”,

we can increase input.split.size_in_mb if we want to pull more data into spark

we can also try tuning spark.sql.autoBroadcastJoinThreshold

if we use sqlContext.read.format("org.apache.spark.sql.cassandra") , we can cache the filterdDF if

Example of joining billions of user_impressions with few thousand websites:

Partitioning and Re-Partitioning

Note that too few partitions leads to

Increase no of partitions for shuffle (default partition size is 2 GB)

This avoids pulling the complete data set down to spark.

More details: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/

The repartitionByCassandraReplica method can be used prior to calling joinWithCassandraTable

Aggressively repartition before processing the data.

If we use sqlContext.read.format("org.apache.spark.sql.cassandra") , then we can cache a smaller

Inappropriate use of caching.

spark.locality.wait 3s How long to wait to launch a data-local task before giving up

spark.locality.wait.rack spark.locality.wait Customize the locality wait for rack locality.

Intermediate Data Persistence

Handling Large Number of Data

In order to avoid memory issue set these configuration params

Watch out for Shuffle problems. Check Spark UI page for

Best Practices for Spark Streaming Jobs

Legacy DStream Example

Issues with Checkpoint

For Exactly-once using transactional writes

Recovering from Failures with Checkpointing

Check Data Lag

Achieving Zero Data Loss

log.warn("Reading from Kafka”)

Spark Reference Code

Handling Late Data

Powerful Range Join

Trigger Type Description

Comparison With Kafka Streams

Tune the Big Data Infrastructure (Elastic Map Reduce Computes)

# c3.8xlarge: 32 VCPUS, 60GB Memory

# Available memory for shuffle, more than enough for 125mb

6.5 / 4 * 0.2 * 0.8 = 0.26G

Sample Test script:

./bin/spark-submit—nameAnalysisDataFramePartitionTest--master yarn—deploy-mode cluster—

Testing with few such combinations of spark.default.parallellism and spark.sql.shuffle.partitions will

Other things to consider for partitions

Saving Files in proper formats

Right Cluster Types

Right number of Executors

The ideal number of executors depends on various factors:

— Incoming events per second, especially during peaks

— Buffering capabilities of the streaming source

Shuffle Memory Usage, Executor Memory-to-CPU ratio

In order to avoid Stragglers we need to remember:

The memory available for shuffle can be calculated as such:

// 0.2 -> spark.shuffle.memoryFraction

shuffle_write * shuffle_spill_mem * (4)executor_core

Yarn resource allocation

— Understand Amazon EMR Scale-Down Behavior