You are on page 1of 3

High Level Spark Optimization Methods

To understand the optimizations methods in spark, we must first understand how spark operates.
Spark is a plug and play in-memory parallel processing engine (compute) which can run on cluster or
on local with a basic requirement of some supporting libraries to be present in the system.

How Spark Works

Spark is an in-memory processing engine, i.e. it process data in memory (RAM) and minimizes the
disk read and writes which was a major performance constraints in Map-Reduce engine.

Components in Spark:

1. Driver – It’s a program which we can called as a Manager in a team, who is responsible
for carrying out the entire process successfully. i.e. it executes the Application’s main
method, take and respond to user’s request, distribute and schedule work among the
executors (discussed next). It resides on one of the cluster node/machine.
2. Executor: It is a program which can be considered as a developer in a team, it actually
executes/performs the task assigned by driver and respond back to the driver with
results. It resides on nodes of the cluster which is called as a worker node.

Spark works on the principal of lazy execution, i.e. when we write any spark application we give data
processing instructions to spark in the form of transformations and actions, internally spark creates a
lineage called as DAG (Directed Acyclic Graph) which indicates execution steps and it only executes
the DAG when it encounters any action.

Spark – Code:
DAG:

As we can see above DAG there is something called Stage 1, since spark divides the complete
processing into smallest execution unit i.e. Tasks and these tasks are grouped into different stages
based on the transformations used:

1. Narrow Transformations – Where data shuffle (data movement between different


executors) is not required, i.e. it can be carried out on a single partition independently e.g.
map, flatMap,filter etc.
2. Wide Transformations – Where data shuffle is required, need data from different partitions
to carry out the data transformations e.g.
reduceByKey,groupByKey,repartition,join,aggregate etc

Till now I have given a high level introduction of how spark works. Now we will discuss about
optimization methods in spark:

Optimization in Spark can be done in two ways:

1. Configuration Level Optimization


2. Data Level
3. Code Level Optimization

Configuration Level Optimization-

This is most easiest way to optimize but did not work every time and also may incur heavy cost.
While doing configuration Level optimizations we mostly focus around the scaling-up of resources.

1. Increasing no of executors – If there are not enough executors to support parallel execution
of tasks at its best, i.e. tasks are getting queued up and going into sequential execution.

2. Increasing executor memory – If there is not enough memory in executor to process data
and its breaking due to OOM issue.

There are few ways to implement above techniques viz. creating single executor with high
configuration called as fat executor this is not helpful most if the times since it reduces the
throughput and also it takes lot of time for garbage collector to clean up the huge memory.

The second way is to create multiple executors with less no of resources/executor this is also not a
very good practice as it restricts the parallelism /executor since it has less no of cores.

Therefore the best way is not opt for a balanced approach which can satisfy our requirement.
Data Level –

Data level optimization is something related to tweaking the data, i.e. filtering and cleaning data
before doing some heavy transformation, bucketing and partitioning the data to make the scan
efficient.

Code Level-

In code level optimization we write our code/query in a way that spark internally choose the
optimized way to perform the task.

Also if we observe that the data in getting skewed and only few executors are able to perform the
tasks and rest are sitting idle then we implement a technique called Salting technique to increase
parallelism by increasing the cardinality to create multiple partitions so that they can be processed
individually in parallel.

If while joining the data we see that one of the table is small enough to be fit in executor memory
then opt for broadcast join which reduces the data shuffle and hence increase the application
performance.

You might also like