You are on page 1of 4

Module5:

Apache Spark Architecture


Apache Spark Architectural Concepts, Key Terms, and Keywords
Spark Cluster
Spark Master
Spark Worker
Spark Executor
Spark Driver
SparkSession and SparkContext
Spark Deployment Modes Cheat Sheet
Spark Apps, Jobs, Stages and Tasks
Working of Spark Architecture

Apache Spark Architectural Concepts, Key Terms, and Keywords


Now since we have a fair understanding of Spark and its main features, let us dive deeper into
the architecture of Spark and understand the anatomy of a Spark application. We know Spark is
a distributed, cluster computing framework and Spark works in a master-slave fashion.
Whenever we need to execute a Spark program we need to perform an operation called
“spark-submit”. We will go over the details of what this means in later sections. But to simply
understand, spark-submit is like calling the main program as we do in Java. On performing a
“spark-submit” on a cluster, a master and one or more slaves are launched to accomplish the
task written in the Spark program. There are different modes of launching a Spark program like
standalone, client, cluster mode. We will see these options in details later.

Spark Cluster
To visualize the architecture of a Spark cluster let us look at the below diagram and understand
each component and their functions.

Whenever we want to run an application we need to perform a spark-submit with some


parameters. Say we submitted an Application A, this leads to creation of one Driver process for
A which is usually the Master and one or more Executors on the Worker nodes. This entire set
of a Driver and Executors is exclusive for the Application A. Now say we want to run another
application B and perform a spark-submit, another set of one Driver and few Executors are
started which are totally independent of Driver and Executors for Application A. Even though
both the Drivers might run on the same machine on the cluster, they are mutually exclusive.
Same applies for Executors. So, a Spark cluster consists of a Master Node and Worker Nodes
which can be shared across multiple applications, but each application runs mutually exclusive
of each other.

When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways
to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an
Application Master where the Driver runs and the client can go away once the application is
started. In client mode, Driver keeps running on the client and Application Master only requests
resources from the YARN.

To launch a Spark application in cluster mode:


$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options]
<app jar> [app options]

To launch in client mode:


$ ./bin/spark-shell --master yarn --deploy-mode client

Spark Master
When we run Spark in a Standalone mode, a Master node first needs to be started which can
be done by executing:
./sbin/start-master.sh
This creates a Master node on the machine where the command is executed. Once the master
node starts it gives the Spark url of the form: spark://HOST:PORT which can be used to start the
Worked Nodes.
Spark Worker
Several worker nodes can be started on different machines on the cluster using the command:
./sbin/start-slave.sh <master-spark-URL>
The master’s web UI can be accessed using : ​http://localhost:8080​ or ​http://server:8080
We will see these scripts in detail in the Spark Installation section.
Spark Driver: ​Driver is the process which runs the main() function of the application and also
created the SparkContext. Driver is a separate JVM process and it is the responsibility of the
driver to analyze, distribute, schedule and monitor work across the worker nodes. Each
application launched or submitted on a cluster will have its own separate Driver running and
even if there are multiple applications running simultaneously on a cluster, these Drivers will not
talk to each other in any way. The Driver program also plays host to a bunch of processes which
are part of the application like
- SparkEnv
- DAGSchedueler
- TaskScheduler
- SparkUI
The spark application which we want to run is instantiated within the Spark Driver.

Spark Executor: ​The Driver program launches the tasks which run on individual worker nodes.
These tasks are what operate on subset of RDDs which are present on that node. These
programs running on the worker nodes are called executors. The actual program written in your
application gets executed by these executors. The Driver program after getting started interacts
with the Cluster manager (YARN, Mesos,Default) to spin off the resources on the Worker nodes
and then assigns the tasks to the executors. Tasks are the basic units of execution.

SparkSession and SparkContext :​ SparkContext is the heart of any spark application. The
Sparkcontext can be thought of as a bridge to the spark environment and all that it has to offer
from your program. SparkContext is used as the entry point to kickstart the application.
SparkContext can be used to create RDDs like below:
➢ val​ conf ​=​ ​new​ ​SparkConf​().​setAppName​(​“FirstSpark”​).​setMaster​(​master​)
➢ val sc = new​ ​SparkContext​(​conf​)
➢ val​ data ​=​ ​Array​(​1​,​ ​2​,​ ​3​,​ ​4​,​ ​5​)
➢ val​ distData ​=​ sc​.​parallelize​(​data​)

distData​ is the RDD which gets created using SparkContext.

SparkSession is a simplified entry point into spark application and it also encapsulates the
SparkContext. SparkSession is introduced in Spark 2.x. Prior to this Spark had different
contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if
running Spark on Hive, StreamingContext. SparkSession makes it simple so there is no
confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is
instantiated using a builder and it is an important component of Spark 2.0.

➢ val​ ​spark​ = ​SparkSession​.builder()


.master(​"local"​)
.appName(​"SparkSessionExample"​)
.getOrCreate()

In the Spark interactive scala shell the SparkSession/Context is automatically provided by the
environment and it is not required to manually create it. But in standalone applications we need
to explicitly create it.

Spark Deployment Modes Cheat Sheet

Mode Driver When To Use

Client Mode Driver runs on the machine from When job submitting machine is very near to the
where spark job is submitted Cluster, so there is no network latency. Failure
chances high due to network issues

Cluster Mode Driver is launched on any of the When job submitting machine is far from the
machines on the Cluster not on cluster. Failure chances less due to network
the Client machine where job is issues.
submitted

Standalone Driver will be launched on the Useful for development and testing purpose, not
Mode machine where master script is recommended for Production grade applications.
started

You might also like