Professional Documents
Culture Documents
Spark Cluster
To visualize the architecture of a Spark cluster let us look at the below diagram and understand
each component and their functions.
When we launch a Spark cluster using a Resource Manager such as YARN, there are two ways
to do it: using cluster mode and client mode. In cluster mode, YARN creates and manages an
Application Master where the Driver runs and the client can go away once the application is
started. In client mode, Driver keeps running on the client and Application Master only requests
resources from the YARN.
Spark Master
When we run Spark in a Standalone mode, a Master node first needs to be started which can
be done by executing:
./sbin/start-master.sh
This creates a Master node on the machine where the command is executed. Once the master
node starts it gives the Spark url of the form: spark://HOST:PORT which can be used to start the
Worked Nodes.
Spark Worker
Several worker nodes can be started on different machines on the cluster using the command:
./sbin/start-slave.sh <master-spark-URL>
The master’s web UI can be accessed using : http://localhost:8080 or http://server:8080
We will see these scripts in detail in the Spark Installation section.
Spark Driver: Driver is the process which runs the main() function of the application and also
created the SparkContext. Driver is a separate JVM process and it is the responsibility of the
driver to analyze, distribute, schedule and monitor work across the worker nodes. Each
application launched or submitted on a cluster will have its own separate Driver running and
even if there are multiple applications running simultaneously on a cluster, these Drivers will not
talk to each other in any way. The Driver program also plays host to a bunch of processes which
are part of the application like
- SparkEnv
- DAGSchedueler
- TaskScheduler
- SparkUI
The spark application which we want to run is instantiated within the Spark Driver.
Spark Executor: The Driver program launches the tasks which run on individual worker nodes.
These tasks are what operate on subset of RDDs which are present on that node. These
programs running on the worker nodes are called executors. The actual program written in your
application gets executed by these executors. The Driver program after getting started interacts
with the Cluster manager (YARN, Mesos,Default) to spin off the resources on the Worker nodes
and then assigns the tasks to the executors. Tasks are the basic units of execution.
SparkSession and SparkContext : SparkContext is the heart of any spark application. The
Sparkcontext can be thought of as a bridge to the spark environment and all that it has to offer
from your program. SparkContext is used as the entry point to kickstart the application.
SparkContext can be used to create RDDs like below:
➢ val conf = new SparkConf().setAppName(“FirstSpark”).setMaster(master)
➢ val sc = new SparkContext(conf)
➢ val data = Array(1, 2, 3, 4, 5)
➢ val distData = sc.parallelize(data)
SparkSession is a simplified entry point into spark application and it also encapsulates the
SparkContext. SparkSession is introduced in Spark 2.x. Prior to this Spark had different
contexts for different use cases, like SQLContext when used with SQL queries, HiveContext if
running Spark on Hive, StreamingContext. SparkSession makes it simple so there is no
confusion which context to use. It subsumes SQLContext and HiveContext. SparkSession is
instantiated using a builder and it is an important component of Spark 2.0.
In the Spark interactive scala shell the SparkSession/Context is automatically provided by the
environment and it is not required to manually create it. But in standalone applications we need
to explicitly create it.
Client Mode Driver runs on the machine from When job submitting machine is very near to the
where spark job is submitted Cluster, so there is no network latency. Failure
chances high due to network issues
Cluster Mode Driver is launched on any of the When job submitting machine is far from the
machines on the Cluster not on cluster. Failure chances less due to network
the Client machine where job is issues.
submitted
Standalone Driver will be launched on the Useful for development and testing purpose, not
Mode machine where master script is recommended for Production grade applications.
started