You are on page 1of 7

APACHE SPARK architecture

Apache Spark is a fast and general purpose engine for large-scale data processing
It is composed of a computing engine, and a set of APIs and libraries
It runs over a cluster resource manager and a distributed storage system

Computing engine
Provides some basic functionalities like memory management, task scheduling, fault recovery and
interacting with the cluster manager and storage system

Spark Core APIs


We have two APIs: a structured API wich consists of data frames and data sets. They are designed and
optimized to work with structured data + an unstructured API wich are the lower level APIs including RDDs.
These APIs are available in Scala, Python, Java, and R.

Libraries
Outside the Spark Core, we have 4 set of libraries. Spark SQL (use SQL queries for structured data
processing), Spark Streaming (process continuous data streams), MLlib (machine learning library that
delivers high-quality algorithms), GraphX (comes with typical graph algorithms)
Data structures
RDD
Collection of data (data structure)
Resilient : they can recover from a failure (fault tolerant)
Partitioned : data is broken into smaller chunks so as to be processed on a distributed system
Distributed : spark spreads the partitions across the cluster
Immutable : Once defined, we can't change them (read-only data structure)
RDDs do not have a schema, which means that they do not have a columnar structure. Records are just
recorded row-by-row, and are displayed similar to a list. Data Frames and Datasets are built on top of RDDs.

Data Frame
Data structures having all of the features of RDDs but also have a schema, data is organized into named
columns, like a table in a relational database.

Dataset
Similar to DataFrames but are strongly-typed, meaning that the type is specified upon the creation of the
Dataset and is not deduced from the type of records stored in it.

Operations
Transformations
Operations that take an RDD as an input and perform some function on them, then returns one or more RDD.
Transformations are lazy: they get execute only when we call an action, they are not executed immediately.
Spark doesn’t actually build any new RDDs, but rather constructs a chain of hypothetical RDDs that would
result from those transformations, which will only be evaluated once an action is called. This chain of
hypothetical (child) RDDs, all connected logically back to the original (parent) RDD, is called lineage graph
or logical execution plan, it is a directed acyclic graph. This feature is primarily what drives Spark’s fault
tolerance. If a node fails for some reason, all the information about what that node was supposed to be doing
is stored in the lineage graph, which can be replicated elsewhere. There is 2 types of transformations:
Narrow transformations: all the required elements to compute the records in a single partition, live in a single
partition of parent RDD
Wide transformations: all the elements that are required to compute the records in the single partition may
live in many partitions of parent RDD

Actions
They are RDD operations that give non-RDD values.
The values of action are stored to drivers or to the external storage system.
It brings laziness of RDD into motion, by the lineage graph evaluation.
An action is one of the ways of sending data from executor to the driver.

Execution
Terminology
Task: a single operation (map, filter…) running on a specific RDD
partition, by a single executor.
Stage: a sequence of tasks that runs in parallel without a shuffle.
Job: a sequence of stages, triggered by an action.
Application: a user built program that consists of a driver
and that driver’s associated executors.

Entry point
An entry point is where control is transferred from the operating system to the provided program. Prior Spark
2.0, Spark Context was the entry point of any spark application and used to access all spark features, and
needed a sparkConf which had all the cluster configurations to create a Spark Context object. We could
primarily create just RDDs using SparkContext and we had to create specific spark contexts for any other
spark interactions (SQL => SQLContext, => streamingContext...). Spark session is a combination of all these
different contexts. Internally, Spark session creates a new SparkContext for all the operations, and also all the
above-mentioned contexts can be accessed using the SparkSession object.

Execution modes
Spark follows the master-slave architecture. So, for every Spark app, it will create one master process (the
driver) and multiple slave processes (the executors). The driver is the master, it is responsible for analyzing,
distributing, scheduling and monitoring work across the executors, and maintaining all the necessary
information during the lifetime of the application. The executors are only responsible for executing the code
assigned to them by the driver and reporting the status back to the driver.
2 apps

Assuming we have a cluster + a local client machine, and we start the Spark app from our client machine.
The executors are always going to run on the cluster machines, there is no exception to this. But we have the
flexibility to start the driver on the local machine (Client Mode) or on the cluster itself (Cluster Mode)
If we use an interactive client, the client tool itself is a driver, and we will have some executors on the
cluster. If we use spark-submit in cluster mode, Spark will start both the driver and executors on the cluster.
Cluster mode makes perfect sense for production deployment. Because after spark-submit, you can switch
off your local computer and the application executes independently within the cluster.
When you are exploring things or debugging an application, you want the driver to be running locally so that
you can easily debug it, or at least it can throw back the output on your terminal.

Resources allocation
How Spark gets the resources for the driver and the executors?
That's where we need a cluster manager (Apache YARN, Apache Mesos, Kubernetes, Standalone). No
matter which cluster manager do we use, primarily, all of them delivers the same purpose.

Client mode
1) Assume that we start an app with a spark shell
2) As soon as the driver creates the spark session, a request goes to YARN resource manager to create a
YARN application.
3) The YARN resource manager starts an Application Master.
4) The AM will reach out to YARN resource manager and request for further containers.
5) The resource manager will allocate new containers.
6) The Application Master starts an executor in each container.
7) The executors directly communicate with the driver.
Cluster mode
1) We submit the app using spark-submit tool, witch sends a request to YARN resource manager.
2) The resource manager starts an Application Master, then the driver starts in the AM container.
3) The YARN AM starts the driver, and we have any dependency with the local machine.
4) Once started, the driver will reach out to RM with a request for more containers.
5) The rest of the process is the same as in client mode.

Local mode
We just need a local machine and the Spark binaries.
The application starts in a JVM, and everything else including driver
and executor runs in the same JVM.
Spark jobs are executed in parallel on a single machine, using multi-threading,
this restricts parallelism to the number of the available cores in the machine.
Internal mechanics
Example: count the number of files in each different directory – using a six-node cluster

1) Loading a text file into an RDD and make 5 partitions.


2) First map operation witch splits each line into an array of words. Hence, the new RDD is a collection
of arrays.
3) Second map witch generates key value pairs, by taking the directory name as a key and 1 as a value.
4) Count the number of files by grouping all the values by key and sum up the ones.
5) Collect all the data back from the executors to the driver.
To 1 executor , Spark gives 1 partition at the same time. If we have 10 partitions, we can achieve ten parallel
processes at the most. However, if you have just two executors, all those 10 partitions will be queued to those
2 executors.

We created 5 partitions and Spark assigned them to 5 executors. Each executor performs the map operations
on the partition that it holds, and keeps the transformed data with it, there is no need to send the data from
one executor to another or to the driver. There is no need for any data movement yet, and hence Spark
performs all this in a single stage.
We want to group the whole dataset by key and then count the number of ones for each key, however the
keys are spread across the partitions. To accomplish this task, we need to repartition the data in a way that we
get all the records for a single key in one partition (Shuffle & sort activity), that’s why it needs a new stage.
So whenever we need to move data across the executors, we need a new stage. Spark is able to identify such
needs an breaks the job into stages. Once we have these key based partitions, each executor performs the
count. Finally, the results are sent back to the driver, when we call the collect action.

_______________________________________________________________________________________
Rassem HACHOUD Master 1 MIAGE Paris Descartes 14 Mars 2020

________________________________________________________________________________________________

You might also like