Professional Documents
Culture Documents
Now, we can do about four models a day.” - said Rajiv Bhat, senior vice president of data sciences and
marketplace at InMobi.
1
Outline
◼ Introduction to Spark
◼ Brief History of Spark
◼ Spark Features
◼ Spark and Hadoop
◼ Spark Framework
◼ Spark Core
◼ Spark RDD
◼ Spark Architecture
◼ Spark Application Components
◼ Spark Eco-system
◼ Some Spark Application
2
What is Spark
◼ Spark is an open source unified computing engine and a set of libraries for parallel
data processing on computer clusters.
◼ It is designed to
◼ query, analyze, and transform big data
◼ run and manage huge clusters of computers
◼ deliver computational speed
◼ scalablility
◼ programmability required for big data, specially
◼ streaming data
◼ graph data
◼ machine learning
◼ artificial intelligence applications
3
Brief History of Spark
4
Brief History of Spark
5
Spark Features
6
Spark Features
7
Spark and Hadoop
◼ It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and
Stream Processing.
◼ Spark was built for speeding up the Hadoop computational computing software
process.
◼ It is not a modified version of Hadoop.
◼ It doesn’t dependent on Hadoop because it has its own cluster management.
◼ Spark uses Hadoop for storage purpose only.
8
Spark Vs MapReduce
9
10
11
12
-- How far away is your data?
13
Spark Frame Work
14
Spark Core
15
RDD
16
Features of RDD
17
RDD Storage
18
RDD Partitions
19
Three Ways to Create RDD
data = [1, 2, 3, 4, 5]
rdd1 = sc.parallelize(data)
rdd2 = sc.textFile("data.txt")
20
RDD Operations
◼ After the data in question has been placed into its RDD, Spark permits two primary
types of operation:
◼ Transformations
◼ Actions
21
RDD Transformations
◼ Transformations such as , modify, map, filter, etc. are applied on RDDs to generate
another RDD.
◼ The new RDD returns a pointer to the parent RDD
◼ The parent RDD remains unchanged, no matter what happens to the child RDD.
◼ This immutability lets Spark keep track of the alterations that were carried out
during the transformation, and thereby makes it possible to restore matters back to
normal should a failure or other problem crop up somewhere along the line.
◼ Transformations aren’t executed until the moment they’re needed. This behavior is
known as lazy evaluation
22
RDD Transformations
23
Narrow Transformations
24
Wide Transformations
25
RDD Actions
◼ Actions are behaviors that interact with the data but don’t change it.
◼ Examples:
◼ Counting()
◼ Collect()
◼ First()
◼ Take()
◼ After an application requests an action, Spark evaluates the initial RDD and then
creates subsequent RDD instances based on what it’s being asked to do.
◼ When spawning these next-generation RDDs, Spark examines the cluster to find the
most efficient place to put the new RDD.
26
Transformation and Action Functions
rdd1 = sc.textFile("data.txt")
rdd2 = rdd1.map(lambda s: len(s))
total = rdd2.reduce(lambda a, b: a + b)
28
RDD Dependencies
29
Lineage Graph
◼ RDD DAG (Directed Acyclic Graph) is a dependency lineage graph of all parent RDDs
of an RDD.
◼ DAG is a sequence of computations performed on data where each node is an RDD
partition and edge is a transformation on top of data.
30
Stages DAG
32
Spark Architecture
33
Spark Application Main Components
34
Spark Application Main Components
35
Driver Program
◼ The life of Spark application starts (and finishes) with the Spark driver.
◼ The driver is also responsible for planning and coordinating the execution of the
Spark program and returning status and/or results (data) to the client.
◼ When an application code is submitted, the driver implicitly converts user code that
contains transformations and actions into a logical DAG.
◼ Creates spark context which is a gateway to all the Spark functionalities.
◼ Talks to the cluster manager and negotiates the resources.
◼ Sends tasks to executors based on data placement.
◼ When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ During execution of tasks, driver will monitor the set of executors that runs.
◼ Driver schedules future tasks based on data placement.
36
Driver & Cluster Manager
37
Spark Context
38
How to Create Spark Context Class
39
Worker Nodes
◼ Are the slave nodes whose job is to basically execute the tasks.
◼ Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes.
◼ Work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
◼ With the increase in the number of workers, memory size will also increase & you
can cache the jobs to execute it faster.
40
Spark Architecture Workflow
◼ STEP 1: The client submits spark user application code. When an application code is
submitted, the driver implicitly converts user code that contains transformations
and actions into a logically directed acyclic graph.
◼ STEP 2: Converts DAG into physical execution plan with many stages. After
converting into a physical execution plan, it creates physical execution units called
tasks under each stage. Then the tasks are bundled and sent to the cluster.
◼ STEP 3: The driver talks to the cluster manager and negotiates the resources.
Cluster manager launches executors in worker nodes on behalf of the driver. At this
point, the driver will send the tasks to the executors based on data placement.
When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ STEP 4: During execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.
41
Spark Eco-System
42
Spark Supported Languages
43
Spark Cluster Managers
44
Spark Eco-System
45
Spark SQL
46
Spark Streaming
47
48
Spark Graphx
49
Companies Who Use Spark Streaming
50
Some Spark Application
51
Some Spark Application
52
Some Companies Who Use Spark
53
Refs
◼ https://spark.apache.org/docs/latest/rdd-programming-guide.html
54
END