T05 Spark

"Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model.
Now, we can do about four models a day.” - said Rajiv Bhat, senior vice president of data sciences and
marketplace at InMobi.
1
Outline
◼ Introduction to Spark
◼ Brief History of Spark
◼ Spark Features
◼ Spark and Hadoop
◼ Spark Framework
◼ Spark Core
◼ Spark RDD
◼ Spark Architecture
◼ Spark Application Components
◼ Spark Eco-system
◼ Some Spark Application
2
What is Spark
◼ Spark is an open source unified computing engine and a set of libraries for parallel
data processing on computer clusters.
◼ It is designed to
◼ query, analyze, and transform big data
◼ run and manage huge clusters of computers
◼ deliver computational speed
◼ scalablility
◼ programmability required for big data, specially
◼ streaming data
◼ graph data
◼ machine learning
◼ artificial intelligence applications
3
Brief History of Spark
4
Brief History of Spark
◼ 2009: Started at as a project in UC

Berkley by Matie Zaharia.
◼ 2010: Open sourced under Berkley

SW Distribution (BSD) license.
◼ 2013: It became Apache top level

project.
◼ 2014: Databricks used Spark to sort

large-scale data in record time.
Matei Zaharia
◼ 2020: Most in demand data
processing framework.
Research paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
5
Spark Features
6
Spark Features
◼ Many times faster than Hadoop

◼ In-memory computing
◼ Powerful caching
◼ Lazy evaluation
◼ Can be deployed through spark Cluster Manager, MESOS, or YARN
◼ Real-time data processing
◼ Flexible: Polyglot (Scala, R, Python, Java)
◼ Provides shell in Scala and Python
◼ Fault tolerant
◼ Efficient pipelining (avoids HDD)
7
Spark and Hadoop
◼ It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and
Stream Processing.
◼ Spark was built for speeding up the Hadoop computational computing software
process.
◼ It is not a modified version of Hadoop.
◼ It doesn’t dependent on Hadoop because it has its own cluster management.
◼ Spark uses Hadoop for storage purpose only.
8
Spark Vs MapReduce
◼ In comparison with MapReduce, Spark offers four primary advantages for

developing Big Data solutions:
◼ Performance
◼ Simplicity
◼ Ease of administration
◼ Faster application development
9
10
11
12
-- How far away is your data?
Location Cycles Location Time

Registers 1 In your head 1 min
On-chip cache 2 This Room 2 min
On-board cache 10 KFUPM Campus 10 min
Memory 100 Jubail 1.5 hours
Disk 10**6 Pluto 2 years
Tape 10**9 Andromeda 2000 years
13
Spark Frame Work
14
Spark Core
◼ Spark core is responsible for

◼ Memory management
◼ Fault recovery
◼ Scheduling, distributing, and monitoring jobs in a cluster
◼ Interacting with storage systems
◼ Spark uses a specialized fundamental data structure known as RDD (Resilient

Distributed Datasets) that is a logical collection of data partitioned across machines.
15
RDD
◼ RDD is read-only partitioned collection of records.

◼ RDD stands for Resilient Distributed Datasets
◼ resilient: They can reconstruct themselves if there are any failures
◼ They are collection of objects
◼ Embedded in Spark core
◼ Immutable (red only)
◼ In-memory
◼ Fault-tolerant
◼ Distributed
◼ Partitioned.
16
Features of RDD
17
RDD Storage
18
RDD Partitions
◼ Each partition hold

subset of RDD data
◼ Each partition can be
assigned to a node in a
cluster
◼ RDD holds references
to its partitions
◼ Users can indicate how
RDDs to be partitioned
across machines based
on a key in each record.
19
Three Ways to Create RDD
data = [1, 2, 3, 4, 5]
rdd1 = sc.parallelize(data)
rdd2 = sc.textFile("data.txt")
rdd3 = rdd2.map(lambda s: len(s))
20
RDD Operations
◼ After the data in question has been placed into its RDD, Spark permits two primary
types of operation:
◼ Transformations
◼ Actions
21
RDD Transformations
◼ Transformations such as , modify, map, filter, etc. are applied on RDDs to generate
another RDD.
◼ The new RDD returns a pointer to the parent RDD
◼ The parent RDD remains unchanged, no matter what happens to the child RDD.
◼ This immutability lets Spark keep track of the alterations that were carried out
during the transformation, and thereby makes it possible to restore matters back to
normal should a failure or other problem crop up somewhere along the line.
◼ Transformations aren’t executed until the moment they’re needed. This behavior is
known as lazy evaluation
22
RDD Transformations
◼ There are two types of

transformations
◼ Narrow transformations:
where each partition of the parent
RDD is used by atmost one partition
of the child RDD.
◼ Wide transformations: where

multiple child partitions may depend
on it
23
Narrow Transformations
24
Wide Transformations
25
RDD Actions
◼ Actions are behaviors that interact with the data but don’t change it.
◼ Examples:
◼ Counting()
◼ Collect()
◼ First()
◼ Take()
◼ After an application requests an action, Spark evaluates the initial RDD and then
creates subsequent RDD instances based on what it’s being asked to do.
◼ When spawning these next-generation RDDs, Spark examines the cluster to find the
most efficient place to put the new RDD.
Note: For Spark function list visit https://training.databricks.com/visualapi.pdf
26
Transformation and Action Functions
For spark operations: https://training.databricks.com/visualapi.pdf

27
Lazy Evaluation
◼ RDDs do always not need to be materialized.

◼ An RDD has enough information about how it was derived from others.
rdd1 = sc.textFile("data.txt")
rdd2 = rdd1.map(lambda s: len(s))
total = rdd2.reduce(lambda a, b: a + b)
◼ The first line defines a base RDD from an external file.

◼ This dataset is not loaded in memory or otherwise acted on: rdd1 is merely a pointer to the
file.
◼ The second line defines rdd2 as the result of a map transformation. Again, rdd2 is not
immediately computed, due to laziness.
◼ Finally, we run reduce, which is an action. At this point Spark breaks the computation into
tasks to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.
28
RDD Dependencies
29
Lineage Graph
◼ RDD DAG (Directed Acyclic Graph) is a dependency lineage graph of all parent RDDs
of an RDD.
◼ DAG is a sequence of computations performed on data where each node is an RDD
partition and edge is a transformation on top of data.
30
Stages DAG
◼ DAG consists of tasks and stages.

◼ Tasks are the smallest unit of schedulable work in a Spark program.
◼ The DAG scheduler splits each job into multiple stages
◼ Stages are sets of tasks that can be run together.
◼ Stages are dependent upon one another.
◼ The stages are created based on the transformations.
◼ The narrow transformations will be grouped together into a single stage.
◼ Shuffle Operation or wide transformation define the boundary of two stages.
◼ Stages are separated by two shuffle operations. The shuffle is Spark’s mechanism
for re-distributing data so that it’s grouped differently across partitions. This
typically involves copying data across executors and machines, making the shuffle a
complex and costly operation.
31
Stages
32
Spark Architecture
33
Spark Application Main Components
◼ The main Spark application components are:

◼ Driver program
◼ Spark context
◼ Cluster manager
◼ Executors
◼ Executors run in the worker nodes
◼ Spark is written in Scala but compiles into bytecode and runs on JVMs
34
Spark Application Main Components
35
Driver Program
◼ The life of Spark application starts (and finishes) with the Spark driver.
◼ The driver is also responsible for planning and coordinating the execution of the
Spark program and returning status and/or results (data) to the client.
◼ When an application code is submitted, the driver implicitly converts user code that
contains transformations and actions into a logical DAG.
◼ Creates spark context which is a gateway to all the Spark functionalities.
◼ Talks to the cluster manager and negotiates the resources.
◼ Sends tasks to executors based on data placement.
◼ When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ During execution of tasks, driver will monitor the set of executors that runs.
◼ Driver schedules future tasks based on data placement.
36
Driver & Cluster Manager
37
Spark Context
◼ Is a gateway to all the Spark

functionalities.
◼ Anything you do on Spark goes

through Spark context
◼ Together with the driver take

care of the job execution within
the cluster.
◼ It breaks each job in tasks and

distribute them to the worker
nodes.
◼ Works with the cluster manager

to manage various jobs
38
How to Create Spark Context Class
◼ If you want to create SparkContext, first SparkConf should be made.

◼ The SparkConf has a configuration parameter that our Spark driver application will
pass to SparkContext.
◼ Some of these parameter defines properties of Spark driver application. While some
are used by Spark to allocate resources on the cluster, like the number, memory
size, and cores used by executor running on the worker nodes.
◼ In short, it guides how to access the Spark cluster.
◼ After the creation of a SparkContext object, we can invoke functions such as
textFile, sequenceFile, parallelize etc.
39
Worker Nodes
◼ Are the slave nodes whose job is to basically execute the tasks.
◼ Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes.
◼ Work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
◼ With the increase in the number of workers, memory size will also increase & you
can cache the jobs to execute it faster.
40
Spark Architecture Workflow
◼ STEP 1: The client submits spark user application code. When an application code is
submitted, the driver implicitly converts user code that contains transformations
and actions into a logically directed acyclic graph.
◼ STEP 2: Converts DAG into physical execution plan with many stages. After
converting into a physical execution plan, it creates physical execution units called
tasks under each stage. Then the tasks are bundled and sent to the cluster.
◼ STEP 3: The driver talks to the cluster manager and negotiates the resources.
Cluster manager launches executors in worker nodes on behalf of the driver. At this
point, the driver will send the tasks to the executors based on data placement.
When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ STEP 4: During execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.
41
Spark Eco-System
42
Spark Supported Languages
43
Spark Cluster Managers
44
Spark Eco-System
45
Spark SQL
◼ Used for processing structured and semi-structured data.
46
Spark Streaming
◼ Processes stream data.

◼ Example, twitter twits, real-time video camera input, credit-card transaction
evaluation, product or service recommendations.
47
48
Spark Graphx
◼ For processing analyzing associations, connections, networks and so on.

◼ Example: Social networks, computer networks
49
Companies Who Use Spark Streaming
50
Some Spark Application
51
Some Spark Application
52
Some Companies Who Use Spark
53
Refs
◼ Learning Spark: http://index-of.co.uk/Big-Data-Technologies/Learning%20Spark%20%20Lightning-Fast%20Big%20Data%20Analysis%20.pdf
◼ Spark for Dummies: https://www.ibm.com/downloads/cas/WEB4XBOR
◼ https://spark.apache.org/docs/latest/rdd-programming-guide.html
54
END

T05 Spark

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

T05 Spark

Uploaded by

Copyright:

Available Formats

"Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model.

◼ 2009: Started at as a project in UC

◼ 2010: Open sourced under Berkley

◼ 2013: It became Apache top level

◼ 2014: Databricks used Spark to sort

Research paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

◼ Many times faster than Hadoop

◼ In comparison with MapReduce, Spark offers four primary advantages for

Location Cycles Location Time

◼ Spark core is responsible for

◼ Spark uses a specialized fundamental data structure known as RDD (Resilient

◼ RDD is read-only partitioned collection of records.

◼ Each partition hold

rdd3 = rdd2.map(lambda s: len(s))

◼ There are two types of

◼ Wide transformations: where

Note: For Spark function list visit https://training.databricks.com/visualapi.pdf

For spark operations: https://training.databricks.com/visualapi.pdf

◼ RDDs do always not need to be materialized.

◼ The first line defines a base RDD from an external file.

◼ DAG consists of tasks and stages.

◼ The main Spark application components are:

◼ Is a gateway to all the Spark

◼ Anything you do on Spark goes

◼ Together with the driver take

◼ It breaks each job in tasks and

◼ Works with the cluster manager

◼ If you want to create SparkContext, first SparkConf should be made.

◼ Used for processing structured and semi-structured data.

◼ Processes stream data.

◼ For processing analyzing associations, connections, networks and so on.

◼ Learning Spark: http://index-of.co.uk/Big-Data-Technologies/Learning%20Spark%20%20Lightning-Fast%20Big%20Data%20Analysis%20.pdf

◼ Spark for Dummies: https://www.ibm.com/downloads/cas/WEB4XBOR

You might also like