You are on page 1of 55

"Spark is beautiful. With Hadoop, it would take us six-seven months to develop a machine learning model.

Now, we can do about four models a day.” - said Rajiv Bhat, senior vice president of data sciences and
marketplace at InMobi.

1
Outline

◼ Introduction to Spark
◼ Brief History of Spark
◼ Spark Features
◼ Spark and Hadoop
◼ Spark Framework
◼ Spark Core
◼ Spark RDD
◼ Spark Architecture
◼ Spark Application Components
◼ Spark Eco-system
◼ Some Spark Application

2
What is Spark

◼ Spark is an open source unified computing engine and a set of libraries for parallel
data processing on computer clusters.
◼ It is designed to
◼ query, analyze, and transform big data
◼ run and manage huge clusters of computers
◼ deliver computational speed
◼ scalablility
◼ programmability required for big data, specially
◼ streaming data
◼ graph data
◼ machine learning
◼ artificial intelligence applications

3
Brief History of Spark

4
Brief History of Spark

◼ 2009: Started at as a project in UC


Berkley by Matie Zaharia.

◼ 2010: Open sourced under Berkley


SW Distribution (BSD) license.

◼ 2013: It became Apache top level


project.

◼ 2014: Databricks used Spark to sort


large-scale data in record time.
Matei Zaharia
◼ 2020: Most in demand data
processing framework.

Research paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

5
Spark Features

6
Spark Features

◼ Many times faster than Hadoop


◼ In-memory computing
◼ Powerful caching
◼ Lazy evaluation
◼ Can be deployed through spark Cluster Manager, MESOS, or YARN
◼ Real-time data processing
◼ Flexible: Polyglot (Scala, R, Python, Java)
◼ Provides shell in Scala and Python
◼ Fault tolerant
◼ Efficient pipelining (avoids HDD)

7
Spark and Hadoop

◼ It was built on top of Hadoop MapReduce and it extends the MapReduce model to
efficiently use more types of computations which includes Interactive Queries and
Stream Processing.
◼ Spark was built for speeding up the Hadoop computational computing software
process.
◼ It is not a modified version of Hadoop.
◼ It doesn’t dependent on Hadoop because it has its own cluster management.
◼ Spark uses Hadoop for storage purpose only.

8
Spark Vs MapReduce

◼ In comparison with MapReduce, Spark offers four primary advantages for


developing Big Data solutions:
◼ Performance
◼ Simplicity
◼ Ease of administration
◼ Faster application development

9
10
11
12
-- How far away is your data?

Location Cycles Location Time


Registers 1 In your head 1 min
On-chip cache 2 This Room 2 min
On-board cache 10 KFUPM Campus 10 min
Memory 100 Jubail 1.5 hours
Disk 10**6 Pluto 2 years
Tape 10**9 Andromeda 2000 years

13
Spark Frame Work

14
Spark Core

◼ Spark core is responsible for


◼ Memory management
◼ Fault recovery
◼ Scheduling, distributing, and monitoring jobs in a cluster
◼ Interacting with storage systems

◼ Spark uses a specialized fundamental data structure known as RDD (Resilient


Distributed Datasets) that is a logical collection of data partitioned across machines.

15
RDD

◼ RDD is read-only partitioned collection of records.


◼ RDD stands for Resilient Distributed Datasets
◼ resilient: They can reconstruct themselves if there are any failures
◼ They are collection of objects
◼ Embedded in Spark core
◼ Immutable (red only)
◼ In-memory
◼ Fault-tolerant
◼ Distributed
◼ Partitioned.

16
Features of RDD

17
RDD Storage

18
RDD Partitions

◼ Each partition hold


subset of RDD data
◼ Each partition can be
assigned to a node in a
cluster
◼ RDD holds references
to its partitions
◼ Users can indicate how
RDDs to be partitioned
across machines based
on a key in each record.

19
Three Ways to Create RDD

data = [1, 2, 3, 4, 5]
rdd1 = sc.parallelize(data)

rdd2 = sc.textFile("data.txt")

rdd3 = rdd2.map(lambda s: len(s))

20
RDD Operations

◼ After the data in question has been placed into its RDD, Spark permits two primary
types of operation:
◼ Transformations
◼ Actions

21
RDD Transformations

◼ Transformations such as , modify, map, filter, etc. are applied on RDDs to generate
another RDD.
◼ The new RDD returns a pointer to the parent RDD
◼ The parent RDD remains unchanged, no matter what happens to the child RDD.
◼ This immutability lets Spark keep track of the alterations that were carried out
during the transformation, and thereby makes it possible to restore matters back to
normal should a failure or other problem crop up somewhere along the line.
◼ Transformations aren’t executed until the moment they’re needed. This behavior is
known as lazy evaluation

22
RDD Transformations

◼ There are two types of


transformations
◼ Narrow transformations:
where each partition of the parent
RDD is used by atmost one partition
of the child RDD.

◼ Wide transformations: where


multiple child partitions may depend
on it

23
Narrow Transformations

24
Wide Transformations

25
RDD Actions

◼ Actions are behaviors that interact with the data but don’t change it.
◼ Examples:
◼ Counting()
◼ Collect()
◼ First()
◼ Take()
◼ After an application requests an action, Spark evaluates the initial RDD and then
creates subsequent RDD instances based on what it’s being asked to do.
◼ When spawning these next-generation RDDs, Spark examines the cluster to find the
most efficient place to put the new RDD.

Note: For Spark function list visit https://training.databricks.com/visualapi.pdf

26
Transformation and Action Functions

For spark operations: https://training.databricks.com/visualapi.pdf


27
Lazy Evaluation

◼ RDDs do always not need to be materialized.


◼ An RDD has enough information about how it was derived from others.

rdd1 = sc.textFile("data.txt")
rdd2 = rdd1.map(lambda s: len(s))
total = rdd2.reduce(lambda a, b: a + b)

◼ The first line defines a base RDD from an external file.


◼ This dataset is not loaded in memory or otherwise acted on: rdd1 is merely a pointer to the
file.
◼ The second line defines rdd2 as the result of a map transformation. Again, rdd2 is not
immediately computed, due to laziness.
◼ Finally, we run reduce, which is an action. At this point Spark breaks the computation into
tasks to run on separate machines, and each machine runs both its part of the map and a
local reduction, returning only its answer to the driver program.

28
RDD Dependencies

29
Lineage Graph

◼ RDD DAG (Directed Acyclic Graph) is a dependency lineage graph of all parent RDDs
of an RDD.
◼ DAG is a sequence of computations performed on data where each node is an RDD
partition and edge is a transformation on top of data.

30
Stages DAG

◼ DAG consists of tasks and stages.


◼ Tasks are the smallest unit of schedulable work in a Spark program.
◼ The DAG scheduler splits each job into multiple stages
◼ Stages are sets of tasks that can be run together.
◼ Stages are dependent upon one another.
◼ The stages are created based on the transformations.
◼ The narrow transformations will be grouped together into a single stage.
◼ Shuffle Operation or wide transformation define the boundary of two stages.
◼ Stages are separated by two shuffle operations. The shuffle is Spark’s mechanism
for re-distributing data so that it’s grouped differently across partitions. This
typically involves copying data across executors and machines, making the shuffle a
complex and costly operation.
31
Stages

32
Spark Architecture

33
Spark Application Main Components

◼ The main Spark application components are:


◼ Driver program
◼ Spark context
◼ Cluster manager
◼ Executors
◼ Executors run in the worker nodes
◼ Spark is written in Scala but compiles into bytecode and runs on JVMs

34
Spark Application Main Components

35
Driver Program

◼ The life of Spark application starts (and finishes) with the Spark driver.
◼ The driver is also responsible for planning and coordinating the execution of the
Spark program and returning status and/or results (data) to the client.
◼ When an application code is submitted, the driver implicitly converts user code that
contains transformations and actions into a logical DAG.
◼ Creates spark context which is a gateway to all the Spark functionalities.
◼ Talks to the cluster manager and negotiates the resources.
◼ Sends tasks to executors based on data placement.
◼ When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ During execution of tasks, driver will monitor the set of executors that runs.
◼ Driver schedules future tasks based on data placement.
36
Driver & Cluster Manager

37
Spark Context

◼ Is a gateway to all the Spark


functionalities.

◼ Anything you do on Spark goes


through Spark context

◼ Together with the driver take


care of the job execution within
the cluster.

◼ It breaks each job in tasks and


distribute them to the worker
nodes.

◼ Works with the cluster manager


to manage various jobs

38
How to Create Spark Context Class

◼ If you want to create SparkContext, first SparkConf should be made.


◼ The SparkConf has a configuration parameter that our Spark driver application will
pass to SparkContext.
◼ Some of these parameter defines properties of Spark driver application. While some
are used by Spark to allocate resources on the cluster, like the number, memory
size, and cores used by executor running on the worker nodes.
◼ In short, it guides how to access the Spark cluster.
◼ After the creation of a SparkContext object, we can invoke functions such as
textFile, sequenceFile, parallelize etc.

39
Worker Nodes

◼ Are the slave nodes whose job is to basically execute the tasks.
◼ Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes.
◼ Work on the partitioned RDD, perform operations, collect the results and return to
the main Spark Context.
◼ With the increase in the number of workers, memory size will also increase & you
can cache the jobs to execute it faster.

40
Spark Architecture Workflow

◼ STEP 1: The client submits spark user application code. When an application code is
submitted, the driver implicitly converts user code that contains transformations
and actions into a logically directed acyclic graph.
◼ STEP 2: Converts DAG into physical execution plan with many stages. After
converting into a physical execution plan, it creates physical execution units called
tasks under each stage. Then the tasks are bundled and sent to the cluster.
◼ STEP 3: The driver talks to the cluster manager and negotiates the resources.
Cluster manager launches executors in worker nodes on behalf of the driver. At this
point, the driver will send the tasks to the executors based on data placement.
When executors start, they register themselves with drivers. So, the driver will have
a complete view of executors that are executing the task.
◼ STEP 4: During execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.

41
Spark Eco-System

42
Spark Supported Languages

43
Spark Cluster Managers

44
Spark Eco-System

45
Spark SQL

◼ Used for processing structured and semi-structured data.

46
Spark Streaming

◼ Processes stream data.


◼ Example, twitter twits, real-time video camera input, credit-card transaction
evaluation, product or service recommendations.

47
48
Spark Graphx

◼ For processing analyzing associations, connections, networks and so on.


◼ Example: Social networks, computer networks

49
Companies Who Use Spark Streaming

50
Some Spark Application

51
Some Spark Application

52
Some Companies Who Use Spark

53
Refs

◼ Learning Spark: http://index-of.co.uk/Big-Data-Technologies/Learning%20Spark%20%20Lightning-Fast%20Big%20Data%20Analysis%20.pdf

◼ Spark for Dummies: https://www.ibm.com/downloads/cas/WEB4XBOR

◼ https://spark.apache.org/docs/latest/rdd-programming-guide.html

54
END

You might also like