SPARK

Spark
Afef Bahri
Cours: ESC Tunis
1 23/12/2023
Plan
 Motivation
 From MapReduce to Spark
 Spark use cases
 Spark unified stack

 Spark core, Spark API
 Spark vs Hadoop ecosystem
 RDD
 Actions execution
 Spark Cluster
2 23/12/2023
Brief history of Spark
 Timeline
 2002: MapReduce @ Google
 2004: MapReduce paper
 2006: Hadoop @ Yahoo
 2008: Hadoop Summit
 2010: Spark paper
 2014: Apache Spark top-level
 MapReduce started a general batch processing paradigm,
but had its limitations:
 difficulty programming in MapReduce
 batch processing did not fit many use cases
3 23/12/2023
Motivations
 Explosion of data: social media such as Twitter feeds,
Facebook posts, SMS
 The need to be able to process those data:
 Ex. How can you find out what your customers want and be
able to offer it to them right away?
 Response time: batch job may take too much time to complete
4 23/12/2023
MapReduce to Spark
 Time: run jobs not acceptable in many situations
 Difficult to write a MapReduce job is also difficult as it

takes specific programming knowledge and expertise
 Specific set of use cases
Apache Spark: fast, general-purpose, and easy to use
5 23/12/2023
Spark
 Apache Spark [1] is an in-memory distributed computing
platform designed for large-scale data processing
 Spark was originally developed at UC Berkeley in 2009 and

currently is one of the most active big-data Apache projects
 It can be considered as a main-memory extension of the

MapReduce model
 Enable parallel computations on comodity machines with locality-

awareness scheduling, fault tolerance and load balancing
6 23/12/2023
Spark
 Speed
 Because of Spark’s In-memory implementation It can be up to 100 times
faster than Hadoop [3]
 Generality : covers a wide range of workloads on one system

 Batch applications (e.g. MapReduce)
 Iterative algorithms
 Interactive queries and streaming
 Ease of use (simple APIs)

 APIs for Scala, Python, Java, R
 Libraries for SQL, machine learning, streaming, and graph processing
 Runs on Hadoop clusters Hadoop YARN or Apache Mesos, or as a
standalone
7 23/12/2023
Ease to use
 To implement the classic wordcount in Java MapReduce,
you need three classes: the main class that sets up the job,
a Mapper, and a Reducer, each about 10 lines long
 For the same wordcount program, written in Scala for
Spark:
8 23/12/2023
Spark
 Expands on Hadoop's capabilities
 Like MapReduce: Parallel distributed processing, fault

tolerance on commodity hardware, scalability
 Spark adds: In-memory distributed computing, low

latency, high level APIs
9 23/12/2023
Spark
 Data scientist
 Analyze and model the data to obtain insight using ad-hoc analysis
 Transforming the data into a useable format
 SQL, statistics, machine learning Python, MatLab or R
 Data engineers
 Develop a data processing system or application
 Inspect and monitor, inspect, and tune their applications
 Programming with the Spark's API
 Everyone else
 Ease of use
 Wide variety of functionality
 Mature and reliable
10 23/12/2023
Spark use cases
 Data Streaming
 Machine Learning
 Collaborative Filtering
 Interactive Analysis
11 23/12/2023
Spark unified stack
 The Spark core is at the center of the Spark Unified Stack
All the higher layer components will inherit the improvements

made at the lower layers
12 23/12/2023
Spark Unified Stack
 Various add-in components can run on top of the core
 The benefit of the Spark Unified Stack is that all the

higher layer components will inherit the improvements
made at the lower layers
 Example: Optimization to the Spark Core will speed up

the SQL, the streaming, the machine learning and graph
processing
13 23/12/2023
SPARK Core
 General purpose system: scheduling, distributing, and
monitoring of the applications across a cluster
 Scalability: the Spark core is designed to scale up from one

to thousands of nodes
 Cluster managers: Hadoop YARN and Apache Mesos or on

standalone with its own built-in scheduler
 Spark Core contains basic Spark functionalities required for

running jobs (and needed by other components): RDD and
DataFrames
14 23/12/2023
Spark core: RDD
 Spark is more efficient than Hadoop for applications
requiring frequent reuse of working data sets across
multiple parallel operations
 This level of efficiency is due to the main data abstraction
that Spark provides:
 RDD (Resilient Distributed Dataset): an immutable distributed,

fault tolerant memory data abstraction that enables one to
perform in-memory computations (transformations and actions)
 when Hadoop is mainly disk-based
15 23/12/2023
Spark Unified Stack: High level
 On top of RDD and DataFrames, Spark proposes two
higher-level data access models for processing semi-
structured data in general
 Spark SQL is designed to work with the Spark via SQL and
HiveQL (a Hive variant of SQL)
 SQL with Spark's programming language supported by Python,

Scala, Java, and R
 Spark Streaming provides processing of live streams of

data
16 23/12/2023
 MLlib is the machine learning library that provides
multiple types of machine learning algorithms (scale out)
 Supported algorithms:
 logistic regression
 naive Bayes classification,
 SVM
 decision trees
 random forests
 linear regression
 k-means clustering
 and others
17 23/12/2023
 GraphX is a graph processing library with APIs to manipulate
graphs and performing graph-parallel computations
 GraphX provides functions for building graphs
 Implementations of the most important algorithms of the

graph theory like
 page rank
 connected components
 shortest paths, and others
18 23/12/2023
Spark VS Hadoop ecosystem
Hadoop Ecosystem Spark
Apache Storm Spark Streaming
Apache Giraph Spark GraphX
Apache Mahout Spark MLlib
Apache Pig, and Apache Sqoop Spark Core and Spark SQL
Spork
Apache Pig, and Apache Sqoop are not really

needed anymore, as the same functionalities are
covered by Spark Core and Spark SQL
“The Spork project enables you to run Pig on Spark”
19 23/12/2023
RDD
20 23/12/2023
Need for RDD
 Data reuse is common in many iterative machine learning
and graph algorithms
 including PageRank, K-means clustering, and logistic

regression
 Interactive data mining, where a user runs multiple ad-hoc

queries on the same subset of the data
21 23/12/2023
Need for RDD
 In most frameworks, the only way to reuse data between
computations (e.g., between two MapReduce jobs) is to:
 Write it to an external stable storage system (e.g.,

HDFS)
 Substantial overheads due to
 data replication
 disk I/O
 and serialization
which can dominate application execution times
22 23/12/2023
Need for RDD
 In distributed computing system data is stored in
intermediate stable distributed store such as HDFS
 Computation of job slower since it involves many IO
operations, replications, and serializations in the process
23 23/12/2023
Need for RDD
 RDDs solve these problems by enabling fault-tolerant
distributed In-memory computations
24 23/12/2023
SPARK RDD
 RDD stands for “Resilient Distributed Dataset”. It is the
fundamental data structure of Apache Spark
 Resilient, i.e. fault-tolerant with the help of RDD lineage

graph(DAG) and so able to recompute missing or damaged
partitions due to node failures
 Distributed, since Data resides on multiple nodes
 Dataset represents records of the data you work with. The user
can load the data set externally which can be either JSON file,
CSV file, text file or database via JDBC with no specific data
structure
25 23/12/2023
SPARK Core: RDD
 Spark's primary abstraction: Distributed collection of
elements, parallelized across the cluster
 RDD is an abstraction of a distributed collection of items with

operations and transformations applicable to the dataset
 It is resilient because it is capable of rebuilding datasets in case

of node failures
 Two types of RDD operations: Transformations and actions
26 23/12/2023
RDD
 RDD : Resilient Distributed Dataset
 Operations (task, programme c’est un ensemble de taches) :
 Transformations
 Lazy evaluation : créer un nouveau RDD et ne rien exécuter
 Input : RDD
 Output : RDD qui pointe vers le RDD précédent
 Exemple : map
 Actions :
 Déclencher l’exécution de toutes les opérations précédentes
(transformations définies dans le DAG) et de l’action : linéage
 Input : RDD
 Output : ce n’est pas un RDD, a value
 Stockage : external storage
 Exemple : reduce
27 23/12/2023
DAG
 Programme : un ensemble de tâches/d’opérations
(transformations, actions)
 Transformations: Map, filtermap, join
 actions: reduce, count
 L’exécution d’un programme
 Spark crée un DAG : un graphe de RDD (nœuds) et les arcs
(transformations)
 L’exécution est déclenché par une action
28 23/12/2023
Transformation RDD
 Function that produces new RDD from the existing RDDs
RDD Point to parent

(immutable) New RDD
transformations
Map, filter, …
 It takes RDD as input and produces one or more RDD as output
 Each time it creates new RDD when we apply any transformation
 Thus, the so input RDDs, cannot be changed since RDD are immutable in
nature
29 23/12/2023
Transformation RDD
 Exemple
30 23/12/2023
RDD transformations
 Transformations do not return a value
 In fact, nothing is evaluated during the definition of

transformation statements
 Spark just creates the definition of a transformation that will

be evaluated later at runtime
 This is called lazy evaluation
 The transformation is stored as a directed acyclic graphs

(DAG)
31 23/12/2023
RDD transformations
 Nothing is executed until an action is called
 Each transformation function basically updates the graph

and when an action is called, the graph is executed
 Transformation returns a pointer to the new RDD
Lazy evaluation
32 23/12/2023
Action RDD
 Transformations create RDDs from each other
 Actions are Spark RDD operations that give non-RDD

values
 The values of action are stored to drivers or to the external

storage system
RDD
Value
action
Count, reduce, …
33 23/12/2023
RDD lineage
 Applying transformations built an RDD lineage, with the
entire parent RDDs of the final RDD(s)
 RDD lineage, also known as RDD operator graph or RDD

dependency graph. It is a logical execution plan i.e., it is Directed
Acyclic Graph (DAG) of the entire parent RDDs of RDD
34 23/12/2023
Exemple
 Dataset: transformed into an RDD
 map: is a transformation, that’s it is not executed just we

create another RDD
Point to parent New RDD

RDD
Map
35 23/12/2023
RDD operations: Transformations
 These are some of the transformations available
 Transformations are lazy evaluations
 Returns a pointer to the transformed RDD
36 23/12/2023
flatMap
 The flatMap function is similar to map, but each input
can be mapped to 0 or more output items
 Return a sequence of objects, rather than a single item
 Given a text file:

 Each time a line is read in, you split that line up by spaces to
get individual keywords
 Each of those lines ultimately is flatten so that you can perform

the map operation on it to map each keyword to the value of
one
37 23/12/2023
Join and reduceByKey
 The join function combines two sets of key value pairs
and return a set of keys to a pair of values from the two
initial set
 Example:
 you have a K,V pair and a K,W pair
 When you join them together, you will get a K, (V,W) set
 The reduceByKey function aggregates on each key by

using the given reduce function
 Example: use in a WordCount to sum up the values for each
word to count its occurrences
38 23/12/2023
RDD operations: Actions
 Action returns values
39 23/12/2023
RDD operations: Actions
 collect: returns all the elements of the dataset as an array of
the driver program
 count: returns the number of elements in a dataset and can

also be used to check and test transformations
 take(n): returns an array with the first n elements. Note that

this is currently not executed in parallel. The driver
computes all the elements
 foreach(func): function run a function func on each element

of the dataset
40 23/12/2023
Transformation and Action
exécution: In memory
41 23/12/2023
RDD basic operations
 Loading a file
 val lines = sc.textFile("hdfs://data.txt")
 Loading the file creates a RDD, which is only a pointer to

the file
 The dataset is not loaded into memory yet
 Nothing will happen until some action is called
42 23/12/2023
 Applying transformation
 val lineLengths = lines.map(s => s.length)
 The transformation: map each line - ‘s’ - to the length of

that line
 The transformation basically updates the direct acyclic

graph (DAG)
43 23/12/2023
 Invoking action
 val totalLengths = lineLengths.reduce((a,b) => a + b)
 The action operation is reducing it to get the total length of

all the lines
 When the action is called, Spark goes through the DAG

and applies all the transformation up until that point,
followed by the action and then a value is returned back to
the caller
44 23/12/2023
 View the DAG : display the series of transformation that
 lineLengths.toDebugString
 Read DAG from the bottom to up
 Example: the following DAG starts as a textFile and goes
through a series of transformation such as map and filter,
followed by more map operations
45 23/12/2023
What happens when an action is exectuted ? (1/8)
 Example: code to analyze
some log files
 The first line you load the log
from the hadoop file system
 The next two lines you filter
out the messages within the log
errors
 Tell it to cache the filtered
dataset
 More filters to get specific
error messages relating to
mysql and php followed by the
count action (number of errors)
46 23/12/2023
 RDD log (1)
 Appliquer transformations : filter -> RDD messages (2)
 Charger en mémoire cache (in_memory) le RDD

messages
 Action 1 : count() dans messages « mysql »
 Lorsqu’on exécute action, tout le DAG est exécuté et l’output

final est stocké dans le disque external storage
 Action 2 : count() dans messages « php »

47 23/12/2023
What happens when an action is exectuted
? (2/8)
Load the text file: the data is partitioned into

different blocks across the cluster
48 23/12/2023
What happens when an action is
exectuted ? (3/8)
The driver sends the code to be

executed on each block
49 23/12/2023
? (4/8)
 The executors read the HDFS blocks to prepare the data
for the operations in parallel
50 23/12/2023
? (5/8)
 After a series of transformations, you can cache the results
up until that point into memory
 A cache is created
51 23/12/2023
? (6/8)
 After the first action completes, the results are sent back to
the driver
52 23/12/2023
? (7/8)
 To process the second action, Spark will use the data on
the cache; it does not need to go to the HDFS data again
53 23/12/2023
? (8/8)
 Finally the results are sent back to the driver and you have
completed a full cycle
54 23/12/2023
Spark Cluster overview
55 23/12/2023
Spark cluster overview
 Components Example of cluster
configuration
 Driver
 Cluster Manager
 Executors
56 23/12/2023
Spark cluster overview
 There are three main components of a Spark cluster
 The driver, where the SparkContext is located within the main

program
 Cluster manager : this could be either Spark's standalone

cluster manager, or Mesos or Yarn
 Worker nodes where the executors reside
 The executors are the processes that run computations and store
the data for the application
57 23/12/2023
Spark context
 SparkContext is the entry point of Spark functionality.
 The most important step of any Spark driver application is

to generate SparkContext
 It allows Spark Application to access Spark Cluster with the

help of Resource Manager
 The SparkContext sends the application, defined as JAR or

Python files to each executor
 Finally, it sends the tasks for each executor to run

58 23/12/2023
Cluster manager
 There are currently three supported cluster managers
 Spark standalone manager that we can use to get up and

running
 We can use Apache Mesos, a general cluster manager that can

run and service Hadoop jobs
 We can also use Hadoop YARN, the resource manager in

Hadoop
59 23/12/2023
Driver
 Where the SparkContext is located within the main
program
 The driver program schedules tasks on the cluster - it

should run close to the worker nodes on the same local
network
 If you like to send remote requests to the cluster, it is

better to use a RPC and have it submit operations from
nearby
60 23/12/2023
Executor
 The executors are the processes that run computations and
store the data for the application
 The SparkContext sends the application, defined as JAR

or Python files to each executor
 Finally, it sends the tasks for each executor to run
61 23/12/2023
Executor
 Each application gets its own executor processes
 The executor stays up for the entire duration that the application
is running
 The benefit of this is that the applications are isolated from each
other, on both the scheduling side, and running on different
JVMs
 However, this means that we cannot share data across

applications
 You would need to externalize the data if you wish to share data
between different applications, instances of SparkContext
62 23/12/2023
Programming with Spark
63 23/12/2023
Programming with Spark
 Spark with interactive shells: (TP3, Shell)
 spark-shell (for Scala)

 pyspark (for Python)
 Programming with Spark with:

 Scala
 Python
 Java
64 23/12/2023
Initializing spark: scala/java/python
 Build a SparkConf object that contains information about
your application
 Scala
 Java
 Python
 The appName parameter: Name for your application to run on the

cluster UI
 The master parameter: is a Spark, Mesos, or YARN cluster URL (or a
special "local" string to run in local mode)
65 23/12/2023
 Then, you need to create the SparkContext object
Java
Scal
a
Python
66 23/12/2023
Révision
67 23/12/2023
 Data.csv sur le disque
 Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
 Exécuter le processus : un ensemble de tâches :
 map
 filter
 count
 Deux types de tâches : transformation, action
 Transformation : transforme un RDD1 en un RDD2 (lazy
evaluation)
map filter
RDD1 RDD2 RDD3
count
 Créer un DAG, le DAG est exécuté
résultat
68 23/12/2023
 Data.csv sur le disque
 Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
 Exécuter le processus : un ensemble de tâches :
 map
 filter
 count
 Transformation : transforme un RDD1 en un RDD2
 map(RDD1) -> RDD2
 filter(RDD2) -> RDD3
 count(RDD3) -> output : action déclencher l’exécution de
toutes les tâches (map,filter,count)
 Lazy evaluation (transformation)
69 23/12/2023
Dataset RDD
map
DAG : Graphe d’exécution, un

ensemble de RDD RDD
Les tâches map et mapfilter ne sont
pas exécutés (lazy evaluation) mapfilter
Count() déclenche l’exécution de
tout le graphe
RDD
Diapo du cours en ligne

count
External storage
70 23/12/2023
Example: MapReduce Job With Hadoop
 MapReduce Job: map, reduce, map, reduce (disk)
map2
Dataset Hadoop read write

map1 Local file
(replicated,
partionned)
Hadoop
Diapo du cours en ligne reduce1
(replicated,
reduce2 partionned)
71 23/12/2023
Example: mapReduce job with Spark
 MapReduce Job: map, reduce, map, reduce
Point to parent
RDD New RDD
map
Dataset
reduce
Diapo du cours en ligne Output
72 23/12/2023
Annexe
73 23/12/2023
Compatible versions of software
 Scala:
 Spark 1.6.3 uses Scala 2.10
 Spark 2.1.1 is built and distributed to work with Scala 2.11 by default
 To write applications in Scala, you will need to use a compatible Scala
version (e.g. 2.10.X)
 Python:
 Spark 1.x works with Python 2.6 or higher (but not yet with Python 3)
 Java:
 Spark 1.x works with Java 6 and higher - and Java 8 supports lambda
expressions
74 23/12/2023
Linking with Spark
 Scala
 Java
 Python
75 23/12/2023
Initializing spark: Spark Properties
 SparkConf: allows you to configure some of the common properties
 master URL
 application name
 Example (with Scala) initialize an application with two

threads as follows:
 Run with local[2], meaning two threads - which represents

“minimal” parallelism
76 23/12/2023
 Build a SparkConf object that contains information about
your application
 Scala
 Java
 Python
 The appName parameter: Name for your application to run on the

cluster UI
 The master parameter: is a Spark, Mesos, or YARN cluster URL (or a
special "local" string to run in local mode)
77 23/12/2023
 Then, you need to create the SparkContext object
Java
Scal
a
Python
78 23/12/2023
Resilient Distributed Datasets (RDDs)
 There are two ways to create RDDs:
1. Parallelizing an existing collection in your driver program
 Once created the distributed dataset (distData) can be operated

on in parallel
distData.reduce((a, b) => a + b)
79 23/12/2023
Resilient Distributed Datasets (RDDs)
 There are two ways to create RDDs:
2. Referencing a dataset in an external storage system, such as a

shared filesystem, HDFS, HBase, or any data source offering
a Hadoop InputFormat
 Once created, distFile can be acted on by dataset operations
distFile.map(s => s.length).reduce((a, b) => a + b)
80 23/12/2023
RDD operations with scala
 Define RDD
 Defines lineLengths as the result of a map transformation
lineLengths is not immediately computed, due to laziness
 Run reduce, which is an action
 At this point Spark breaks the computation into tasks to run on separate machines
 Each machine runs both its part of the map and a local reduction, returning only
its answer to the driver program
81 23/12/2023
Run spark application: dependencies
1. Define the dependencies using any system build
mechanism (Ant, SBT, Maven, Gradle)
 Example of Build files:

 Scala: simple.sbt
 Java: pom.xml
 Create the typical directory structure with the files
82 23/12/2023
 pom.xml
83 23/12/2023
 Create zip files (example- abc.zip) containing all your
dependencies. While creating the spark context mention the
zip file name as:
 Python: if you need to define dependencies use --py-files

argument
84 23/12/2023
2. Create a JAR package containing the application's code
and submit to run the program
3. Use spark-submit to run the program
 Scala: sbt
 Java (marven): mvn
 Python: submit-spark
85 23/12/2023
RDD Features
86 23/12/2023
Spark RDD features
 In-memory Computation
 Spark RDDs have a provision of in-memory computation
 It stores intermediate results in (RAM) instead of stable

storage (disk)
 Lazy Evaluations
 All transformations in Apache Spark are lazy, in that they do
not compute their results right away
 Instead, they just remember the transformations applied to

some base data set
87 23/12/2023
Spark RDD features
 Fault Tolerance
 Spark RDDs are fault tolerant as they track data lineage
information to rebuild lost data automatically on failure
 They rebuild lost data on failure using lineage
 Each RDD remembers how it was created from other datasets

(by transformations like a map, join or groupBy) to recreate
itself
88 23/12/2023
Spark RDD features
 Immutability
 Data is safe to share across processes
 It can also be created or retrieved anytime which makes
caching, sharing & replication easy
 Thus, it is a way to reach consistency in computations
 Partitioning
 Partitioning is the fundamental unit of parallelism in Spark
RDD
 Each partition is one logical division of data which is mutable
 One can create a partition through some transformations on
existing partitions
89 23/12/2023
Spark RDD features
 Persistence
 Users can state which RDDs they will reuse and choose a
storage strategy for them (e.g., in-memory storage or on Disk)
 Coarse-grained Operations
 It applies to all elements in datasets through
 maps
 filter
 group by operation
90 23/12/2023
Spark RDD features
 Location-Stickiness
 RDDs are capable of defining placement preference to compute
partitions
 Placement preference refers to information about the location of

RDD
 The DAGScheduler places the partitions in such a way that task

is close to data as much as possible
 Thus, speed up computation
91 23/12/2023
How RDDs are represented
 RDDs are made up of 4 parts:
 Partitions: Atomic pieces of the dataset. One or many per
compute node
 Dependencies: Models relationship between this RDD and its

partitions with the RDD(s) it was derived from
 A function for computing the dataset based on its parent RDDs
 Metadata about it partitioning scheme and data placement
92 23/12/2023
RDD: logical vs physical partition
 Every dataset in RDD is logically partitioned across many
servers so that they can be computed on different nodes of
the cluster
 RDDs are fault tolerant i.e. It posses self-recovery in the

case of failure
93 23/12/2023
RDD persistence
94 23/12/2023
RDD persistence: caching
 One of the key capability of Spark is its speed through
persisting or caching
 Each node stores any partitions of the cache and computes it in

memory
 When a subsequent action is called on the same dataset, or a

derived dataset, it uses it from memory instead of having to
retrieve it again
 Future actions in such cases are often 10 times faster
 The first time a RDD is persisted, it is kept in memory on the node

95 23/12/2023
RDD persistence: In memory RDD
 Spark keeps persistent RDDs in memory by default
 Caching is fault tolerant
 But it can spill them to disk if there is not enough RAM
 Users can also request other persistence strategies
96 23/12/2023
RDD persistence:
 Two methods for RDD persistence: persist(), cache()
 The cache() method is the default way of using persistence
 The persist() method allows you to specify a different

storage level of caching
 For example, persist the data set on disk, persist it in memory

but as serialized objects to save space, etc.
97 23/12/2023
RDD persistence strategies
98 23/12/2023
Best practices for which storage level to
choose
 MEMORY_ONLY : The default storage level is the best
 The most CPU-efficient option
 Allow operations on the RDDs to run as fast as possible
 It is the fastest option to fully take advantage of Spark's design
 MEMORY_ONLY_SER:
 Use it with a fast serialization library to make objects more
space-efficient
 Is reasonably fast to access
99 23/12/2023
choose
 DISK_ONLY ?
 When to use disk-only storage
 the functions that computed your datasets are expensive
 filter a large amount of the data
 recomputing a partition may be as fast as reading it from disk
100 23/12/2023
choose
 The experimental OFF_HEAP mode has several
advantages:
 In environments with high amounts of memory or multiple

applications
 Allows multiple executors to share the same pool of memory
 Reduces garbage collection costs (automatic memory

management)
 Cached data is not lost if individual executors crash

101 23/12/2023
OFF_HEAP
102 23/12/2023
RDD Dependencies
103 23/12/2023
Dependencies
 Transformations can have 2 kinds of dependencies:
 Narrow dependencies: Each partition of the parent RDD is used

by at most one partition of the child RDD
 No shuffle necessary
 Optimizations like pipelining possible
 Thus transformations which have narrow dependencies are fast
104 23/12/2023
Dependencies
 Wide dependencies: all the elements that are required to
compute the records in the single partition may live in many
partitions of parent RDD
 Shuffle necessary for all or some data
 Thus transformations which have wide dependencies are slow
105 23/12/2023
Narrow dependencies Vs. Wide
dependencies
106 23/12/2023
Narrow dependencies
 Transformations with (usually) narrow dependencies:
 map
 mapValues
 flatMap
 filter
 mapPartitions
 mapPartitionsWithIndex
 Narrow dependency objects
 OneToOneDependency
 PruneDependency
 RangeDependency
107 23/12/2023
Wide dependencies
 Transformations with (usually) Wide dependencies: (might cause a shuffle):
 cogroup
 groupWith
 join
 leftOuterJoin
 rightOuterJoin
 groupByKey
 reduceByKey
 combineByKey
 distinct
 intersection
 repartition
 Coalesce
 Wide dependency objects

 ShuffleDependency
108 23/12/2023
RDD
 RDDs provide a low-level API that gives great control
over the dataset
 It lacks the schema control, but gives greater flexibility

when it comes to storage and partition, as it gives the
choice of implementing a custom partitioner
109 23/12/2023
Spark functions
110 23/12/2023
Spark librariries
 Extensions of the core Spark API
 Improvements made to the core are passed to these libraries
 Little overhead to use with the Spark core
111 23/12/2023
Spark SQL
 Allows relational queries expressed in
 SQL
 HiveQL
 Scala
 SchemaRDD
 Row objects
 Schema
 Created from:
 Existing RDD
 Parquet file
 JSON dataset
 HiveQL against Apache Hive
 Supports Scala, Java, R, and Python
112 23/12/2023
Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live data
streams
 Receives live input data and divides into small batches which are processed
and returned as batches
 DStream - sequence of RDD
 Currently supports Scala, Java, and Python
• Receives data from: Kafka, Flume, HDFS / S3, Kinesis, Twitter

• Pushes data out to: HDFS, Databases, Dashboard
113 23/12/2023
Spark streaming internals
 The input stream (DStream) goes into Spark Streaming
 The data is broken up into batches that are fed into the Spark
engine for processing
 The final results are generated as a stream of batches
• Sliding window operations

• Windowed computations
• Window length
• Sliding interval
• reduceByKeyAndWindow
114 23/12/2023
MLib
 MLlib for machine learning library - under active
development
 Provides, currently, the following common algorithm and
utilities
 Classification
 Regression
 Clustering
 Collaborative filtering
 Dimensionality reduction
115 23/12/2023
GraphX
 GraphX for graph processing
 Graphs and graph parallel computation
 Social networks and language modeling
116 23/12/2023
Berkley Data Analytics Stack (BDAS)
117 23/12/2023
Spark Monitoring
118 23/12/2023
Spark monitoring
 Three ways to monitor Spark applications
1. Web UI
 Port 4040 (lab exercise on port 8088)
 Available for the duration of the application
 The Web UI has the following information.

 A list of scheduler stages and tasks
 A summary of RDD sizes and memory usage
 Environmental information and information about the running
executors
119 23/12/2023
Spark monitoring
2. Metrics
 Based on the Coda Hale Metrics Library
 Report to a variety of sinks (HTTP, JMX, and CSV)
/conf/metrics.properties
3. External instrumentations
 Cluster-wide monitoring tool (Ganglia)
 OS profiling tools (dstat, iostat, iotop)
 JVM utilities (jstack, jmap, jstat, jconsole)
120 23/12/2023
Spark Monitoring Web UI
121 23/12/2023
Spark Monitoring web UI
122 23/12/2023
Spark Monitoring: Metrics
123 23/12/2023
Spark monitoring: Ganglia
124 23/12/2023
Conclusion
 Purpose of Apache Spark in the Hadoop ecosystem
 Architecture and components of the Spark unified stack
 Role of a Resilient Distributed Dataset (RDD)
 Principles of Spark programming
 List and describe the Spark libraries
 Launch and use Spark's Scala and Python shells

125 23/12/2023

SPARK

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPARK

Uploaded by

Copyright:

Available Formats

Spark

 Spark use cases

 Spark unified stack

 Spark vs Hadoop ecosystem

 The need to be able to process those data:

 Difficult to write a MapReduce job is also difficult as it

 Specific set of use cases

Apache Spark: fast, general-purpose, and easy to use

 Spark was originally developed at UC Berkeley in 2009 and

 It can be considered as a main-memory extension of the

 Enable parallel computations on comodity machines with locality-

 Generality : covers a wide range of workloads on one system

 Ease of use (simple APIs)

 Like MapReduce: Parallel distributed processing, fault

 Spark adds: In-memory distributed computing, low

All the higher layer components will inherit the improvements

 The benefit of the Spark Unified Stack is that all the

 Example: Optimization to the Spark Core will speed up

 Scalability: the Spark core is designed to scale up from one

 Cluster managers: Hadoop YARN and Apache Mesos or on

 Spark Core contains basic Spark functionalities required for

 RDD (Resilient Distributed Dataset): an immutable distributed,

 SQL with Spark's programming language supported by Python,

 Spark Streaming provides processing of live streams of

 GraphX provides functions for building graphs

 Implementations of the most important algorithms of the

Apache Mahout Spark MLlib

Apache Pig, and Apache Sqoop are not really

“The Spork project enables you to run Pig on Spark”

 including PageRank, K-means clustering, and logistic

 Interactive data mining, where a user runs multiple ad-hoc

 Write it to an external stable storage system (e.g.,

which can dominate application execution times

 Resilient, i.e. fault-tolerant with the help of RDD lineage

 Distributed, since Data resides on multiple nodes

 RDD is an abstraction of a distributed collection of items with

 It is resilient because it is capable of rebuilding datasets in case

 Two types of RDD operations: Transformations and actions

RDD Point to parent

 It takes RDD as input and produces one or more RDD as output

 Each time it creates new RDD when we apply any transformation

 In fact, nothing is evaluated during the definition of

 Spark just creates the definition of a transformation that will

 This is called lazy evaluation

 The transformation is stored as a directed acyclic graphs

 Nothing is executed until an action is called

 Each transformation function basically updates the graph

 Transformation returns a pointer to the new RDD

 Actions are Spark RDD operations that give non-RDD

 The values of action are stored to drivers or to the external

 RDD lineage, also known as RDD operator graph or RDD

 map: is a transformation, that’s it is not executed just we

Point to parent New RDD

 Return a sequence of objects, rather than a single item

 Given a text file:

 Each of those lines ultimately is flatten so that you can perform

 The reduceByKey function aggregates on each key by

 count: returns the number of elements in a dataset and can

 take(n): returns an array with the first n elements. Note that

 foreach(func): function run a function func on each element

 val lines = sc.textFile("hdfs://data.txt")

 Loading the file creates a RDD, which is only a pointer to

 The dataset is not loaded into memory yet

 Nothing will happen until some action is called