You are on page 1of 125

Spark

Afef Bahri
Cours: ESC Tunis

1 23/12/2023
Plan
 Motivation
 From MapReduce to Spark

 Spark use cases

 Spark unified stack


 Spark core, Spark API

 Spark vs Hadoop ecosystem

 RDD

 Actions execution

 Spark Cluster
2 23/12/2023
Brief history of Spark
 Timeline
 2002: MapReduce @ Google
 2004: MapReduce paper
 2006: Hadoop @ Yahoo
 2008: Hadoop Summit
 2010: Spark paper
 2014: Apache Spark top-level
 MapReduce started a general batch processing paradigm,
but had its limitations:
 difficulty programming in MapReduce
 batch processing did not fit many use cases

3 23/12/2023
Motivations
 Explosion of data: social media such as Twitter feeds,
Facebook posts, SMS

 The need to be able to process those data:

 Ex. How can you find out what your customers want and be
able to offer it to them right away?

 Response time: batch job may take too much time to complete

4 23/12/2023
MapReduce to Spark
 Time: run jobs not acceptable in many situations

 Difficult to write a MapReduce job is also difficult as it


takes specific programming knowledge and expertise

 Specific set of use cases

Apache Spark: fast, general-purpose, and easy to use

5 23/12/2023
Spark
 Apache Spark [1] is an in-memory distributed computing
platform designed for large-scale data processing

 Spark was originally developed at UC Berkeley in 2009 and


currently is one of the most active big-data Apache projects

 It can be considered as a main-memory extension of the


MapReduce model

 Enable parallel computations on comodity machines with locality-


awareness scheduling, fault tolerance and load balancing

6 23/12/2023
Spark
 Speed
 Because of Spark’s In-memory implementation It can be up to 100 times
faster than Hadoop [3]

 Generality : covers a wide range of workloads on one system


 Batch applications (e.g. MapReduce)
 Iterative algorithms
 Interactive queries and streaming

 Ease of use (simple APIs)


 APIs for Scala, Python, Java, R
 Libraries for SQL, machine learning, streaming, and graph processing
 Runs on Hadoop clusters Hadoop YARN or Apache Mesos, or as a
standalone
7 23/12/2023
Ease to use
 To implement the classic wordcount in Java MapReduce,
you need three classes: the main class that sets up the job,
a Mapper, and a Reducer, each about 10 lines long
 For the same wordcount program, written in Scala for
Spark:

8 23/12/2023
Spark
 Expands on Hadoop's capabilities

 Like MapReduce: Parallel distributed processing, fault


tolerance on commodity hardware, scalability

 Spark adds: In-memory distributed computing, low


latency, high level APIs

9 23/12/2023
Spark
 Data scientist
 Analyze and model the data to obtain insight using ad-hoc analysis
 Transforming the data into a useable format
 SQL, statistics, machine learning Python, MatLab or R
 Data engineers
 Develop a data processing system or application
 Inspect and monitor, inspect, and tune their applications
 Programming with the Spark's API
 Everyone else
 Ease of use
 Wide variety of functionality
 Mature and reliable
10 23/12/2023
Spark use cases
 Data Streaming

 Machine Learning

 Collaborative Filtering

 Interactive Analysis

11 23/12/2023
Spark unified stack
 The Spark core is at the center of the Spark Unified Stack

All the higher layer components will inherit the improvements


made at the lower layers

12 23/12/2023
Spark Unified Stack
 Various add-in components can run on top of the core

 The benefit of the Spark Unified Stack is that all the


higher layer components will inherit the improvements
made at the lower layers

 Example: Optimization to the Spark Core will speed up


the SQL, the streaming, the machine learning and graph
processing

13 23/12/2023
SPARK Core
 General purpose system: scheduling, distributing, and
monitoring of the applications across a cluster

 Scalability: the Spark core is designed to scale up from one


to thousands of nodes

 Cluster managers: Hadoop YARN and Apache Mesos or on


standalone with its own built-in scheduler

 Spark Core contains basic Spark functionalities required for


running jobs (and needed by other components): RDD and
DataFrames
14 23/12/2023
Spark core: RDD
 Spark is more efficient than Hadoop for applications
requiring frequent reuse of working data sets across
multiple parallel operations
 This level of efficiency is due to the main data abstraction
that Spark provides:

 RDD (Resilient Distributed Dataset): an immutable distributed,


fault tolerant memory data abstraction that enables one to
perform in-memory computations (transformations and actions)
 when Hadoop is mainly disk-based

15 23/12/2023
Spark Unified Stack: High level
 On top of RDD and DataFrames, Spark proposes two
higher-level data access models for processing semi-
structured data in general

 Spark SQL is designed to work with the Spark via SQL and
HiveQL (a Hive variant of SQL)

 SQL with Spark's programming language supported by Python,


Scala, Java, and R

 Spark Streaming provides processing of live streams of


data
16 23/12/2023
Spark Unified Stack: High level
 MLlib is the machine learning library that provides
multiple types of machine learning algorithms (scale out)
 Supported algorithms:
 logistic regression
 naive Bayes classification,
 SVM
 decision trees
 random forests
 linear regression
 k-means clustering
 and others

17 23/12/2023
Spark Unified Stack: High level
 GraphX is a graph processing library with APIs to manipulate
graphs and performing graph-parallel computations

 GraphX provides functions for building graphs

 Implementations of the most important algorithms of the


graph theory like

 page rank
 connected components
 shortest paths, and others

18 23/12/2023
Spark VS Hadoop ecosystem
Hadoop Ecosystem Spark
Apache Storm Spark Streaming
Apache Giraph Spark GraphX

Apache Mahout Spark MLlib

Apache Pig, and Apache Sqoop Spark Core and Spark SQL
Spork

Apache Pig, and Apache Sqoop are not really


needed anymore, as the same functionalities are
covered by Spark Core and Spark SQL

“The Spork project enables you to run Pig on Spark”

19 23/12/2023
RDD

20 23/12/2023
Need for RDD
 Data reuse is common in many iterative machine learning
and graph algorithms

 including PageRank, K-means clustering, and logistic


regression

 Interactive data mining, where a user runs multiple ad-hoc


queries on the same subset of the data

21 23/12/2023
Need for RDD
 In most frameworks, the only way to reuse data between
computations (e.g., between two MapReduce jobs) is to:

 Write it to an external stable storage system (e.g.,


HDFS)
 Substantial overheads due to
 data replication
 disk I/O
 and serialization

which can dominate application execution times

22 23/12/2023
Need for RDD
 In distributed computing system data is stored in
intermediate stable distributed store such as HDFS
 Computation of job slower since it involves many IO
operations, replications, and serializations in the process

23 23/12/2023
Need for RDD
 RDDs solve these problems by enabling fault-tolerant
distributed In-memory computations

24 23/12/2023
SPARK RDD
 RDD stands for “Resilient Distributed Dataset”. It is the
fundamental data structure of Apache Spark

 Resilient, i.e. fault-tolerant with the help of RDD lineage


graph(DAG) and so able to recompute missing or damaged
partitions due to node failures

 Distributed, since Data resides on multiple nodes

 Dataset represents records of the data you work with. The user
can load the data set externally which can be either JSON file,
CSV file, text file or database via JDBC with no specific data
structure
25 23/12/2023
SPARK Core: RDD
 Spark's primary abstraction: Distributed collection of
elements, parallelized across the cluster

 RDD is an abstraction of a distributed collection of items with


operations and transformations applicable to the dataset

 It is resilient because it is capable of rebuilding datasets in case


of node failures

 Two types of RDD operations: Transformations and actions

26 23/12/2023
RDD
 RDD : Resilient Distributed Dataset
 Operations (task, programme c’est un ensemble de taches) :
 Transformations
 Lazy evaluation : créer un nouveau RDD et ne rien exécuter
 Input : RDD
 Output : RDD qui pointe vers le RDD précédent
 Exemple : map
 Actions :
 Déclencher l’exécution de toutes les opérations précédentes
(transformations définies dans le DAG) et de l’action : linéage
 Input : RDD
 Output : ce n’est pas un RDD, a value
 Stockage : external storage
 Exemple : reduce
27 23/12/2023
DAG
 Programme : un ensemble de tâches/d’opérations
(transformations, actions)
 Transformations: Map, filtermap, join
 actions: reduce, count
 L’exécution d’un programme
 Spark crée un DAG : un graphe de RDD (nœuds) et les arcs
(transformations)
 L’exécution est déclenché par une action

28 23/12/2023
Transformation RDD
 Function that produces new RDD from the existing RDDs

RDD Point to parent


(immutable) New RDD
transformations

Map, filter, …

 It takes RDD as input and produces one or more RDD as output

 Each time it creates new RDD when we apply any transformation

 Thus, the so input RDDs, cannot be changed since RDD are immutable in
nature
29 23/12/2023
Transformation RDD
 Exemple

30 23/12/2023
RDD transformations
 Transformations do not return a value

 In fact, nothing is evaluated during the definition of


transformation statements

 Spark just creates the definition of a transformation that will


be evaluated later at runtime

 This is called lazy evaluation

 The transformation is stored as a directed acyclic graphs


(DAG)
31 23/12/2023
RDD transformations

 Nothing is executed until an action is called

 Each transformation function basically updates the graph


and when an action is called, the graph is executed

 Transformation returns a pointer to the new RDD

Lazy evaluation
32 23/12/2023
Action RDD
 Transformations create RDDs from each other

 Actions are Spark RDD operations that give non-RDD


values

 The values of action are stored to drivers or to the external


storage system
RDD
Value
action
Count, reduce, …

33 23/12/2023
RDD lineage
 Applying transformations built an RDD lineage, with the
entire parent RDDs of the final RDD(s)

 RDD lineage, also known as RDD operator graph or RDD


dependency graph. It is a logical execution plan i.e., it is Directed
Acyclic Graph (DAG) of the entire parent RDDs of RDD
34 23/12/2023
Exemple
 Dataset: transformed into an RDD

 map: is a transformation, that’s it is not executed just we


create another RDD

Point to parent New RDD


RDD
Map

35 23/12/2023
RDD operations: Transformations
 These are some of the transformations available
 Transformations are lazy evaluations
 Returns a pointer to the transformed RDD

36 23/12/2023
flatMap
 The flatMap function is similar to map, but each input
can be mapped to 0 or more output items

 Return a sequence of objects, rather than a single item

 Given a text file:


 Each time a line is read in, you split that line up by spaces to
get individual keywords

 Each of those lines ultimately is flatten so that you can perform


the map operation on it to map each keyword to the value of
one
37 23/12/2023
Join and reduceByKey
 The join function combines two sets of key value pairs
and return a set of keys to a pair of values from the two
initial set
 Example:
 you have a K,V pair and a K,W pair
 When you join them together, you will get a K, (V,W) set

 The reduceByKey function aggregates on each key by


using the given reduce function
 Example: use in a WordCount to sum up the values for each
word to count its occurrences

38 23/12/2023
RDD operations: Actions
 Action returns values

39 23/12/2023
RDD operations: Actions
 collect: returns all the elements of the dataset as an array of
the driver program

 count: returns the number of elements in a dataset and can


also be used to check and test transformations

 take(n): returns an array with the first n elements. Note that


this is currently not executed in parallel. The driver
computes all the elements

 foreach(func): function run a function func on each element


of the dataset
40 23/12/2023
Transformation and Action
exécution: In memory

41 23/12/2023
RDD basic operations
 Loading a file

 val lines = sc.textFile("hdfs://data.txt")

 Loading the file creates a RDD, which is only a pointer to


the file

 The dataset is not loaded into memory yet

 Nothing will happen until some action is called

42 23/12/2023
RDD basic operations
 Applying transformation
 val lineLengths = lines.map(s => s.length)

 The transformation: map each line - ‘s’ - to the length of


that line

 The transformation basically updates the direct acyclic


graph (DAG)

43 23/12/2023
RDD basic operations
 Invoking action
 val totalLengths = lineLengths.reduce((a,b) => a + b)

 The action operation is reducing it to get the total length of


all the lines

 When the action is called, Spark goes through the DAG


and applies all the transformation up until that point,
followed by the action and then a value is returned back to
the caller

44 23/12/2023
RDD basic operations
 View the DAG : display the series of transformation that
 lineLengths.toDebugString
 Read DAG from the bottom to up
 Example: the following DAG starts as a textFile and goes
through a series of transformation such as map and filter,
followed by more map operations

45 23/12/2023
What happens when an action is exectuted ? (1/8)
 Example: code to analyze
some log files
 The first line you load the log
from the hadoop file system
 The next two lines you filter
out the messages within the log
errors
 Tell it to cache the filtered
dataset
 More filters to get specific
error messages relating to
mysql and php followed by the
count action (number of errors)
46 23/12/2023
 RDD log (1)
 Appliquer transformations : filter -> RDD messages (2)

 Charger en mémoire cache (in_memory) le RDD


messages
 Action 1 : count() dans messages « mysql »

 Lorsqu’on exécute action, tout le DAG est exécuté et l’output


final est stocké dans le disque external storage

 Action 2 : count() dans messages « php »


47 23/12/2023
What happens when an action is exectuted
? (2/8)

Load the text file: the data is partitioned into


different blocks across the cluster

48 23/12/2023
What happens when an action is
exectuted ? (3/8)

The driver sends the code to be


executed on each block

49 23/12/2023
What happens when an action is exectuted
? (4/8)
 The executors read the HDFS blocks to prepare the data
for the operations in parallel

50 23/12/2023
What happens when an action is exectuted
? (5/8)
 After a series of transformations, you can cache the results
up until that point into memory
 A cache is created

51 23/12/2023
What happens when an action is exectuted
? (6/8)
 After the first action completes, the results are sent back to
the driver

52 23/12/2023
What happens when an action is exectuted
? (7/8)
 To process the second action, Spark will use the data on
the cache; it does not need to go to the HDFS data again

53 23/12/2023
What happens when an action is exectuted
? (8/8)
 Finally the results are sent back to the driver and you have
completed a full cycle

54 23/12/2023
Spark Cluster overview

55 23/12/2023
Spark cluster overview
 Components Example of cluster
configuration
 Driver
 Cluster Manager
 Executors

56 23/12/2023
Spark cluster overview
 There are three main components of a Spark cluster

 The driver, where the SparkContext is located within the main


program

 Cluster manager : this could be either Spark's standalone


cluster manager, or Mesos or Yarn

 Worker nodes where the executors reside

 The executors are the processes that run computations and store
the data for the application
57 23/12/2023
Spark context
 SparkContext is the entry point of Spark functionality.

 The most important step of any Spark driver application is


to generate SparkContext

 It allows Spark Application to access Spark Cluster with the


help of Resource Manager

 The SparkContext sends the application, defined as JAR or


Python files to each executor

 Finally, it sends the tasks for each executor to run


58 23/12/2023
Cluster manager
 There are currently three supported cluster managers

 Spark standalone manager that we can use to get up and


running

 We can use Apache Mesos, a general cluster manager that can


run and service Hadoop jobs

 We can also use Hadoop YARN, the resource manager in


Hadoop

59 23/12/2023
Driver
 Where the SparkContext is located within the main
program

 The driver program schedules tasks on the cluster - it


should run close to the worker nodes on the same local
network

 If you like to send remote requests to the cluster, it is


better to use a RPC and have it submit operations from
nearby

60 23/12/2023
Executor
 The executors are the processes that run computations and
store the data for the application

 The SparkContext sends the application, defined as JAR


or Python files to each executor

 Finally, it sends the tasks for each executor to run

61 23/12/2023
Executor
 Each application gets its own executor processes
 The executor stays up for the entire duration that the application
is running

 The benefit of this is that the applications are isolated from each
other, on both the scheduling side, and running on different
JVMs

 However, this means that we cannot share data across


applications

 You would need to externalize the data if you wish to share data
between different applications, instances of SparkContext
62 23/12/2023
Programming with Spark

63 23/12/2023
Programming with Spark
 Spark with interactive shells: (TP3, Shell)

 spark-shell (for Scala)


 pyspark (for Python)

 Programming with Spark with:


 Scala
 Python
 Java

64 23/12/2023
Initializing spark: scala/java/python
 Build a SparkConf object that contains information about
your application

 Scala

 Java

 Python

 The appName parameter: Name for your application to run on the


cluster UI
 The master parameter: is a Spark, Mesos, or YARN cluster URL (or a
special "local" string to run in local mode)
65 23/12/2023
Initializing spark: scala/java/python
 Then, you need to create the SparkContext object

Java

Scal
a

Python

66 23/12/2023
Révision

67 23/12/2023
 Data.csv sur le disque
 Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
 Exécuter le processus : un ensemble de tâches :
 map
 filter
 count
 Deux types de tâches : transformation, action
 Transformation : transforme un RDD1 en un RDD2 (lazy
evaluation)
map filter
RDD1 RDD2 RDD3

count
 Créer un DAG, le DAG est exécuté
résultat
68 23/12/2023
 Data.csv sur le disque
 Le fait de charger Data.csv : un RDD correspondant à
Data.csv avec un numéro RDD1
 Exécuter le processus : un ensemble de tâches :
 map
 filter
 count
 Transformation : transforme un RDD1 en un RDD2
 map(RDD1) -> RDD2
 filter(RDD2) -> RDD3
 count(RDD3) -> output : action déclencher l’exécution de
toutes les tâches (map,filter,count)
 Lazy evaluation (transformation)
69 23/12/2023
Dataset RDD

map

DAG : Graphe d’exécution, un


ensemble de RDD RDD
Les tâches map et mapfilter ne sont
pas exécutés (lazy evaluation) mapfilter
Count() déclenche l’exécution de
tout le graphe

RDD

Diapo du cours en ligne


count

External storage
70 23/12/2023
Example: MapReduce Job With Hadoop
 MapReduce Job: map, reduce, map, reduce (disk)

map2

Dataset Hadoop read write


map1 Local file
(replicated,
partionned)

Hadoop
Diapo du cours en ligne reduce1
(replicated,
reduce2 partionned)

71 23/12/2023
Example: mapReduce job with Spark
 MapReduce Job: map, reduce, map, reduce

Point to parent
RDD New RDD

map
Dataset

reduce

Diapo du cours en ligne Output

72 23/12/2023
Annexe

73 23/12/2023
Compatible versions of software
 Scala:
 Spark 1.6.3 uses Scala 2.10
 Spark 2.1.1 is built and distributed to work with Scala 2.11 by default
 To write applications in Scala, you will need to use a compatible Scala
version (e.g. 2.10.X)

 Python:
 Spark 1.x works with Python 2.6 or higher (but not yet with Python 3)

 Java:

 Spark 1.x works with Java 6 and higher - and Java 8 supports lambda
expressions
74 23/12/2023
Linking with Spark
 Scala

 Java

 Python

75 23/12/2023
Initializing spark: Spark Properties
 SparkConf: allows you to configure some of the common properties
 master URL
 application name

 Example (with Scala) initialize an application with two


threads as follows:

 Run with local[2], meaning two threads - which represents


“minimal” parallelism

76 23/12/2023
Initializing spark: scala/java/python
 Build a SparkConf object that contains information about
your application

 Scala

 Java

 Python

 The appName parameter: Name for your application to run on the


cluster UI
 The master parameter: is a Spark, Mesos, or YARN cluster URL (or a
special "local" string to run in local mode)
77 23/12/2023
Initializing spark: scala/java/python
 Then, you need to create the SparkContext object

Java

Scal
a

Python

78 23/12/2023
Resilient Distributed Datasets (RDDs)
 There are two ways to create RDDs:

1. Parallelizing an existing collection in your driver program

 Once created the distributed dataset (distData) can be operated


on in parallel

distData.reduce((a, b) => a + b)

79 23/12/2023
Resilient Distributed Datasets (RDDs)
 There are two ways to create RDDs:

2. Referencing a dataset in an external storage system, such as a


shared filesystem, HDFS, HBase, or any data source offering
a Hadoop InputFormat

 Once created, distFile can be acted on by dataset operations

distFile.map(s => s.length).reduce((a, b) => a + b)

80 23/12/2023
RDD operations with scala
 Define RDD

 Defines lineLengths as the result of a map transformation

lineLengths is not immediately computed, due to laziness

 Run reduce, which is an action

 At this point Spark breaks the computation into tasks to run on separate machines

 Each machine runs both its part of the map and a local reduction, returning only
its answer to the driver program

81 23/12/2023
Run spark application: dependencies
1. Define the dependencies using any system build
mechanism (Ant, SBT, Maven, Gradle)

 Example of Build files:


 Scala: simple.sbt
 Java: pom.xml

 Create the typical directory structure with the files

82 23/12/2023
Run spark application: dependencies
 pom.xml

83 23/12/2023
Run spark application: dependencies
 Create zip files (example- abc.zip) containing all your
dependencies. While creating the spark context mention the
zip file name as:

 Python: if you need to define dependencies use --py-files


argument

84 23/12/2023
Run spark application: dependencies
2. Create a JAR package containing the application's code
and submit to run the program

3. Use spark-submit to run the program

 Scala: sbt

 Java (marven): mvn

 Python: submit-spark

85 23/12/2023
RDD Features

86 23/12/2023
Spark RDD features
 In-memory Computation
 Spark RDDs have a provision of in-memory computation

 It stores intermediate results in (RAM) instead of stable


storage (disk)

 Lazy Evaluations
 All transformations in Apache Spark are lazy, in that they do
not compute their results right away

 Instead, they just remember the transformations applied to


some base data set
87 23/12/2023
Spark RDD features
 Fault Tolerance
 Spark RDDs are fault tolerant as they track data lineage
information to rebuild lost data automatically on failure

 They rebuild lost data on failure using lineage

 Each RDD remembers how it was created from other datasets


(by transformations like a map, join or groupBy) to recreate
itself

88 23/12/2023
Spark RDD features
 Immutability
 Data is safe to share across processes
 It can also be created or retrieved anytime which makes
caching, sharing & replication easy
 Thus, it is a way to reach consistency in computations
 Partitioning
 Partitioning is the fundamental unit of parallelism in Spark
RDD
 Each partition is one logical division of data which is mutable
 One can create a partition through some transformations on
existing partitions

89 23/12/2023
Spark RDD features
 Persistence
 Users can state which RDDs they will reuse and choose a
storage strategy for them (e.g., in-memory storage or on Disk)

 Coarse-grained Operations
 It applies to all elements in datasets through
 maps
 filter
 group by operation

90 23/12/2023
Spark RDD features
 Location-Stickiness
 RDDs are capable of defining placement preference to compute
partitions

 Placement preference refers to information about the location of


RDD

 The DAGScheduler places the partitions in such a way that task


is close to data as much as possible

 Thus, speed up computation

91 23/12/2023
How RDDs are represented
 RDDs are made up of 4 parts:
 Partitions: Atomic pieces of the dataset. One or many per
compute node

 Dependencies: Models relationship between this RDD and its


partitions with the RDD(s) it was derived from

 A function for computing the dataset based on its parent RDDs

 Metadata about it partitioning scheme and data placement

92 23/12/2023
RDD: logical vs physical partition
 Every dataset in RDD is logically partitioned across many
servers so that they can be computed on different nodes of
the cluster

 RDDs are fault tolerant i.e. It posses self-recovery in the


case of failure

93 23/12/2023
RDD persistence

94 23/12/2023
RDD persistence: caching
 One of the key capability of Spark is its speed through
persisting or caching

 Each node stores any partitions of the cache and computes it in


memory

 When a subsequent action is called on the same dataset, or a


derived dataset, it uses it from memory instead of having to
retrieve it again

 Future actions in such cases are often 10 times faster

 The first time a RDD is persisted, it is kept in memory on the node


95 23/12/2023
RDD persistence: In memory RDD
 Spark keeps persistent RDDs in memory by default

 Caching is fault tolerant

 But it can spill them to disk if there is not enough RAM

 Users can also request other persistence strategies

96 23/12/2023
RDD persistence:
 Two methods for RDD persistence: persist(), cache()

 The cache() method is the default way of using persistence

 The persist() method allows you to specify a different


storage level of caching

 For example, persist the data set on disk, persist it in memory


but as serialized objects to save space, etc.

97 23/12/2023
RDD persistence strategies

98 23/12/2023
Best practices for which storage level to
choose
 MEMORY_ONLY : The default storage level is the best
 The most CPU-efficient option
 Allow operations on the RDDs to run as fast as possible
 It is the fastest option to fully take advantage of Spark's design

 MEMORY_ONLY_SER:
 Use it with a fast serialization library to make objects more
space-efficient
 Is reasonably fast to access

99 23/12/2023
Best practices for which storage level to
choose
 DISK_ONLY ?
 When to use disk-only storage

 the functions that computed your datasets are expensive

 filter a large amount of the data

 recomputing a partition may be as fast as reading it from disk

100 23/12/2023
Best practices for which storage level to
choose
 The experimental OFF_HEAP mode has several
advantages:

 In environments with high amounts of memory or multiple


applications

 Allows multiple executors to share the same pool of memory

 Reduces garbage collection costs (automatic memory


management)

 Cached data is not lost if individual executors crash


101 23/12/2023
OFF_HEAP

102 23/12/2023
RDD Dependencies

103 23/12/2023
Dependencies
 Transformations can have 2 kinds of dependencies:

 Narrow dependencies: Each partition of the parent RDD is used


by at most one partition of the child RDD

 No shuffle necessary

 Optimizations like pipelining possible

 Thus transformations which have narrow dependencies are fast

104 23/12/2023
Dependencies
 Wide dependencies: all the elements that are required to
compute the records in the single partition may live in many
partitions of parent RDD

 Shuffle necessary for all or some data

 Thus transformations which have wide dependencies are slow

105 23/12/2023
Narrow dependencies Vs. Wide
dependencies

106 23/12/2023
Narrow dependencies
 Transformations with (usually) narrow dependencies:
 map
 mapValues
 flatMap
 filter
 mapPartitions
 mapPartitionsWithIndex
 Narrow dependency objects
 OneToOneDependency
 PruneDependency
 RangeDependency

107 23/12/2023
Wide dependencies
 Transformations with (usually) Wide dependencies: (might cause a shuffle):
 cogroup
 groupWith
 join
 leftOuterJoin
 rightOuterJoin
 groupByKey
 reduceByKey
 combineByKey
 distinct
 intersection
 repartition
 Coalesce

 Wide dependency objects


 ShuffleDependency
108 23/12/2023
RDD
 RDDs provide a low-level API that gives great control
over the dataset

 It lacks the schema control, but gives greater flexibility


when it comes to storage and partition, as it gives the
choice of implementing a custom partitioner

109 23/12/2023
Spark functions

110 23/12/2023
Spark librariries
 Extensions of the core Spark API
 Improvements made to the core are passed to these libraries
 Little overhead to use with the Spark core

111 23/12/2023
Spark SQL
 Allows relational queries expressed in
 SQL
 HiveQL
 Scala
 SchemaRDD
 Row objects
 Schema
 Created from:
 Existing RDD
 Parquet file
 JSON dataset
 HiveQL against Apache Hive
 Supports Scala, Java, R, and Python
112 23/12/2023
Spark Streaming
 Scalable, high-throughput, fault-tolerant stream processing of live data
streams
 Receives live input data and divides into small batches which are processed
and returned as batches
 DStream - sequence of RDD
 Currently supports Scala, Java, and Python

• Receives data from: Kafka, Flume, HDFS / S3, Kinesis, Twitter


• Pushes data out to: HDFS, Databases, Dashboard

113 23/12/2023
Spark streaming internals
 The input stream (DStream) goes into Spark Streaming
 The data is broken up into batches that are fed into the Spark
engine for processing
 The final results are generated as a stream of batches

• Sliding window operations


• Windowed computations
• Window length
• Sliding interval
• reduceByKeyAndWindow

114 23/12/2023
MLib
 MLlib for machine learning library - under active
development
 Provides, currently, the following common algorithm and
utilities
 Classification
 Regression
 Clustering
 Collaborative filtering
 Dimensionality reduction

115 23/12/2023
GraphX
 GraphX for graph processing
 Graphs and graph parallel computation
 Social networks and language modeling

116 23/12/2023
Berkley Data Analytics Stack (BDAS)

117 23/12/2023
Spark Monitoring

118 23/12/2023
Spark monitoring
 Three ways to monitor Spark applications
1. Web UI
 Port 4040 (lab exercise on port 8088)
 Available for the duration of the application

 The Web UI has the following information.


 A list of scheduler stages and tasks
 A summary of RDD sizes and memory usage
 Environmental information and information about the running
executors

119 23/12/2023
Spark monitoring
2. Metrics
 Based on the Coda Hale Metrics Library
 Report to a variety of sinks (HTTP, JMX, and CSV)
/conf/metrics.properties

3. External instrumentations
 Cluster-wide monitoring tool (Ganglia)
 OS profiling tools (dstat, iostat, iotop)
 JVM utilities (jstack, jmap, jstat, jconsole)

120 23/12/2023
Spark Monitoring Web UI

121 23/12/2023
Spark Monitoring web UI

122 23/12/2023
Spark Monitoring: Metrics

123 23/12/2023
Spark monitoring: Ganglia

124 23/12/2023
Conclusion
 Purpose of Apache Spark in the Hadoop ecosystem

 Architecture and components of the Spark unified stack

 Role of a Resilient Distributed Dataset (RDD)

 Principles of Spark programming

 List and describe the Spark libraries

 Launch and use Spark's Scala and Python shells


125 23/12/2023

You might also like