You are on page 1of 12

Course overview General information Course overview General information

Organization of the course Class material

Lecture 1. MapReduce and Spark (08/09/2021). Available online


Big data
Tutorial 1. Spark programming (15/09/2021). https://tinyurl.com/p7jb5wra
Lecture 1 – MapReduce and Spark
Click here
Lecture 2. MapReduce and Spark (15/09/2021).

Gianluca Quercini, Stéphane Vialle Lab assignment 1. Spark programming (22/09/2021).

gianluca.quercini@centralesupelec.fr
Lecture 3. SQL and NoSQL (29/09/2021).
stephane.vialle@centralesupelec.fr Lecture 4. Hadoop technologies (06/10/2021).

Lecture 5. Scaling (06/10/2021).


Polytech Paris-Saclay, 2021
Lab assignment 2. MongoDB programming (13/10/2021). Slides of the lectures.
Tutorial 2. Scaling (20/10/2021). Tutorials and lab assignments.
References (books and articles).
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 1 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 2 / 67

Course overview General information Course overview General information Introduction Objectives

Evaluation Contact What you will learn

In this lecture you will learn:

Lab assignments. Lab assignments 1 and 2 will be graded. Big data notions, motivations and challenges.
Submission: code source + written report. Email: gianluca.quercini@centralesupelec.fr Basic notions of Hadoop and its ecosystem.

The MapReduce programming paradigm.


Written exam. 1 hour. What Spark is and its main features.
Spark programming.
Data modeling in MongoDB. Email: stephane.vialle@centralesupelec.fr The components of the Spark stack.
Querying in MongoDB.
The high-level Spark architecture.

The notion of Resilient Distributed Dataset (RDD).

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 3 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 4 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 5 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era

What is data? Structured data Unstructured data

Definition (Data)
Data are raw symbols that represent the properties of objects and events Definition (Structured data)
and in this sense data has no meaning of itself, it simply exists (Russell L. Definition (Unstructured data)
Structured data describe the properties (e.g., the name, address, credit
Acko↵, 1989). Source card number and phone number) of entities (e.g., customer, products) Unstructured data describe entities that lack a clear structure due to
following a fixed template or model. their properties not being immediately distinguishable.
Example. {“John”, “Smith”, 30000}
Information. Data + meaning.
{(first name, “John”), (last name, “Smith”), (salary, 30000)} Data stored in spreadsheets (e.g., Excel files). Text is unstructured.
Records stored in the tables of a relational database. Description of entity properties drowned in a rich context.
Definition (Dataset)
Each property is easily distinguishable from the others. No direct access to these properties.
A dataset is a collection of data. It fits one unit of the structure (e.g., a column of a table).

Data can be categorized based on their structure.

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 6 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 7 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 8 / 67

Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era

Semi-structured data What is Big Data? The 3 Vs of Big Data

Definition (Semi-structured data)


Semi-structured data have a structure in which the entities and their Definition (Big Data) Definition (3V)
properties are easily distinguishable, but the organization of the structure The term Big Data refers to an accumulation of data that is too large and Big data is high-Volume, high-Velocity and high-Variety information
is not as rigorous as in a table of a relational database. complex for processing by traditional database management tools. Source assets that demand cost-e↵ective, innovative forms of information
processing for enhanced insight and decision making (Gartner).
Examples: XML, JSON, HTML documents.
The term Big Data also refers to:
Example (XML document) The solutions developed to manage large volumes of data.
<book id="bk101"> Volume, the size of a dataset.
The branch of computing studying solutions to manage large
<author>Gambardella, Matthew</author>
volumes of data. Velocity, the necessity of processing data as they arrive.
<title>XML Developer's Guide</title>
<genre>Computer</genre> Variety, the heterogeneous nature of data (structured, unstructured,
<price>44.95</price> semi-structured).
<publish_date>2000-10-01</publish_date> Other sources define Big Data in terms of its characteristics.
</book>

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 9 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 10 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 11 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era

The 4 Vs of Big Data The 4 Vs of Big Data: example More Vs

Definition (4V)
Example (4V)
Big Data consists of extensive datasets primarily in the characteristics of Veracity. Data might not correspond to the truth.
Volume, Variety, Velocity, and/or Variability that require a scalable Sentiment analysis system that processes tweets to derive the general mood about Fake news retweeted multiple times.
a political candidate.
architecture for efficient storage, manipulation, and analysis (NIST). Uncertainty. The example of Google Flu Trends.
Language analysis: positive/negative/neutral sentiment?
Volume. Millions of tweets.
Subtle di↵erence between variety and variability. Velocity. Constant stream of data (7,500 tweets/second).
Variety: a bakery that sells ten types of bread. Variety. Text, images and links to Web pages. Value. Separating the wheat from the cha↵.
Variability: a bakery that sells only one type of bread that tastes Lot of data available.
di↵erently every day. Variability. The meaning of each word changes depending on the context.
Identify the data that can have some value.
I’m deeply satisfied about the candidate , Discard the other data.
I’m deeply o↵ended by the candidate /
Scalability: the ability of a system architecture to manage growing
amounts of data, without a significant decrease of its performance

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 12 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 13 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 14 / 67

Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era

Scalability Where does Big Data come from? Big Data applications

Scalability: the ability of a system to handle growing amounts of Communications, media entertainment.
data, without a significant decrease of its performances. Recommendation systems, social network analysis . . .
Two techniques:
Vertical scaling (scale-up).
Web search engines.
Upgrade the existing infrastructure (more memory, computing
power. . . ).
Horizontal scaling (scale-out). Banking industry.
Add machines to the existing infrastructure. Fraud detection, anti-money laundering . . . .
Distribute the data and the workload across several machines.
Advantages of vertical scaling: Healthcare industry.
Easier to maintain a single machine than many. Diagnostics, medical research . . . .
Centralized control over the data and the computations. Source: IDC, 2014

Advantages of horizontal scaling: The Web: social networks, blogs, wikis. Government agencies.
Limitless upgrade of the computing power of a system. Sensors: surveillance cameras, medical devices, cellphones. Processing of unemployment claims, homeland security. . .
Fault tolerance.
Companies (e.g., Amazon, UPS, Spotify, Netflix).
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 15 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 16 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 17 / 67
Introduction The Big Data era Introduction The Big Data era Hadoop Origins

Big Data: search engines Big Data challenges Processing big data
Return a list of Web pages related to a search query.
Need to index all Web pages.
In this course, we study two main challenges: processing and Why processing Big Data is challenging?
Inverted index: for each word, list the Web pages containing that
word. storage.
Need to rank all Web pages. Disk storage capacities have increased rapidly over the years.
wikipedia.org (supposedly) more important than myblog.com. A typical disk from 1990 could store 1,370 MB of data (cf. Seagate
Processing. ST-41600n).
Parallelize the computation across machines. A typical disk (SSD) today can store 2 TB of data (cf. Seagate
Barracuda 120).
Distributed processing frameworks (e.g., Hadoop MapReduce/Spark)
Disk access increased much slower.
A typical disk from 1990 had a transfer speed of 4.4 MB/s.
Storage. A typical disk (SSD) from today has a transfer speed of 500 MB/s.
Distributed file systems.
Distributed (relational/NoSQL) databases. In 1990 it could take 5 minutes to read all the data from a disk.
In 2020 it takes more than one hour to read all the data from a disk.

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 18 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 19 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 20 / 67

Hadoop Origins Hadoop Origins Hadoop Origins

Processing big data: parallelization Parallelization: challenges Apache Hadoop


Split the data into many smaller chunks stored on separate disks.
Read in parallel from each disk. Hardware failure. Lots of disks ! higher chances of failures. Hadoop infobox
Use redundancy. Replicate data across several disks. Key people: Doug Cutting, Mike Cafarella.
Dataset (2 TB)
Original project: Apache Nutch, open-source web search engine.
+ This is where HDFS (Hadoop Distributed File System)
Timeline:
comes into play.
Chunk 1 Chunk 2 Chunk 99 Chunk 100 2002. Apache Nutch was started (crawler and search system).
(20000 MB) (20000 MB) …. (20000 MB) (20000 MB) 2003. Google published the Google File System (GFS).
Combine data from di↵erent sources. 2004. Nutch implemented GFS as the Nutch Distributed File System
Disk 1 Disk 2 Disk 99 Disk 100 (NDFS).
….
+ This is where MapReduce comes into play. 2004. Google published MapReduce.
2005. Nutch integrated its own implementation of MapReduce.
2006. NDFS and MapReduce moved out to another project: Hadoop.
Transfer rate: 500 MB/s
Apache Hadoop is a system that provides: 2008. Hadoop used in the Yahoo! search engine.
A reliable shared storage: HDFS. 2008. Hadoop made top-level project at Apache.
A processing system: MapReduce. 2008. Hadoop sorted 1 TB of data in 209 seconds.
+ Each disk may contain chunks from di↵erent datasets.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 21 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 22 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 23 / 67
Hadoop Ecosystem Hadoop Motivation Hadoop Motivation

Hadoop ecosystem Why do we need Hadoop? Why do we need Hadoop?


Second approach: run parts of the script in parallel.
Example (Mining weather data Source ) Using all available threads on a single machine.
We want to analyze a dataset with weather data.
Problem: how do we split the input data?
The dataset contains a file for each year (between 1901 and 2001).
Each file contains temperature readings from di↵erent weather
Weather dataset
stations.
Objective. Compute the maximum temperature for each year.
File 1 File 2
(1901) ….
(1902)
First approach:
Create a Linux script or a Python program. File 100 (2000)
File 101 (2001)
Run it on a single machine.
Di↵erent file sizes imply
+ The computation may take a long time depending on the dataset di↵erent running times.
size.
Source

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 24 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 25 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 26 / 67

Hadoop Motivation Hadoop Motivation MapReduce Definitions and principles

Why do we need Hadoop? Why do we need Hadoop? MapReduce


Second approach: run parts of the program in parallel.
Many datasets cannot be handled on a single machine. Definition (MapReduce)
Using all available threads on a single machine.
Third approach: run parts of the program in parallel on a cluster of MapReduce is a programming model for data processing that abstracts
Problem: how do we split the input data? machines. the problem from disk reads and writes transforming it into a computation
over sets of keys and values. Source
Challenges
Weather dataset
Coordination.
The computation consists of two parts: map and reduce.
On which machine each process goes?
Chunk 1 Chunk 2 Chunk n Chunk n+m
How the results are combined? The same program can run on one or multiple machines.
… …
(1901) (1901) (2001) (2001)
Reliability.
What happens if some processes fail? What can you do with MapReduce?
Processing Processing
Processing Processing Search indexes, image analysis, graph-based problems, machine learning. . .
Hadoop takes care of all these challenges.
Hadoop o↵ers programmers an abstraction to run parallel
Combine Combine
programs: MapReduce.
+ MapReduce cannot solve any problem!

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 27 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 28 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 29 / 67
MapReduce Definitions and principles MapReduce Definitions and principles MapReduce Definitions and principles

MapReduce: principles MapReduce: principles MapReduce: principles


Function map()
Input Input
file splits Input. One record of an input split. key-value pairs
reduce()
key-value pairs map()
map() Output. Key-value pairs extracted from the input record.
record 1 Map (k1, x1) (k2, x2) (k3, x3) (k1, [x1, y1, y2])
record 1 (k1, x1) (k2, x2) (k3, x3) Reduce
Map task 1 output
record 2
record 2 task 1 Map task (k2, [x2, w1]) task 1
record 3 Input. One input split.
record 3 map()
map()
record 4 record 4 For each record r in the input split call map(r). (k1, y1) (k1, y2)
(k1, y1) (k1, y2) Map
Map Shuffle
record 5 record 5 Shuffle task 2 reduce()
task 2 key-value pairs
record 6 record 6 record 1
map()
(k3, [x3, w2])
map() Reduce
(k1, x1) (k2, x2) (k3, x3)
record 7 map()
record 2
Map output
record 7 task 1 Map (k4, [w3]) task 2
record 8 Map record 3
task 3
record 8 (k2, w1) (k3, w2)(k4, w3)
task 3 (k2, w1) (k3, w2)(k4, w3)
record 9
record 9
+ Input splits are independent. They can be processed in parallel.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 30 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 31 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 32 / 67

MapReduce Definitions and principles MapReduce Definitions and principles MapReduce Definitions and principles

MapReduce: principles Example: word count Example: word count


Function reduce() Input. A text file F (text over several lines).
Input. A pair (k, L). Output. {(w , ow ) 8w 2 F }, w = word, ow = occurrences of w in F . Answers
k is a key. An input split is a set of lines of the input text.
L is the list of values associated to k. Output A record is a line of text.
Output. Obtained from a computation on k and/or the values in L. Input (quick, 2)
(brown, 1)
quick brown fox (fox, 1)
Reduce task jump lazy dog (jump, 2) Output
MapReduce (lazy, 3)
jump vow quick
Input. A sequence of pairs (k, L). lazy strength lazy (dog, 1) (quick, 2)
Input quick brown fox
(vow, 1) (brown, 1)
For each pair (k, L), call reduce(k, L). jump lazy dog (fox, 1)
(strength, 1) quick brown fox
jump lazy dog (jump, 2)
MapReduce
jump vow quick (lazy, 3)
reduce() lazy strength lazy (dog, 1)
jump vow quick (vow, 1)
(k1, [x1, y1, y2]) Questions lazy strength lazy (strength, 1)
Reduce
output What is an input split here?
(k2, [x2, w1]) task 1
What is a record here?
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 33 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 34 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 35 / 67
MapReduce Definitions and principles MapReduce Definitions and principles MapReduce Definitions and principles

Example: word count Example: word count Example: word count

Functions map() and reduce() Input Input splits Map tasks Shuffle Reduce tasks Output
Answers
map : line ! sequence of (k, v )
Partition 1
reduce : (k, L) ! output value The result is a pair (w , ow ). A key k is a word w . (quick, [1, 1])

ow must be computed from the values in L. (quick, 1)


(brown, 1)
(brown, [1]) (quick, 2)
(brown, 1)
(fox, 1)
L might contain as many 1s as w occurs in the text. (jump, 1) (fox, [1])
(fox, 1)
(jump, 2)
quick brown fox
We already know the output value (what we want to obtain). ow is simply computed as sum(L). quick brown fox
jump lazy dog
(lazy, 1)
(dog, 1)
(quick, 2)
(brown, 1)
(jump, [1, 1]) (fox, 1)
jump lazy dog
Output value: sequence of pairs (w , ow ). L results from shu✏ing the result of the map tasks: jump vow quick (jump, 2)
lazy strength lazy (jump, 1) (lazy, 3)
v is 1. jump vow quick
lazy strength lazy
(vow, 1)
(quick, 1)
(lazy, [1, 1, 1]) Partition 2
(dog, 1)
(vow, 1)
(lazy, 1) (strength, 1)
(strength, 1)
Questions (lazy, 1)
(dog, [1]) (lazy, 3)
(dog, 1)
(vow, 1)
What does k represent? (vow, [1]) (strength, 1)

Which value v must be associated to a key k? + We have all we need to define both map and reduce. (strength, [1])

What should L contain in order to get the output value?

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 36 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 37 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 38 / 67

MapReduce Definitions and principles MapReduce Implementation MapReduce Implementation

Example: word count MapReduce implementation: data storage MapReduce implementation: data flow
For large inputs, we use a cluster of machines (scale out).
Machines in a cluster are referred to as nodes.
sort
Function map Data is stored in a distributed file systems (e.g., HDFS). map copy
split 0 merge
Each file is split into a set of fixed-size blocks (64 MB or 128 MB). task
def map(line):
Each block is replicated across machines (for reliability). reduce
for word in line.split(): part 0
task
yield(word, 1) File with 4 blocks
B1 B2 B3 B4 Switch sort
map
split 1
task
merge
B1 R11 B1 R21 R31 B1
Function reduce R12 R22 R32 reduce
def reduce(w, L): R13 R23 R33 part 1
B2
R14
B2
R24 R34
B2 task
sort
yield(w, sum(L)) R15 R25 R35 map
B3 R16 B3 R26 R36 B3
split 2
task
R17 R27 R37
R18 R28 R38 Source
B4 B4 B4
Rack R1 Rack R2 Rack R3

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 39 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 40 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 41 / 67
MapReduce Implementation MapReduce Implementation MapReduce Implementation

MapReduce implementation: data locality MapReduce: fault tolerance Combiner functions


A map task is run on the node holding the input split, if possible. The output of a map task is stored to the local disk.
Problem. Lots of data is transferred on the network between map
If not, the map task is run on another node in the same rack. It is not written to HDFS (no replication).
and reduce tasks.
At worst, the map task is run on a node in another rack. In case of problems, the map task doesn’t need to be re-executed. Solution. Use a combiner function.
Data locality doesn’t necessarily apply to reduce tasks. The output of a reduce task is written to HDFS (replication).

merge
map (lazy, 1)
sort copy map
sort copy quick brown fox
map
split 0 merge split 0
task
merge jump lazy dog task reduce
task part 0
task
reduce reduce
part 0 part 0
task task

sort sort
map map
split 1 split 1
task task
merge
merge merge (lazy, 1)
reduce reduce jump vow quick map (lazy, 1) reduce
part 1 part 1 part 1
task task lazy strength lazy task task
sort sort
map map
split 2 split 2
task task

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 42 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 43 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 44 / 67

MapReduce Implementation Apache Spark Main notions Apache Spark Main notions

Combiner functions Apache Spark Spark stack


Word count Definition (Apache Spark) Unified stack of components that address diverse workloads under a
map: line ! {(w , 1) 8w 2 line} w is a word. single distributed engine.
Apache Spark is a distributed computing framework designed to be fast
combine: (w , cw = [1, . . . , 1]) ! (w , sum(cw )).
reduce: (w , cw = {o1,w , . . . , om,w }) ! (w , sum(cw )) oi,w : occurrences of w . and general-purpose. Source
Spark SQL Spark Streaming Machine Learning Graph processing

+ We can use a combiner only if the function that we want to implement is Main features DataFrames Structured
MLlib GraphX
Datasets Streaming
commutative and associative. Speed. Run computations in memory, as opposed to Hadoop that
heavily relies on disks and HDFS. Spark Core and Spark SQL engine
merge

map (lazy, 1) combine (lazy, 1) General-purpose. It integrates a wide range of workloads that
quick brown fox reduce Scala Python Java SQL R
jump lazy dog task
task
part 0 previously required separate distributed systems.
Batch applications, iterative algorithms.
Interactive queries, streaming applications. Cluster manager

(lazy, 1) merge Accessibility. It o↵ers APIs in Python, Scala, Java and SQL and rich Standalone
Mesos YARN Kubernetes
combine (lazy, 2) Scheduler
(lazy, 1)
jump vow quick
lazy strength lazy
map
task
reduce
part 1
built-in libraries.
task
Integration. It integrates with other Big Data tools, such as Hadoop. Image source

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 45 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 46 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 47 / 67
Apache Spark Main notions Apache Spark Main notions Apache Spark Main notions

Spark stack Spark stack: benefits Who uses Spark

Spark core. Computational engine responsible for scheduling, Improvements on the bottom layers are automatically reflected on
distributing, and monitoring applications. high-level libraries. Amazon.
Optimizations in the Spark core result in better performances in Spark eBay. Log transaction aggregation and analytics.
Spark SQL. SQL interface to Spark for structured data. SQL and MLlib.
Structured data are accessed through a DataFrames/Datasets. Groupon.
Stanford DAWN. Research project aiming at democratizing AI.
Remove the costs of using di↵erent independent systems.
Structured Streaming. Processing of streaming sources. Built on TripAdvisor.
Deployment, maintenance, test, support of di↵erent systems
top of the Spark SQL engine. (streaming, SQL, machine learning. . . ). Yahoo!

MLlib. Machine learning algorithms built on top of the DataFrame


API. Di↵erent programming models in the same application.
Application that reads a stream of data.
Applies machine learning algorithms.
+ Full list available at http://spark.apache.org/powered-by.html

GraphX. Graph processing and graph-parallel computations (e.g.,


Uses SQL to analyze the results.
PageRank).

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 48 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 49 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 50 / 67

Apache Spark Spark architecture Apache Spark Spark architecture Apache Spark Spark architecture

Spark distributed execution Spark components Spark components


A Spark Application consists of a Spark Driver.
The Spark driver orchestrates parallel operations on the cluster. Spark Driver
The driver accesses the cluster through a Spark Session. Requests resources (e.g., CPU, memory) for the Spark’s executors
(Java processes) from the cluster manager. SparkSession
Transforms all the Spark operations into DAG computations. Single entry-point to all Spark operations.
Spark Application Distributes these computations as tasks across the Spark executors. Define DataFrames and DataSets.
Spark Driver
Cluster
Spark Executor Communicates directly with the executors, once the resources are Read data from sources (e.g., files, databases).
Manager
Spark Session allocated. Issue Spark SQL queries.
……
First object created in a Spark application.

Cluster manager
Responsible for managing and allocating resources in the cluster.
Spark Executor
CPU core
Image source

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 51 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 52 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 53 / 67
Apache Spark Spark application concepts Apache Spark Spark application concepts Apache Spark Low-level Spark programming

Launching a Spark application Distributed data and partitions Writing a low-level Spark program
Data is distributed across several machines.
4 The program accesses Spark through an object called SparkContext.
Data is split across di↵erent partitions (e.g., HDFS blocks).
Each Spark core is assigned a partition to work on.
Cluster Node
1 Data locality principle. Partitions assigned to the closest core. Initializing the SparkContext
manager 2 Driver manager
from pyspark import SparkCon f, SparkContext
5
conf = SparkConf().setMaster(<cluster URL>).setAppName(<app_name>)
3 sc = SparkContext(conf = conf)

1. spark-submit
2. Launch the driver 6 Executor Spark Executor Spark Executor Spark Executor
A Spark program is a sequence of operations invoked on the
3. Ask resources 7 SparkContext.
4. Schedule executors
5. Launch executors Data partitions Data partitions Data partitions
6. Register with the driver These operations manipulate a special type of data structure, called
Master process Worker process Files across HDFS, Amazon S2, Azure Blob…
Resilient Distributed Dataset (RDD).
7. Assign tasks
CPU core

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 54 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 55 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 56 / 67

Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming

Resilient Distributed Dataset (RDD) Creating RDDs RDD operations


Transformations take in one or more RDDs and return a new RDD.
Definition (Resilient Distributed Dataset) Parallelize a collection Actions take in one or several RDDs and return a result to the driver
A Resilient Distributed Dataset, or simply RDD, is an immutable, The SparkContext is used to parallelize an existing collection. or write it to storage.
distributed collection of objects. Source wordList = ["This", "is", "my", "first", "RDD"] Create RDD
Tranformation
words = sc.parallelize(wordList)
Spark splits each RDD into multiple partitions. It assumes that the collection be in entirely in memory.
Used for prototyping and testing. RDD
lineage
Partitions are distributed transparently across the nodes of the cluster.
Action

Spark parallelizes the operations invoked on each RDD. Load data from external storage
The SparkContext o↵ers numerous functions to load data from
Result
external sources (e.g., a text file).
+ A Spark program is a sequence of operations invoked on RDDs.
An operation can be either a transformation or an action. lines = sc.textFile(<path_to_file>)
+ A transformation never changes the input RDD.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 57 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 58 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 59 / 67
Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming

Common transformations Common transformations Common transformations

Element-wise transformations
Element-wise transformations Element-wise transformations
map(). Takes in a function f and a RDD < xi | 0  i  n >;
map(). Takes in a function f and a RDD < xi | 0  i  n >; map(). Takes in a function f and a RDD < xi | 0  i  n >;
Returns a new RDD < f (xi ) | 0  i  n >.
Returns a new RDD < f (xi ) | 0  i  n >. Returns a new RDD < f (xi ) | 0  i  n >.
filter(). Takes in a predicate p and a RDD < xi | 0  i  n >;
filter(). Takes in a predicate p and a RDD < xi | 0  i  n >; filter(). Takes in a predicate p and a RDD < xi | 0  i  n >;
Returns a new RDD < xi | 0  i  n, p(xi ) is true >
Returns a new RDD < xi | 0  i  n, p(xi ) is true > Returns a new RDD < xi | 0  i  n, p(xi ) is true >

Map
Input RDD nums = sc.parallelize([1, 2, 3, 4]) Map (alternative)
<1, 2, 3, 4>
mapped_rdd = nums.map(lambda x: x*x)
def power2(x):
map( lambda x : x*x) ) filter( lambda x : x<=3) )
return x*x
Mapped RDD Filtered RDD Filter nums = sc.parallelize([1, 2, 3, 4])
<1, 4, 9, 16> <1, 2, 3> mapped_rdd = nums.map(power2)
nums = sc.parallelize([1, 2, 3, 4])
filtered_rdd = nums.filter(lambda x: x <= 3)

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 60 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 61 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 62 / 67

Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming

Common transformations Common transformations Common transformations and actions


flatMap
Similarly to map, takes in a function f and an RDD and applies the
function element-wise. f must return a list of values.
flatMap
Mapped RDD Similarly to map, takes in a function f and an RDD and applies the
< [“The”, “quick”, “brown”, “fox”],
[“jumps”, “over”],
function element-wise. f must return a list of values.
[“the, “lazy”, “dog”] >

Input RDD flatMap


< “The quick brown fox”,
“jumps over”, map( lambda x: x.split(“ ”) ) phrase=sc.parallelize(
“the lazy dog” > ["The quick brown fox", "jumps over", "the lazy dog"])
flat_mapped_rdd = phrase.flatMap( lambda x: x.split(" ") )

flatMapped RDD
flatMap( lambda x: x.split(“ ”) ) < “The”, “quick”, “brown”, “fox”,
“jumps”, “over”,
“the, “lazy”, “dog” >

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 63 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 64 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 65 / 67
Apache Spark Low-level Spark programming References

Spark programming demo References

Notebook available on Google Colab Click here White, Tom. Hadoop: The definitive guide. ”O’Reilly Media, Inc.”,
2012. Click here

+ Select File ! Save a copy in Drive to create a copy of the Karau, Holden, et al. Learning spark: lightning-fast big data analysis.
notebook in your Drive and play with it. ”O’Reilly Media, Inc.”, 2015. Click here

Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 66 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 67 / 67

You might also like