Professional Documents
Culture Documents
gianluca.quercini@centralesupelec.fr
Lecture 3. SQL and NoSQL (29/09/2021).
stephane.vialle@centralesupelec.fr Lecture 4. Hadoop technologies (06/10/2021).
Course overview General information Course overview General information Introduction Objectives
Lab assignments. Lab assignments 1 and 2 will be graded. Big data notions, motivations and challenges.
Submission: code source + written report. Email: gianluca.quercini@centralesupelec.fr Basic notions of Hadoop and its ecosystem.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 3 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 4 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 5 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era
Definition (Data)
Data are raw symbols that represent the properties of objects and events Definition (Structured data)
and in this sense data has no meaning of itself, it simply exists (Russell L. Definition (Unstructured data)
Structured data describe the properties (e.g., the name, address, credit
Acko↵, 1989). Source card number and phone number) of entities (e.g., customer, products) Unstructured data describe entities that lack a clear structure due to
following a fixed template or model. their properties not being immediately distinguishable.
Example. {“John”, “Smith”, 30000}
Information. Data + meaning.
{(first name, “John”), (last name, “Smith”), (salary, 30000)} Data stored in spreadsheets (e.g., Excel files). Text is unstructured.
Records stored in the tables of a relational database. Description of entity properties drowned in a rich context.
Definition (Dataset)
Each property is easily distinguishable from the others. No direct access to these properties.
A dataset is a collection of data. It fits one unit of the structure (e.g., a column of a table).
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 6 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 7 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 8 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 9 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 10 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 11 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era
Definition (4V)
Example (4V)
Big Data consists of extensive datasets primarily in the characteristics of Veracity. Data might not correspond to the truth.
Volume, Variety, Velocity, and/or Variability that require a scalable Sentiment analysis system that processes tweets to derive the general mood about Fake news retweeted multiple times.
a political candidate.
architecture for efficient storage, manipulation, and analysis (NIST). Uncertainty. The example of Google Flu Trends.
Language analysis: positive/negative/neutral sentiment?
Volume. Millions of tweets.
Subtle di↵erence between variety and variability. Velocity. Constant stream of data (7,500 tweets/second).
Variety: a bakery that sells ten types of bread. Variety. Text, images and links to Web pages. Value. Separating the wheat from the cha↵.
Variability: a bakery that sells only one type of bread that tastes Lot of data available.
di↵erently every day. Variability. The meaning of each word changes depending on the context.
Identify the data that can have some value.
I’m deeply satisfied about the candidate , Discard the other data.
I’m deeply o↵ended by the candidate /
Scalability: the ability of a system architecture to manage growing
amounts of data, without a significant decrease of its performance
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 12 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 13 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 14 / 67
Introduction The Big Data era Introduction The Big Data era Introduction The Big Data era
Scalability Where does Big Data come from? Big Data applications
Scalability: the ability of a system to handle growing amounts of Communications, media entertainment.
data, without a significant decrease of its performances. Recommendation systems, social network analysis . . .
Two techniques:
Vertical scaling (scale-up).
Web search engines.
Upgrade the existing infrastructure (more memory, computing
power. . . ).
Horizontal scaling (scale-out). Banking industry.
Add machines to the existing infrastructure. Fraud detection, anti-money laundering . . . .
Distribute the data and the workload across several machines.
Advantages of vertical scaling: Healthcare industry.
Easier to maintain a single machine than many. Diagnostics, medical research . . . .
Centralized control over the data and the computations. Source: IDC, 2014
Advantages of horizontal scaling: The Web: social networks, blogs, wikis. Government agencies.
Limitless upgrade of the computing power of a system. Sensors: surveillance cameras, medical devices, cellphones. Processing of unemployment claims, homeland security. . .
Fault tolerance.
Companies (e.g., Amazon, UPS, Spotify, Netflix).
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 15 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 16 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 17 / 67
Introduction The Big Data era Introduction The Big Data era Hadoop Origins
Big Data: search engines Big Data challenges Processing big data
Return a list of Web pages related to a search query.
Need to index all Web pages.
In this course, we study two main challenges: processing and Why processing Big Data is challenging?
Inverted index: for each word, list the Web pages containing that
word. storage.
Need to rank all Web pages. Disk storage capacities have increased rapidly over the years.
wikipedia.org (supposedly) more important than myblog.com. A typical disk from 1990 could store 1,370 MB of data (cf. Seagate
Processing. ST-41600n).
Parallelize the computation across machines. A typical disk (SSD) today can store 2 TB of data (cf. Seagate
Barracuda 120).
Distributed processing frameworks (e.g., Hadoop MapReduce/Spark)
Disk access increased much slower.
A typical disk from 1990 had a transfer speed of 4.4 MB/s.
Storage. A typical disk (SSD) from today has a transfer speed of 500 MB/s.
Distributed file systems.
Distributed (relational/NoSQL) databases. In 1990 it could take 5 minutes to read all the data from a disk.
In 2020 it takes more than one hour to read all the data from a disk.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 18 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 19 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 20 / 67
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 24 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 25 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 26 / 67
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 27 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 28 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 29 / 67
MapReduce Definitions and principles MapReduce Definitions and principles MapReduce Definitions and principles
MapReduce Definitions and principles MapReduce Definitions and principles MapReduce Definitions and principles
Functions map() and reduce() Input Input splits Map tasks Shuffle Reduce tasks Output
Answers
map : line ! sequence of (k, v )
Partition 1
reduce : (k, L) ! output value The result is a pair (w , ow ). A key k is a word w . (quick, [1, 1])
Which value v must be associated to a key k? + We have all we need to define both map and reduce. (strength, [1])
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 36 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 37 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 38 / 67
Example: word count MapReduce implementation: data storage MapReduce implementation: data flow
For large inputs, we use a cluster of machines (scale out).
Machines in a cluster are referred to as nodes.
sort
Function map Data is stored in a distributed file systems (e.g., HDFS). map copy
split 0 merge
Each file is split into a set of fixed-size blocks (64 MB or 128 MB). task
def map(line):
Each block is replicated across machines (for reliability). reduce
for word in line.split(): part 0
task
yield(word, 1) File with 4 blocks
B1 B2 B3 B4 Switch sort
map
split 1
task
merge
B1 R11 B1 R21 R31 B1
Function reduce R12 R22 R32 reduce
def reduce(w, L): R13 R23 R33 part 1
B2
R14
B2
R24 R34
B2 task
sort
yield(w, sum(L)) R15 R25 R35 map
B3 R16 B3 R26 R36 B3
split 2
task
R17 R27 R37
R18 R28 R38 Source
B4 B4 B4
Rack R1 Rack R2 Rack R3
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 39 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 40 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 41 / 67
MapReduce Implementation MapReduce Implementation MapReduce Implementation
merge
map (lazy, 1)
sort copy map
sort copy quick brown fox
map
split 0 merge split 0
task
merge jump lazy dog task reduce
task part 0
task
reduce reduce
part 0 part 0
task task
sort sort
map map
split 1 split 1
task task
merge
merge merge (lazy, 1)
reduce reduce jump vow quick map (lazy, 1) reduce
part 1 part 1 part 1
task task lazy strength lazy task task
sort sort
map map
split 2 split 2
task task
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 42 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 43 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 44 / 67
MapReduce Implementation Apache Spark Main notions Apache Spark Main notions
+ We can use a combiner only if the function that we want to implement is Main features DataFrames Structured
MLlib GraphX
Datasets Streaming
commutative and associative. Speed. Run computations in memory, as opposed to Hadoop that
heavily relies on disks and HDFS. Spark Core and Spark SQL engine
merge
map (lazy, 1) combine (lazy, 1) General-purpose. It integrates a wide range of workloads that
quick brown fox reduce Scala Python Java SQL R
jump lazy dog task
task
part 0 previously required separate distributed systems.
Batch applications, iterative algorithms.
Interactive queries, streaming applications. Cluster manager
(lazy, 1) merge Accessibility. It o↵ers APIs in Python, Scala, Java and SQL and rich Standalone
Mesos YARN Kubernetes
combine (lazy, 2) Scheduler
(lazy, 1)
jump vow quick
lazy strength lazy
map
task
reduce
part 1
built-in libraries.
task
Integration. It integrates with other Big Data tools, such as Hadoop. Image source
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 45 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 46 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 47 / 67
Apache Spark Main notions Apache Spark Main notions Apache Spark Main notions
Spark core. Computational engine responsible for scheduling, Improvements on the bottom layers are automatically reflected on
distributing, and monitoring applications. high-level libraries. Amazon.
Optimizations in the Spark core result in better performances in Spark eBay. Log transaction aggregation and analytics.
Spark SQL. SQL interface to Spark for structured data. SQL and MLlib.
Structured data are accessed through a DataFrames/Datasets. Groupon.
Stanford DAWN. Research project aiming at democratizing AI.
Remove the costs of using di↵erent independent systems.
Structured Streaming. Processing of streaming sources. Built on TripAdvisor.
Deployment, maintenance, test, support of di↵erent systems
top of the Spark SQL engine. (streaming, SQL, machine learning. . . ). Yahoo!
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 48 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 49 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 50 / 67
Apache Spark Spark architecture Apache Spark Spark architecture Apache Spark Spark architecture
Cluster manager
Responsible for managing and allocating resources in the cluster.
Spark Executor
CPU core
Image source
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 51 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 52 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 53 / 67
Apache Spark Spark application concepts Apache Spark Spark application concepts Apache Spark Low-level Spark programming
Launching a Spark application Distributed data and partitions Writing a low-level Spark program
Data is distributed across several machines.
4 The program accesses Spark through an object called SparkContext.
Data is split across di↵erent partitions (e.g., HDFS blocks).
Each Spark core is assigned a partition to work on.
Cluster Node
1 Data locality principle. Partitions assigned to the closest core. Initializing the SparkContext
manager 2 Driver manager
from pyspark import SparkCon f, SparkContext
5
conf = SparkConf().setMaster(<cluster URL>).setAppName(<app_name>)
3 sc = SparkContext(conf = conf)
1. spark-submit
2. Launch the driver 6 Executor Spark Executor Spark Executor Spark Executor
A Spark program is a sequence of operations invoked on the
3. Ask resources 7 SparkContext.
4. Schedule executors
5. Launch executors Data partitions Data partitions Data partitions
6. Register with the driver These operations manipulate a special type of data structure, called
Master process Worker process Files across HDFS, Amazon S2, Azure Blob…
Resilient Distributed Dataset (RDD).
7. Assign tasks
CPU core
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 54 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 55 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 56 / 67
Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming
Spark parallelizes the operations invoked on each RDD. Load data from external storage
The SparkContext o↵ers numerous functions to load data from
Result
external sources (e.g., a text file).
+ A Spark program is a sequence of operations invoked on RDDs.
An operation can be either a transformation or an action. lines = sc.textFile(<path_to_file>)
+ A transformation never changes the input RDD.
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 57 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 58 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 59 / 67
Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming
Element-wise transformations
Element-wise transformations Element-wise transformations
map(). Takes in a function f and a RDD < xi | 0 i n >;
map(). Takes in a function f and a RDD < xi | 0 i n >; map(). Takes in a function f and a RDD < xi | 0 i n >;
Returns a new RDD < f (xi ) | 0 i n >.
Returns a new RDD < f (xi ) | 0 i n >. Returns a new RDD < f (xi ) | 0 i n >.
filter(). Takes in a predicate p and a RDD < xi | 0 i n >;
filter(). Takes in a predicate p and a RDD < xi | 0 i n >; filter(). Takes in a predicate p and a RDD < xi | 0 i n >;
Returns a new RDD < xi | 0 i n, p(xi ) is true >
Returns a new RDD < xi | 0 i n, p(xi ) is true > Returns a new RDD < xi | 0 i n, p(xi ) is true >
Map
Input RDD nums = sc.parallelize([1, 2, 3, 4]) Map (alternative)
<1, 2, 3, 4>
mapped_rdd = nums.map(lambda x: x*x)
def power2(x):
map( lambda x : x*x) ) filter( lambda x : x<=3) )
return x*x
Mapped RDD Filtered RDD Filter nums = sc.parallelize([1, 2, 3, 4])
<1, 4, 9, 16> <1, 2, 3> mapped_rdd = nums.map(power2)
nums = sc.parallelize([1, 2, 3, 4])
filtered_rdd = nums.filter(lambda x: x <= 3)
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 60 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 61 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 62 / 67
Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming Apache Spark Low-level Spark programming
flatMapped RDD
flatMap( lambda x: x.split(“ ”) ) < “The”, “quick”, “brown”, “fox”,
“jumps”, “over”,
“the, “lazy”, “dog” >
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 63 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 64 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 65 / 67
Apache Spark Low-level Spark programming References
Notebook available on Google Colab Click here White, Tom. Hadoop: The definitive guide. ”O’Reilly Media, Inc.”,
2012. Click here
+ Select File ! Save a copy in Drive to create a copy of the Karau, Holden, et al. Learning spark: lightning-fast big data analysis.
notebook in your Drive and play with it. ”O’Reilly Media, Inc.”, 2015. Click here
Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 66 / 67 Gianluca Quercini, Stéphane Vialle Big Data Polytech Paris-Saclay, 2021 67 / 67