Big Data Overview

Big data overview
ESC Tunis (Afef Bahri)

Plan
 Big data
 The V’s of Big Data
 Types of data
 Facets of data
 Use cases
 Industry 4.0
 Terminologies: data in motion data at rest and data lake
 Big data Platform
 Hadoop ecosystem
2 23/12/2023
Big Data
 Big Data refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make sense
of a large volume of data
 Huge volumes of data

 A tsunami of data
 Data of different types and formats
 Impacting the business at new and ever increasing speeds
3 23/12/2023
Big Data
 The challenges:
 Capturing, transporting, and moving the data
 Managing - the data, the hardware involved, and the software
 Processing - to provide insight
 Storing - safeguarding and securing
4 23/12/2023
Big Data
 Some terminology of Big Data :
• Oceans of data (data at rest) vs. Streams of data (data in

motion)
• Data Lake (a large storage repository and processing engine)
• NoSQL (“not only SQL”)
• Horizontal Scalability
5 23/12/2023
Cluster
Scalability: horizontal vs vertical
7 23/12/2023
Entreprise architecture: traditional
approach
Data and Application are in

differents machines
8 23/12/2023
Big Data Architecture
Data and Application are in the

same machine (Data nodes)
9 23/12/2023
Cluster
10 23/12/2023
A computer Cluster
 a group of linked computers, working together closely
so that in many respects they form a single computer
 The components of a cluster are commonly, but not

always, connected to each other through fast local
area networks
 Clusters are usually deployed to improve performance

and/or availability over that provided by a single
computer
11 23/12/2023
Cluster
12 23/12/2023
Advantages
 High availability : disponibilité élevée
 Load balancing : redistribuer vers un autre ordinateur du

cluster
 Remontée en charge
 Flexibility
13 23/12/2023
CFS
Network core
Rack 1 Rack 2 Rack 3
 Nodes are grouped into racks

 Links:
 intra-rack link
 cross-rack link
14 23/12/2023
Clustered file systems
 Clustered file systems (CFSes) (e.g., GFS, HDFS, Azure)

are widely adopted by enterprises
 A CFS comprises nodes connected via a network

 Nodes are prone to failures  data availability is crucial
 CFSes store data with redundancy

 Store new data with replication
15 23/12/2023
Packaging/Physical Design
16 23/12/2023
Packaging/Physical Design
 Google data center (circa 2012)

17 23/12/2023
Example of cluster
18 23/12/2023
The Vs of Big Data
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
 Volume
 Traditional database technology will not suffice the need of

efficient data management i.e. storage and analysis.
 The need of a large scale adoption of new age tools like

Hadoop and MongoDB.
 Big data volume: Zettabytes which is equivalent to a trillion

gigabytes. 1 zettabyte is equivalent to approximately 3 million
galaxies of stars
20 23/12/2023
Data has an intrinsic property…it grows
and grows
90% 80% 20%
of the world’s of available
of the world’s
data was data can be
data today is
created in the processed by
unstructured
last two years traditional
systems
1 in 2 83% 5.4X
business leaders of CIO’s cited BI and more likely that
don’t have access analytics as part of top performers
to data they need their visionary plan use business
analytics
21 23/12/2023
Growing interconnected & instrumented
world
22 23/12/2023
System of Units / Binary System of Units
23 23/12/2023
 Variety
 Variety is defined as the different types of data we can
now use. New and innovative big data technology is now
allowing structured and unstructured data
 Structured data such as bank records, demographic data,

inventory databases,
 Unstructured data includes captured like images, tweets or

Facebook status updates,
24 23/12/2023
 Velocity
 Velocity refers to the speed at which vast amounts of data are
being generated, collected and analyzed
 The frequency of specific data collected via various sources

defines the velocity of that data
 Every day the number of emails, twitter messages, photos, video clips,
etc. increases at lighting speeds around the world
 Big data technology allows us now to analyze the data while it

is being generated, without ever putting it into databases
25 23/12/2023
 Veracity
 Veracity is the quality or trust of the data
 Quality, accuracy, and precision are needed
 Data entry errors, wrong references or links
 Determining right source and form of the data and the

right method of analysis
26 23/12/2023
5th V - Value
 5th V - Value - that is the real purpose of working with
Big Data to obtain business insight.
27 23/12/2023
Les …Vs
 Volume - how much data is there?
 Velocity - how quickly is the data being created, moved, or
accessed?
 Variety - how many different types of sources are there?
 Veracity - can we trust the data?
 Validity - is the data accurate and correct?
 Viability - is the data relevant to the use case at hand?
 Volatility - how often does the data change?
 Vulnerability - can we keep the data secure?
 Visualization - how can the data be presented to the user?
 Value - can this data produce a meaningful return on
investment?
28 23/12/2023
Types of Big Data
 Structured
 Data that can be stored
and processed in a
fixed format, aka schema
 Semi-structured
 Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that
makes it easier to analyze, aka XML or JSON
 Unstructured
 Data that has an unknown form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured format is called as
unstructured data
 Text Files and multimedia contents like images, audios, videos are example of
unstructured data - unstructured data is growing quicker than others, experts
say that 80 percent of the data in an organization is unstructured
29 23/12/2023
Facets of Data
 In data science and big data, you will come across many
different types of data, and each of them require different
tools and techniques. The main categories of data are:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and image
 Streaming
30 23/12/2023
5’Vs and Data
31 23/12/2023
Data at Rest data in motion
 Data at Rest (data in storage)
 Refers to data that is being stored in stable destination
systems. Data at rest is frequently defined as data that is
not in use
 Data stored in an online database
 Data stored on disk
 Data stored online or offline database extracts
 Backups transferred to disk
 Archives
 Data at rest is a snapshot of the information that is
collected and stored, ready to be analyzed for decision-
making
32 23/12/2023
Data at Rest data in motion
 Data in motion
 Stream of data moving through any kind of network
 Data actively moving from one location to another such as
across the internet
 Data in motion is the process of analyzing data on the fly
without storing it
 Systems to analyze this data include IBM Streams
33 23/12/2023
Use cases for a Big Data platform:
Healthcare and Life Sciences
 Problem:
 Vast quantities of real-time information are starting to come
from wireless monitoring devices that postoperative patients
and those with chronic diseases are wearing at home and in
their daily lives.
 How big data analytics can help:
 Epidemic early warning
 Intensive Care Unit and remote monitoring
34 23/12/2023
Use cases for a Big Data
platform: Financial Services
 Problem:
 Manage the several Petabytes of data which is
growing at 40-100% per year under increasing
pressure to prevent frauds and complaints to
regulators
 How big data analytics can help:
 Fraud detection
 Credit issuance
 Risk management
 360° view of the Customer
35 23/12/2023
Graph analytics
 Path analysis • Community
analysis
 Connectivity
• Centrality
analysis analysis
36 23/12/2023
Industry 4.0
 “Industry 4.0” was the brainchild of the German
government, and describes the next phase in
manufacturing - a so-called fourth industrial revolution
 Industry 1.0: Water/steam power
 Industry 2.0: Electric power
 Industry 3.0: Computing power
 Industry 4:0: Internet of Things (IoT) power
 Characteristic for industrial production in an Industry 4.0 environment
are the strong customization of products under the conditions of
highly flexibilized (mass-) production. The required automation
technology is improved by the introduction of methods of self-
optimization, self-configuration, self-diagnosis, cognition and
intelligent support of workers in their increasingly complex work.
37 23/12/2023
IoT and the connected world
 International Data Corporation (IDC) has estimated that
there will be 26 times more connected things than people
in 2020 - perhaps 200 billion
38 23/12/2023
Sensors & software for the automobile
39 23/12/2023
Big Data scenarios span many industries
Multi-channel customer
sentiment and
experience analysis
Detect life-threatening
conditions at hospitals
in time to intervene
Predict weather patterns

to plan optimal wind
turbine usage, and
optimize capital
expenditure on asset
placement
Make risk decisions
based on real-time
transactional data
Identify criminals and

threats from disparate
video, audio, and data
feeds
40 23/12/2023
Hadoop
Introduction to Hadoop & the Hadoop
Ecosystem
 Why? When? Where?
 Origins / History
 The Why of Hadoop
 The When of Hadoop
 The Where of Hadoop
 Hadoop Basics
 Comparison with RDBMS
 Hadoop architecture
 MapReduce
 HDFS
 Hadoop Common
42 23/12/2023
What is Hadoop?
 Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
 Hides underlying system details and complexities from user
 Developed in Java
 Uses Google’s MapReduce and Google File System (GFS)

technologies as its foundation
43 23/12/2023
What is Hadoop?
 Consists of 3 sub projects:
 MapReduce
 Hadoop Distributed File System (aka. HDFS)
 Hadoop Common
 Has a large ecosystem with both open-source &

proprietary Hadoop-related projects
 Hbase / Zookeeper / Avro / etc.
 Meant for heterogeneous commodity hardware
44 23/12/2023
Why & where Hadoop is used / not used
 What Hadoop is good for:
 Massive amounts of data through
parallelism
 A variety of data (structured, unstructured,
semi-structured)
 Inexpensive commodity hardware
 Hadoop is not good for:
 Not to process transactions (random access)
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensive calculations with little data
45 23/12/2023
Hadoop / MapReduce timeline
46 23/12/2023
The two key components of Hadoop
 Hadoop Distributed File System = HDFS
 Where Hadoop stores data
 A file system that spans all the nodes in a Hadoop cluster
 It links together the file systems on many local nodes to make
them into one big file system
 MapReduce framework
 How Hadoop understands and assigns work to the nodes
(machines)
47 23/12/2023
Partitionning : shrading
 Data is divided into partitions that can be managed and
accessed separately
 Partitioning can
 Improve scalability
 Optimize performance.
48 23/12/2023
Scaling – Master/Slave Hadoop
49 23/12/2023
Horizontal scalability Hadoop
50 23/12/2023
Requirements for this new approach
 Partial Failure Support
 Data Recoverability
 Component Recovery
 Consistency
 Scalability
 Hadoop is based on work done by Google in the late 1990s/early 2000s:

Specifically, on papers describing the Google File System (GFS)
(published in 2003), and MapReduce (published in 2004)
51 23/12/2023
Core Hadoop concepts
 Applications are written in high-level language code
 Work is performed in a cluster of commodity machines

 Nodes talk to each other as little as possible
 Data is distributed in advance

 Bring the computation to the data
 Data is replicated for increased availability and reliability
 Hadoop is fully scalable and fault-‐tolerant

52 23/12/2023
A large (and growing) Ecosystem
53 23/12/2023
Who uses Hadoop?
54 23/12/2023
An example of a big data platform in
practice (IBM)
Ingestion and Real-time Analytic Zone Analytics and
Reporting Zone
Streaming Data
Warehousing Zone
BI &
Reporting
Enterprise
Warehouse
Connectors
Predictive
Analytics
Hadoop
MapReduce Hive/HBase Data Marts

Col Stores Visualization
& Discovery
Documents
in variety of formats
ETL, MDM, Data Governance
Landing and Analytics Sandbox Metadata and Governance Zone

55 Zone 23/12/2023
Annexe
Big Data centers for Massive Parallelism
BlinkDB
Storm Spark-Streaming
Pregel GraphLab GraphX
DryadLINQSpark Dremel
MapReduce Hadoop Dryad Hive
2005 2010 2015
57 23/12/2023
Historique : La naissance de MapReduce
et Hadoop
 2003/2004 : publication par Google de deux whitepapers,
le premier sur GFS (un système de fichier distribué) et le
second sur le paradigme Map/Reduce pour le calcul
distribué
 2004 : développement de la première version du
framework qui deviendra Hadoop par Doug Cutting
(archive.org)
 2006 : Doug Cutting (désormais chez Yahoo) développe
une première version exploitable de Apache Hadoop pour
l'amélioration de l'indexation du moteur de recherche
58 23/12/2023
Historique
 2008 : développement maintenant très abouti, Hadoop
utilisé chez Yahoo dans plusieurs départements
 2011: Hadoop désormais utilisé par de nombreuses autres

entreprises et des universités, et le cluster Yahoo comporte
42000 machines et des centaines de peta-octets d'espace
de stockage
 Le nom lui-même n'est pas un acronyme : il s'agit du nom

d'un éléphant en peluche du fils de l'auteur originel (Doug
Cuttin) de Hadoop
59 23/12/2023
Map-Reduce
 Modèle de programmation parallèle (framework de calcul
distribué) pour le traitement de grands ensembles de
données
 Développé par Google :
 permet de répartir la charge sur un grand nb de serveurs

(cluster)
 abstraction quasi-totale de l’infrastructure matérielle
 La librairie MapReduce existe dans plusieurs langages

(C++, C#, Erlang, Java, Python, Ruby…)
60 23/12/2023
Map-Reduce
61 23/12/2023
Hadoop
 Hadoop Distributed File System (HDFS)
 where Hadoop stores data
 Nodes: a Hadoop cluster
 MapReduce framework
 Evolving: MR v1, MR v2, etc.
62 23/12/2023
Hadoop
 HDFS: Hadoop data file system
 Hbase: is the Hadoop database
 Yarn: is the architectural center of Hadoop that allows multiple
data processing engines such as interactive SQL, real-time
streaming, data science and batch ...
 Oozie: is a workflow scheduler system to manage Apache
Hadoop jobs
 Parquet: a columnar storage format available to any project in
the Hadoop ecosystem, regardless of the choice of data
processing framework
 Pig: is a platform for analyzing large data sets
63 23/12/2023
Hadoop
 Snappy: compression for Hadoop
 Slor: searches of data stored in HDFS in Hadoop
 Sqoop: transferring bulk data between Apache Hadoop and structured

datastores such as relational databases
 ZooKeeper: is a centralized service for maintaining configuration

information, naming, providing distributed synchronization, and providing
group services.
 Knox Gateway (“Knox”) is a system to extend the reach of Apache™

Hadoop services to users outside of a Hadoop cluster without reducing
Hadoop Security.
64 23/12/2023
Hadoop
 Slider lets you deploy distributed applications across a Hadoop cluster.
Slider leverages the YARN ResourceManager to allocate and distribute
components of an application across a cluster
 Ambari : The Apache Ambari project is aimed at making Hadoop

management simpler by developing software for provisioning, managing,
and monitoring Apache
 Mahout : creating Scalable Performant Machine Learning Applications
65 23/12/2023
Differences between RDBMS and
Hadoop/HDFS
66 23/12/2023
Some terminology…to get you started
 75 Big Data Terms Everyone Should Know (July 2017)
http://dataconomy.com/2017/07/75-big-data-terms-everyo
ne-know
 But these are just

the beginning of
a terminological
dictionary that you
should develop
for yourself
67 23/12/2023
Data lake
 A Data Lake is a storage repository that can store large
amount of structured, semi-structured, and
unstructured data
 Entreprise Data Hubs’’, ou ’’Data Platform’’
 It is a place to store every type of data in its native format

with no fixed limits on account size or file. ... Data
Lake is like a large container which is very similar to
real lake and rivers
68 23/12/2023
Data lake Vs data warehouse
 Data.
 A data warehouse only stores data that has been
modeled/structured
 A data lake stores structured, semi-structured, and unstructured
 Processing
 Before we can load data into a data warehouse, we first need to
model it. That’s called schema-on-write
 With a data lake, you just load in the raw data, as-is. Schema-on-
read
69 23/12/2023
 Storage. One of the primary features of big data
technologies like Hadoop is that
 the cost of storing data is relatively low as compared to the data
warehouse.
 There are two key reasons for this:
 First, Hadoop is open source software, so the licensing and

community support is free.
 And second, Hadoop is designed to be installed on low-cost

commodity hardware
70 23/12/2023
 Agility. A data warehouse is a highly-structured
repository, by definition. It can be very time-consuming to
change the structure
 A data lake lacks the structure of a data warehouses
 Gives developers and data scientists the ability to easily

configure and reconfigure their models, queries, and apps on-
the-fly
 A data lake is best suited for the data scientists.
71 23/12/2023
Data lake for Data scientists
72 23/12/2023
Munging eq wrangling
 Data munging/Data wrangling is the process of
transforming and mapping data from one
"raw" data form into another format with the intent of
making it more appropriate and valuable for analytics
 Example: python and pandas libraries
73 23/12/2023
Data ingestion
 Is the transportation of data from sources to a storage
Flume is a distributed, reliable, and available open source analytics &

service for efficiently collecting, aggregating, monitoring solution
and moving large amounts of log data.
74 23/12/2023
Vocabulaire à retenir
 Data in rest
 Data at motion
 The V’s of Big data
 Data Lake
 Data ingestion
 Data munging/Data wrangling
 Horizontal scalability
 Data node
 Master node
 Rack
 Cluster
 Cluster file system
 GFS, HDFS, Azure: CFS
75 23/12/2023

Big Data Overview

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Overview

Uploaded by

Copyright:

Available Formats

Big data overview

ESC Tunis (Afef Bahri)

 Huge volumes of data

 Capturing, transporting, and moving the data

 Managing - the data, the hardware involved, and the software

 Processing - to provide insight

 Storing - safeguarding and securing

• Oceans of data (data at rest) vs. Streams of data (data in

• Data Lake (a large storage repository and processing engine)

• NoSQL (“not only SQL”)

Data and Application are in

Data and Application are in the

 The components of a cluster are commonly, but not

 Clusters are usually deployed to improve performance

 High availability : disponibilité élevée

 Load balancing : redistribuer vers un autre ordinateur du

Rack 1 Rack 2 Rack 3

 Nodes are grouped into racks

 Clustered file systems (CFSes) (e.g., GFS, HDFS, Azure)

 A CFS comprises nodes connected via a network

 CFSes store data with redundancy

 Google data center (circa 2012)

 Traditional database technology will not suffice the need of

 The need of a large scale adoption of new age tools like

 Big data volume: Zettabytes which is equivalent to a trillion

 Structured data such as bank records, demographic data,

 Unstructured data includes captured like images, tweets or

 The frequency of specific data collected via various sources

 Big data technology allows us now to analyze the data while it

 Quality, accuracy, and precision are needed

 Data entry errors, wrong references or links

 Determining right source and form of the data and the

Predict weather patterns

Identify criminals and

 Hides underlying system details and complexities from user

 Uses Google’s MapReduce and Google File System (GFS)

 Has a large ecosystem with both open-source &

 Meant for heterogeneous commodity hardware

 Hadoop is based on work done by Google in the late 1990s/early 2000s:

 Work is performed in a cluster of commodity machines

 Data is distributed in advance

 Data is replicated for increased availability and reliability

 Hadoop is fully scalable and fault-­‐tolerant

MapReduce Hive/HBase Data Marts

Landing and Analytics Sandbox Metadata and Governance Zone

2005 2010 2015

 2011: Hadoop désormais utilisé par de nombreuses autres

 Le nom lui-même n'est pas un acronyme : il s'agit du nom

 permet de répartir la charge sur un grand nb de serveurs

 abstraction quasi-totale de l’infrastructure matérielle

 La librairie MapReduce existe dans plusieurs langages

 Evolving: MR v1, MR v2, etc.

 Slor: searches of data stored in HDFS in Hadoop

 Sqoop: transferring bulk data between Apache Hadoop and structured

 ZooKeeper: is a centralized service for maintaining configuration

 Knox Gateway (“Knox”) is a system to extend the reach of Apache™

 Ambari : The Apache Ambari project is aimed at making Hadoop

 Mahout : creating Scalable Performant Machine Learning Applications

 But these are just

 Entreprise Data Hubs’’, ou ’’Data Platform’’

 It is a place to store every type of data in its native format

 First, Hadoop is open source software, so the licensing and

 Hadoop is fully scalable and fault-‐tolerant