Professional Documents
Culture Documents
Big Data Overview
Big Data Overview
2 23/12/2023
Big Data
Big Data refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make sense
of a large volume of data
3 23/12/2023
Big Data
The challenges:
4 23/12/2023
Big Data
Some terminology of Big Data :
• Horizontal Scalability
5 23/12/2023
Cluster
Scalability: horizontal vs vertical
7 23/12/2023
Entreprise architecture: traditional
approach
8 23/12/2023
Big Data Architecture
9 23/12/2023
Cluster
10 23/12/2023
A computer Cluster
a group of linked computers, working together closely
so that in many respects they form a single computer
11 23/12/2023
Cluster
12 23/12/2023
Advantages
Remontée en charge
Flexibility
13 23/12/2023
CFS
Network core
14 23/12/2023
Clustered file systems
15 23/12/2023
Packaging/Physical Design
16 23/12/2023
Packaging/Physical Design
18 23/12/2023
The Vs of Big Data
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
Volume
20 23/12/2023
Data has an intrinsic property…it grows
and grows
90% 80% 20%
of the world’s of available
of the world’s
data was data can be
data today is
created in the processed by
unstructured
last two years traditional
systems
1 in 2 83% 5.4X
business leaders of CIO’s cited BI and more likely that
don’t have access analytics as part of top performers
to data they need their visionary plan use business
analytics
21 23/12/2023
Growing interconnected & instrumented
world
22 23/12/2023
System of Units / Binary System of Units
23 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
Variety
Variety is defined as the different types of data we can
now use. New and innovative big data technology is now
allowing structured and unstructured data
24 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
Velocity
Velocity refers to the speed at which vast amounts of data are
being generated, collected and analyzed
25 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
Veracity
Veracity is the quality or trust of the data
26 23/12/2023
5th V - Value
5th V - Value - that is the real purpose of working with
Big Data to obtain business insight.
27 23/12/2023
Les …Vs
Volume - how much data is there?
Velocity - how quickly is the data being created, moved, or
accessed?
Variety - how many different types of sources are there?
Veracity - can we trust the data?
Validity - is the data accurate and correct?
Viability - is the data relevant to the use case at hand?
Volatility - how often does the data change?
Vulnerability - can we keep the data secure?
Visualization - how can the data be presented to the user?
Value - can this data produce a meaningful return on
investment?
28 23/12/2023
Types of Big Data
Structured
Data that can be stored
and processed in a
fixed format, aka schema
Semi-structured
Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that
makes it easier to analyze, aka XML or JSON
Unstructured
Data that has an unknown form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured format is called as
unstructured data
Text Files and multimedia contents like images, audios, videos are example of
unstructured data - unstructured data is growing quicker than others, experts
say that 80 percent of the data in an organization is unstructured
29 23/12/2023
Facets of Data
In data science and big data, you will come across many
different types of data, and each of them require different
tools and techniques. The main categories of data are:
Structured
Unstructured
Natural language
Machine-generated
Graph-based
Audio, video, and image
Streaming
30 23/12/2023
5’Vs and Data
31 23/12/2023
Data at Rest data in motion
Data at Rest (data in storage)
Refers to data that is being stored in stable destination
systems. Data at rest is frequently defined as data that is
not in use
Data stored in an online database
Data stored on disk
Data stored online or offline database extracts
Backups transferred to disk
Archives
Data at rest is a snapshot of the information that is
collected and stored, ready to be analyzed for decision-
making
32 23/12/2023
Data at Rest data in motion
Data in motion
Stream of data moving through any kind of network
Data actively moving from one location to another such as
across the internet
Data in motion is the process of analyzing data on the fly
without storing it
Systems to analyze this data include IBM Streams
33 23/12/2023
Use cases for a Big Data platform:
Healthcare and Life Sciences
Problem:
Vast quantities of real-time information are starting to come
from wireless monitoring devices that postoperative patients
and those with chronic diseases are wearing at home and in
their daily lives.
How big data analytics can help:
Epidemic early warning
Intensive Care Unit and remote monitoring
34 23/12/2023
Use cases for a Big Data
platform: Financial Services
Problem:
Manage the several Petabytes of data which is
growing at 40-100% per year under increasing
pressure to prevent frauds and complaints to
regulators
How big data analytics can help:
Fraud detection
Credit issuance
Risk management
360° view of the Customer
35 23/12/2023
Graph analytics
Path analysis • Community
analysis
Connectivity
• Centrality
analysis analysis
36 23/12/2023
Industry 4.0
“Industry 4.0” was the brainchild of the German
government, and describes the next phase in
manufacturing - a so-called fourth industrial revolution
Industry 1.0: Water/steam power
Industry 2.0: Electric power
Industry 3.0: Computing power
Industry 4:0: Internet of Things (IoT) power
Characteristic for industrial production in an Industry 4.0 environment
are the strong customization of products under the conditions of
highly flexibilized (mass-) production. The required automation
technology is improved by the introduction of methods of self-
optimization, self-configuration, self-diagnosis, cognition and
intelligent support of workers in their increasingly complex work.
37 23/12/2023
IoT and the connected world
International Data Corporation (IDC) has estimated that
there will be 26 times more connected things than people
in 2020 - perhaps 200 billion
38 23/12/2023
Sensors & software for the automobile
39 23/12/2023
Big Data scenarios span many industries
Multi-channel customer
sentiment and
experience analysis
Detect life-threatening
conditions at hospitals
in time to intervene
42 23/12/2023
What is Hadoop?
Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data
Developed in Java
43 23/12/2023
What is Hadoop?
Consists of 3 sub projects:
MapReduce
Hadoop Distributed File System (aka. HDFS)
Hadoop Common
44 23/12/2023
Why & where Hadoop is used / not used
What Hadoop is good for:
Massive amounts of data through
parallelism
A variety of data (structured, unstructured,
semi-structured)
Inexpensive commodity hardware
Hadoop is not good for:
Not to process transactions (random access)
Not good when work cannot be parallelized
Not good for low latency data access
Not good for processing lots of small files
Not good for intensive calculations with little data
45 23/12/2023
Hadoop / MapReduce timeline
46 23/12/2023
The two key components of Hadoop
Hadoop Distributed File System = HDFS
Where Hadoop stores data
A file system that spans all the nodes in a Hadoop cluster
It links together the file systems on many local nodes to make
them into one big file system
MapReduce framework
How Hadoop understands and assigns work to the nodes
(machines)
47 23/12/2023
Partitionning : shrading
Data is divided into partitions that can be managed and
accessed separately
Partitioning can
Improve scalability
Optimize performance.
48 23/12/2023
Scaling – Master/Slave Hadoop
49 23/12/2023
Horizontal scalability Hadoop
50 23/12/2023
Requirements for this new approach
Partial Failure Support
Data Recoverability
Component Recovery
Consistency
Scalability
51 23/12/2023
Core Hadoop concepts
Applications are written in high-level language code
53 23/12/2023
Who uses Hadoop?
54 23/12/2023
An example of a big data platform in
practice (IBM)
Ingestion and Real-time Analytic Zone Analytics and
Reporting Zone
Streaming Data
Warehousing Zone
BI &
Reporting
Enterprise
Warehouse
Connectors
Predictive
Analytics
Hadoop
Documents
in variety of formats
ETL, MDM, Data Governance
57 23/12/2023
Historique : La naissance de MapReduce
et Hadoop
2003/2004 : publication par Google de deux whitepapers,
le premier sur GFS (un système de fichier distribué) et le
second sur le paradigme Map/Reduce pour le calcul
distribué
2004 : développement de la première version du
framework qui deviendra Hadoop par Doug Cutting
(archive.org)
2006 : Doug Cutting (désormais chez Yahoo) développe
une première version exploitable de Apache Hadoop pour
l'amélioration de l'indexation du moteur de recherche
58 23/12/2023
Historique
2008 : développement maintenant très abouti, Hadoop
utilisé chez Yahoo dans plusieurs départements
59 23/12/2023
Map-Reduce
Modèle de programmation parallèle (framework de calcul
distribué) pour le traitement de grands ensembles de
données
Développé par Google :
61 23/12/2023
Hadoop
Hadoop Distributed File System (HDFS)
where Hadoop stores data
Nodes: a Hadoop cluster
MapReduce framework
62 23/12/2023
Hadoop
HDFS: Hadoop data file system
Hbase: is the Hadoop database
Yarn: is the architectural center of Hadoop that allows multiple
data processing engines such as interactive SQL, real-time
streaming, data science and batch ...
Oozie: is a workflow scheduler system to manage Apache
Hadoop jobs
Parquet: a columnar storage format available to any project in
the Hadoop ecosystem, regardless of the choice of data
processing framework
Pig: is a platform for analyzing large data sets
63 23/12/2023
Hadoop
Snappy: compression for Hadoop
64 23/12/2023
Hadoop
Slider lets you deploy distributed applications across a Hadoop cluster.
Slider leverages the YARN ResourceManager to allocate and distribute
components of an application across a cluster
65 23/12/2023
Differences between RDBMS and
Hadoop/HDFS
66 23/12/2023
Some terminology…to get you started
75 Big Data Terms Everyone Should Know (July 2017)
http://dataconomy.com/2017/07/75-big-data-terms-everyo
ne-know
67 23/12/2023
Data lake
A Data Lake is a storage repository that can store large
amount of structured, semi-structured, and
unstructured data
68 23/12/2023
Data lake Vs data warehouse
Data.
A data warehouse only stores data that has been
modeled/structured
A data lake stores structured, semi-structured, and unstructured
Processing
Before we can load data into a data warehouse, we first need to
model it. That’s called schema-on-write
With a data lake, you just load in the raw data, as-is. Schema-on-
read
69 23/12/2023
Data lake Vs data warehouse
Storage. One of the primary features of big data
technologies like Hadoop is that
the cost of storing data is relatively low as compared to the data
warehouse.
There are two key reasons for this:
70 23/12/2023
Data lake Vs data warehouse
Agility. A data warehouse is a highly-structured
repository, by definition. It can be very time-consuming to
change the structure
71 23/12/2023
Data lake for Data scientists
72 23/12/2023
Munging eq wrangling
Data munging/Data wrangling is the process of
transforming and mapping data from one
"raw" data form into another format with the intent of
making it more appropriate and valuable for analytics
73 23/12/2023
Data ingestion
Is the transportation of data from sources to a storage