You are on page 1of 75

Big data overview

ESC Tunis (Afef Bahri)


Plan
 Big data
 The V’s of Big Data
 Types of data
 Facets of data
 Use cases
 Industry 4.0
 Terminologies: data in motion data at rest and data lake
 Big data Platform
 Hadoop ecosystem

2 23/12/2023
Big Data
 Big Data refers to non-conventional strategies and
innovative technologies used by businesses and
organizations to capture, manage, process, and make sense
of a large volume of data

 Huge volumes of data


 A tsunami of data
 Data of different types and formats
 Impacting the business at new and ever increasing speeds

3 23/12/2023
Big Data
 The challenges:

 Capturing, transporting, and moving the data

 Managing - the data, the hardware involved, and the software

 Processing - to provide insight

 Storing - safeguarding and securing

4 23/12/2023
Big Data
 Some terminology of Big Data :

• Oceans of data (data at rest) vs. Streams of data (data in


motion)

• Data Lake (a large storage repository and processing engine)

• NoSQL (“not only SQL”)

• Horizontal Scalability

5 23/12/2023
Cluster
Scalability: horizontal vs vertical

7 23/12/2023
Entreprise architecture: traditional
approach

Data and Application are in


differents machines

8 23/12/2023
Big Data Architecture

Data and Application are in the


same machine (Data nodes)

9 23/12/2023
Cluster

10 23/12/2023
A computer Cluster
 a group of linked computers, working together closely
so that in many respects they form a single computer

 The components of a cluster are commonly, but not


always, connected to each other through fast local
area networks

 Clusters are usually deployed to improve performance


and/or availability over that provided by a single
computer

11 23/12/2023
Cluster

12 23/12/2023
Advantages

 High availability : disponibilité élevée

 Load balancing : redistribuer vers un autre ordinateur du


cluster

 Remontée en charge

 Flexibility

13 23/12/2023
CFS

Network core

Rack 1 Rack 2 Rack 3

 Nodes are grouped into racks


 Links:
 intra-rack link
 cross-rack link

14 23/12/2023
Clustered file systems

 Clustered file systems (CFSes) (e.g., GFS, HDFS, Azure)


are widely adopted by enterprises

 A CFS comprises nodes connected via a network


 Nodes are prone to failures  data availability is crucial

 CFSes store data with redundancy


 Store new data with replication

15 23/12/2023
Packaging/Physical Design

16 23/12/2023
Packaging/Physical Design

 Google data center (circa 2012)


17 23/12/2023
Example of cluster

18 23/12/2023
The Vs of Big Data
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
 Volume

 Traditional database technology will not suffice the need of


efficient data management i.e. storage and analysis.

 The need of a large scale adoption of new age tools like


Hadoop and MongoDB.

 Big data volume: Zettabytes which is equivalent to a trillion


gigabytes. 1 zettabyte is equivalent to approximately 3 million
galaxies of stars

20 23/12/2023
Data has an intrinsic property…it grows
and grows
90% 80% 20%
of the world’s of available
of the world’s
data was data can be
data today is
created in the processed by
unstructured
last two years traditional
systems

1 in 2 83% 5.4X
business leaders of CIO’s cited BI and more likely that
don’t have access analytics as part of top performers
to data they need their visionary plan use business
analytics
21 23/12/2023
Growing interconnected & instrumented
world

22 23/12/2023
System of Units / Binary System of Units

23 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
 Variety
 Variety is defined as the different types of data we can
now use. New and innovative big data technology is now
allowing structured and unstructured data

 Structured data such as bank records, demographic data,


inventory databases,

 Unstructured data includes captured like images, tweets or


Facebook status updates,

24 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
 Velocity
 Velocity refers to the speed at which vast amounts of data are
being generated, collected and analyzed

 The frequency of specific data collected via various sources


defines the velocity of that data
 Every day the number of emails, twitter messages, photos, video clips,
etc. increases at lighting speeds around the world

 Big data technology allows us now to analyze the data while it


is being generated, without ever putting it into databases

25 23/12/2023
The Vs of Big Data : (3Vs, 4Vs, 5Vs, …)
 Veracity
 Veracity is the quality or trust of the data

 Quality, accuracy, and precision are needed

 Data entry errors, wrong references or links

 Determining right source and form of the data and the


right method of analysis

26 23/12/2023
5th V - Value
 5th V - Value - that is the real purpose of working with
Big Data to obtain business insight.

27 23/12/2023
Les …Vs
 Volume - how much data is there?
 Velocity - how quickly is the data being created, moved, or
accessed?
 Variety - how many different types of sources are there?
 Veracity - can we trust the data?
 Validity - is the data accurate and correct?
 Viability - is the data relevant to the use case at hand?
 Volatility - how often does the data change?
 Vulnerability - can we keep the data secure?
 Visualization - how can the data be presented to the user?
 Value - can this data produce a meaningful return on
investment?
28 23/12/2023
Types of Big Data
 Structured
 Data that can be stored
and processed in a
fixed format, aka schema
 Semi-structured
 Data that does not have a formal structure of a data model, i.e. a table
definition in a relational DBMS, but nevertheless it has some organizational
properties like tags and other markers to separate semantic elements that
makes it easier to analyze, aka XML or JSON
 Unstructured
 Data that has an unknown form and cannot be stored in RDBMS and cannot be
analyzed unless it is transformed into a structured format is called as
unstructured data
 Text Files and multimedia contents like images, audios, videos are example of
unstructured data - unstructured data is growing quicker than others, experts
say that 80 percent of the data in an organization is unstructured
29 23/12/2023
Facets of Data
 In data science and big data, you will come across many
different types of data, and each of them require different
tools and techniques. The main categories of data are:
 Structured
 Unstructured
 Natural language
 Machine-generated
 Graph-based
 Audio, video, and image
 Streaming

30 23/12/2023
5’Vs and Data

31 23/12/2023
Data at Rest data in motion
 Data at Rest (data in storage)
 Refers to data that is being stored in stable destination
systems. Data at rest is frequently defined as data that is
not in use
 Data stored in an online database
 Data stored on disk
 Data stored online or offline database extracts
 Backups transferred to disk
 Archives
 Data at rest is a snapshot of the information that is
collected and stored, ready to be analyzed for decision-
making
32 23/12/2023
Data at Rest data in motion
 Data in motion
 Stream of data moving through any kind of network
 Data actively moving from one location to another such as
across the internet
 Data in motion is the process of analyzing data on the fly
without storing it
 Systems to analyze this data include IBM Streams

33 23/12/2023
Use cases for a Big Data platform:
Healthcare and Life Sciences
 Problem:
 Vast quantities of real-time information are starting to come
from wireless monitoring devices that postoperative patients
and those with chronic diseases are wearing at home and in
their daily lives.
 How big data analytics can help:
 Epidemic early warning
 Intensive Care Unit and remote monitoring

34 23/12/2023
Use cases for a Big Data
platform: Financial Services
 Problem:
 Manage the several Petabytes of data which is
growing at 40-100% per year under increasing
pressure to prevent frauds and complaints to
regulators
 How big data analytics can help:
 Fraud detection
 Credit issuance
 Risk management
 360° view of the Customer

35 23/12/2023
Graph analytics
 Path analysis • Community
analysis
 Connectivity
• Centrality
analysis analysis

36 23/12/2023
Industry 4.0
 “Industry 4.0” was the brainchild of the German
government, and describes the next phase in
manufacturing - a so-called fourth industrial revolution
 Industry 1.0: Water/steam power
 Industry 2.0: Electric power
 Industry 3.0: Computing power
 Industry 4:0: Internet of Things (IoT) power
 Characteristic for industrial production in an Industry 4.0 environment
are the strong customization of products under the conditions of
highly flexibilized (mass-) production. The required automation
technology is improved by the introduction of methods of self-
optimization, self-configuration, self-diagnosis, cognition and
intelligent support of workers in their increasingly complex work.

37 23/12/2023
IoT and the connected world
 International Data Corporation (IDC) has estimated that
there will be 26 times more connected things than people
in 2020 - perhaps 200 billion

38 23/12/2023
Sensors & software for the automobile

39 23/12/2023
Big Data scenarios span many industries

Multi-channel customer
sentiment and
experience analysis

Detect life-threatening
conditions at hospitals
in time to intervene

Predict weather patterns


to plan optimal wind
turbine usage, and
optimize capital
expenditure on asset
placement
Make risk decisions
based on real-time
transactional data

Identify criminals and


threats from disparate
video, audio, and data
feeds
40 23/12/2023
Hadoop
Introduction to Hadoop & the Hadoop
Ecosystem
 Why? When? Where?
 Origins / History
 The Why of Hadoop
 The When of Hadoop
 The Where of Hadoop
 Hadoop Basics
 Comparison with RDBMS
 Hadoop architecture
 MapReduce
 HDFS
 Hadoop Common

42 23/12/2023
What is Hadoop?
 Apache open source software framework for reliable,
scalable, distributed computing of massive amount of data

 Hides underlying system details and complexities from user

 Developed in Java

 Uses Google’s MapReduce and Google File System (GFS)


technologies as its foundation

43 23/12/2023
What is Hadoop?
 Consists of 3 sub projects:
 MapReduce
 Hadoop Distributed File System (aka. HDFS)
 Hadoop Common

 Has a large ecosystem with both open-source &


proprietary Hadoop-related projects
 Hbase / Zookeeper / Avro / etc.

 Meant for heterogeneous commodity hardware

44 23/12/2023
Why & where Hadoop is used / not used
 What Hadoop is good for:
 Massive amounts of data through
parallelism
 A variety of data (structured, unstructured,
semi-structured)
 Inexpensive commodity hardware
 Hadoop is not good for:
 Not to process transactions (random access)
 Not good when work cannot be parallelized
 Not good for low latency data access
 Not good for processing lots of small files
 Not good for intensive calculations with little data
45 23/12/2023
Hadoop / MapReduce timeline

46 23/12/2023
The two key components of Hadoop
 Hadoop Distributed File System = HDFS
 Where Hadoop stores data
 A file system that spans all the nodes in a Hadoop cluster
 It links together the file systems on many local nodes to make
them into one big file system

 MapReduce framework
 How Hadoop understands and assigns work to the nodes
(machines)

47 23/12/2023
Partitionning : shrading
 Data is divided into partitions that can be managed and
accessed separately

 Partitioning can
 Improve scalability
 Optimize performance.

48 23/12/2023
Scaling – Master/Slave Hadoop

49 23/12/2023
Horizontal scalability Hadoop

50 23/12/2023
Requirements for this new approach
 Partial Failure Support
 Data Recoverability
 Component Recovery
 Consistency
 Scalability

 Hadoop is based on work done by Google in the late 1990s/early 2000s:


Specifically, on papers describing the Google File System (GFS)
(published in 2003), and MapReduce (published in 2004)

51 23/12/2023
Core Hadoop concepts
 Applications are written in high-­level language code

 Work is performed in a cluster of commodity machines


 Nodes talk to each other as little as possible

 Data is distributed in advance


 Bring the computation to the data

 Data is replicated for increased availability and reliability

 Hadoop is fully scalable and fault-­‐tolerant


52 23/12/2023
A large (and growing) Ecosystem

53 23/12/2023
Who uses Hadoop?

54 23/12/2023
An example of a big data platform in
practice (IBM)
Ingestion and Real-time Analytic Zone Analytics and
Reporting Zone
Streaming Data
Warehousing Zone

BI &
Reporting

Enterprise
Warehouse
Connectors

Predictive
Analytics
Hadoop

MapReduce Hive/HBase Data Marts


Col Stores Visualization
& Discovery

Documents
in variety of formats
ETL, MDM, Data Governance

Landing and Analytics Sandbox Metadata and Governance Zone


55 Zone 23/12/2023
Annexe
Big Data centers for Massive Parallelism
BlinkDB
Storm Spark-Streaming
Pregel GraphLab GraphX
DryadLINQSpark Dremel
MapReduce Hadoop Dryad Hive

2005 2010 2015

57 23/12/2023
Historique : La naissance de MapReduce
et Hadoop
 2003/2004 : publication par Google de deux whitepapers,
le premier sur GFS (un système de fichier distribué) et le
second sur le paradigme Map/Reduce pour le calcul
distribué
 2004 : développement de la première version du
framework qui deviendra Hadoop par Doug Cutting
(archive.org)
 2006 : Doug Cutting (désormais chez Yahoo) développe
une première version exploitable de Apache Hadoop pour
l'amélioration de l'indexation du moteur de recherche

58 23/12/2023
Historique
 2008 : développement maintenant très abouti, Hadoop
utilisé chez Yahoo dans plusieurs départements

 2011: Hadoop désormais utilisé par de nombreuses autres


entreprises et des universités, et le cluster Yahoo comporte
42000 machines et des centaines de peta-octets d'espace
de stockage

 Le nom lui-même n'est pas un acronyme : il s'agit du nom


d'un éléphant en peluche du fils de l'auteur originel (Doug
Cuttin) de Hadoop

59 23/12/2023
Map-Reduce
 Modèle de programmation parallèle (framework de calcul
distribué) pour le traitement de grands ensembles de
données
 Développé par Google :

 permet de répartir la charge sur un grand nb de serveurs


(cluster)

 abstraction quasi-totale de l’infrastructure matérielle

 La librairie MapReduce existe dans plusieurs langages


(C++, C#, Erlang, Java, Python, Ruby…)
60 23/12/2023
Map-Reduce

61 23/12/2023
Hadoop
 Hadoop Distributed File System (HDFS)
 where Hadoop stores data
 Nodes: a Hadoop cluster

 MapReduce framework

 Evolving: MR v1, MR v2, etc.

62 23/12/2023
Hadoop
 HDFS: Hadoop data file system
 Hbase: is the Hadoop database
 Yarn: is the architectural center of Hadoop that allows multiple
data processing engines such as interactive SQL, real-time
streaming, data science and batch ...
 Oozie: is a workflow scheduler system to manage Apache
Hadoop jobs
 Parquet: a columnar storage format available to any project in
the Hadoop ecosystem, regardless of the choice of data
processing framework
 Pig: is a platform for analyzing large data sets

63 23/12/2023
Hadoop
 Snappy: compression for Hadoop

 Slor: searches of data stored in HDFS in Hadoop

 Sqoop: transferring bulk data between Apache Hadoop and structured


datastores such as relational databases

 ZooKeeper: is a centralized service for maintaining configuration


information, naming, providing distributed synchronization, and providing
group services.

 Knox Gateway (“Knox”) is a system to extend the reach of Apache™


Hadoop services to users outside of a Hadoop cluster without reducing
Hadoop Security.

64 23/12/2023
Hadoop
 Slider lets you deploy distributed applications across a Hadoop cluster.
Slider leverages the YARN ResourceManager to allocate and distribute
components of an application across a cluster

 Ambari : The Apache Ambari project is aimed at making Hadoop


management simpler by developing software for provisioning, managing,
and monitoring Apache

 Mahout : creating Scalable Performant Machine Learning Applications

65 23/12/2023
Differences between RDBMS and
Hadoop/HDFS

66 23/12/2023
Some terminology…to get you started
 75 Big Data Terms Everyone Should Know (July 2017)
http://dataconomy.com/2017/07/75-big-data-terms-everyo
ne-know

 But these are just


the beginning of
a terminological
dictionary that you
should develop
for yourself

67 23/12/2023
Data lake
 A Data Lake is a storage repository that can store large
amount of structured, semi-structured, and
unstructured data

 Entreprise Data Hubs’’, ou ’’Data Platform’’

 It is a place to store every type of data in its native format


with no fixed limits on account size or file. ... Data
Lake is like a large container which is very similar to
real lake and rivers

68 23/12/2023
Data lake Vs data warehouse
 Data.
 A data warehouse only stores data that has been
modeled/structured
 A data lake stores structured, semi-structured, and unstructured

 Processing
 Before we can load data into a data warehouse, we first need to
model it. That’s called schema-on-write

 With a data lake, you just load in the raw data, as-is. Schema-on-
read

69 23/12/2023
Data lake Vs data warehouse
 Storage. One of the primary features of big data
technologies like Hadoop is that
 the cost of storing data is relatively low as compared to the data
warehouse.
 There are two key reasons for this:

 First, Hadoop is open source software, so the licensing and


community support is free.

 And second, Hadoop is designed to be installed on low-cost


commodity hardware

70 23/12/2023
Data lake Vs data warehouse
 Agility. A data warehouse is a highly-structured
repository, by definition. It can be very time-consuming to
change the structure

 A data lake lacks the structure of a data warehouses

 Gives developers and data scientists the ability to easily


configure and reconfigure their models, queries, and apps on-
the-fly
 A data lake is best suited for the data scientists.

71 23/12/2023
Data lake for Data scientists

72 23/12/2023
Munging eq wrangling
 Data munging/Data wrangling is the process of
transforming and mapping data from one
"raw" data form into another format with the intent of
making it more appropriate and valuable for analytics

 Example: python and pandas libraries

73 23/12/2023
Data ingestion
 Is the transportation of data from sources to a storage

Flume is a distributed, reliable, and available open source analytics &


service for efficiently collecting, aggregating, monitoring solution
and moving large amounts of log data.
74 23/12/2023
Vocabulaire à retenir
 Data in rest
 Data at motion
 The V’s of Big data
 Data Lake
 Data ingestion
 Data munging/Data wrangling
 Horizontal scalability
 Data node
 Master node
 Rack
 Cluster
 Cluster file system
 GFS, HDFS, Azure: CFS
75 23/12/2023

You might also like