Big Data

‫بـسـم اللـه الـرحـمـن الرحـيم‬
Arab Academy for Science &Technology & Maritime Transport

College of Computing and Information Technology
Information Systems Department
Lecturer: Prof. Ossama Mohamed Badawy

Lecture Outline Table shows class meeting days, and topics.
W Lecture Day & Date
1 Introduction to the Course S & M 19, 20/02/2023
2 Intro. to Data warehousing & Business Intelligence S & M 26, 27/02/2023
3 Data Warehousing Modeling S & M 05, 06/03/2023
4 OLAP & Data Visualization S & M 12, 13/03/2023
5 Introduction to Data Mining S & M 19, 20/03/2023
6 Data Mining Classification S & M 26, 27/03/2023
7 7th Week Exam S & M 02, 03/04 (Sat 01/4)
8 Data Mining Clustering S & M 09, 10/04/2023
9 Big Data and map-reduce S & M 16, 17/04/2023
10 NoSQL Databases 1: Introduction S & M 23,24/04/2023
11 NOSQL DBs 2: Document-based MongoDB S & M 30/04, 01/05/2023
12 NOSQL DBs 3: MongoDB Modeling S & M 07, 08/05/2023
13 NOSQL DBs for Cloud Applications S & M 14, 15/05/2023
14 Intro to Mobile Databases S & M 21, 22/05/2023
15 Course Project Presentation S & M 28, 29/05/2023
Big Data
• Big Data refers to a collection of data sets so large and complex, it’s
impossible to process them with the usual databases and tools.
• Because of its size and associated numbers, Big Data is hard to:
– capture,
– store,
– delete (privacy),
– search,
– share,
– analyze and visualize.
What comes next?
• Kilobyte (KB) – 103 bytes
• Megabyte (MB) –106 bytes
• Gigabyte (GB) – 109 bytes
• Terabyte (TB) –1012 bytes
• Petabyte (PB) – 1015 bytes
• Exabyte (EB) – 1018 bytes
• Zettabyte (ZB) – 1021 bytes
• Yottabyte (YB) – 1024 bytes
Where Is This “Big Data” Coming From ?
30 billion RFID tags today 4.6 billion
12+ TBs (1.3B in 2005)
camera
phones world
of tweet data wide
every day
100s of
data
millions of GPS
every day
enabled
devices sold
? TBs of
annually
2+ billion
25+ TBs of people on the
log data every Web by end
2011
day 76 million smart meters in
2009…
200M by 2014
The Five Vs of Big Data
Five Characteristics of Big Data
• Big data is often defined according to the five Vs as shown:
volume, variety, velocity, veracity, and value.
• Volume refers to the significantly larger amounts of data that are
being collected due to an increased number of data sources as
well as the increased number of sensors that are generating data.
For example, 6 billion people have cell phones that are used to
place calls, make posts to social media sites such as Facebook.
• Variety refers to the different forms of data that are generated.
Big data is highly heterogeneous in nature:
• data is textual, data include video, XML data, or sensor data.
Five Characteristics of Big Data
• Velocity refers to the fact that big data is often data “in motion”
rather than data “at rest” as found in traditional database. Data
is constantly being transmitted through computer networks,
sensors, mobile devices, and satellites. Processing big data involves
the analysis of streaming data arrives at a fast rate.
• Veracity refers to the accuracy of the data being analyzed. Data
can be corrupted in transmission, or can potentially be lost. Before
analyzing any large data set, steps must be taken to improve the
trustworthiness of the data.
• Value refers to the advantages that the successful analysis of big
data can bring to businesses, organizations, and research labs.
Hadoop, a distributed framework for Big Data
• Hadoop is a system for the efficient storage and retrieval of
large data files.
• Hadoop is based on a simple data model any data will fit.
• The two main components of Hadoop are:
• The Hadoop storage framework:
• Hadoop distributed file system (HDFS) and
• The parallel programming paradigm:
• MapReduce
Hadoop Components
• Distributed file system (HDFS)
• The Hadoop storage framework
• Single namespace for entire cluster
• Replicates data 3x for fault-tolerance
• MapReduce implementation
• Executes user jobs specified as “map” and “reduce” functions
• Manages work distribution & fault-tolerance
What is Hadoop?
• Here's what makes it especially useful:
• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
• Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of data
and automatically redeploys computing tasks based on
failures.
Hadoop Architecture
 Hadoop Distributed file system (HDFS)
 Execution engine (MapReduce)
Master node (single node)
Many slave nodes

Hadoop Distributed File System
• The Hadoop storage framework is supported by the Hadoop
distributed file system (HDFS).
• The advantage of a distributed file system (DFS) is that it is
capable of representing data sets that are too large to fit within
the storage capacity of a single machine by distributing data
across a network of machines.
• The disadvantage of a DFS is that it is complex to manage all of
the file components and metadata that are needed to store and
retrieve data from distributed locations.
• HDFS has several advantages:
• HDFS provides fault tolerance since it is capable of dividing a
file into subcomponents, known as blocks, and then replicating
copies of each block. If one block is unavailable, HDFS can
access another copy of the same block on a different machine.
• Distributing the blocks of a file across multiple computers also
supports parallel processing, thus leading to faster
computational capabilities.
• As an example of the processing power that is capable with
HDFS, Hadoop was able to sort 500 gigabytes of data in
59 seconds and 100 terabytes of data in 173 minutes.
• An HDFS cluster is composed of nodes and racks.
• A node is a single computer.
• A rack is a collection of about 30–40 nodes on the same
network switch.
• A cluster is a collection of racks.
• Data set in HDFS is divided into blocks, where a
• block is the basic unit of storage.
• Blocks default to a size of 64 megabytes.
• HDFS will distribute and replicate the blocks. By default,
• each block has three replicas that are spread across two racks.
Organization of Dataset
Blocks Within an HDFS
Cluster
HDFS architecture
• HDFS has master/slave architecture.
• An HDFS cluster consists of a single NameNode, a master
server that manages the file system namespace and regulates
access to files by clients.
• There are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they
run on.
• HDFS exposes a file system namespace and allows user data
to be stored in files. Internally, a file is split into one or more
blocks and these blocks are stored in a set of DataNodes.
Hadoop’s Architecture
Data Centers
Hadoop Distributed File System (HDFS)
• Files are BIG (100s of GB – TB)
• Typical usage patterns
• Append-only, Data are rarely updated, common File1
1
• Files split into 64-128 MB blocks (called chunks)

2
3
4
• Blocks replicated (usually 3 times) across several

datanodes (called chuck or slave nodes)
• Chunk nodes are compute nodes too
1 2 1 3
2 1 4 2
4 3 3 4
Datanodes
• Single namenode (master node) stores
metadata (file names, block locations, ….)
• May be replicated also
Namenode File1
1
•
2
Client library for file access 3
4
• Talks to master to find chunk servers

• Connects directly to chunk servers to
access data
• Master node is not a bottleneck 1
2
2
1
1
4
3
2
• Computation is done at chuck node 4 3 3 4
(close to data) Datanodes

NameNode:
• Stores metadata for the files, like the directory structure of a
typical FS.
• The server holding the NameNode instance is quite crucial, as
there is only one.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary after a
DataNode failure
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 10 PB
• Convenient Cluster Management
• Load balancing
• Node failures
• Cluster expansion
• Optimized for Batch Processing
• Allow move computation to data
• Maximize throughput
Typical Hadoop Cluster
Aggregation switch
Rack switch
• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 GBps bandwidth in rack, 8 GBps out of rack
• Node specs (Yahoo terasort):
8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
In 2011 it was guestimated that Google had 1M machines 25
Hadoop/MapReduce Computing Paradigm
• Once HDFS has been used to distribute a large data set,
MapReduce can then be used to do a programming job that
requires reading the entire file to perform computation.
• The computation is designed to operate in parallel on the
individual blocks of the file and then merge the results.
• MapReduce is composed of two main steps: map and reduce.
• The map step is used to filter and/or transform the input data
into a more appropriate form in preparation for the reduce step.
• The reduce step is then used to perform some type of calculation
or aggregation over the data that it receives from the map step
to achieve the results.
MapReduce framework
• Per cluster node:
• Single JobTracker per master
• Responsible for scheduling
the jobs’ component tasks
on the slaves
• Monitor slave progress
• Re-execute failed tasks
• Single TaskTracker per slave
• Execute the task as
directed by the master
Map Reduce: High Level
• JobTracker
• Run on Master node
• Accepts Job requests
from clients
• TaskTracker
• Run on slave nodes
• Forks separate Java
process for task
instances
MapReduce core functionality
Code usually written in Java. Two fundamental components:
• Map step
• Master node takes large problem and slices it into smaller sub
problems; distributes these to worker nodes.
• Worker node may do this again if necessary.
• Worker processes smaller problem and hands back to master.
• Reduce step
• Master node takes the answers to the sub problems and
combines them in a predefined way to get the output/answer
to original problem.
Dataflow in MapReduce
MapReduce example: counting words
• Problem definition: given a large collection of documents, output
the frequency for each unique word.
• When you put this data into HDFS, Hadoop automatically splits
into blocks and replicates each block.
Input reader
• Input reader reads a block and divides into splits.
• Each split would be sent to a map function. E.g., a line is an
input of a map function.
• The key could be some internal number, the value is the content
of the textual line.
Apple Orange Mongo Apple Orange Mongo

Block 1
Orange Grapes Plum
Orange Grapes Plum
Input reader
Apple Plum Mongo
Apple Plum Mongo
Block 2
Apple Apple Plum
Apple Apple Plum
Mapper: map function
mapper
m1 Apple, 1
• Mapper takes Apple Orange Mongo Orange, 1
the output Mongo, 1
generated by m2 Orange, 1
input reader and Orange Grapes Plum Grapes, 1
Plum, 1
output a list of
intermediate m3 Apple, 1
<key, value> Apple Plum Mongo Plum, 1
Mongo, 1
pairs.
m4 Apple, 1
Apple Apple Plum Apple, 1
Plum, 1
Reducer: reduce function
• Reducer takes the shuffle/sort reducer
output generated by Apple, 1 Apple, 1
the Mapper, Orange, 1 Apple, 1 r1
Apple, 4
Apple, 1
aggregates the value Mongo, 1
Apple, 1
for each key, and Orange, 1
Orange, 1 r2
Grapes, 1
outputs the final result. Plum, 1 Orange, 1 Orange, 2
r3
• There is shuffle/sort Apple, 1
Grapes, 1 Grapes, 1
before reducing. Plum, 1 Mongo, 1 r4

Mongo, 2
Mongo, 1 Mongo, 1
Apple, 1 Plum, 1 r5
Apple, 1 Plum, 1 Plum, 3
Plum, 1 Plum, 1
MapReduce: Execution Details
• Input reader
• Divide input into splits, assign each split to a Map task
• Map task
• Apply the Map function to each record in the split
• Each Map function returns a list of (key, value) pairs
• Shuffle/Partition and Sort
• Shuffle distributes sorting & aggregation to many reducers
• All records for key k are directed to the same reduce processor
• Sort groups the same keys together, & prepares for aggregation
• Reduce task
• Apply the Reduce function to each key
• The result of the Reduce function is a list of (key, value) pairs
35
Dataflow in MapReduce
MapReduce – Group AVG Example
Input Data MAP(k,v) Intermediate REDUCE(k,list(v Result
(K,V)-Pairs
(US,10
(US,10) (US,40
(US,40)
NewYork, US, 10 (GB,20
LosAngeles, US,40
London, GB, 20 DE,45
Berlin, DE, 60 (GB,20 GB,15
Glasgow, GB, 10 (GB,10 US,25
Munich, DE, 30
(GB,10
…
(DE,60
(DE,30
(DE,60
(DE,30
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
• This file has 5 Blocks  run 5 map tasks
• Where to run the task reading block “1”

• Try to run it on Node 1 or Node 3
Properties of MapReduce Engine (Cont’d)
• Task Tracker is the slave node (runs on each datanode)
• Receives the task from Job Tracker
• Runs the task until completion (either map or reduce task)
• Always in communication with the Job Tracker reporting
progress
Map Parse-hash
Reduce
Parse-hash
Map
Reduce In this example, 1 map-reduce job
consists of 4 map tasks and 3 reduc
Map Parse-hash
Reduce tasks
Map Parse-hash
MapReduce Motivation
 Data processing: > 1 TB

 Massively parallel
 Locality
 Fault Tolerant
Limitation of MR
 Have to use M/R model

 Not Reusable
 Error prone
 For complex jobs:
 Multiple stage of Map/Reduce functions
Example - Wordcount
Sort/Copy
Input Hello 1 Output
Mapper Cloud 1
Merge
Hello Hello 1
Cloud Hello 2
Hello 1 Hello [1 1] Reducer
TA 1 TA [1 1] TA 2
TA 1
TA cool TA 1
Mapper cool 1
cool 1
Hello TA Cloud 1
Cloud [1] Cloud 1
cool 1
cool [1 1] Reducer
cool 1 cool 2
cool Mapper Hello 1
TA 1
Example: Color Count
Job: Count the number of each color in a data set
Input blocks on Produces (k, v) Shuffle & Sorting based Consumes(k, [v])
HDFS ( , 1) on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it has 3 parts
on probably 3 different machines
Example: Color Filter
Job: Select only the blue and the green colors

• Each map task will select only the blue or
Input blocks on Produces (k, v) green colors
HDFS ( , 1)
• No need for reduce phase

Write to HDFS
Map Part0001
Write to HDFS
Map Part0002
That’s the output file, it has 4 parts
on probably 4 different machines
Write to HDFS
Map Part0003
Write to HDFS
Map Part0004
MapReduce Example:
Finding the Most Common
Dog Names
Hadoop Software Products
• The Apache Hadoop open source software project provides free
downloads of modules including Hadoop Common, HDFS,
Hadoop YARN (for job scheduling, and Hadoop MapReduce.
• The download framework includes other Hadoop-related tools
as Hive, HBase, the Pig, and the Mahout data mining library.
• Database vendors have also begun to provide software products
based on Apache Hadoop. Many of these vendors also provide
free virtual machines (VM) for learning the basics of the Hadoop.
• IBM, for example, provides the InfoSphere BigInsights Quick Start.
• Other virtual machines for the Hadoop framework are provided
by vendors such as Cloudera, Hortonworks, Oracle, and Microsoft.
Summary
• The amount of data that is now being collected by social media
sites, and others has grown at an unprecedented rate.
• The term big data refers to massively large data sets that cannot
be handled with traditional database technology.
• Hadoop is the framework that gave birth to the era of big data
storage and analytics. Hadoop is composed of the Hadoop
distributed file system (HDFS) and the MapReduce model.
• HDFS provides fault tolerance by dividing a large file into blocks
and then replicating the blocks within clusters..
• MapReduce is used as the programming model for parallel
computation on Hadoop.
Reference
• Catherine M Ricardo, Susan D Urban, Databases
Illuminated (3rd edition), Jones & Bartlett Learning, LLC,
an Ascend Learning Company, 2017, Chapter 8.
The End of the Lecture
Next week Lecture
Keep Calm
For The Present Lecture

Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data

Uploaded by

Copyright:

Available Formats

‫بـسـم اللـه الـرحـمـن الرحـيم‬

Arab Academy for Science &Technology & Maritime Transport

Lecturer: Prof. Ossama Mohamed Badawy

Master node (single node)

Many slave nodes

• Files split into 64-128 MB blocks (called chunks)

• Blocks replicated (usually 3 times) across several

• Talks to master to find chunk servers

• Computation is done at chuck node 4 3 3 4

(close to data) Datanodes

• 40 nodes/rack, 1000-4000 nodes in cluster

Apple Orange Mongo Apple Orange Mongo

before reducing. Plum, 1 Mongo, 1 r4

• This file has 5 Blocks  run 5 map tasks

• Where to run the task reading block “1”

 Data processing: > 1 TB

 Have to use M/R model

Job: Count the number of each color in a data set

Job: Select only the blue and the green colors

• No need for reduce phase

For The Present Lecture

You might also like