Professional Documents
Culture Documents
Big Data
Big Data
100s of
data
millions of GPS
every day
enabled
devices sold
? TBs of
annually
2+ billion
25+ TBs of people on the
log data every Web by end
2011
day 76 million smart meters in
2009…
200M by 2014
The Five Vs of Big Data
Five Characteristics of Big Data
• Big data is often defined according to the five Vs as shown:
volume, variety, velocity, veracity, and value.
• Volume refers to the significantly larger amounts of data that are
being collected due to an increased number of data sources as
well as the increased number of sensors that are generating data.
For example, 6 billion people have cell phones that are used to
place calls, make posts to social media sites such as Facebook.
• Variety refers to the different forms of data that are generated.
Big data is highly heterogeneous in nature:
• data is textual, data include video, XML data, or sensor data.
Five Characteristics of Big Data
• Velocity refers to the fact that big data is often data “in motion”
rather than data “at rest” as found in traditional database. Data
is constantly being transmitted through computer networks,
sensors, mobile devices, and satellites. Processing big data involves
the analysis of streaming data arrives at a fast rate.
• Veracity refers to the accuracy of the data being analyzed. Data
can be corrupted in transmission, or can potentially be lost. Before
analyzing any large data set, steps must be taken to improve the
trustworthiness of the data.
• Value refers to the advantages that the successful analysis of big
data can bring to businesses, organizations, and research labs.
Hadoop, a distributed framework for Big Data
• Hadoop is a system for the efficient storage and retrieval of
large data files.
• Hadoop is based on a simple data model any data will fit.
• The two main components of Hadoop are:
• The Hadoop storage framework:
• Hadoop distributed file system (HDFS) and
• The parallel programming paradigm:
• MapReduce
Hadoop Components
• Distributed file system (HDFS)
• The Hadoop storage framework
• Single namespace for entire cluster
• Replicates data 3x for fault-tolerance
• MapReduce implementation
• Executes user jobs specified as “map” and “reduce” functions
• Manages work distribution & fault-tolerance
What is Hadoop?
• Here's what makes it especially useful:
• Scalable: It can reliably store and process petabytes.
• Economical: It distributes the data and processing across
clusters of commonly available computers (in thousands).
• Efficient: By distributing the data, it can process it in parallel
on the nodes where the data is located.
• Reliable: It automatically maintains multiple copies of data
and automatically redeploys computing tasks based on
failures.
Hadoop Architecture
Hadoop Distributed file system (HDFS)
Execution engine (MapReduce)
Datanodes
Hadoop Distributed File System
• Single namenode (master node) stores
metadata (file names, block locations, ….)
• May be replicated also
Namenode File1
1
•
2
Client library for file access 3
4
DataNode:
• Stores the actual data in HDFS
• Can run on any underlying filesystem (ext3/4, NTFS, etc)
• Notifies NameNode of what blocks it has
• NameNode replicates blocks 2x in local rack, 1x elsewhere
Goals of HDFS
• Very Large Distributed File System
• 10K nodes, 100 million files, 10 PB
• Convenient Cluster Management
• Load balancing
• Node failures
• Cluster expansion
• Optimized for Batch Processing
• Allow move computation to data
• Maximize throughput
Typical Hadoop Cluster
Aggregation switch
Rack switch
• When you put this data into HDFS, Hadoop automatically splits
into blocks and replicates each block.
Input reader
• Input reader reads a block and divides into splits.
• Each split would be sent to a map function. E.g., a line is an
input of a map function.
• The key could be some internal number, the value is the content
of the textual line.
(US,10
(US,10) (US,40
(US,40)
NewYork, US, 10 (GB,20
LosAngeles, US,40
London, GB, 20 DE,45
Berlin, DE, 60 (GB,20 GB,15
Glasgow, GB, 10 (GB,10 US,25
Munich, DE, 30
(GB,10
…
(DE,60
(DE,30
(DE,60
(DE,30
Properties of MapReduce Engine
• Job Tracker is the master node (runs with the namenode)
• Receives the user’s job
• Decides on how many tasks will run (number of mappers)
• Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3
Parse-hash
Map
Reduce In this example, 1 map-reduce job
consists of 4 map tasks and 3 reduc
Map Parse-hash
Reduce tasks
Map Parse-hash
MapReduce Motivation
Sort/Copy
Input Hello 1 Output
Mapper Cloud 1
Merge
Hello Hello 1
Cloud Hello 2
Hello 1 Hello [1 1] Reducer
TA 1 TA [1 1] TA 2
TA 1
TA cool TA 1
Mapper cool 1
cool 1
Hello TA Cloud 1
Cloud [1] Cloud 1
cool 1
cool [1 1] Reducer
cool 1 cool 2
cool Mapper Hello 1
TA 1
Example: Color Count
Input blocks on Produces (k, v) Shuffle & Sorting based Consumes(k, [v])
HDFS ( , 1) on k ( , [1,1,1,1,1,1..])
Produces(k’, v’)
Map Parse-hash ( , 100)
Reduce Part0001
Map Parse-hash
Reduce Part0002
Map Parse-hash
Reduce Part0003
Map Parse-hash
That’s the output file, it has 3 parts
on probably 3 different machines
Example: Color Filter
Write to HDFS
Map Part0002
That’s the output file, it has 4 parts
on probably 4 different machines
Write to HDFS
Map Part0003
Write to HDFS
Map Part0004
MapReduce Example:
Finding the Most Common
Dog Names
Hadoop Software Products
• The Apache Hadoop open source software project provides free
downloads of modules including Hadoop Common, HDFS,
Hadoop YARN (for job scheduling, and Hadoop MapReduce.
• The download framework includes other Hadoop-related tools
as Hive, HBase, the Pig, and the Mahout data mining library.
• Database vendors have also begun to provide software products
based on Apache Hadoop. Many of these vendors also provide
free virtual machines (VM) for learning the basics of the Hadoop.
• IBM, for example, provides the InfoSphere BigInsights Quick Start.
• Other virtual machines for the Hadoop framework are provided
by vendors such as Cloudera, Hortonworks, Oracle, and Microsoft.
Summary
• The amount of data that is now being collected by social media
sites, and others has grown at an unprecedented rate.
• The term big data refers to massively large data sets that cannot
be handled with traditional database technology.
• Hadoop is the framework that gave birth to the era of big data
storage and analytics. Hadoop is composed of the Hadoop
distributed file system (HDFS) and the MapReduce model.
• HDFS provides fault tolerance by dividing a large file into blocks
and then replicating the blocks within clusters..
• MapReduce is used as the programming model for parallel
computation on Hadoop.
Reference
• Catherine M Ricardo, Susan D Urban, Databases
Illuminated (3rd edition), Jones & Bartlett Learning, LLC,
an Ascend Learning Company, 2017, Chapter 8.
The End of the Lecture
Next week Lecture
Keep Calm