Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT

Hadoop and MR programming
Dr G Sudha Sadasivam
Professor
CSE, PSGCT
Outline
• Big Data
• Hadoop and MR programming
• MR model
• Example
• Kmeans
• SVM
Trend: New “Big Data” becoming commonplace
10 Terabytes per day
7 Terabytes per day

Transactions: 46 Terabytes per year
Genomes: Petabytes per year

20 Petabytes per day
Call Records: 5 Terabytes per day
New Video Uploads:

4.5 Terabytes per day
100 Terabytes per year
LHC: 40 Terabytes per second
Massive Volumes of Data …..
The Big Data Challenge
• Challenge: Big Data’s characteristics are challenging conventional information
management architectures
 Massive, increasingly diverse, and growing amounts of information residing
internal and external to the organization
 Unconventional semi structured or unstructured: including web pages, log
files, social media, click-streams, instant messages, text messages, emails,
sensor data from active and passive systems, etc.
 Changing information
Sentiment
analytics Transaction Claim fraud Warranty claim Surveillance CDR
Multi-Channel analytics analytics analytics analytics analytics
analytics
4
The Big Data Opportunity
Extracting insight from an immense volume, variety and velocity of data, in

context, beyond what was previously possible.
Variety: Manage the complexity of

multiple relational and non-
relational data types and
schemas
Velocity: Streaming data and large

volume data movement
Volume: Scale from terabytes to
zetabytes
Big Data vis-à-vis Existing Communities
Variety
Machine Learning
NLP
Big Data
Databases
Volume
Velocity
Complex Event Processing
5 V’s
NoSQL
• SQL
• Unstructured
– Key value stores
– Column family
– Document databases
– Graph databases
Hadoop and MR Programming

A framework for running applications on large clusters of
commodity hardware ( 1000 nodes) which produces huge data
(petabytes – zetabytes) and to process it

Open source Apache Software Foundation Project
Hadoop Includes

HDFS a distributed filesystem to distribute data

Map/Reduce HDFS implements this programming model. It is an
offline computing engine. Handles distributed Applications
CONCEPT
Moving computation is more efficient than moving
large data
Hadoop evolution 2011 – Cloudera
- MapR
2012 – Hortonworks
2013 – Hadoop 2.0
Yarn
2014 - Spark
Hadoop – Who’s Using It ?
Uses Hadoop and HBase for :
• Social services
• Structured data storage
Uses Hadoop for :
• Processing for internal use
• Amazon's product search
indices They process millions of
sessions daily for analytics.
Uses Hadoop for :

• Search optimization
• Research
Uses Hadoop for :

• Internal log reporting/parsing
systems designed to scale to
Uses Hadoop :
infinity and beyond.
• As a source for
• web-wide analytics platform
reporting/analytics and machine
learning.
Uses Hadoop for :

• Databasing and analyzing Next
Generation Sequencing (NGS) And Many More ….
dataApache
produced for the Cancer
Zookeeper
Genome Atlas (TCGA) project and Apache Tomcat
other groups 12
2. Hadoop Components
• Distributed file system (HDFS)
Namenode
– Name node: Single namespace for entire File1
cluster (file names, block locations, etc) 1
– Worker nodes / data nodes 2
3
– Each file is chopped up into a number of 4
blocks (128MB)
– Fault tolerance is achieved by replicating
– Optimized for large files, sequential reads
• MapReduce framework
– Simple data-parallel programming
model
– Executes user jobs specified as “map” 1 2 1 3
2 1 4 2
and “reduce” functions 4 3 3 4
– Job tracker
Datanodes
– Task tracker
Cluster node runs both DFS and MR
Dataflow in Hadoop
Submit job
map schedule reduce
map reduce
Dataflow in Hadoop
Read
Input File
map reduce
Block 1
HDFS
Block 2
map reduce
Dataflow in Hadoop
Finished Finished + Location
map Local
FS
reduce
Local
map FS reduce
Dataflow in Hadoop
map Local
FS
reduce
HTTP GET
Local
map FS reduce
Dataflow in Hadoop
Write
Final
reduce
Answer
HDFS
reduce
HDFS Architecture
• NameNode: filename, offset> blockid, block > datanode
• DataNode: maps block > local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
Pipeline Details
21
MR programming
1) C A L C U L A T I N G PI
The area of the square, denoted As = (2r)^2 or
4r^2.
The area of the circle, denoted Ac, is pi * r2.
• pi= 4 * No of pts on the circle /
num of points on the square
• Count the number of generated
points that are both in the circle
and in the square  MAP
• PI = 4 * r  REDUC E
• Restricted parallel programming model meant for

large clusters
– User implements Map() and Reduce()
Ex 2: WORDCOUNT EXAMPLE
Divide the document Reducer sums up all

and analyse one line in counts
one mapper
Each mapper counts

words in 1 line
• File
Hello World Bye World
Hello Hadoop Good Bye Hadoop
• Map
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
• The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
The output of the first combine:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second combine:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Thus the output of the job (reduce) is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
• Map()
– Input <filename, file text>
– Parses file and emits <word, count> pairs
• eg. <”hello”, 1>
• Reduce()
– Sums all values for the same key and emits <word,
TotalCount>
• eg. <”hello”, ( 1 1) > => <”hello”, 2 >
Parallelisation - Hadoop MapReduce
File < Hello, 1> < Hello, 1> < Bye, 1>
Hello World Bye World < World, 1> < Hadoop, 1> < Hello, 1>
Hello Hadoop GoodBye Hadoop < Bye, 1> < Goodbye, 1> < World, 2>
< World, 1> < Hadoop, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
MR model
• Map()
– Process a key/value pair to generate intermediate
key/value pairs
• Reduce()
– Merge all intermediate values associated with the same
key
• Users implement interface of two primary methods:
1. Map: (key1, val1) → (key2, val2)
2. Reduce: (key2, [val2]) → [val3]
• Map - clause group-by (for Key) of an aggregate function of
SQL
• Reduce - aggregate function (e.g., average) that is computed
over all the rows with the same group-by attribute (key).
Example Word Count (1)
• Map
public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value,

OutputCollector output, Reporter reporter)
throws IOException {
String line = ((Text)value).toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
• Reduce
public static class Reduce extends MapReduceBase implements
Reducer {
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += ((IntWritable) values.next()).get();
}
output.collect(key, new IntWritable(sum));
}
}
• Main
public static void main(String[] args) throws IOException {
//checking goes here
JobConf conf = new JobConf();
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}
• Job
• Task
• Task attempt – crash, speculative execution
Job Tracker
MR Job
Task Tracker Task Tracker Task Tracker
Task Instance Task Instance Task Instance
Slave Node Slave Node Slave Node

Jar + conf in HDFS
Job-in-progress
CLIENT JobClient. JobTracker

JobConf runjob()
Task Tracker
Launch task
Task in progress
TaskRunner
Mapper/
reducer
• MapReduce programs are contained in a Java
“jar” file + an XML file containing serialized
program configuration options
• These are stored in HDFS and task trackers are
notified
• Data Map – Local, if overloaded to other
nodes
3. Genome clustering
• Genome is a DNA sequence (ATGC)
• Features characterize a genome – motifs are
extracted – M R1
AAAGG GTTTCCCAAAG –
Mapper: AAA- 1, AAG- 1, AGG- 1, GGG - 1, GGT- 1,
GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA-
1, AAA- 1, AAG- 1
Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1,

G TT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1, CCA- 1, CAA- 1
Read the input sequence
Read input
Obtain [1x64] Feature Descriptor vector (MR 1)
Compare to the existing species by clustering (MR 2)
Identify the species

38
K Means
• Fix up cluster centres
• Find the distance between samples and
cluster centres
• Find which cluster centre is closest to a given
sample
• Add the sample to the cluster
• Re-evaluate cluster centres
Mapreduce kmeans
• Key – clusterid, value – coordinate of row
• Map – calculates distance of each point from
centers
• Keyout – clusterid , valueout-distance to each
cluster center;
• Reducer: finds closest cluster center;
recalculates new centers for next iteration
• Out- new centers
M R K -Means
• L et k be no. of features in a sample, V1(k) is
the value for kth feature in the sample. V2(k)
is value of kth feature of the cluster centre
• Mapper: Distance to ith cluster centre is given
by
• Reducer: min [Si] is evaluated for a species
considering all cluster centres
• Centroid is evaluated for all features (k) in
sample V2 and cluster centre V1. This is done
to update centroid
4. SVM as a classifier
• A Support Vector Machine (SVM) is a
supervised learning method that analyzes data
used for classification and regression analysis
• Learning in SVM is done by finding a hyperplane
• Separates the training set in a feature space using a

kernel function which is inner product of input
space features
• Binary classification can be viewed as the task of
separating classes in feature space
f(x) = sign(wTx + b)
wTx + b = 0 W = ∑ i αi si
wTx + b > 0
wTx + b < 0
where
w represents normal vector to hyperplane
x represents training patterns
b represents threshold
α represents coefficients
s represents support vectors
Methodology
Steps
1. Data Collection – download data sets for month/
year/day
2. Input feature selection – MA and RSI
Vol avg -- +,- points
3. Finding support vectors -- S1, S2
α1, α2
4. Construction of hyper plane
y = wT x + b
5. Classification of test data using hyper plane and X
values into +ive / -ive points
2. a) Input feature
• Moving Average
MA = ∑n i=1 CPi
n
where, CPi = Closing Price on ith day
• Relative Strength Index

RSI = 100 * RS ; RS = AU
1+RS AD
AU = total upwards price changes during the past n days
AD = total downwards prices changes during the past n
days
Input Data & Feature selection example
DATE HIGH LOW OPEN CLOSE VOLUME OPEN-CLOSE
26.8.11 8.87 7.04 8.19 7.53 4.3 0.66

19.8.11 9.48 7.70 8.81 8.08 2.5 0.73
12.8.11 9.27 8.32 9.20 8.42 1.6 0.78
5.8.11 10.83 7.74 8.60 9.15 2.8 -0.55
X= MA = Vol Avg = (0.66+0.73=0.78)/0.55
8.3 2.8 RS = 3.94
RSI= 3.94/(1+3.94)*100
Total data Y = RSI = 79.87
set avg =
1.54
+ point
81.61%
increase
2.b)Repeat for other months also
Month MA RSI +/-
August 8.30 79.87 +
June 8.97 85.81 +
May 10.73 74.09 +
April 14.54 55.35 +
March 13.74 45.94 +
November 11.79 24.24 +
July 10.61 42.53 -
February 12.32 31.50 -
January 10.44 83.97 -
December 12.78 28.05 -
October 8.81 31.03 -
3. Support Vectors
• Support vectors are calculated by using Euclidean

distance formula
• In the Euclidean plane, if X = (x1, x2) and

Y = (y1, y2) then the distance is given by,
D (X,Y) = ( x1 - y1 ) 2 + ( x2 - y2 ) 2
where X and Y are positive points and negative points
of Moving Average and Relative Strength Index
+ive point -ive point distance
8.30,79.87 10.61,42.53 37.44
12.32, 31.50 48.53
10.44, 83.97 4.62
12.78, 28.05 52.01
8.81, 31.03 48.84
8.97, 85.81 10.61,42.53 43.31
12.32, 31.50 54.41
10.44, 83.97 2.35 (min)
12.78, 28.05 57.88
8.81, 31.03 54.77
Others are also calculated and min is found

S1= 8.97,85.81
S2= 10.44,83.97
8.97 10.44
S1 = 85.81 S2 = 83.97
1 1
3.b)Finding α1 and α2
• α1 S1 S1 + α2 S1 S2 = + 1
• α1 S1 S2 + α2 S2 S2 = - 1
7444.81 α1 + 7300.11 α2 = +1
7300.11 α1 + 7160.95 α2 = -1
α1 = -0.71; α2 = 0.73
4. Construction of Hyperplane
• Hyperplane is constructed using the equation , y = wT x + b

Where, w = ∑ αi si
w represents normal vector to hyperplane
x represents training patterns
b represents threshold
α represents coefficients
s represents support vector
• α values are calculated by using the equation
• y = wT x + b
• wT = ∑ αi Si = α1 s1 + α2 s2
• b = 0.02
• y = 1.25 x + 0.02  hyperplane
0.37
5. Testing set
• MA=x1= 12.44, RSI =x2= 50 to predict Volume avg
• y = 1.25 T x + 0.02  hyperplane
0.37
• Y = 34.05 + 0.02 = 34.07 

predicted value (vol avg)
• Overall avg = 18
• Increased % = (34.07-18 )/18 = 89%
Multi SVM
• Support vector machine is a powerful tool for binary

classification, capable of generating very fast
classifier functions following a training period
• To extend binary class scenario to multi-class

scenario, decompose an M-class problem into a series
of two-class problems
Approaches for Multi class SVM
• Multiclass ranking SVMs, in which one SVM
decision function attempts to classify all classes
• One-against-all classification, in which there is one
binary SVM for each class to separate members of that
class from members of other classes
• Pairwise classification, in which there is one binary
SVM for each pair of classes to separate members of
one class from members of the other
• Binary decision tree classification
Classes – - 100 to -50, -50 to 0, 0 to 50, 50 to 100
-100 to -50 -50 to 0

One against all
0 to 50 50 to 100 pairwise
BDT
-100 to 100
-100 to 0 0 to 100
-100 to -50 -50 to 0 0 to 50 50 to 100

Hadoop – Components
HDFS – Distributed File System Chukwa –To monitor Large Distributed System
Mahout Flume – To move Large Data post processing efficiently
MapReduce – Distributed Processing of large Data Hue – GUI to operate & develop Hadoop Applications
sets
ZooKeeper – Co-ordination Service for Dist Apps Hive – Data Warehousing framework
HBase – Scalable Distributed DB. Supports Structured Many more ….

Data
Avro – Data Serialization System Pig – Framework for Parallel Computation
SQOOP – Connector to Structured Database Apache

Oozie – Workflow Service to manage
Zookeeper
Data Processing Jobs
Apache Tomcat
63
Hadoop – The Various Forms Today
Apache Hadoop – Native Hadoop Distribution from Apache Foundation
Yahoo! Hadoop – Hadoop Distribution of Yahoo
CDH – Hadoop Distribution from Cloudera
GreenPlum Hadoop – Hadoop Distribution from EMC
HDP – Hadoop Platform from Hortonworks
M3 / M5 / M7 – Hadoop Distribution from MAPR
Project Serengeti – Vmware’s Implementation of Hadoop on Vcenter
Apache
Zookeeper Apache Tomcat
And More …
64
Hadoop vs Spark
• Spark is on top of hadoop
• Hadoop is distributed storage. Spark is tool for distributed processing on the storage
• Batch vs real time processing
• Written in java vs scala
• They are complementary
• Spark stores data in-memory whereas Hadoop stores data on disk.
– more RAM instead of network and disk I/O
• relatively fast as compared to hadoop.
• Spark needs large RAM  dedicated high end physical machine
• Spark
– Iterative jobs / machine learning
– Interactive
– Stream and sensor data processing
• Hadoop uses replication to achieve fault tolerance whereas Spark uses resilient
distributed datasets (RDD) to reconstruct a block if lost (from stable storage).
Backtracks and completes a job rather than start from beginning
• Shared memory vs distributed
Hadoop mapreduce vs spark
conclusion
• Hadoop
• MapReduce
• Examples
• Case studies
• Hadoop vs Spark
Thank You

Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT

Uploaded by

Copyright:

Available Formats

Hadoop and MR programming

10 Terabytes per day

7 Terabytes per day

Genomes: Petabytes per year

New Video Uploads:

Extracting insight from an immense volume, variety and velocity of data, in

Variety: Manage the complexity of

Velocity: Streaming data and large

Uses Hadoop for :

Uses Hadoop for :

Uses Hadoop for :

map schedule reduce

Finished Finished + Location

• Restricted parallel programming model meant for

Divide the document Reducer sums up all

Each mapper counts

public void map(WritableComparable key, Writable value,

Task Tracker Task Tracker Task Tracker

Task Instance Task Instance Task Instance

Slave Node Slave Node Slave Node

CLIENT JobClient. JobTracker

Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1,

Obtain [1x64] Feature Descriptor vector (MR 1)

Compare to the existing species by clustering (MR 2)

Identify the species

• Learning in SVM is done by finding a hyperplane

• Separates the training set in a feature space using a

• Relative Strength Index

26.8.11 8.87 7.04 8.19 7.53 4.3 0.66

• Support vectors are calculated by using Euclidean

• In the Euclidean plane, if X = (x1, x2) and

Others are also calculated and min is found

• Hyperplane is constructed using the equation , y = wT x + b

• Y = 34.05 + 0.02 = 34.07 

• Support vector machine is a powerful tool for binary

• To extend binary class scenario to multi-class

-100 to -50 -50 to 0

-100 to -50 -50 to 0 0 to 50 50 to 100

Mahout Flume – To move Large Data post processing efficiently

HBase – Scalable Distributed DB. Supports Structured Many more ….

Avro – Data Serialization System Pig – Framework for Parallel Computation

SQOOP – Connector to Structured Database Apache

Yahoo! Hadoop – Hadoop Distribution of Yahoo

CDH – Hadoop Distribution from Cloudera

GreenPlum Hadoop – Hadoop Distribution from EMC

HDP – Hadoop Platform from Hortonworks

M3 / M5 / M7 – Hadoop Distribution from MAPR

Project Serengeti – Vmware’s Implementation of Hadoop on Vcenter

You might also like