You are on page 1of 71

Hadoop and MR programming

Dr G Sudha Sadasivam
Professor
CSE, PSGCT
Outline
• Big Data
• Hadoop and MR programming
• MR model
• Example
• Kmeans
• SVM
Trend: New “Big Data” becoming commonplace

10 Terabytes per day

7 Terabytes per day


Transactions: 46 Terabytes per year

Genomes: Petabytes per year


20 Petabytes per day
Call Records: 5 Terabytes per day

New Video Uploads:


4.5 Terabytes per day
100 Terabytes per year
LHC: 40 Terabytes per second
Massive Volumes of Data …..
The Big Data Challenge
• Challenge: Big Data’s characteristics are challenging conventional information
management architectures
 Massive, increasingly diverse, and growing amounts of information residing
internal and external to the organization
 Unconventional semi structured or unstructured: including web pages, log
files, social media, click-streams, instant messages, text messages, emails,
sensor data from active and passive systems, etc.
 Changing information

Sentiment
analytics Transaction Claim fraud Warranty claim Surveillance CDR
Multi-Channel analytics analytics analytics analytics analytics
analytics

4
The Big Data Opportunity

Extracting insight from an immense volume, variety and velocity of data, in


context, beyond what was previously possible.

Variety: Manage the complexity of


multiple relational and non-
relational data types and
schemas

Velocity: Streaming data and large


volume data movement
Volume: Scale from terabytes to
zetabytes
Big Data vis-à-vis Existing Communities

Variety
Machine Learning
NLP

Big Data

Databases
Volume

Velocity
Complex Event Processing
5 V’s
NoSQL

• SQL
• Unstructured
– Key value stores
– Column family
– Document databases
– Graph databases
Hadoop and MR Programming

A framework for running applications on large clusters of
commodity hardware ( 1000 nodes) which produces huge data
(petabytes – zetabytes) and to process it

Open source Apache Software Foundation Project

Hadoop Includes

HDFS ­a distributed filesystem to distribute data

Map/Reduce ­ HDFS implements this programming model. It is an
offline computing engine. Handles distributed Applications

CONCEPT
Moving computation is more efficient than moving
large data
Hadoop evolution 2011 – Cloudera
- MapR
2012 – Hortonworks
2013 – Hadoop 2.0
Yarn
2014 - Spark
Hadoop – Who’s Using It ?
Uses Hadoop and HBase for :
• Social services
• Structured data storage
Uses Hadoop for :
• Processing for internal use
• Amazon's product search
indices They process millions of
sessions daily for analytics.

Uses Hadoop for :


• Search optimization
• Research

Uses Hadoop for :


• Internal log reporting/parsing
systems designed to scale to
Uses Hadoop :
infinity and beyond.
• As a source for
• web-wide analytics platform
reporting/analytics and machine
learning.

Uses Hadoop for :


• Databasing and analyzing Next
Generation Sequencing (NGS) And Many More ….
dataApache
produced for the Cancer
Zookeeper
Genome Atlas (TCGA) project and Apache Tomcat
other groups 12
2. Hadoop Components
• Distributed file system (HDFS)
Namenode
– Name node: Single namespace for entire File1
cluster (file names, block locations, etc) 1
– Worker nodes / data nodes 2
3
– Each file is chopped up into a number of 4
blocks (128MB)
– Fault tolerance is achieved by replicating
– Optimized for large files, sequential reads
• MapReduce framework
– Simple data-parallel programming
model
– Executes user jobs specified as “map” 1 2 1 3
2 1 4 2
and “reduce” functions 4 3 3 4
– Job tracker
Datanodes
– Task tracker
Cluster node runs both DFS and MR
Dataflow in Hadoop

Submit job

map schedule reduce

map reduce
Dataflow in Hadoop

Read
Input File
map reduce
Block 1

HDFS
Block 2
map reduce
Dataflow in Hadoop

Finished Finished + Location

map Local
FS
reduce

Local
map FS reduce
Dataflow in Hadoop

map Local
FS
reduce

HTTP GET
Local
map FS reduce
Dataflow in Hadoop

Write
Final
reduce
Answer
HDFS

reduce
HDFS Architecture

• NameNode: filename, offset­> block­id, block ­> datanode
• DataNode:  maps block ­> local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
Pipeline Details

21
MR programming
1) C A L C U L A T I N G PI
The area of the square, denoted As = (2r)^2 or
4r^2.
The area of the circle, denoted Ac, is pi * r2.
• pi= 4 * No of pts on the circle /
num of points on the square
• Count the number of generated
points that are both in the circle
and in the square  MAP
• PI = 4 * r  REDUC E

• Restricted parallel programming model meant for


large clusters
– User implements Map() and Reduce()
Ex 2: WORDCOUNT EXAMPLE

Divide the document Reducer sums up all


and analyse one line in counts
one mapper

Each mapper counts


words in 1 line
• File
Hello World Bye World
Hello Hadoop Good Bye Hadoop
• Map
For the given sample input the first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
• The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
The output of the first combine:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second combine:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Thus the output of the job (reduce) is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
• Map()
– Input <filename, file text>
– Parses file and emits <word, count> pairs
• eg. <”hello”, 1>
• Reduce()
– Sums all values for the same key and emits <word,
TotalCount>
• eg. <”hello”, ( 1 1) > => <”hello”, 2 >
Parallelisation - Hadoop MapReduce
File < Hello, 1> < Hello, 1> < Bye, 1>
Hello World Bye World < World, 1> < Hadoop, 1> < Hello, 1>
Hello Hadoop GoodBye Hadoop < Bye, 1> < Goodbye, 1> < World, 2>
< World, 1> < Hadoop, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
MR model
• Map()
– Process a key/value pair to generate intermediate
key/value pairs
• Reduce()
– Merge all intermediate values associated with the same
key
• Users implement interface of two primary methods:
1. Map: (key1, val1) → (key2, val2)
2. Reduce: (key2, [val2]) → [val3]
• Map - clause group-by (for Key) of an aggregate function of
SQL
• Reduce - aggregate function (e.g., average) that is computed
over all the rows with the same group-by attribute (key).
Example Word Count (1)
• Map
public static class MapClass extends MapReduceBase
implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(WritableComparable key, Writable value,


OutputCollector output, Reporter reporter)
throws IOException {
String line = ((Text)value).toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
Example Word Count (2)
• Reduce
public static class Reduce extends MapReduceBase implements
Reducer {
public void reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += ((IntWritable) values.next()).get();
}
output.collect(key, new IntWritable(sum));
}
}
Example Word Count (3)
• Main
public static void main(String[] args) throws IOException {
//checking goes here
JobConf conf = new JobConf();

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));

JobClient.runJob(conf);
}
• Job
• Task
• Task attempt – crash, speculative execution

Job Tracker
MR Job

Task Tracker Task Tracker Task Tracker

Task Instance Task Instance Task Instance

Slave Node Slave Node Slave Node


Jar + conf in HDFS
Job-in-progress

CLIENT JobClient. JobTracker


JobConf runjob()

Task Tracker
Launch task
Task in progress

TaskRunner

Mapper/
reducer
• MapReduce programs are contained in a Java
“jar” file + an XML file containing serialized
program configuration options
• These are stored in HDFS and task trackers are
notified
• Data Map – Local, if overloaded to other
nodes
3. Genome clustering
• Genome is a DNA sequence (ATGC)
• Features characterize a genome – motifs are
extracted – M R1
AAAGG GTTTCCCAAAG –
Mapper: AAA- 1, AAG- 1, AGG- 1, GGG - 1, GGT- 1,
GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA-
1, AAA- 1, AAG- 1

Reducer: AAA- 2, AA G- 2, AGG- 1, GGG - 1, GGT- 1,


G TT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1, CCA- 1, CAA- 1
Read the input sequence

Read input

Obtain [1x64] Feature Descriptor vector (MR 1)

Compare to the existing species by clustering (MR 2)

Identify the species


38
K Means
• Fix up cluster centres
• Find the distance between samples and
cluster centres
• Find which cluster centre is closest to a given
sample
• Add the sample to the cluster
• Re-evaluate cluster centres
Mapreduce kmeans
• Key – clusterid, value – coordinate of row
• Map – calculates distance of each point from
centers
• Keyout – clusterid , valueout-distance to each
cluster center;
• Reducer: finds closest cluster center;
recalculates new centers for next iteration
• Out- new centers
M R K -Means
• L et k be no. of features in a sample, V1(k) is
the value for kth feature in the sample. V2(k)
is value of kth feature of the cluster centre
• Mapper: Distance to ith cluster centre is given
by
• Reducer: min [Si] is evaluated for a species
considering all cluster centres
• Centroid is evaluated for all features (k) in
sample V2 and cluster centre V1. This is done
to update centroid
4. SVM as a classifier
• A Support Vector Machine (SVM) is a
supervised learning method that analyzes data
used for classification and regression analysis

• Learning in SVM is done by finding a hyperplane

• Separates the training set in a feature space using a


kernel function which is inner product of input
space features
• Binary classification can be viewed as the task of
separating classes in feature space
f(x) = sign(wTx + b)

wTx + b = 0 W = ∑ i αi si
wTx + b > 0
wTx + b < 0
where
w represents normal vector to hyperplane
x represents training patterns
b represents threshold
α represents coefficients
s represents support vectors
Methodology
Steps
1. Data Collection – download data sets for month/
year/day
2. Input feature selection – MA and RSI
Vol avg -- +,- points
3. Finding support vectors -- S1, S2
α1, α2
4. Construction of hyper plane
y = wT x + b
5. Classification of test data using hyper plane and X
values into +ive / -ive points
2. a) Input feature
• Moving Average
MA = ∑n i=1 CPi
n
where, CPi = Closing Price on ith day

• Relative Strength Index


RSI = 100 * RS ; RS = AU
1+RS AD
AU = total upwards price changes during the past n days
AD = total downwards prices changes during the past n
days
Input Data & Feature selection example
DATE HIGH LOW OPEN CLOSE VOLUME OPEN-CLOSE

26.8.11 8.87 7.04 8.19 7.53 4.3 0.66


19.8.11 9.48 7.70 8.81 8.08 2.5 0.73
12.8.11 9.27 8.32 9.20 8.42 1.6 0.78
5.8.11 10.83 7.74 8.60 9.15 2.8 -0.55
X= MA = Vol Avg = (0.66+0.73=0.78)/0.55
8.3 2.8 RS = 3.94
RSI= 3.94/(1+3.94)*100
Total data Y = RSI = 79.87
set avg =
1.54

+ point
81.61%
increase
2.b)Repeat for other months also
Month MA RSI +/-
August 8.30 79.87 +
June 8.97 85.81 +
May 10.73 74.09 +
April 14.54 55.35 +
March 13.74 45.94 +
November 11.79 24.24 +
July 10.61 42.53 -
February 12.32 31.50 -
January 10.44 83.97 -
December 12.78 28.05 -
October 8.81 31.03 -
3. Support Vectors

• Support vectors are calculated by using Euclidean


distance formula

• In the   Euclidean plane, if X = (x1, x2) and


Y = (y1, y2) then the distance is given by,
D (X,Y) = ( x1 - y1 ) 2 + ( x2 - y2 ) 2
where X and Y are positive points and negative points
of Moving Average and Relative Strength Index
+ive point -ive point distance
8.30,79.87 10.61,42.53 37.44
12.32, 31.50 48.53
10.44, 83.97 4.62
12.78, 28.05 52.01
8.81, 31.03 48.84
8.97, 85.81 10.61,42.53 43.31
12.32, 31.50 54.41
10.44, 83.97 2.35 (min)
12.78, 28.05 57.88
8.81, 31.03 54.77

Others are also calculated and min is found


S1= 8.97,85.81
S2= 10.44,83.97
8.97 10.44
S1 = 85.81 S2 = 83.97
1 1
3.b)Finding α1 and α2

• α1 S1 S1 + α2 S1 S2 = + 1
• α1 S1 S2 + α2 S2 S2 = - 1
7444.81 α1 + 7300.11 α2 = +1
7300.11 α1 + 7160.95 α2 = -1

α1 = -0.71; α2 = 0.73
4. Construction of Hyperplane

• Hyperplane is constructed using the equation , y = wT x + b


Where, w = ∑ αi si
w represents normal vector to hyperplane
x represents training patterns
b represents threshold
α represents coefficients
s represents support vector
• α values are calculated by using the equation
• y = wT x + b
• wT = ∑ αi Si = α1 s1 + α2 s2
• b = 0.02
• y = 1.25 x + 0.02  hyperplane
0.37
5. Testing set
• MA=x1= 12.44, RSI =x2= 50 to predict Volume avg
• y = 1.25 T x + 0.02  hyperplane
0.37

• Y = 34.05 + 0.02 = 34.07 


predicted value (vol avg)
• Overall avg = 18
• Increased % = (34.07-18 )/18 = 89%
Multi SVM

• Support vector machine is a powerful tool for binary


classification, capable of generating very fast
classifier functions following a training period

• To extend binary class scenario to multi-class


scenario, decompose an M-class problem into a series
of two-class problems
Approaches for Multi class SVM
• Multiclass ranking SVMs, in which one SVM
decision function attempts to classify all classes
• One-against-all classification, in which there is one
binary SVM for each class to separate members of that
class from members of other classes
• Pairwise classification, in which there is one binary
SVM for each pair of classes to separate members of
one class from members of the other
• Binary decision tree classification
Classes – - 100 to -50, -50 to 0, 0 to 50, 50 to 100

-100 to -50 -50 to 0


One against all

0 to 50 50 to 100 pairwise
BDT
-100 to 100

-100 to 0 0 to 100

-100 to -50 -50 to 0 0 to 50 50 to 100


Hadoop – Components
HDFS – Distributed File System Chukwa –To monitor Large Distributed System

Mahout Flume – To move Large Data post processing efficiently

MapReduce – Distributed Processing of large Data Hue – GUI to operate & develop Hadoop Applications
sets

ZooKeeper – Co-ordination Service for Dist Apps Hive – Data Warehousing framework

HBase – Scalable Distributed DB. Supports Structured Many more ….


Data

Avro – Data Serialization System Pig – Framework for Parallel Computation

SQOOP – Connector to Structured Database Apache


Oozie – Workflow Service to manage
Zookeeper
Data Processing Jobs
Apache Tomcat
63
Hadoop – The Various Forms Today
Apache Hadoop – Native Hadoop Distribution from Apache Foundation

Yahoo! Hadoop – Hadoop Distribution of Yahoo

CDH – Hadoop Distribution from Cloudera

GreenPlum Hadoop – Hadoop Distribution from EMC

HDP – Hadoop Platform from Hortonworks

M3 / M5 / M7 – Hadoop Distribution from MAPR

Project Serengeti – Vmware’s Implementation of Hadoop on Vcenter

Apache
Zookeeper Apache Tomcat
And More …
64
Hadoop vs Spark
• Spark is on top of hadoop
• Hadoop is distributed storage. Spark is tool for distributed processing on the storage
• Batch vs real time processing
• Written in java vs scala
• They are complementary
• Spark stores data in-memory whereas Hadoop stores data on disk.
– more RAM instead of network and disk I/O
• relatively fast as compared to hadoop.
• Spark needs large RAM  dedicated high end physical machine
• Spark
– Iterative jobs / machine learning
– Interactive
– Stream and sensor data processing
• Hadoop uses replication to achieve fault tolerance whereas Spark uses resilient
distributed datasets (RDD) to reconstruct a block if lost (from stable storage).
Backtracks and completes a job rather than start from beginning
• Shared memory vs distributed
Hadoop mapreduce vs spark
conclusion
• Hadoop
• MapReduce
• Examples
• Case studies
• Hadoop vs Spark
Thank You

You might also like