Professional Documents
Culture Documents
Dr G Sudha Sadasivam
Professor
CSE, PSGCT
Outline
• Big Data
• Hadoop and MR programming
• MR model
• Example
• Kmeans
• SVM
Trend: New “Big Data” becoming commonplace
Sentiment
analytics Transaction Claim fraud Warranty claim Surveillance CDR
Multi-Channel analytics analytics analytics analytics analytics
analytics
4
The Big Data Opportunity
Variety
Machine Learning
NLP
Big Data
Databases
Volume
Velocity
Complex Event Processing
5 V’s
NoSQL
• SQL
• Unstructured
– Key value stores
– Column family
– Document databases
– Graph databases
Hadoop and MR Programming
A framework for running applications on large clusters of
commodity hardware ( 1000 nodes) which produces huge data
(petabytes – zetabytes) and to process it
Open source Apache Software Foundation Project
Hadoop Includes
HDFS a distributed filesystem to distribute data
Map/Reduce HDFS implements this programming model. It is an
offline computing engine. Handles distributed Applications
CONCEPT
Moving computation is more efficient than moving
large data
Hadoop evolution 2011 – Cloudera
- MapR
2012 – Hortonworks
2013 – Hadoop 2.0
Yarn
2014 - Spark
Hadoop – Who’s Using It ?
Uses Hadoop and HBase for :
• Social services
• Structured data storage
Uses Hadoop for :
• Processing for internal use
• Amazon's product search
indices They process millions of
sessions daily for analytics.
Submit job
map reduce
Dataflow in Hadoop
Read
Input File
map reduce
Block 1
HDFS
Block 2
map reduce
Dataflow in Hadoop
map Local
FS
reduce
Local
map FS reduce
Dataflow in Hadoop
map Local
FS
reduce
HTTP GET
Local
map FS reduce
Dataflow in Hadoop
Write
Final
reduce
Answer
HDFS
reduce
HDFS Architecture
• NameNode: filename, offset> blockid, block > datanode
• DataNode: maps block > local disk
• Secondary NameNode: periodically merges edit logs
Block is also called chunk
Pipeline Details
21
MR programming
1) C A L C U L A T I N G PI
The area of the square, denoted As = (2r)^2 or
4r^2.
The area of the circle, denoted Ac, is pi * r2.
• pi= 4 * No of pts on the circle /
num of points on the square
• Count the number of generated
points that are both in the circle
and in the square MAP
• PI = 4 * r REDUC E
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}
• Job
• Task
• Task attempt – crash, speculative execution
Job Tracker
MR Job
Task Tracker
Launch task
Task in progress
TaskRunner
Mapper/
reducer
• MapReduce programs are contained in a Java
“jar” file + an XML file containing serialized
program configuration options
• These are stored in HDFS and task trackers are
notified
• Data Map – Local, if overloaded to other
nodes
3. Genome clustering
• Genome is a DNA sequence (ATGC)
• Features characterize a genome – motifs are
extracted – M R1
AAAGG GTTTCCCAAAG –
Mapper: AAA- 1, AAG- 1, AGG- 1, GGG - 1, GGT- 1,
GTT- 1, TTT- 1, TTC- 1, TCC- 1, CCC- 1,CCA- 1, CAA-
1, AAA- 1, AAG- 1
Read input
wTx + b = 0 W = ∑ i αi si
wTx + b > 0
wTx + b < 0
where
w represents normal vector to hyperplane
x represents training patterns
b represents threshold
α represents coefficients
s represents support vectors
Methodology
Steps
1. Data Collection – download data sets for month/
year/day
2. Input feature selection – MA and RSI
Vol avg -- +,- points
3. Finding support vectors -- S1, S2
α1, α2
4. Construction of hyper plane
y = wT x + b
5. Classification of test data using hyper plane and X
values into +ive / -ive points
2. a) Input feature
• Moving Average
MA = ∑n i=1 CPi
n
where, CPi = Closing Price on ith day
+ point
81.61%
increase
2.b)Repeat for other months also
Month MA RSI +/-
August 8.30 79.87 +
June 8.97 85.81 +
May 10.73 74.09 +
April 14.54 55.35 +
March 13.74 45.94 +
November 11.79 24.24 +
July 10.61 42.53 -
February 12.32 31.50 -
January 10.44 83.97 -
December 12.78 28.05 -
October 8.81 31.03 -
3. Support Vectors
• α1 S1 S1 + α2 S1 S2 = + 1
• α1 S1 S2 + α2 S2 S2 = - 1
7444.81 α1 + 7300.11 α2 = +1
7300.11 α1 + 7160.95 α2 = -1
α1 = -0.71; α2 = 0.73
4. Construction of Hyperplane
0 to 50 50 to 100 pairwise
BDT
-100 to 100
-100 to 0 0 to 100
MapReduce – Distributed Processing of large Data Hue – GUI to operate & develop Hadoop Applications
sets
ZooKeeper – Co-ordination Service for Dist Apps Hive – Data Warehousing framework
Apache
Zookeeper Apache Tomcat
And More …
64
Hadoop vs Spark
• Spark is on top of hadoop
• Hadoop is distributed storage. Spark is tool for distributed processing on the storage
• Batch vs real time processing
• Written in java vs scala
• They are complementary
• Spark stores data in-memory whereas Hadoop stores data on disk.
– more RAM instead of network and disk I/O
• relatively fast as compared to hadoop.
• Spark needs large RAM dedicated high end physical machine
• Spark
– Iterative jobs / machine learning
– Interactive
– Stream and sensor data processing
• Hadoop uses replication to achieve fault tolerance whereas Spark uses resilient
distributed datasets (RDD) to reconstruct a block if lost (from stable storage).
Backtracks and completes a job rather than start from beginning
• Shared memory vs distributed
Hadoop mapreduce vs spark
conclusion
• Hadoop
• MapReduce
• Examples
• Case studies
• Hadoop vs Spark
Thank You