You are on page 1of 30

red

red

red

red

red

red

red

red

red

red

red

red

red

red

red

red

red

red

red

CYS14011 - Rithu P Ravi


CYS14012 - Saumya K

red

Big Data Hadoop... HDFS Map Reduce

Why and What HADOOP?...

Apache Hadoop is an open-source software framework


A tool to process big data

Rithu P Ravi,SaumyaK HADOOP

2/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce

Rithu P Ravi,SaumyaK HADOOP

3/30

Big Data Hadoop... HDFS Map Reduce

Big Data

Data beyond storage and processing power


3 Vs
Volume
Velocity
Variety

Rithu P Ravi,SaumyaK HADOOP

4/30

Big Data Hadoop... HDFS Map Reduce

Big Data

Exponential growth of data


Challenges to Google, Yahoo, Microsoft, Amazon
Need to go through TBs and PBs of data ?
Existing tools became inadequate to process such large
data sets.

Rithu P Ravi,SaumyaK HADOOP

5/30

Big Data Hadoop... HDFS Map Reduce

Big Elephant
Numerous small chicken..?

Rithu P Ravi,SaumyaK HADOOP

6/30

Big Data Hadoop... HDFS Map Reduce

How to handle such BIG ?

Issues
How to handle a system up and downs ?
How to combine the data from all the systems ?

Rithu P Ravi,SaumyaK HADOOP

7/30

Big Data Hadoop... HDFS Map Reduce

Problem1 : Systems Ups Downs

Commodity hardware for data storage and analysis


Chances of failure are very high
Replication of data across some machines
GFS (Google File System)
GFS
Divides data into chunks and stores in the file System
Can store data in ranges of PBs also

Rithu P Ravi,SaumyaK HADOOP

8/30

Big Data Hadoop... HDFS Map Reduce

Problem 2 : How to combine the data ?

Analyze data across different machines .


Merge
-, Data has to travel across network.
Doing this is notoriously challenging
Again Google
MapReduce

Rithu P Ravi,SaumyaK HADOOP

9/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Provides a programming model


Abstracts disk reads and writes
Converts to (keys,values) pair
Two Phases
Map
Reduce

Rithu P Ravi,SaumyaK HADOOP

10/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce

Rithu P Ravi,SaumyaK HADOOP

11/30

Big Data Hadoop... HDFS Map Reduce

HADOOP

A reliable shared storage system


Analysis system

Rithu P Ravi,SaumyaK HADOOP

12/30

Big Data Hadoop... HDFS Map Reduce

History

Google was the first to launch GFS and MapReduce


Published a paper 2004
A brand new technology
Was well proven in Google by 2004 itself

Rithu P Ravi,SaumyaK HADOOP

13/30

Big Data Hadoop... HDFS Map Reduce

History

Doug Cutting
Open source version of MapReduce system called Hadoop
Yahoo and others rallied around to support this effort.
Now Hadoop is core part in : Facebook, Yahoo, LinkedIn,
Twitter

Rithu P Ravi,SaumyaK HADOOP

14/30

Big Data Hadoop... HDFS Map Reduce

Core Concepts

HDFS
Map Reduce

Rithu P Ravi,SaumyaK HADOOP

15/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce

Rithu P Ravi,SaumyaK HADOOP

16/30

Big Data Hadoop... HDFS Map Reduce

HDFS...
Hadoop Distributed File System

Streaming very large files on commodity cluster


1
2

Very Large Files : MBs to PBs


Streaming
Write once read many approach
No modifiation
Time to read the whole data is more important

Commodity Cluster
No High end Servers
Yes, high chance of failure (But HDFS is tolerant
enough)
Replication is done

Rithu P Ravi,SaumyaK HADOOP

17/30

Big Data Hadoop... HDFS Map Reduce

HDFS
Hadoop Distributed File System...

Services
Masters
Name Node
Secondary Name Node
Job Tracker
Slaves
Data Node
Task Tracker

Rithu P Ravi,SaumyaK HADOOP

18/30

Big Data Hadoop... HDFS Map Reduce

HDFS
Hadoop Distributed File System...

Name Node
Master Node
Maintains Name System
Meta Data
Secondary Name Node
Periodically updating fsimage file
Data Node
Slaves
Actual Storage

Rithu P Ravi,SaumyaK HADOOP

19/30

Big Data Hadoop... HDFS Map Reduce

HDFS Architecture

Rithu P Ravi,SaumyaK HADOOP

20/30

Big Data Hadoop... HDFS Map Reduce

Outline

1 Big Data
2 Hadoop...
3 HDFS
4 Map Reduce

Rithu P Ravi,SaumyaK HADOOP

21/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Large scale data processing in parallel.


It provides
Automatic parallelization and distribution
Fault-tolerance

Two Phases in Map Reduce


Map
Reduce

Rithu P Ravi,SaumyaK HADOOP

22/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Job Tracker
Master
Manages the jobes in the cluster
Task Tracker
Slaves
Responsible for Map Reduce

Rithu P Ravi,SaumyaK HADOOP

23/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK HADOOP

24/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Map Phase
map(inKey,invalue)-list(outKey, intermediateValue)
Processes input key/value pair
Produces set of intermediate pairs
Reduce Phase
reduce(outKey,list(intermediateValue))- list(outValue)
Combines all intermediate values for a particular key
Produces a set of merged output values (usually just one)

Rithu P Ravi,SaumyaK HADOOP

25/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK HADOOP

26/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK HADOOP

27/30

Big Data Hadoop... HDFS Map Reduce

Map Reduce

Rithu P Ravi,SaumyaK HADOOP

28/30

Big Data Hadoop... HDFS Map Reduce

References
If you want to improve this style

Hadoop Tutorial-Durga Soft


https://www.youtube.com/watch?v=DLutRT6K2rM/
Hadoop Official Site
http://hadoop.apache.org/index.html/
Processing Big Data using Hadoop Framework
Prashant D. Londhe, Satish S. Kumbhar, Ramakant S.
Sul, Amit J. Khadse

Rithu P Ravi,SaumyaK HADOOP

29/30

Big Data Hadoop... HDFS Map Reduce

Happy Hadooping.... :)

Rithu P Ravi,SaumyaK HADOOP

30/30

You might also like