Professional Documents
Culture Documents
Ecosystem
What is Hadoop?
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)
YARN – (Yet
Another
Resource
7 Satishkumar Varma, PCE
Hadoop Components?
Hadoop Distributed File System:
HDFS is designed to run on commodity machines which are of low
cost hardware
Distributed data is stored in the HDFS file
system HDFS is highly fault tolerant
HDFS provides high throughput access to the
applications that
require big data
Java-based scalable system that stores data across multiple
machines without prior organization
TaskTracker
Tasktracker runs on the datanodes
Task trackers responsibility is to
run the map or reduce tasks assigned by the namenode
and report the status of the tasks to the namenode
12 Satishkumar Varma, PCE
How does it work: Hadoop Architecture?
Hadoop distributed file system (HDFS)
MapReduce execution engine
Centralized namenode
- Maintains metadata info about files
1 2 3 4 5
File F
Blocks (64/128 MB)
Replication
Each data block is replicated many times (default is 3)
Failure
Failure is the norm rather than exception
Fault Tolerance
Detection of faults and quick, automatic recovery from them is a
core architectural goal of HDFS
Flume
It is distributed real time data collection service.
It efficiently collects, aggregate and move large amounts of data.
29
4TB of TIFFs into 11 million PDF articles in 24 hrs
Satishkumar Varma, PCE
Hadoop Applications
30
FB’s Hadoop cluster hosts 100+ PB of data (July, 2012) &
Satishkumar Varma, PCE