You are on page 1of 27

Big Data: Hadoop 2.

0
Map Reduce / HDFS 2.0
@diego_pacheco
Software Architect | Agile Coach
Big Data
Hadoop - Cases
Hadoop
H
D
F
S
adoop
distributed
ile
ystem
4000 nodes: 14PB storage
HDFS Assumptions and Goals
Hardware Failure: Houndred or thousands machines, expect to fail.
Streaming Data Access: Batch processing, high throughtput not low latency.
Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.
Simple Coherency Model: Write-once-read-many(create, read, close, no
changes) maximize coherency and high throughtput, perfect for Map/Reduce.
Moving Computation instead of Moving Data: Is way
more cheaper, huge data, minimize network. HDFS moves the computation close to the data.
Sofware and hardware Portability: Easily Portable.
HDFS
Very large distributed FS
10k nodes, 100M files, 10PB
Works with comodity hardware
File replication
Detect and recover from failures
Optimized for batch processing
Files break by blocks 128mb
blocks: replicated in N dataNodes
Data Coherency
Write Once, Read Many
Only Append to existent files

HDFS - Architecture
HDFS 2.0 - Federation
Hadoop
Map Reduce
Today: Parallelism per file
Single LARGE File
Single Thread
No
Parallelism
Map/Reduce: Unit of data
Task 0 Task 1 Task 2 Task 3
0..64 mb
64..128mb 128..192mb 192..256mb
Each task process a unit of data
Today: Network issue
Task 0 Task 1 Task 2 Task 3
Node 0
0..64 mb
64..128mb 128..192mb
192..256mb
Node 1 Node 2 Node 3
Map/Reduce: Local Read
Local Read, no need for network copy
Data is read from many disks in parallel
Map/Reduce: The Magic!
Single Hard Drive: Reads 75mb/second
12 hard drive
Per machine
12 * 75mb/second * 4k = 3.4 TB/ second
Map
Reduce
Big Data: Hadoop 2.0
Map Reduce / HDFS 2.0
@diego_pacheco
Software Architect | Agile Coach
Obrigado!
Thank You!

You might also like