Professional Documents
Culture Documents
Module 2
Agenda
• Understanding Map Reduce
• MapReduce – Word Count example
• MapReduce Overview
• Data Flow of Map Reduce
• YARN Map Reduce detail flow
• Concept of Mapper & Reducer
• Speculative Execution
• Hadoop Fault Tolerance
• Submission & Initialization of Map Reduce Job
• Monitoring & Progress of Map Reduce Job
Understanding Map Reduce
MAP REDUCE
Take a big dataset and divide into
smaller datasets Combine the output from all smaller sub datasets, which
are the output of mapper
• Map Phase
• Reduce Phase
Key-Value pair generation
The MapReduce framework O
operates exclusively on <key,
I
N Map() U
value> pairs, i.e. the framework
views the input to the job as a P Redu T
set of <key, value> pairs and U ce() P
produces a set of <key, value> T U
pairs as the output of the job, Map() T
conceivably of different types
D Redu
A ce() D
(input) <k1, v1> -> map -> <k2, v2> T A
Map()
-> combine -> <k2, v2> -> reduce - A T
> <k3, v3> (output) A
5a. Start
1. Run Job container
3. Copy job
resources 8. Allocate 9a. Start Node Manager
resources container
Node Manager
HDFS
Map Task or
9b. Run
Reduce Task
Collector
Map output Map output …....... Map output
1 2 m
For ex:
Mapper= {(total data size)/ (input split size)}
128 MB is the Block Size/ 128 MB is the Input Split size
100 MB is the file size
So, file will take only 1 block in HDFS for storage and thus 1 input split and 1 Map Task
K1 K1 K2
…
K2 Km Km The right
… …….... …
V1 Vn V1 Vn V1 Vm
number of
Sort/ Merge reducers are –
Reduce Reduce Reduce
……… 0.95 or 1.75 *
Task 1 Task 2 Task m
(<no. of
nodes>
part-r- * mapred.taskt
Output part part-r-00001 part-r-00002 ………
files in HDFS 0000m racker.reduce.
tasks.maximu
m)
4. If Node B completes it
3. Scheduler submits
task first, it sends the final
a speculative Node B output
(duplicate) task on (Duplicate
Node B Map Task)
Hadoop Fault Tolerance
Intermediate data between mappers and reducers are materialized to simple & straightforward fault
tolerance