You are on page 1of 16

MAP REDUCE

Module 2
Agenda
• Understanding Map Reduce
• MapReduce – Word Count example
• MapReduce Overview
• Data Flow of Map Reduce
• YARN Map Reduce detail flow
• Concept of Mapper & Reducer
• Speculative Execution
• Hadoop Fault Tolerance
• Submission & Initialization of Map Reduce Job
• Monitoring & Progress of Map Reduce Job
Understanding Map Reduce
MAP REDUCE
Take a big dataset and divide into
smaller datasets Combine the output from all smaller sub datasets, which
are the output of mapper

Perform the same function (map) on all


smaller sub datasets
MapReduce – Word Count
Before we jump into detail, lets
walk through an example
MapReduce application to get a
flavour of how they work.
Wordcount is a simple application
that count the number of
occurrences of each word in a
given input dataset.
MapReduce Overview
• Job Tracker knows everything about submitted
jobs

• Divides job into tasks and decides where to run


each task

• Continuously communicating with task trackers

• TaskTrackers execute tasks (multiple per node)

• Monitors the execution of each task

• Continuously sending feedback to JobTracker


Data Flow of Map Reduce
Two main phases of the job are :

• Map Phase
• Reduce Phase
Key-Value pair generation
The MapReduce framework O
operates exclusively on <key,
I
N Map() U
value> pairs, i.e. the framework
views the input to the job as a P Redu T
set of <key, value> pairs and U ce() P
produces a set of <key, value> T U
pairs as the output of the job, Map() T
conceivably of different types
D Redu
A ce() D
(input) <k1, v1> -> map -> <k2, v2> T A
Map()
-> combine -> <k2, v2> -> reduce - A T
> <k3, v3> (output) A

Split Sort Merge


<k1,v1> by k1 <k1,[v1,v2,v3….]>
YARN Map Reduce detail flow
2. Get new application
MR
Job Resource Manager
Program
4. Submit application
Client Node Resource Manager node

5a. Start
1. Run Job container
3. Copy job
resources 8. Allocate 9a. Start Node Manager
resources container

Node Manager
HDFS
Map Task or
9b. Run
Reduce Task

5b. Launch Application Master


Node Manager node
7. Get input
splits from
HDFS 6. Start job Node Manager node
Concept of Mapper
Input path in HDFS
/user/cloudera/data/file.txt
InputFormat

InputSplit Input Split


Input Split 1 Input Split 2
…....... m
RecordReader
K1 K1 K2 K2 Km Km
InputKey
V1 … Vn V1 … Vn …....... V1 … Vn
InputValue
RecordReader
Map Task 1 Map Task 2 …....... Map Task m

Collector
Map output Map output …....... Map output
1 2 m

Partitioner Partitioner …....... Partitioner


How Many Map Tasks?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of
the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although
it has been set up to 300 maps for very CPU-light map tasks. Task setup takes awhile, so it is best if the
maps take at least a minute to execute. Thus, if you expect 10TB of input data and have a blocksize
of 128MB, you'll end up with 82,000 maps.

For ex:
Mapper= {(total data size)/ (input split size)}
128 MB is the Block Size/ 128 MB is the Input Split size
100 MB is the file size
So, file will take only 1 block in HDFS for storage and thus 1 input split and 1 Map Task

Data size is 1 TB and input split size is 100 MB.


Mapper= (1000*1000)/100= 10,000
Concept of Reducer
Partitioner
Partitioner 1 Partitioner 2 …....... m

K1 K1 K2

K2 Km Km The right
… …….... …
V1 Vn V1 Vn V1 Vm
number of
Sort/ Merge reducers are –
Reduce Reduce Reduce
……… 0.95 or 1.75 *
Task 1 Task 2 Task m
(<no. of
nodes>
part-r- * mapred.taskt
Output part part-r-00001 part-r-00002 ………
files in HDFS 0000m racker.reduce.
tasks.maximu
m)

Output path in HDFS


/user/cloudera/data/file1.txt
Speculative Execution
Supposedly,
Node A starts
Final output is taken
running slow
say from Node B, so
all the remaining
Node A speculative tasks
1. Submit a task on
Node A to process (Original Map 5. If Node A completes it will then be
Task) task first, it sends the final abandoned
output
2. Scheduler finds out
Node A is running slow
from the task progress
Job Scheduler which node has sent Output

4. If Node B completes it
3. Scheduler submits
task first, it sends the final
a speculative Node B output
(duplicate) task on (Duplicate
Node B Map Task)
Hadoop Fault Tolerance
Intermediate data between mappers and reducers are materialized to simple & straightforward fault
tolerance

What if a task fails (map or reduce)?


• TaskTracker detects the failure
• Sends message to the JobTracker
• JobTracker re-schedules the task

What if a DataNode fails?


• Both NameNode and JobTracker detect the failure
• All tasks on the failed node are re-scheduled
• NameNode replicates the user’s data to another node

What if a NameNode or JobTracker fails?


• The entire cluster is down
Submission & Initialization of Map Reduce Job
Submit a Pig job as –
[cloudera@quickstart ~]$ pig word_count_pig.pig

Track the progress of pig job on YARN as –


[cloudera@quickstart ~]$ yarn application -list
17/02/25 06:38:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue
State Final-State Progress Tracking-URL
application_1488033196887_0001 PigLatin:word_count_pig.pig MAPREDUCE
cloudera root.cloudera ACCEPTED UNDEFINED 0%
N/A

URL to track the job progress in YARN looks like –


http://quickstart.cloudera:19888/jobhistory/job/job_1488033196887_0001/
Monitoring & Progress of Map Reduce Job
THANK YOU

You might also like