MAP REDUCE WORD COUNT ANALYSIS

MAP REDUCE
Module 2
Agenda
• Understanding Map Reduce
• MapReduce – Word Count example
• MapReduce Overview
• Data Flow of Map Reduce
• YARN Map Reduce detail flow
• Concept of Mapper & Reducer
• Speculative Execution
• Hadoop Fault Tolerance
• Submission & Initialization of Map Reduce Job
• Monitoring & Progress of Map Reduce Job
Understanding Map Reduce
MAP REDUCE
Take a big dataset and divide into
smaller datasets Combine the output from all smaller sub datasets, which
are the output of mapper
Perform the same function (map) on all

smaller sub datasets
MapReduce – Word Count
Before we jump into detail, lets
walk through an example
MapReduce application to get a
flavour of how they work.
Wordcount is a simple application
that count the number of
occurrences of each word in a
given input dataset.
MapReduce Overview
• Job Tracker knows everything about submitted
jobs
• Divides job into tasks and decides where to run

each task
• Continuously communicating with task trackers
• TaskTrackers execute tasks (multiple per node)
• Monitors the execution of each task
• Continuously sending feedback to JobTracker

Data Flow of Map Reduce
Two main phases of the job are :
• Map Phase
• Reduce Phase
Key-Value pair generation
The MapReduce framework O
operates exclusively on <key,
I
N Map() U
value> pairs, i.e. the framework
views the input to the job as a P Redu T
set of <key, value> pairs and U ce() P
produces a set of <key, value> T U
pairs as the output of the job, Map() T
conceivably of different types
D Redu
A ce() D
(input) <k1, v1> -> map -> <k2, v2> T A
Map()
-> combine -> <k2, v2> -> reduce - A T
> <k3, v3> (output) A
Split Sort Merge

<k1,v1> by k1 <k1,[v1,v2,v3….]>
YARN Map Reduce detail flow
2. Get new application
MR
Job Resource Manager
Program
4. Submit application
Client Node Resource Manager node
5a. Start
1. Run Job container
3. Copy job
resources 8. Allocate 9a. Start Node Manager
resources container
Node Manager
HDFS
Map Task or
9b. Run
Reduce Task
5b. Launch Application Master

Node Manager node
7. Get input
splits from
HDFS 6. Start job Node Manager node
Concept of Mapper
Input path in HDFS
/user/cloudera/data/file.txt
InputFormat
InputSplit Input Split

Input Split 1 Input Split 2
…....... m
RecordReader
K1 K1 K2 K2 Km Km
InputKey
V1 … Vn V1 … Vn …....... V1 … Vn
InputValue
RecordReader
Map Task 1 Map Task 2 …....... Map Task m
Collector
Map output Map output …....... Map output
1 2 m
Partitioner Partitioner …....... Partitioner

How Many Map Tasks?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of
the input files. The right level of parallelism for maps seems to be around 10-100 maps per-node, although
it has been set up to 300 maps for very CPU-light map tasks. Task setup takes awhile, so it is best if the
maps take at least a minute to execute. Thus, if you expect 10TB of input data and have a blocksize
of 128MB, you'll end up with 82,000 maps.
For ex:
Mapper= {(total data size)/ (input split size)}
128 MB is the Block Size/ 128 MB is the Input Split size
100 MB is the file size
So, file will take only 1 block in HDFS for storage and thus 1 input split and 1 Map Task
Data size is 1 TB and input split size is 100 MB.

Mapper= (1000*1000)/100= 10,000
Concept of Reducer
Partitioner
Partitioner 1 Partitioner 2 …....... m
K1 K1 K2
…
K2 Km Km The right
… …….... …
V1 Vn V1 Vn V1 Vm
number of
Sort/ Merge reducers are –
Reduce Reduce Reduce
……… 0.95 or 1.75 *
Task 1 Task 2 Task m
(<no. of
nodes>
part-r- * mapred.taskt
Output part part-r-00001 part-r-00002 ………
files in HDFS 0000m racker.reduce.
tasks.maximu
m)
Output path in HDFS

/user/cloudera/data/file1.txt
Speculative Execution
Supposedly,
Node A starts
Final output is taken
running slow
say from Node B, so
all the remaining
Node A speculative tasks
1. Submit a task on
Node A to process (Original Map 5. If Node A completes it will then be
Task) task first, it sends the final abandoned
output
2. Scheduler finds out
Node A is running slow
from the task progress
Job Scheduler which node has sent Output
4. If Node B completes it
3. Scheduler submits
task first, it sends the final
a speculative Node B output
(duplicate) task on (Duplicate
Node B Map Task)
Hadoop Fault Tolerance
Intermediate data between mappers and reducers are materialized to simple & straightforward fault
tolerance
What if a task fails (map or reduce)?

• TaskTracker detects the failure
• Sends message to the JobTracker
• JobTracker re-schedules the task
What if a DataNode fails?

• Both NameNode and JobTracker detect the failure
• All tasks on the failed node are re-scheduled
• NameNode replicates the user’s data to another node
What if a NameNode or JobTracker fails?

• The entire cluster is down
Submission & Initialization of Map Reduce Job
Submit a Pig job as –
[cloudera@quickstart ~]$ pig word_count_pig.pig
Track the progress of pig job on YARN as –

[cloudera@quickstart ~]$ yarn application -list
17/02/25 06:38:43 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
Application-Id Application-Name Application-Type User Queue
State Final-State Progress Tracking-URL
application_1488033196887_0001 PigLatin:word_count_pig.pig MAPREDUCE
cloudera root.cloudera ACCEPTED UNDEFINED 0%
N/A
URL to track the job progress in YARN looks like –

http://quickstart.cloudera:19888/jobhistory/job/job_1488033196887_0001/
Monitoring & Progress of Map Reduce Job
THANK YOU

MAP REDUCE WORD COUNT ANALYSIS

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MAP REDUCE WORD COUNT ANALYSIS

Uploaded by

Copyright:

Available Formats

MAP REDUCE

Perform the same function (map) on all

• Divides job into tasks and decides where to run

• Continuously communicating with task trackers

• TaskTrackers execute tasks (multiple per node)

• Monitors the execution of each task

• Continuously sending feedback to JobTracker

Split Sort Merge

5b. Launch Application Master

InputSplit Input Split

Partitioner Partitioner …....... Partitioner

Data size is 1 TB and input split size is 100 MB.

Output path in HDFS

What if a task fails (map or reduce)?

What if a DataNode fails?

What if a NameNode or JobTracker fails?

Track the progress of pig job on YARN as –

URL to track the job progress in YARN looks like –

You might also like