Professional Documents
Culture Documents
<
1
Agenda
MapReduce Introduction
MapReduce Tasks
WordCount Example
Splits
Execution
Scheduling
10110100 Cluster
10100100
1
11100111
11100101
00111010 1 3 2
01010010
11001001
2
01010011 4 1 3
Blocks 00010100
10111010
11101011
3
11011011
01010110 2 4
10010101 4 2 3
00101010
10101110
1
4
01001101
01110100
Logical File
3
MapReduce Explained
Hadoop computation model
– Data stored in a distributed file system spanning many inexpensive computers
– Bring function to the data
– Distribute application to the compute resources where the data is stored
Scalable to thousands of nodes and petabytes of data
}
context.write(word, one); 1. Map Phase
}
} (break job into small parts)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
public void reduce(Text key, Distribute map 2. Shuffle
Iterable<IntWritable> val, Context context){
int sum = 0;
tasks to cluster (transfer interim output
for (IntWritable v : val) {
sum += v.get();
for final processing)
. . .
3. Reduce Phase
(boil all output down to
MapReduce Application
a single result set)
Shuffle
4
MapReduce Engine
Master / Slave architecture
– Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
JobTracker
– Accepts MapReduce jobs submitted by clients
– Pushes map and reduce tasks out to TaskTracker nodes
– Keeps the work as physically close to data as possible
– Monitors tasks and TaskTracker status
TaskTracker
– Runs map and reduce tasks
– Reports status to JobTracker
– Manages storage and transmission of intermediate output
cluster Computer 1
JobTracker
"Map" step:
– Input split into pieces
– Each worker node stores its result in its local file system
where a reducer is able to access it
"Reduce" step:
– Data is aggregated (‘reduced” from the map steps) by
worker nodes (under control of the Job Tracker)
Results can be
written to HDFS or a
database
7
Agenda
MapReduce Introduction
MapReduce Tasks
– Map
– Shuffle
– Reduce
– Combiner
WordCount Example
Splits
Execution
Scheduling
Map Phase
sort Logical Output File
10110100 1 map copy merge 10110100
10100100
1
11100111 10100100
11100101 11100111
00111010 11100101 To DFS
01010010
sort reduce 00111010
01010010
11001001
2
01010011
2 map 11001001
00010100
10111010 Logical Output File
11101011
sort merge 10110100
3
11011011
10100100
01010110
10010101
3 map 11100111
00101010 11100101 To DFS
reduce 00111010
10101110
01010010
4
01001101
sort 11001001
01110100
4 map
Logical
Input File
9
MapReduce – The Shuffle
The output of each mapper is locally grouped together by key
One node is chosen to process data for each unique key
All of this movement (shuffle) of data is transparently orchestrated by
MapReduce
Shuffle
10
MapReduce – Reduce Phase
Reducers
– Small programs (typically) that aggregate all of the values for the key that
they are responsible for
– Each reducer writes output to its own file
Reduce Phase
11
Agenda
MapReduce Introduction
MapReduce Tasks
WordCount Example
Splits
Execution
Scheduling
Distributed TaskTracker
8. retrieve
file system
job resources 9. launch
(e.g. HDFS)
child JVM
Child
Client: submits MapReduce jobs
JobTracker: coordinates the job run, breaks 10. run
down the job to map and reduce tasks for each
node to work on the cluster MapTask
or
TaskTracker: execute the map and reduce ReduceTask
functions
tasktracker node
13
Fault tolerance
JobTracker node
heartbeat
child JVM
Child
1 Task fails
MapTask
or
ReduceTask
TaskTracker node
14
14
Fault tolerance
• Task Failure
• If a child task fails, the child JVM reports to the TaskTracker before it
exits. Attempt is marked failed, freeing up slot for another task.
• If the child task hangs, it is killed. JobTracker reschedules the task
on another machine.
• If task continues to fail, job is failed.
15
15
Fault tolerance
• TaskTracker Failure
• JobTracker receives no heartbeat
• Removes TaskTracker from pool of TaskTrackers to schedule tasks
on.
16
16
Fault tolerance
• JobTracker Failure
• Singe point of failure. Job fails
17
17
Configuration
mapred-site.xml
– Configuration settings for MapReduce (prefixed with mapred.)
– These settings are global, some of the settings can be changed per Map Task
Parameter Description
Scheduler used by JobTracker. In BigInsights
jobtracker.taskScheduler changed to:
com.ibm.biginsights.scheduler.WorkflowScheduler
18
Agenda
MapReduce Introduction
MapReduce Tasks
WordCount Example
Splits
Execution
Scheduling
– FIFO scheduler (with priorities)
– Fair scheduler
J3
J6 J2
J7 J5 J4 J1
20
20
Scheduling – Fair scheduler
• Jobs are assigned to pools (1 pool per user by default)
• A user job pool is a number of slots assigned for tasks for that user
• Each pool gets the same number of task slots by default
Mary submits
many jobs …
Slot for a task
21 21
Scheduling – Fair scheduler
• Multiple users can run jobs on the cluster at the same time
• Example:
• Mary, John, Peter submit jobs that demand 80, 80, and 120 tasks
respectively
• Say the cluster has a limit to allocate 60 tasks at most
• Default behavior: Distribute task fairly among 3 users (each get 20)
23 23
Questions?