Professional Documents
Culture Documents
2
What is MapReduce
3
4
What is MapReduce
5
What is MapReduce
6
How MapReduce Works
• Define MapReduce
• key-value pairs
• map
• Input: input key/value
• Output: intermediate key/value
• reduce
• Input: intermediate key/{value}
• Output: output key/value
7
How
MapReduce
Works
• Input Splits -> divided into
fixed-size pieces (jobs) => key-
value pairs
• Mapping -> each chunk split
passed into mapping function
• Shuffling -> task is to
consolidate the relevant
records
• Reducing -> value aggregate
combined and returns a single
output value
8
Example: Word Count Problem
9
MapReduce Extends
10
Multiple stages approach
• Advantages:
• Easier to write and maintain
• Reusability
11
Incremental MapReduce approach
12
Conclusion MapReduce
• Allow computations to be parallelized over a cluster, but has large latency.
• The map task reads data from an aggregate and boils it down to relevant key-value
pairs. Only read a single record at a time and can thus be parallelized.
• Reduce tasks take many values for a single key, output from map tasks and summarize
them into a single output. Parallelized by key
• Reducers can be combined into pipelines, improves parallelism and reduces data to
be transferred.
• Map-reduce operations can be composed into pipelines with multi map-reduce
others (map -> reduce -> map -> reduces...)
• Result of a map-reduce computation can be stored as a materialized view -> it can be
updated through incremental map-reduce operations (only recomputing changing)
13
15