Professional Documents
Culture Documents
DIVYA PANTA
21109
MapReduce Theory
Map and Reduce functions produce input and output
– Input and output can range from Text to Complex data
structures
– Specified via Job’s configuration
– Relatively easy to implement your own
• Generally we can treat the flow as
(K1,V1) → list (K2,V2) reduce: (K2,list(V2)) → list (K3,V3)
– Reduce input types are the same as map output types
Map Reduce Flow of Data
Node#1
Mapper Map
Data
Task Outp
Split
ut
Reduce Reduce
Task output
4
Key and Value Types
Keys must implement Writable Comparable interface
– Extends Writable and java.lang.Comparable
– Required because keys are sorted prior reduce phase
• Hadoop is shipped with many default implementations
of Writable Comparable
– Wrappers for primitives (String, Integer, etc...)
– Or you can implement your own
5
Inputs and Outputs
The MapReduce framework operates on <key, value> pairs,
that is, the framework views the input to the job as a set of
<key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner
by the framework and hence, need to implement the
Writable interface.
Additionally, the key classes have to implement the Writable-
Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job − (Input) <k1,
v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
INPUT OUTPUT
7
Example
8
Cont...
Cont...
First, we divide the input in three splits as shown in the figure. This
will distribute the work among all the map nodes.
Then, we tokenize the words in each of the mapper and give a
hardcoded value (1) to each of the tokens or words. The rationale
behind giving a hardcoded value equal to 1 is that every word, in itself,
will occur once.
Now, a list of key-value pair will be created where the key is nothing
but the individual words and value is one. So, for the first line (Dear
Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The
mapping process remains the same on all the nodes.
Cont....
After mapper phase, a partition process takes place where sorting
and shuffling happens so that all the tuples with the same key are
sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
Cont....
Now, each Reducer counts the values which are present in
that list of values. As shown in the figure, reducer gets a
list of values which is [1,1] for the key Bear. Then, it
counts the number of ones in the very list and gives the
final output as – Bear, 2.
Finally, all the output key/value pairs are then collected
and written in the output file.
Thank you