MAP REDUCE
MapReduce is a framework using which we can write applications to process huge amounts of data,
in parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples
(key/value pairs). Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
MAP FUCTION IS GIVEN AS:
REDUCE FUNCTION IS GIVEN AS:
Key features
Supports parallel programming
Fast
Can handle a large amount of data
CGL MAP REDUCE:
CGL-MapReduce is another type of MapReduce runtime and was developed by Ekanayake et al.
Unlike MapReduce, CGL-MapReduce uses streaming instead of a file system and eliminates the
overheads associated with file communication, and the intermediate results from the map functions
are directly sent. The application was primarily created and tested for a large amount of scientific data
and was compared with MapReduce. It is shown in the following figure.
1. Map worker: A map worker is responsible for doing map operation.
2. Reduce worker: A reduce worker is responsible for doing reduce operation.
Fig: CGL-MapReduce
3. Content Dissemination Network: Content dissemination network handles all the communication
between the components
4. MRDriver: MRDriver is a master worker and controls the other workers based on the instructions by
the user program. CGL-MapReduce is different from MapReduce, the main difference being the
avoidance of file system and usage of streaming.
a. Initializing stage: The first step involves starting the MapReduce worker nodes and configuration of
the MapReduce task. This is one of the improvements of CGL-MapReduce, which facilitates efficient
iterative MapReduce computations.
b. Map stage: After the initialization step, MRDriver starts the map computation upon the instruction of
the programmer. This is done by passing the variable data to the map tasks. This is relayed to
workers for invoking configured map tasks. It also allows passing the results from one iteration to
another. Finally, the map tasks are transferred directly to reduce workers using dissemination
network.
c. Reduce stage: As soon as all the map tasks are completed, they are transferred to reduce workers,
and these workers start executing tasks after they are initialized by the MRDriver. Output of the
reduce function is directly sent to the user application.
d. Combine stage: In this stage, all the results obtained in the reduce stage are combined. In single-
pass MapReduce computation, then the results are directly combined, and in iterative operation, then
appropriate combination is obtained such that the iteration continues successfully.
e. Termination stage: This is the final stage, and user program gives the command for termination. At
this stage, all the workers are terminated
Key features
•Uses streaming for communication
•Supports parallelization
•Iterative in nature
•Can handle a large amount of data