You are on page 1of 6

Name:Utsav

Name: MohitVijay Gavli


Gangwani
Div:D17A
Div:D17A
Roll
RollNo.17
no:16

Aim: To demonstrate sorting using MapReduce in Hadoop

Theory:

MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce
algorithm contains two important tasks, namely Map and Reduce. The map takes a set of data and converts it into
another set of data, where individual elements are broken down into tuples (key/value pairs). Secondly, reduce the task,
which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple computing nodes. Under
the MapReduce model, the data processing primitives are called mappers and reducers. Decomposing a data
processing application into mappers and reducers is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.

MapReduce program executes in three stages, namely map stage, shuffle stage and reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally, the input data is in the form of a file or
directory and is stored in the Hadoop file system (HDFS). The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks of data.

Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The Reducer’s job is to
process the data that comes from the mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.

Sorting using Hadoop MapReduce

Sorting is easy in sequential programming. Sorting in MapReduce, or more generally in parallel, is not easy. This is
because the typical divide and conquer approach is a bit harder to apply here.Each individual reducer will sort its data
by key, but unfortunately, this sorting is not global across all data. What we want to do here is a total order sorting
where, if you concatenate the output files, the records are sorted. If we just concatenate the output of a simple
MapReduce job, segments of the data will be sorted, but the whole set will not be.

This pattern has two phases: an analyze phase that determines the ranges, and the order phase that actually sorts the
data. The analysis phase is optional in some ways. You need to run it only once if the distribution of your data does not
change quickly over time, because the value ranges it produces will continue to perform well. Also, in some cases, you
may be able to guess the partitions yourself, especially if the data is evenly distributed.

The program begins by importing necessary libraries and classes from Hadoop for various functionalities like
input/output handling, configuration settings, and MapReduce components.

The mapper is defined as an inner class named MapTask. It reads a line from the input CSV file, splits it into individual
fields using commas, and then extracts the units sold and order ID from specific positions in the array. It emits a
key-value pair where the key is the extracted units sold and the value is the order ID. This data structuring is crucial for
the upcoming sorting process.

The reducer, defined as an inner class named ReduceTask, takes the units sold as the key and a list of corresponding
order IDs as values. It iterates through the list and emits key-value pairs where the key is the order ID and the value is
the units sold. This arrangement results in sorting the data based on units sold during the shuffle and sort phase.

The main function sets the input and output paths for the MapReduce job. It initializes a Hadoop configuration and
creates a new job instance named "Sort the numbers," specifying the main class.

The program sets the output key and value data types to IntWritable. This signifies that the order IDs and units sold will
be written as integers in the output.

Commands and Output:

Create directories in HDFS

Explore and put dataset to be sorted in HDFS


Modify program to handle the custom Housing dataset

Compile and run the Java program


Execute the jar of the program on HDFS

Observe output file


Conclusion:
Thus, we studied the implementation and theory of sorting using Hadoop MapReduce.

You might also like