You are on page 1of 7

EXPERIMENT NO 3

AIM:

Create mapreduce code for wordcount and execute it on the Multi Node cluster.

THEORY:

MapReduce:
MapReduce, a programming paradigm originally introduced by Google and popularized by open-source
implementations like Apache Hadoop, stands as a fundamental model for efficiently processing and generating large
datasets. This paradigm is particularly well-suited for distributed computing environments, where the workload is
distributed across a cluster of interconnected computers. The essence of MapReduce lies in its ability to harness the
power of parallel processing to handle extensive datasets, making it a cornerstone in the realm of big data.

The MapReduce model consists of two primary phases: the Map phase and the Reduce phase, each playing a distinct
yet interconnected role in the overall data processing workflow.

Introduction:
MapReduce, a powerful paradigm for distributed computing, revolutionized the field of big data processing by introducing
a scalable and efficient framework. In this comprehensive exploration, we delve into the intricacies of the MapReduce
process, focusing on the three fundamental phases: Map, Shuffle and Sort, and Reduce. We'll unravel the intricacies of
each phase and illustrate their collective role in addressing the challenges posed by large-scale data processing.

Map Phase:
The Map Phase serves as the initial stage in the MapReduce process, where the input dataset undergoes a crucial
transformation. This phase comprises two primary components: Input Data Splitting and Mapper Function.
1.1 Input Data Splitting:
To manage vast datasets efficiently, the input dataset is divided into smaller, manageable chunks known as input splits.
The size of these splits is determined by the underlying distributed file system, such as Hadoop Distributed File System
(HDFS). This division enables parallel processing, as each input split can be independently processed by a separate
mapper.

1.2 Mapper Function:


The heart of the Map Phase lies in the execution of the map function on each input split. The objective is to extract
relevant information from the data and emit key-value pairs based on the processed information. Taking the classic
example of word count, the map function tokenizes the input text into words and emits key-value pairs with the word as
the key and a count of 1 as the value.

1.3 Intermediate Key-Value Pairs:


The output of the map phase consists of a myriad of intermediate key-value pairs. These pairs represent the intermediate
processing stage, acting as a bridge between the initial input data and the final output. Despite not being the ultimate
results, these intermediate key-value pairs play a crucial role in facilitating the subsequent phases of the MapReduce
process.

Shuffle and Sort:


Following the Map Phase, the Shuffle and Sort phase comes into play, focusing on optimizing data movement and
preparing the ground for efficient reduction.
2.1 Shuffling:
Shuffling involves the transfer of intermediate key-value pairs generated by the map phase across the network to the
reducers. The primary goal is to bring together all values associated with the same key, irrespective of the mapper that
produced them. This phase optimizes data movement and sets the stage for streamlined reduction.

2.2 Sorting:
Once the intermediate key-value pairs are shuffled, they undergo a sorting process based on their keys. Sorting is critical
as it allows the reducers to easily identify groups of key-value pairs with the same key. This sorted order ensures that all
values for a particular key are contiguous, simplifying the subsequent reduction phase.

Reduce Phase:
The Reduce Phase is the final stage of the MapReduce process, where the grouped and sorted key-value pairs are
processed by individual reducers to generate the ultimate output.
3.1 Reducer Function:
In the Reduce Phase, each group of key-value pairs with the same key is processed by a dedicated reducer. The primary
objective of the reduce function is to aggregate or combine the values associated with a specific key. Taking the word
count example, the reduce function sums up the counts for each word, yielding the final word frequency.

3.2 Output:
The output of the reduce phase is the culmination of the MapReduce job. It typically consists of key-value pairs, where
the key serves as a unique identifier, and the value represents the aggregated or processed information associated with
that key. This final output often finds its place in a distributed file system or other storage mediums for further analysis or
use.

Overall Flow:
The seamless integration of the Map Phase, Shuffle and Sort Phase, and Reduce Phase defines the overall flow of the
MapReduce framework. This flow is designed to handle large-scale data processing by distributing the workload across a
cluster of machines, allowing for parallelization and efficient computation. The orchestrated execution of these phases
ensures the optimal utilization of resources and the timely processing of massive datasets.

Strengths of MapReduce:
MapReduce's strength lies in its ability to address the challenges posed by large-scale data processing. By distributing
the workload across a cluster of machines, it leverages parallelization to enhance computational efficiency. This
scalability is particularly advantageous in the context of big data, where traditional processing methods may falter. The
framework's compatibility with distributed file systems, such as HDFS, further enhances its utility, enabling seamless
handling of extensive datasets.

Word Count in MapReduce:


To illustrate the principles of MapReduce, the classic word count problem serves as an exemplary case. This problem
involves determining the frequency of each word in a given dataset. The map function, in this context, plays a pivotal role
by tokenizing the input text into words and emitting key-value pairs with the word as the key and a count of 1 as the
value. The subsequent reduce function consolidates these counts, providing the final word frequency.

Overall, the MapReduce framework stands as a cornerstone in the realm of large-scale data processing. Its three crucial
phases – Map, Shuffle and Sort, and Reduce – work in tandem to efficiently process vast datasets, offering scalability
and parallelization. The detailed exploration of each phase, coupled with an emphasis on the word count example,
provides a comprehensive understanding of MapReduce's functioning and its significance in the world of big data. As
technology continues to advance, MapReduce remains a key player, contributing to the evolution of data processing and
analysis methodologies.

OUTPUT:
Start hadoop multinode cluster
Save the mapreduce code in a file

Create input directory for the hadoop mapreduce program and the input file
Set hadoop classpath

Make folder on the file server to store input and output

Put the input file on the file server

Compile code using hadoop classpath

Create a jar of the class files generated

Run the wc jar


Check output

CONCLUSION:

In conclusion, the implementation of the Word Count experiment utilizing the MapReduce paradigm on a multi-node
cluster has vividly demonstrated the potency and scalability of distributed computing for tackling substantial data
processing challenges. Leveraging the Map and Reduce phases, the experiment showcased the ability to efficiently
process and derive word frequency counts from a given text file. By distributing the workload across a cluster of
interconnected machines, the MapReduce framework seamlessly handled the intricacies of parallel processing,
optimizing the computational efficiency and reducing processing times significantly. This experiment not only
underscores the versatility of MapReduce but also highlights its pivotal role in harnessing the collective computational
power of multiple nodes to tackle large-scale data analytics tasks with unparalleled efficiency and speed. The success
of the Word Count experiment serves as a testament to the effectiveness of MapReduce in addressing the
complexities of big data, making it a cornerstone in the domain of distributed computing and data processing.

You might also like