You are on page 1of 6

Practical No.

11: Text Analytics

Guru Gobind Singh Foundation’s


Guru Gobind Singh College of Engineering and
Research Centre, Nashik

Experiment No: 11

Title of Experiment: Word Count application


Write a code in JAVA for a simple Word Count application that counts the number of
occurrences of each word in a given input set using the Hadoop Map-Reduce framework on
local-standalone set-up.

Student Name:

Class: SE (Computer)

Div: Batch:

Roll No.:

Date of Attendance
(Performance):

Date of Evaluation:

Marks (Grade)
Attainment of CO
Marks out of 10

CO Mapped CO6: Use cutting edge tools and technologies to analyze Big Data

Signature of
Subject Teacher

Page 1 of 6
TITLE: Word Count application

AIM: Write a code in JAVA for a simple Word Count application that counts the number of
occurrences of each word in a given input set using the Hadoop Map-Reduce framework on
local-standalone set-up.

OBJECTIVES: Based on above main aim following are the objectives

1. To helps individuals and organizations make sense of data


2. To develop understanding of analyzing raw data for insights and trends.
THEORY
Important Link :
https://www.javatpoint.com/mapreduce-word-count-example

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/
MapReduceTutorial.html

MapReduce Word Count Example

In MapReduce word count example, we find out the frequency of each word. Here, the role
of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate the
keys of common values. So, everything is represented in the form of a Key-value pair.

What is MapReduce?
MapReduce is a processing technique and a program model for distributed
computing based on java. The MapReduce algorithm contains two important tasks,
namely Map and Reduce. Map takes a set of data and converts it into another set of
data, where individual elements are broken down into tuples (key/value pairs).
Secondly, reduce the task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the
name MapReduce implies, the reduce task is always performed after the map job.

Hadoop MapReduce is a software framework for easily writing applications which


process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically both the
Page 2 of 6
input and the output of the job are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System (see HDFS
Architecture Guide) are running on the same set of nodes. This configuration allows
the framework to effectively schedule tasks on the nodes where data is already
present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one


worker NodeManager per cluster-node, and MRAppMaster per application

Minimally, applications specify the input/output locations and supply map and reduce
functions via implementations of appropriate interfaces and/or abstract-classes.
These, and other job parameters, comprise the job configuration.

The Hadoop job client then submits the job (jar/executable etc.) and configuration to
the ResourceManager which then assumes the responsibility of distributing the
software/configuration to the workers, scheduling tasks and monitoring them,
providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications


need not be written in Java.

The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing
primitives are called mappers and reducers. Decomposing a data processing
application into mappers and reducers is sometimes nontrivial. But, once we write
an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

The Algorithm
● Generally the MapReduce paradigm is based on sending the computer to
where the data resides!

Page 3 of 6
● MapReduce program executes in three stages, namely map stage, shuffle
stage, and reduce stage.
○ Map stage − The map or mapper’s job is to process the input data.
Generally the input data is in the form of a file or directory and is stored
in the Hadoop file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and
creates several small chunks of data.
○ Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
● During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
● The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
● Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
● After completion of the given tasks, the cluster collects and reduces the data
to form an appropriate result, and sends it back to the Hadoop server.

Inputs and Outputs (Java Perspective)


The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be serialized by the framework and hence,
need to implement the Writable interface. Additionally, the key classes have to
implement the Writable-Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2>
→ reduce → <k3, v3>(Output).
Page 4 of 6
Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Terminology
● PayLoad − Applications implement the Map and the Reduce functions, and
form the core of the job.
● Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
● NamedNode − Node that manages the Hadoop Distributed File System
(HDFS).
● DataNode − Node where data is presented in advance before any processing
takes place.
● MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
Page 5 of 6
● SlaveNode − Node where Map and Reduce program runs.
● JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
● Task Tracker − Tracks the task and reports status to JobTracker.
● Job − A program is an execution of a Mapper and Reducer across a dataset.
● Task − An execution of a Mapper or a Reducer on a slice of data.
● Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.

CONCLUSIONS
The program will give a Word Count application that counts the number of occurrences of
each word in a given input set using the Hadoop Map-Reduce framework on local-standalone
set-up.

Page 6 of 6

You might also like