You are on page 1of 30

A

Colloquium Report on

Map Reduce Workflows

Submitted as partial fulfillment for the award of

MASTER OF COMPUTER APPLICATIONS DEGREE


(Fontsize20)

Session 2018-2019
Submitted By
JATIN PARASHAR
Roll Number
1703914906

Under the guidance of

Ms. Jyoti Chaudhary

INSTITUTE OF MANAGEMENT & RESEARCH GHAZIABAD

AFFILIATED TO
Dr. A.P.J. Abdul Kalam Technical University (APJAKTU), LUCKNOW
STUDENTS DECLARATION

I hereby declare that the study done by me on the colloquium topic presented
in this report entitled Map Reduce Workflow is an authentic record
carried out under the supervision of Ms. JYOTI CHAUDHARY.

The matter embodied in this report has not been submitted by me for the award
of MASTER OF COMPUTER APPLICATIONS DEGREE.

Date: …………….. Signature of student

Name: Jatin Parashar


Department: MCA

This is to certify that the above statement made by the candidate is correct to
the best of my knowledge.

Signature of HOD Signature of Supervisor

Name: Name:
Department: Department:
Designation:

Date:……………. Date:…………….
ACKNOWLEDGEMENT
I am hearty grateful to member of IMR college for cooperating and guiding me
at each and every step throughput the making of this Colloquium. I take it a
great privilege to avail this opportunity to express my deep gratitude to all those
who helped me and guided me in my training period.

I would specially like to thank and express my profound gratitude to my


Internal Guide Ms. JYOTI CHAUDHARY for providing me an opportunity to
work under their guidance and for constantly motivating me through the course
of this Colloquium.

Student Name: Jatin Parashar


Roll No. 1703914906 Signature
Table of Contents:

1. MapReduce Overview
2. What is BigData?
3. MapReduce Working
4. MapReduce-Example
5. Significance
6. MapReduce algorithm
7. Sorting
8. Searching
9. Indexing
ABSTRACT

Map Reduce is a programming paradigm that runs in the background of Hadoop


to provide scalability and easy data-processing solutions. This tutorial explains
the features of Map Reduce and how it works to analyze Big Data. Map Reduce
is a programming model for writing applications that can process Big Data in
parallel on multiple nodes. Map Reduce provides analytical capabilities for
analyzing huge volumes of complex data. MapReduce is designed to solve the
problem of processing large sets of data on a fleet of commodity hardware. ...
The user of MapReduce is responsible for writing these map and reduce
functions, while the MapReduce library is responsible for executing that
program in a distributed environment.
1. Overview:

Traditional Enterprise Systems normally have a centralized server to store and


process data. The following illustration depicts a schematic view of a traditional
enterprise system. Traditional model is certainly not suitable to process huge
volumes of scalable data and cannot be accommodated by standard database
servers. Moreover, the centralized system creates too much of a bottleneck
while processing multiple files simultaneously.

Fig.1
Google solved this bottleneck issue using an algorithm called MapReduce.
MapReduce divides a task into small parts and assigns them to many computers.
Later, the results are collected at one place and integrated to form the result
dataset.

2. What is Big Data?

Big Data is a collection of large datasets that cannot be processed using


traditional computing techniques. For example, the volume of data Facebook or
YouTube need require it to collect and manage on a daily basis, can fall under
the category of Big Data. However, Big Data is not only about scale and
volume, it also involves one or more of the following aspects − Velocity,
Variety, Volume, and Complexity
Fig.2

3. MapReduce Working

The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
 The Map task takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines
those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
Let us now take a close look at each of the phases and try to understand their
significance.
Fig. 3
 Input Phase − Here we have a Record Reader that translates each record
in an input file and sends the parsed data to the mapper in the form of key-
value pairs.
 Map − Map is a user-defined function, which takes a series of key-value
pairs and processes each one of them to generate zero or more key-value
pairs.
 Intermediate Keys − The key-value pairs generated by the mapper are
known as intermediate keys.
 Combiner − A combiner is a type of local Reducer that groups similar
data from the map phase into identifiable sets. It takes the intermediate
keys from the mapper as input and applies a user-defined code to aggregate
the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step.
It downloads the grouped key-value pairs onto the local machine, where
the Reducer is running. The individual key-value pairs are sorted by key
into a larger data list. The data list groups the equivalent keys together so
that their values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input
and runs a Reducer function on each one of them. Here, the data can be
aggregated, filtered, and combined in a number of ways, and it requires a
wide range of processing. Once the execution is over, it gives zero or more
key-value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes them
onto a file using a record writer.
Fig. 4
4. MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce.


Twitter receives around 500 million tweets per day, which is nearly 3000 tweets
per second. The following illustration shows how Tweeter manages its tweets
with the help of MapReduce.

Fig.5
As shown in the illustration, the MapReduce algorithm performs the
following actions −
 Tokenize − Tokenizes the tweets into maps of tokens and writes them as
key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the
filtered maps as key-value pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values
into small manageable units.
5. Significance

 MapReduce provides a simple way to scale your application


 Scales out to more machines, rather than scaling up
 Effortlessly scale from a single machine to thousands
 Fault tolerant & High performance
 If you can fit your use case to its paradigm, scaling is handled by the
framework
MapReduce is a processing layer which can allow the large data and divided
in the independent task. Assume if we don’t have MapReduce concept in
Big Data. What will happen? It’s very difficult to complete the task without
MapReduce concept. It is a heart of the Big Data. In MapReduce, we can
put a business logic, the rest of the thing care by the framework. The process
of MapReduce is work which will assign by the user to master. The work
can divide into parts and assigned to slaves.

Here in MapReduce, we get inputs from the list and it coverts output which
again lists. Due to MapReduce, Hadoop is more powerful and efficient.
This is what a small overview of MapReduce, take a look on know how to
divide work into sub work and how MapReduce works. In this process, the
total work divided into small divisions. Each division will process in
parallel on the clusters of servers which can give individual output. And
finally, these individual outputs give the final output. It is scalable and can
use across many computers.
************************************************************
6. MAPREDUCE – ALGORITHM
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
The map task is done by means of Mapper Class

The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps, and sorts it. The output of
Mapper class is used as input by Reducer class, which in turn searches matching
pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into


small parts and assign them to multiple systems. In technical terms, MapReduce
algorithm helps in sending the Map & Reduce tasks to appropriate servers in a
cluster.
These mathematical algorithms may include the following −
Sorting

Searching

Indexing

TF-IDF

7. Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data.
MapReduce implements sorting algorithm to automatically sort the output key-
value pairs from the mapper by their keys. MapReduce
Sorting methods are implemented in the mapper class itself.

In the Shuffle and Sort phase, after tokenizing the values in the mapper class,
theContext class (user-defined class) collects the matching valued keys as a
collection.

To collect similar key-value pairs (intermediate keys), the Mapper class takes
the help of RawComparator class to sort the key-value pairs.

The set of intermediate key-value pairs for a given Reducer is automatically


sorted by Hadoop to form key-values (K2, {V2, V2…}) before they are
presented to the Reducer.

8. Searching

Searching plays an important role in MapReduce algorithm. It helps in the


combiner phase (optional) and in the Reducer phase. Let us try to understand
how Searching works with the help of an example.

Example

The following example shows how MapReduce employs Searching algorithm to


find out the details of the employee who draws the highest salary in a given
employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D.
Let us also assume there are duplicate employee records in all four files because
of importing the employee data from all database tables repeatedly. See the
following illustration.

The Map phase processes each input file and provides the employee data in
key-value pairs (<k, v> : <emp name, salary>). See the following illustration.
MapReduce

The combiner phase (searching technique) will accept the input from the
Map phase as a key-value pair with employee name and salary. Using searching
technique, the combiner will check all the employee salary to find the highest
salaried employee in each file. See the following snippet.

<k: employee name, v: salary>


Max= the salary of an first employee. Treated as max salary
if(v(second employee).salary > Max){
Max = v(salary);
}
else{
Continue checking;
}
The expected result is as follows –
Reducer phase − Form each file, you will find the highest salaried
employee. To avoid redundancy, check all the <k, v> pairs and eliminate
duplicate entries, if any. The same algorithm is used in between the four <k, v>
pairs, which are coming from four input files. The final output should be as
follows −

<gopal, 50000> MapReduce

9. Indexing
Normally indexing is used to point to a particular data and its address. It
performs batch indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as
inverted index. Search engines like Google and Bing use inverted indexing
technique. Let us try to understand how Indexing works with the help of a
simple example.

Example
The following text is the input for inverted indexing. Here T[0], T[1], and t[2]
are the file names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
After applying the Indexing algorithm, we get the following output −
"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1,
2} implies the term "is" appears in the files T[0], T[1], and T[2].

TF-IDF
TF-IDF is a text processing algorithm which is short for Term Frequency −
Inverse Document Frequency. It is one of the common web analysis algorithms.
Here, the term 'frequency' refers to the number of times a term appears in a
document.

10. Term Frequency (TF)


It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the
total number of words in that document. MapReduce TF(the) = (Number of
times term the ‘the’ appears in a document) / (Total number of terms in the
document)

Inverse Document Frequency (IDF)


It measures the importance of a term. It is calculated by the number of
documents in the text database divided by the number of documents where a
specific term appears.
While computing TF, all the terms are considered equally important. That
means, TF counts the term frequency for normal words like “is”, “a”, “what”,
etc. Thus we need to know the frequent terms while scaling up the rare ones, by
computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term
‘the’ in it).
The algorithm is explained below with the help of a small example.

Example
Consider a document containing 1000 words, wherein the word hive appears 50
times. The TF for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000
of these. Then, the IDF is calculated as log (10,000,000 / 1,000) = 4.
The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

MapReduce programs work in two phases:


1. Map phase
2. Reduce phase.

An input to each phase is key-value pairs. In addition, every programmer needs


to specify two functions: map function and reduce function.

How MapReduce Works? Complete Process


The whole process goes through four phases of execution namely, splitting,
mapping, shuffling, and reducing.

Let's understand this with an example –

Consider you have following input data for your Map Reduce Program

Welcome to Hadoop Class


Hadoop is good
Hadoop is bad

MapReduce Architecture
The final output of the MapReduce task is

bad 1

Class 1
good 1

Hadoop 3

is 2

to 1

Welcome 1

The data goes through the following phases

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input


splits Input split is a chunk of the input that is consumed by a single map

Mapping

This is the very first phase in the execution of map-reduce program. In this
phase data in each split is passed to a mapping function to produce output
values. In our example, a job of mapping phase is to count a number of
occurrences of each word from input splits (more details about input-split is
given below) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output. In our example, the same words
are clubed together along with their respective frequency.

Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In
short, this phase summarizes the complete dataset.

In our example, this phase aggregates the values from Shuffling phase i.e.,
calculates total occurrences of each word.
MapReduce Architecture explained in detail

 One map task is created for each split which then executes map function
for each record in the split.
 It is always beneficial to have multiple splits because the time taken to
process a split is small as compared to the time taken for processing of
the whole input. When the splits are smaller, the processing is better to
load balanced since we are processing the splits in parallel.
 However, it is also not desirable to have splits too small in size. When
splits are too small, the overload of managing the splits and map task
creation begins to dominate the total job execution time.
 For most jobs, it is better to make a split size equal to the size of an
HDFS block (which is 64 MB, by default).
 Execution of map tasks results into writing output to a local disk on the
respective node and not to HDFS.
 Reason for choosing local disk over HDFS is, to avoid replication which
takes place in case of HDFS store operation.
 Map output is intermediate output which is processed by reduce tasks to
produce the final output.
 Once the job is complete, the map output can be thrown away. So, storing
it in HDFS with replication becomes overkill.
 In the event of node failure, before the map output is consumed by the
reduce task, Hadoop reruns the map task on another node and re-creates
the map output.
 Reduce task doesn't work on the concept of data locality. An output of
every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.
 On this machine, the output is merged and then passed to the user-defined
reduce function.
 Unlike the map output, reduce output is stored in HDFS (the first replica
is stored on the local node and other replicas are stored on off-rack
nodes). So, writing the reduce output

How MapReduce Organizes Work?


Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)

as mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a

1. Jobtracker: Acts like a master (responsible for complete execution of


submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the
job

For every job submitted for execution in the system, there is


one Jobtracker that resides on Namenode and there are multiple
tasktrackers which reside on Datanode.

 A job is divided into multiple tasks which are then run onto multiple data
nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by
scheduling tasks to run on different data nodes.
 Execution of individual task is then to look after by task tracker, which
resides on every data node executing part of the job.
 Task tracker's responsibility is to send the progress report to the job
tracker.
 In addition, task tracker periodically sends 'heartbeat' signal to the
Jobtracker so as to notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the
event of task failure, the job tracker can reschedule it on a different task
tracker.

What is a Join in MapReduce?

A join operation is used to combine two large datasets in MapReduce. However,


this process involves writing lots of code to perform the actual join operation.

Joining of two datasets begins by comparing the size of each dataset. If one
dataset is smaller as compared to the other dataset then smaller dataset is
distributed to every data node in the cluster. Once it is distributed, either
Mapper or Reducer uses the smaller dataset to perform a lookup for matching
records from the large dataset and then combine those records to form output
records.

In this tutorial, you will learn-

 What is a Join in MapReduce?


 Types of Join
 How to Join two DataSets: MapReduce Example
 What is Counter in MapReduce?
 Types of MapReduce Counters
 Counters Example

Types of Join
Depending upon the place where the actual join is performed, this join is
classified into-

1. Map-side join - When the join is performed by the mapper, it is called as


map-side join. In this type, the join is performed before data is actually
consumed by the map function. It is mandatory that the input to each map is in
the form of a partition and is in sorted order. Also, there must be an equal
number of partitions and it must be sorted by the join key.

2. Reduce-side join - When the join is performed by the reducer, it is called as


reduce-side join. There is no necessity in this join to have a dataset in a
structured form (or partitioned).
Here, map side processing emits join key and corresponding tuples of both the
tables. As an effect of this processing, all the tuples with same join key fall into
the same reducer which then joins the records with same join key.

An overall process flow is depicted in below diagram.

How to Join two DataSets: MapReduce Example


There are two Sets of Data in two Different Files (shown below). The Key
Dept_ID is common in both files. The goal is to use MapReduce Join to
combine these files
File 1

File 2

Input: The input data set is a txt file, DeptName.txt & DepStrength.txt

Download Input Files From Here

Ensure you have Hadoop installed. Before you start with the actual process,
change user to 'hduser' (id used while Hadoop configuration, you can switch to
the userid used during your Hadoop config ).

su - hduser_
Step 1) Copy the zip file to the location of your choice

Step 2) Uncompress the Zip File

sudo tar -xvf MapReduceJoin.tar.gz

Step 3) Go to directory MapReduceJoin/

cd MapReduceJoin/
Step 4) Start Hadoop

$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh

Step 5) DeptStrength.txt and DeptName.txt are the input files used for this
program.

These file needs to be copied to HDFS using below command-

$HADOOP_HOME/bin/hdfs dfs -copyFromLocal DeptStrength.txt DeptName.t


xt /

Step 6) Run the program using below command-

$HADOOP_HOME/bin/hadoop jar MapReduceJoin.jar MapReduceJoin/JoinDr


iver/DeptStrength.txt /DeptName.txt /output_mapreducejoin
Step 7) After execution, output file (named 'part-00000') will stored in the
directory /output_mapreducejoin on HDFS

Results can be seen using the command line interface

$HADOOP_HOME/bin/hdfs dfs -cat /output_mapreducejoin/part-00000

Results can also be seen via a web interface as-


Now select 'Browse the filesystem' and navigate upto /output_mapreducejoin
Open part-r-00000

Results are shown


NOTE: Please note that before running this program for the next time, you will
need to delete output directory /output_mapreducejoin

$HADOOP_HOME/bin/hdfs dfs -rm -r /output_mapreducejoin

Alternative is to use a different name for the output directory.

What is Counter in MapReduce?

A counter in MapReduce is a mechanism used for collecting statistical


information about the MapReduce job. This information could be useful for
diagnosis of a problem in MapReduce job processing. Counters are similar to
putting a log message in the code for a map or reduce.
Typically, these counters are defined in a program (map or reduce) and are
incremented during execution when a particular event or condition (specific to
that counter) occurs. A very good application of counters is to track valid and
invalid records from an input dataset.

Types of MapReduce Counters


There are basically 2 types of MapReduce Counters

1. Hadoop Built-In counters:There are some built-in counters which


exist per job. Below are built-in counter groups-
 MapReduce Task Counters - Collects task specific
information (e.g., number of input records) during its
execution time.
 FileSystem Counters - Collects information like number of
bytes read or written by a task
 FileInputFormat Counters - Collects information of a
number of bytes read through FileInputFormat
 FileOutputFormat Counters - Collects information of a
number of bytes written through FileOutputFormat
 Job Counters - These counters are used by JobTracker.
Statistics collected by them include e.g., the number of task
launched for a job.
2. User Defined Counters

In addition to built-in counters, a user can define his own counters using similar
functionalities provided by programming languages. For example,
in Java 'enum' are used to define user defined counters.

Counters Example

An example MapClass with Counters to count the number of missing and


invalid values. Input data file used in this tutorial Our input data set is a CSV
file, SalesJan2009.csv

public static class MapClass


extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text>
{
static enum SalesCounters { MISSING, INVALID };
public void map ( LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException
{

//Input string is split using ',' and stored in 'fields' array


String fields[] = value.toString().split(",", -20);
//Value at 4th index is country. It is stored in 'country' variable
String country = fields[4];

//Value at 8th index is sales data. It is stored in 'sales' variable


String sales = fields[8];

if (country.length() == 0) {
reporter.incrCounter(SalesCounters.MISSING, 1);
} else if (sales.startsWith("\"")) {
reporter.incrCounter(SalesCounters.INVALID, 1);
} else {
output.collect(new Text(country), new Text(sales + ",1"));
}
}
}

Above code snippet shows an example implementation of counters in Map


Reduce.

Here, SalesCounters is a counter defined using 'enum'. It is used to


count MISSING and INVALID input records.

In the code snippet, if 'country' field has zero length then its value is missing
and hence corresponding counter SalesCounters.MISSING is incremented.

Next, if 'sales' field starts with a " then the record is considered INVALID.
This is indicated by incrementing counter SalesCounters.INVALID.
LIST OF REFERENCES

1. https://onlineitguru.com/blog/importance-of-map-reduce-in-hadoop
2. https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
3. https://www.dummies.com/programming/big-data/hadoop/the-
importance-of-mapreduce-in-hadoop/
4. https://data-flair.training/blogs/hadoop-mapreduce-tutorial/

You might also like