You are on page 1of 15

Unit-II

Introduction to MAPREDUCE Programming: Processing Data with Hadoop, Managing


Resources and Applications with Hadoop YARN (Yet another Resource Negotiator), Mapper,
Reducer, Combiner, Partitioner, Searching, Sorting , Compression, Hadoop EcoSystem.

Introduction to MAPREDUCE Programming

MapReduce is a software framework and programming model used for processing huge amounts
of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with
splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable
of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.
The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. The input to each
phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.

Processing data with Hadoop using MapReduce

MapReduce is a programming framework that allows us to perform distributed and parallel processing
on large data sets in a distributed environment

 MapReduce consists of two distinct tasks – Map and Reduce.


 As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate keyvalue pair)
into a smaller set of tuples or key-value pairs which is the final output.

1
Let us understand more about MapReduce and its components. MapReduce majorly has the
following three Classes. They are

Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single
map task in the MapReduce Program.

 RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value
Pairs.

Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.

Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for


setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper
and Reducer Classes long with data types and their respective job names.

Introduction to YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it. Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.

Components Of YARN

 Client: For submitting MapReduce jobs.


 Resource Manager: To manage the use of resources across the cluster
 Node Manager: For launching and monitoring the computer containers on machines
in the cluster.

2
 Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.

Jobtracker & Tasktracker were were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.

Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).

1. Resource Manager(RM)

It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications. follow Resource Manager guide to learn Yarn Resource manager in great detail.

Resource Manager has two Main components:

 Scheduler
 Application manager

a) Scheduler

The scheduler is responsible for allocating the resources to the running application. The
scheduler is pure scheduler it means that it performs no monitoring no tracking for the
application and even doesn’t guarantees about restarting failed tasks either due to application
failure or hardware failures.

3
b) Application Manager

It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of failures.

2. Node Manager(NM)

It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN

3. Application Master(AM)

One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.

The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs
to start the application’s individual tasks.

Mapper
A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate key-
value pairs.

1. Record Reader: RecordReader converts a byte-oriented view of the input (as generated by the
Input- Split) into a record-oriented view and presents it to the Mapper tasks. It presents the tasks
with keys and values. Generally the key is the positional information and value is a chunk of data
that constitutes the record.

2. Map: Map function works on the key-value pair produced by RecordReader and generates
zero or more intermediate key-value pairs. The MapReduce decides the key-value pair based on
the context.

3. Combiner: It is an optional function but provides high performance in terms of network


bandwidth and disk space. It takes intermediate key-value pair provided by mapper and applies
user-specific aggregate function to only that mapper. It is also known as local reducer.

4. Partitioner: The partitioner takes the intermediate key-value pairs produced by the mapper
splits them into shard, and sends the shard to the particular reducer as per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of each map
task is written to the local disk of that machine and pulled by the respective reducer.

4
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share a
common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and Sort,
Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them into
the local machine where the reducer is running. Then these individual data pipes are sorted by
keys which produce larger data list. The main purpose of this sort is grouping similar words so
that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase, applies
reduce function, and processes one group at a time. The reduce function iterates all the values
associated with that key. Reducer function provides various operations such as aggregation,
filtering, and combining data. Once it is done, the output (zero or more key-value pairs) of
reducer is sent to the output format.

3. Output Format: The output format separates key-value pair with tab (default) and writes it
out to a file using record writer.

Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under "Combiner" and
"Partitioner".

5
Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be the
combiner class. The difference between combiner class and reducer class is as follows:

1. Output generated by combiner is intermediate data and it is passed to the reducer.


2. Output of the reducer is passed to the output file on disk.
The sections have been designed as follows:
Objective: What is it that we are trying to achieve here?
Input Data: What is the input that has been given to us to act upon?
Act:The actual statement/command to accomplish the task at hand.
Output: The result/output as a consequence of executing the statement.

Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use
combiner for optimization.

6
7
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number of
partitions are equal to the number of reducers. The default partitioner is hash partitioner.

Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use
partitioner to partition key based on alphabets.

Input Data:

Welcome to Hadoop Session


Introduction to Hadoop
Introducing Hive
Hive Session
Pig Session

Act:

WordCountPartitioner.java

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class WordCountPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition (Text key, IntWritable value, int numPartitions) {
String word= key.toString();
char alphabet = word.toUpperCase().charAt(0);
int partitionNumber = 0;
switch(alphabet) {

8
case 'A': partitionNumber = 1; break;
case 'B': partitionNumber = 2; break;
case 'C': partitionNumber = 3; break;
case 'D': partition Number = 4; break;
case 'E': partition Number = 5; break;
case 'F': partition Number = 6; break;
case 'G': partition Number = 7; break;
case 'H': partition Number = 8; break;
case 'I': partition Number = 9; break;
case 'J': partition Number = 10; break;
case 'K': partitionNumber = 11; break;
case 'L': partition Number = 12; break;
case 'M': partitionNumber = 13; break;
case 'N': partition Number = 14; break;
case 'O': partitionNumber = 15; break;
case 'P': partition Number = 16; break;
case 'Q': partition Number = 17; break;
case 'R': partition Number = 18; break;
case 'S': partition Number = 19; break;
case 'T': partition Number = 20; break;
case 'U': partition Number = 21; break;
case 'V': partition Number = 22; break;
case 'W': partition Number = 23; break;
case 'X': partitionNumber = 24; break;
case 'Y': partition Number = 25; break;
case 'Z': partition Number = 26; break;
default: partition Number = 0; break;
}
return partitionNumber;
}
}

9
Searching
Objective: To write a MapReduce program to search for a specific keyword in a file.
Input Data:
1001, John,45
1002, Jack,39
1003,Alex,44
1004,Smith,38
1005, Bob,33
Act:
Wordsearcher.java
import java.io.IOException;

10
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordSearcher {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordSearcher.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass (WordSearchReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduce Tasks(1);
job.getConfiguration().set("keyword", "Jack");
FileInputFormat.setInputPaths(job, newPath("/mapreduce/student.csv"));
FileOutputFormat.setOutputPath(job, newPath("/mapreduce/output/search"));
System.exit(job.waitForCompletion(true)? 0: 1);
}
}

WordSearchMapper.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class WordSearchMapper extends Mapper<Long Writable, Text, Text, Text> {
static String keyword;
static int pos = 0;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration configuration = context.getConfiguration();
keyword=configuration.get("keyword");
}
protected void map(Long Writable key, Text value, Context context)
throws IOException, InterruptedException {
InputSplit i = context.getInputSplit(); // Get the input split for this map.
FileSplit f= (FileSplit) i;
String fileName = f.getPath().getName();
Integer wordPos;

11
pos++;
if (value.toString().contains(keyword)) {
wordPos = value.find(keyword);
context.write(value, new Text(fileName + ","+ new IntWritable(pos).
toString() + ", "+ wordPos.toString()));
}
}
}
WordSearchReducer.java

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordSearchReducer extends Reducer<Text, Text, Text, Text> {


protected void reduce(Text key, Text value, Context context)
throws IOException, InterruptedException {
context.write(key, value);
}
}

Sorting

12
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SortStudNames {


public static class SortMapper extends
Mapper<Long Writable, Text, Text, Text> {
protected void map(Long Writable key, Text value, Context context)
throws IOException, InterruptedException {

String[] token = value.toString().split(",");


context.write(new Text(token[1]), new Text(token[0]+" - "+token[1]));
}
}
// Here, value is sorted...
public static class SortReducer extends
Reducer<Text, Text, NullWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text details: values) {
context.write(NullWritable.get(), details);
}
}
}
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf= new Configuration();
Job job = new Job(conf);
job.setJarByClass (SortEmpNames.class);
job.setMapperClass (SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path("/mapreduce/student.csv"));
FileOutputFormat.setOutputPath(job, new Path("/mapreduce/output/sorted/"));
System.exit(job.waitForCompletion (true)? 0 : 1);
}
}

13
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:

1. Reduces the space to store files.

2. Speeds up data transfer across the network.

You can specify compression format in the Driver Program as shown below:

conf.setBoolean("mapred.output.compress", true);

conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);

Here, codec is the implementation of a compression and decompression algorithm. GzipCodec is


the compression algorithm for gzip. This compresses the output file.

Hadoop EcoSystem.
Interacting with Hadoop Ecosystem Hadoop Ecosystem Hadoop has an ecosystem that has
evolved from its three core components processing, resource management, and storage. In this
topic, you will learn the components of the Hadoop ecosystem and how they perform their roles
during Big Data processing. The Hadoop ecosystem is continuously growing to meet the needs
of Big Data. It comprises the following twelve components:

 HDFS(Hadoop Distributed file system)


 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig
 Impala
14
 Hive
 Cloudera Search
 Oozie
 Hue

15

You might also like