Professional Documents
Culture Documents
MapReduce is a software framework and programming model used for processing huge amounts
of data. MapReduce program work in two phases, namely, Map and Reduce. Map tasks deal with
splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable
of running MapReduce programs written in various languages: Java, Ruby, Python, and C++.
The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster. The input to each
phase is key-value pairs. In addition, every programmer needs to specify two functions: map
function and reduce function.
MapReduce is a programming framework that allows us to perform distributed and parallel processing
on large data sets in a distributed environment
1
Let us understand more about MapReduce and its components. MapReduce majorly has the
following three Classes. They are
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.
Input Split
It is the logical representation of data. It represents a block of work that contains a single
map task in the MapReduce Program.
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-Value
Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
Introduction to YARN
Yet Another Resource Manager takes programming to the next level beyond Java , and makes it
interactive to let another application Hbase, Spark etc. to work on it. Different Yarn applications
can co-exist on the same cluster so MapReduce, Hbase, Spark all can run at the same time
bringing great benefits for manageability and cluster utilization.
Components Of YARN
2
Map Reduce Application Master: Checks tasks running the MapReduce job. The
application master and the MapReduce tasks run in containers that are scheduled by
the resource manager, and managed by the node managers.
Jobtracker & Tasktracker were were used in previous version of Hadoop, which were responsible
for handling resources and checking progress management. However, Hadoop 2.0 has Resource
manager and NodeManager to overcome the shortfall of Jobtracker & Tasktracker.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).
1. Resource Manager(RM)
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications. follow Resource Manager guide to learn Yarn Resource manager in great detail.
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The
scheduler is pure scheduler it means that it performs no monitoring no tracking for the
application and even doesn’t guarantees about restarting failed tasks either due to application
failure or hardware failures.
3
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of failures.
2. Node Manager(NM)
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN
3. Application Master(AM)
One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs
to start the application’s individual tasks.
Mapper
A mapper maps the input key-value pairs into a set of intermediate key-value pairs. Maps are
individual tasks that have the responsibility of transforming input records into intermediate key-
value pairs.
1. Record Reader: RecordReader converts a byte-oriented view of the input (as generated by the
Input- Split) into a record-oriented view and presents it to the Mapper tasks. It presents the tasks
with keys and values. Generally the key is the positional information and value is a chunk of data
that constitutes the record.
2. Map: Map function works on the key-value pair produced by RecordReader and generates
zero or more intermediate key-value pairs. The MapReduce decides the key-value pair based on
the context.
4. Partitioner: The partitioner takes the intermediate key-value pairs produced by the mapper
splits them into shard, and sends the shard to the particular reducer as per the user-specific code.
Usually, the key with same values goes to the same reducer. The partitioned data of each map
task is written to the local disk of that machine and pulled by the respective reducer.
4
Reducer
The primary chore of the Reducer is to reduce a set of intermediate values (the ones that share a
common key) to a smaller set of values. The Reducer has three primary phases: Shuffle and Sort,
Reduce, and Output Format.
1. Shuffle and Sort: This phase takes the output of all the partitioners and downloads them into
the local machine where the reducer is running. Then these individual data pipes are sorted by
keys which produce larger data list. The main purpose of this sort is grouping similar words so
that their values can be easily iterated over by the reduce task.
2. Reduce: The reducer takes the grouped data produced by the shuffle and sort phase, applies
reduce function, and processes one group at a time. The reduce function iterates all the values
associated with that key. Reducer function provides various operations such as aggregation,
filtering, and combining data. Once it is done, the output (zero or more key-value pairs) of
reducer is sent to the output format.
3. Output Format: The output format separates key-value pair with tab (default) and writes it
out to a file using record writer.
Figure 8.1 describes the chores of Mapper, Combiner, Partitioner, and Reducer for the word
count problem. The Word Count problem has been discussed under "Combiner" and
"Partitioner".
5
Combiner
It is an optimization technique for MapReduce Job. Generally, the reducer class is set to be the
combiner class. The difference between combiner class and reducer class is as follows:
Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use
combiner for optimization.
6
7
Partitioner
The partitioning phase happens after map phase and before reduce phase. Usually the number of
partitions are equal to the number of reducers. The default partitioner is hash partitioner.
Objective: Write a MapReduce program to count the occurrence of similar words in a file. Use
partitioner to partition key based on alphabets.
Input Data:
Act:
WordCountPartitioner.java
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class WordCountPartitioner extends Partitioner<Text, IntWritable> {
@Override
public int getPartition (Text key, IntWritable value, int numPartitions) {
String word= key.toString();
char alphabet = word.toUpperCase().charAt(0);
int partitionNumber = 0;
switch(alphabet) {
8
case 'A': partitionNumber = 1; break;
case 'B': partitionNumber = 2; break;
case 'C': partitionNumber = 3; break;
case 'D': partition Number = 4; break;
case 'E': partition Number = 5; break;
case 'F': partition Number = 6; break;
case 'G': partition Number = 7; break;
case 'H': partition Number = 8; break;
case 'I': partition Number = 9; break;
case 'J': partition Number = 10; break;
case 'K': partitionNumber = 11; break;
case 'L': partition Number = 12; break;
case 'M': partitionNumber = 13; break;
case 'N': partition Number = 14; break;
case 'O': partitionNumber = 15; break;
case 'P': partition Number = 16; break;
case 'Q': partition Number = 17; break;
case 'R': partition Number = 18; break;
case 'S': partition Number = 19; break;
case 'T': partition Number = 20; break;
case 'U': partition Number = 21; break;
case 'V': partition Number = 22; break;
case 'W': partition Number = 23; break;
case 'X': partitionNumber = 24; break;
case 'Y': partition Number = 25; break;
case 'Z': partition Number = 26; break;
default: partition Number = 0; break;
}
return partitionNumber;
}
}
9
Searching
Objective: To write a MapReduce program to search for a specific keyword in a file.
Input Data:
1001, John,45
1002, Jack,39
1003,Alex,44
1004,Smith,38
1005, Bob,33
Act:
Wordsearcher.java
import java.io.IOException;
10
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordSearcher {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(WordSearcher.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(WordSearchMapper.class);
job.setReducerClass (WordSearchReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setNumReduce Tasks(1);
job.getConfiguration().set("keyword", "Jack");
FileInputFormat.setInputPaths(job, newPath("/mapreduce/student.csv"));
FileOutputFormat.setOutputPath(job, newPath("/mapreduce/output/search"));
System.exit(job.waitForCompletion(true)? 0: 1);
}
}
WordSearchMapper.java
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class WordSearchMapper extends Mapper<Long Writable, Text, Text, Text> {
static String keyword;
static int pos = 0;
protected void setup(Context context) throws IOException,
InterruptedException {
Configuration configuration = context.getConfiguration();
keyword=configuration.get("keyword");
}
protected void map(Long Writable key, Text value, Context context)
throws IOException, InterruptedException {
InputSplit i = context.getInputSplit(); // Get the input split for this map.
FileSplit f= (FileSplit) i;
String fileName = f.getPath().getName();
Integer wordPos;
11
pos++;
if (value.toString().contains(keyword)) {
wordPos = value.find(keyword);
context.write(value, new Text(fileName + ","+ new IntWritable(pos).
toString() + ", "+ wordPos.toString()));
}
}
}
WordSearchReducer.java
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Sorting
12
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
13
Compression
In MapReduce programming, you can compress the MapReduce output file. Compression
provides two benefits as follows:
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress", true);
Hadoop EcoSystem.
Interacting with Hadoop Ecosystem Hadoop Ecosystem Hadoop has an ecosystem that has
evolved from its three core components processing, resource management, and storage. In this
topic, you will learn the components of the Hadoop ecosystem and how they perform their roles
during Big Data processing. The Hadoop ecosystem is continuously growing to meet the needs
of Big Data. It comprises the following twelve components:
15