You are on page 1of 25

MapReduce Exam 2019 - Solved Paper

Vinod Patne
Table of Contents

Hadoop MapReduce Programming.....................................................................................1


1. Write MapReduce program to read log file and count number of times word Exception
occurs in the file.................................................................................................................... 1
2. Explain the role of Partitioner in MapReduce program with example. Write sample code
snippet.................................................................................................................................. 3
3. Write custom input Formatter to extract SQL Query from the input file.................................4
4. What is the custom Key class (implements WritableComparable) and Value class
(implements Writable)?......................................................................................................... 7

Hadoop MapReduce Execution Flow................................................................................12


5. What is the difference between Partitioner, Combiner, Shuffle and sort phase in Map
Reduce. What is the order of execution?............................................................................12

Hadoop Configuration........................................................................................................14
6. Explain Hadoop important Configuration parameters.........................................................14

Hadoop Configuration........................................................................................................20
7. Role of YARN, Hue, Application Manager, Node Manager in MapReduce.........................20

MapReduce Exam 2019 - Solved Paper 2


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

Hadoop MapReduce Programming

1. Write MapReduce program to read log file and count number of times word
Exception occurs in the file.

ExceptionCountMapper.java
public class ExceptionCountMapper extends Mapper<LongWritable, Text, Text,
IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text wordException = new Text("Exception");

protected void map(LongWritable key, Text value, Mapper.Context context)


throws Exception {
String line = value.toString();
//Find occurrence
int index = line.indexOf("Exception");
if (index != -1) {
//If Found, add to context to send it to Reducer
context.write(wordException, one);
}
} }

ExceptionCountReducer.java
public class ExceptionCountReducer extends Reducer<Text, IntWritable, Text,
IntWritable>{
protected void reduce(Text key, Iterable<IntWritable> values, Context
context) throws Exception {
int sum = 0;
//sum occurrences
for(IntWritable val : values){
sum = sum + val.get();
}
context.write(key, new IntWritable(sum));
}
}

ExceptionCountDriver.java
public class ExceptionCountDriver extends Configured implements Tool {

// The main() method is the entry point for the driver.


public int run(String[] args) throws Exception {
//input and output paths passed from command line
String inputHDFSPath = args[0];

MapReduce Exam 2019 - Solved Paper 1


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

String outputHDFSPath = args[1];

// Optional: Configuration objects allows to override hadoop configuration


parameters.
// Configurations are specified by resources.
// Unless explicitly turned off, Hadoop by default specifies two resources,
loaded in-order from the classpath:
// 1. core-default.xml: Read-only defaults for hadoop.
// 2. core-site.xml: Site-specific configuration for a given hadoop
installation.
Configuration jobConf = new Configuration();
jobConf.setBoolean("mapreduce.map.output.compress",true);
jobConf.setBoolean("mapreduce.map.output.compress.codec",
org.apache.hadoop.io.compress.SnappyCodec);

jobConf.setBoolean("mapreduce.output.fileoutputformat.compress",
true);

// Set log levels - NONE, INFO, WARN, DEBUG, TRACE, and ALL.
jobConf.set("mapreduce.map.log.level", "DEBUG");
jobConf.set("mapreduce.reduce.log.level", "TRACE");

// 1. Create a new Job


String jobName = "ExceptionCount";
Job job = Job.getInstance(jobConf, jobName);

// Specify various job-specific parameters

// 2. Configure Mapper, Reducer and Driver class names


job.setJarByClass(ExceptionCountDriver.class);
job.setMapperClass(ExceptionCountMapper.class);
job.setReducerClass(ExceptionCountReducer.class);
// Optinal: Set Reducer class as combiner to reduce the volume of data
transfer between Map and Reduce. Combiner is also known as Mini-Reducer
job.setCombinerClass(ExceptionCountReducer.class);

// 3. Set the key & value classes for the job final output data
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

// 4. Set the Input & Output Format classes for the job
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

// 5. Set number of reducers


job.setNumReduceTasks(1);

MapReduce Exam 2019 - Solved Paper 2


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

// 6. Configuring the input/output path from the filesystem into the job
FileInputFormat.setInputPaths(job, new Path(inputHDFSPath));
FileOutputFormat.setOutputPath(job, new Path(outputHDFSPath));
// OR
// jobConf.set("mapred.input.dir",inputHDFSPath);
// jobConf.set("mapred.output.dir",outputHDFSPath);

// 7. Launch/submit the job syncronously


// true argument informs the framework to write verbose output
boolean success = job.waitForCompletion(true);

System.out.println(“job completed successfully=” +


job.isSuccessful());
}

public static void main(String[] args) throws Exception {


ExceptionCountDriver driver = new ExceptionCountDriver();
// ToolRunner can be used to run classes implementing Tool interface.
// It works in conjunction with GenericOptionsParser to parse the generic
hadoop command line arguments (like -Dmapreduce.reduce.tasks=3) and modifies
the Configuration of the Tool. The application-specific options are passed
along without being modified.
int res = ToolRunner.run(new Configuration(), driver, args);
System.exit(res);
}
}

Reference for further reading:


https://ercoppa.github.io/HadoopInternals/AnatomyMapReduceJob.html
https://techvidvan.com/tutorials/category/mapreduce/

2. Explain the role of Partitioner in MapReduce program with example. Write


sample code snippet.

A partitioner works like a condition in processing an input dataset. The partition phase takes
place after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a partitioner will
divide the data according to the number of reducers . It partitions the data using a user-
defined condition, which works like a hash function. Therefore, the data passed from a single
partitioner is processed by a single Reducer.
Default Partitioner (Hash Partitioner) computes a hash value for the key and assigns the
partition based on this result. But if hashCode() method does not uniformly distribute other
keys data over partition range, then data will not be evenly sent to reducers. To overcome poor

MapReduce Exam 2019 - Solved Paper 3


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

partitioner in Hadoop MapReduce, we can create Custom partitioner, which allows sharing
workload uniformly across different reducers.

Driver.java
job.setPartitionerClass(MyPartitioner.class);
// Default
// job.setPartitionerClass(HashPartitioner.class);

MyPartitioner.java
public class MyPartitioner extends Partitioner <Text, Text>
{
@Override
public int getPartition(Text key, Text value, int numReduceTasks) {
if(numReduceTasks == 0) {
return 0;
} else {
String[] str = value.toString().split("\t");
int age = Integer.parseInt(str[2]);
if(age <= 20) {
return 0;
} else if(age > 20 && age <= 30) {
return 1 % numReduceTasks;
} else {
return 2 % numReduceTasks;
}
}
}
}

Reference for further reading: https://data-flair.training/blogs/hadoop-partitioner-tutorial/

3. Write custom input Formatter to extract SQL Query from the input file.

How the input files are split up and read in Hadoop is defined by the InputFormat. It splits the
Input file into InputSplit and assign to individual Mapper.
 The files or other objects that should be used for input is selected by the InputFormat.
 InputFormat defines the Data splits, which defines both the size of individual Map tasks
and its potential execution server.
 InputFormat defines the RecordReader, which is responsible for reading actual records
from the input files.

Driver.java

MapReduce Exam 2019 - Solved Paper 4


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

job.setInputFormatClass(SqlInputFormat.class);
// Default
// job.setInputFormatClass(TextInputFormat.class);

We have 2 methods to get the data to mapper in MapReduce: getsplits() and


createRecordReader().
SqlInputFormat.java
public class SqlInputFormat extends FileInputFormat<LongWritable, Text> {
// No need to overrid getsplits implementation given by super class
public RecordReader<LongWritable, Text> createRecordReader(InputSplit
inputSplit, TaskAttemptContext context) throws Exception {
SqlRecordReader srr = new SqlRecordReader();
srr.initialize(inputSplit, context);
return srr;
}
}

SqlRecordReader.java
public class SqlRecordReader extends RecordReader<LongWritable, Text> {
LineRecordReader lrr;
LongWritable key;
Text value;

public void initialize(InputSplit inputSplit, TaskAttemptContext


contect) throws Exception {
lrr = new LineRecordReader();
lrr.initialize(inputSplit, contect);
}

public boolean nextKeyValue() throws Exception {


StringBuilder query = new StringBuilder();
boolean QStarted = false;

while (lrr.nextKeyValue()) {
String line = lrr.getCurrentValue().toString();

if (QStarted) {
int index = line.indexOf(";");
if (index != -1) {
query.append(line.substring(0, index + 1));
break;
} else {

MapReduce Exam 2019 - Solved Paper 5


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

query.append(line);
}
} else {
int index = line.toUpperCase().indexOf("SELECT");
if (index != -1) {
QStarted = true;
int endIndex = line.indexOf(";");
if (endIndex != -1) {
query.append(line.substring(index, endIndex+1));
break;
} else {
query.append(line.substring(index));
}
}
}
}

if (QStarted) {
key = new LongWritable(1);
value = new Text(query.toString());
}
return QStarted;
}

 Types of InputFormat in MapReduce


1. FileInputFormat - It is the base class for all file-based InputFormats. FileInputFormat
also specifies input directory which has data files location.
2. TextInputFormat (default) - This InputFormat treats each line of each input file as a
separate record. TextInputFormat is useful for unformatted data or line-based records
like log files.
3. KeyValueTextInputFormat - It is similar to TextInputFormat. While the difference is
that TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat
breaks the line itself into key and value by a tab character (‘/t’).
4. SequenceFileInputFormat - It is an InputFormat which reads sequence files.
Sequence files are binary files. These files also store sequences of binary key-value
pairs.
5. SequenceFileAsTextInputFormat - This format converts the sequence file key
values to Text objects. So, it performs conversion by calling ‘tostring()’ on the keys and
values.
6. SequenceFileAsBinaryInputFormat - By using SequenceFileInputFormat we can
extract the sequence file’s keys and values as an opaque binary object.

MapReduce Exam 2019 - Solved Paper 6


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

7. NlineInputFormat - It is another form of TextInputFormat where the keys are byte


offset of the line. And values are contents of the line.
8. DBInputFormat - This InputFormat reads data from a relational database, using
JDBC.

 Types of OutputFormat in MapReduce


It describes how RecordWriter implementation is used to write output to output files.
RecordWriter in MapReduce job execution writes output key-value pairs from the Reducer
phase to output files.
1. FileOutputFormat (implements interface OutputFormat) – base class for all
OutputFormats
2. TextOutputFormat (default) – It writes (key, value) pairs on individual lines of text
files. The reason behind is that TextOutputFormat turns them to string by calling
toString() on them. It separates key-value pair by a tab character. We can also change
it by using property MapReduce.output.textoutputformat.separator.
3. SequenceFileOutputFormat - This OutputFormat writes sequences files for its
output. SequenceFileInputFormat is also intermediate format use between MapReduce
jobs.
4. SequenceFileAsBinaryOutputFormat - It is another variant of
SequenceFileInputFormat. It also writes keys and values to sequence file in binary
format.
5. MapFileOutputFormat - It is another form of FileOutputFormat. It also writes output
as map files. The framework adds a key in a MapFile in order. So we need to ensure
that reducer emits keys in sorted order.
6. MultipleOutputs - This format allows writing data to files whose names are derived
from the output keys and values.
7. LazyOutputFormat - In MapReduce job execution, FileOutputFormat sometimes
create output files, even if they are empty. LazyOutputFormat is also a wrapper
OutputFormat.
8. DBOutputFormat - It is the OutputFormat for writing to relational databases and
HBase. This format also sends the reduce output to a SQL table.

Reference for further reading: https://data-flair.training/blogs/hadoop-inputformat/

4. What is the custom Key class (implements WritableComparable) and Value


class (implements Writable)?

CityTempurature.java
// each value should implements Writable
public class CityTempuratureValue implements Writable<CityTempurature>{

MapReduce Exam 2019 - Solved Paper 7


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

//fields, defined public to make it accessbile withou getter


public Text city;
public IntWritable temperature;

// A constructor with no args should be present,


// else hadoop will throw an error
public CityTempurature() {
city = new Text();
temperature = new IntWritable();
}
public CityTempurature(String city, String degree) {
this.city = new Text(city);
this.temperature = new IntWritable(Integer.parseInt(degree));
}

// this method will be used when deserializing data


@Override
public void readFields(DataInput dataInput) throws IOException {
city.readFields(dataInput);
temperature.readFields(dataInput);
}

// this method will be used when serializing data


@Override
public void write(DataOutput dataOutput) throws IOException {
city.write(dataOutput);
temperature.write(dataOutput);
}

// hashCode, toString & equals methods


// getter & setter methods
}

CustomDate.java
public class CustomDate implements WritableComparable<CustomDate>{
IntWritable year;
IntWritable month;
IntWritable day;

public CustomDate() {
year = new IntWritable();
month = new IntWritable();
day = new IntWritable();
}

MapReduce Exam 2019 - Solved Paper 8


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

public CustomDate(String year, String month, String day) {


this.year = new IntWritable(Integer.parseInt(year));
this.month = new IntWritable(Integer.parseInt(month));
this.day = new IntWritable(Integer.parseInt(day));
}

@Override
public void readFields(DataInput dataInput) throws IOException {
day.readFields(dataInput);
month.readFields(dataInput);
year.readFields(dataInput);
}

@Override
public void write(DataOutput dataOutput) throws IOException {
day.write(dataOutput);
month.write(dataOutput);
year.write(dataOutput);
}

// Used for key comparison while Sorting & Shuffling


@Override
public int compareTo(CustomDate dateWritable) {
int yearCmp = year.compareTo(dateWritable.getYear());
if (yearCmp != 0) {
return yearCmp;
}
int monthCmp = month.compareTo(dateWritable.getMonth());
if (monthCmp != 0) {
return monthCmp;
}
return day.compareTo(dateWritable.getDay());
}

// Used for partitioning


@Override
public int hashCode() {
return year.hashCode() + month.hashCode() + day.hashCode();
}
@Override
public boolean equals(Object obj) {
if (obj instanceof CustomDate) {
CustomDate dateWritable = (CustomDate) obj;
return this.getYear().equals(dateWritable.getYear()) &&

MapReduce Exam 2019 - Solved Paper 9


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

this.getMonth().equals(dateWritable.getMonth()) &&
this.getDay().equals(dateWritable.getDay());
}
return false;
}

// Will be used while writing reducer output to text file


@Override
public String toString() {
return day.toString()+"/"+month.toString()+"/"+year.toString();
}
// hashCode, toString & equals methods
// getter & setter methods
}

TempuratureMapper.java
public class TempuratureMapper extends Mapper<LongWritable, Text,
CustomDate, CityTemperature>{
@Override
protected void map(LongWritable, Text, Mapper.Context context) {

String line = Text.toString();


String[] keyValue = line.split("\t");
String[] keyFields = keyValue[0].split(" ");
String[] valueFields = keyValue[1].split(" ");

CustomDate key = new CustomDate(valueFields[0], valueFields[1],


valueFields[2]);
CityTempurature value = new CityTempurature(keyFields[0],
keyFields[1]);
context.write(key,value);
}
}

TempuratureReducer.java
public class TempuratureReducer extends Reducer <CustomDate,
CityTemperature, CustomDate, IntWritable> {

@Override
protected void reduce(CustomDate key, Iterable<CityTemperature> values,
Reducer.Context context)
throws IOException, InterruptedException {
int sum = 0;
int nbr =0 ;

MapReduce Exam 2019 - Solved Paper 10


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

// calculate the mean of the two temperature


for (CityTemperature value : values) {
nbr++;
sum=sum+value.getTemperature().get();
}
context.write(key, new IntWritable(sum/nbr));
}
}

MapReduce Exam 2019 - Solved Paper 11


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

Hadoop MapReduce Execution Flow

5. What is the difference between Partitioner, Combiner, Shuffle and sort phase in
Map Reduce. What is the order of execution?

Combiner - Combiner is Mini-reducer which performs local aggregation on the mapper’s


output. It minimizes the data transfer between mapper and reducer. So, when the combiner
functionality completes, framework passes the output to the partitioner for further processing.

Partitioner - Partitioner comes into the existence if we are working with more than one
reducer. It takes the output of the combiner and performs partitioning. On the basis of key value
in MapReduce, partitioning of each combiner output takes place. And then the record having
the same key value goes into the same partition. After that, each partition is sent to a reducer.
Partitioning in MapReduce execution allows even distribution of the map output over the
reducer.

Shuffling and Sorting - After partitioning, the process of transferring data from the mappers
to reducers is shuffling. Then it transfers the map output to the reducer as input. Reducer gets
1 or more keys and associated values on the basis of reducers. The shuffling is the physical
movement of the data which is done over the network. As all the mappers finish and shuffle the
output on the reducer nodes. Then framework merges this intermediate output and sort by
keys. This is then provided as input to reduce phase. Since shuffling can start even before the
map phase has finished. So this saves some time and completes the tasks in lesser time.
Sorting in a MapReduce job helps reducer to easily distinguish when a new reduce task should
start. MapReduce Shuffling and Sorting occurs simultaneously to summarize the Mapper
intermediate output.
Skip shuffling and sorting - Shuffling and sorting in Hadoop MapReduce are will not take
place at all if you specify zero reducers (setNumReduceTasks(0)). If reducer is zero, then the
MapReduce job stops at the map phase. And the map phase does not include any kind of
sorting (even the map phase is faster).

MapReduce Exam 2019 - Solved Paper 12


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

Secondary Sorting in MapReduce - If we want to sort reducer values, then we use a


secondary sorting technique. This technique enables us to sort the values (in ascending or
descending order) passed to each reducer.

Reference for further reading: https://techvidvan.com/tutorials/mapreduce-job-execution-flow/

MapReduce Exam 2019 - Solved Paper 13


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

Hadoop Configuration

6. Explain Hadoop important Configuration parameters.

For deprecated properties lookup, visit https://hadoop.apache.org/docs/current/hadoop-project-


dist/hadoop-common/DeprecatedProperties.html

1. Customizing Configuration Files


core-site.xml configuration

Property Recommended value Description

Specifies the NameNode and the default file


hdfs://namenode-
fs.defaultFS system. The default file system is used to
host.company.com:8020
resolve relative paths

hdfs-site.xml configuration

Recommended
Property Description
value

Specifies the UNIX group containing


dfs.permissions.superusergrou
hadoop users that will be treated as superusers
p
by HDFS.

2. Configuring Local Storage Directories


Configuration File
Property Description
Location
This property specifies the URIs of the
directories where the NameNode stores its
metadata and edit logs. Cloudera recommends
dfs.name.dir or hdfs-site.xml on
that you specify at least two directories. One
dfs.namenode.name.dir the NameNode
of these should be located on an NFS mount
point, unless you will be using a HDFS HA
configuration.

dfs.data.dir or hdfs-site.xml on This property specifies the URIs of the


dfs.datanode.data.dir each DataNode directories where the DataNode stores blocks.

Note: dfs.data.dir and dfs.name.dir are deprecated. Thus use alternative mentioned above.

MapReduce Exam 2019 - Solved Paper 14


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

3. Best Practices for MapReduce Configuration


The configuration settings described below can reduce inherent latencies in MapReduce
execution. You set these values in mapred-site.xml.
mapred-site.xml
<!-- Send a heartbeat as soon as a task finishes -->
<property>
<name>mapreduce.tasktracker.outofband.heartbeat</name>
<value>true</value> <!-- default value is false -->
</property>

<!-- Tune the JobTracker heartbeat interval -->


<!-- The interval in ms at which the MR AppMaster should send heartbeats to
the ResourceManager -->
<property>
<name>mapreduce.jobtracker.heartbeat.interval.min</name>
<value>10</value>
</property>

<!-- Reduce the interval for JobClient status reports on single node systems
-->
<property>
<name>mapreduce.client.progressmonitor.pollinterval</name>
<value>10</value> <!-- default value is 1000 milliseconds -->
</property>

<!-- Start MapReduce JVMs immediately -->


<!-- Property specifies the proportion of Map tasks in a job that must be
completed before any Reduce tasks are scheduled. -->

<property>
<name>mapreduce.job.reduce.slowstart.completedmaps</name>
<value>0</value>
</property>

<!-- Enable Snappy for MapReduce intermediate compression for the whole
cluster -->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

MapReduce Exam 2019 - Solved Paper 15


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

<!-- heap memory configuration -->


<property>
<name>mapreduce.map.memory.mb</name>
<value>1536</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1024M</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>3072</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2560M</value>
</property>

4. Enabling WebHDFS
Set the following property in hdfs-site.xml:
hdfs-site.xml
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

<!-- To enable numeric usernames in WebHDFS -->


<property>
<name>dfs.webhdfs.user.provider.user.pattern</name>
<value>^[A-Za-z0-9_][A-Za-z0-9._-]*[$]?$</value>
</property>

Note: Try this - hdfs dfs -ls webhdfs://nameservice1:20101/

5. Configuring Compression Codec


Set the following property in core-site.xml:
core-site.xml
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,

MapReduce Exam 2019 - Solved Paper 16


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

6. Enabling Trash
CDH Parameter Value Description
The number of minutes after which a trash
checkpoint directory is deleted. This option can
be configured both on the server and the client.

If trash is enabled on the server configuration,


minutes or
fs.trash.interval then the value configured on the server is used
0
and the client configuration is ignored.
If trash is disabled on the server configuration,
then the client side configuration is checked. If
the value of this property is zero (the default),
then the trash feature is disabled.
The number of minutes between trash
checkpoints. Every time the checkpointer runs on
the NameNode, it creates a new checkpoint of the
"Current" directory and removes checkpoints
fs.trash.checkpoint.interval minutes or 0 older than fs.trash.interval minutes. This value
should be smaller than or equal
to fs.trash.interval. This option is configured on
the server. If configured to zero (the default), then
the value is set to the value of fs.trash.interval.

7. Configure Properties for YARN

7.1. Configure Properties for YARN Clusters


mapred-site.xml configuration

Recommended
Property Description
value

If you plan on running YARN, you must


mapreduce.framework.name Yarn
set this property to the value of yarn.

7.2. Configure YARN daemons


yarn-site.xml - The following table shows the most important properties that you
must configure for your cluster in yarn-site.xml
Property Recommended value Description
yarn.nodemanager.aux- Shuffle service that needs to be set
mapreduce_shuffle
services for Map Reduce applications.
yarn.resourcemanager.host resourcemanager.comp The following properties will be set
name any.com to their default ports on this host:
yarn.resourcemanager. address,
yarn.resourcemanager.

MapReduce Exam 2019 - Solved Paper 17


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

admin.address,
yarn.resourcemanager.
scheduler.address,
yarn.resourcemanager. resource-
tracker.address,
yarn.resourcemanager.
webapp.address
$HADOOP_CONF_DIR,
$HADOOP_COMMON_
HOME/*,
$HADOOP_COMMON_
HOME/lib/*,
$HADOOP_HDFS_HOME
/*,
$HADOOP_HDFS_HOME
yarn.application.classpath /lib/*, Classpath for typical applications.
$HADOOP_MAPRED_H
OME/*,
$HADOOP_MAPRED_H
OME/lib/*,
$HADOOP_YARN_HOM
E/*,
$HADOOP_YARN_HOM
E/lib/*
yarn.log.aggregation-
TRUE
enable
Specifies the URIs of the directories
where the NodeManager stores its
localized files. All of the files
file:///data/1/yarn/ required for running a particular
local, YARN application will be put here for
yarn.nodemanager.local-
file:///data/2/yarn/local the duration of the application run.
dirs
, Cloudera recommends that this
file:///data/3/yarn/local property specify a directory on each
of the JBOD mount points; for
example, file:///data/1/yarn/local th
rough /data/N/yarn/local.
Specifies the URIs of the directories
where the NodeManager stores
container log files. Cloudera
file:///data/1/yarn/logs,
recommends that this property
yarn.nodemanager.log-dirs file:///data/2/yarn/logs,
specify a directory on each of the
file:///data/3/yarn/logs
JBOD mount points; for
example, file:///data/1/yarn/logs thr
ough file:///data/N/yarn/logs.
yarn.nodemanager.remote hdfs://<namenode- Specifies the URI of the directory
-app-log-dir host.company.com>: where logs are aggregated. Set the
8020/var/log/hadoop- value to eitherhdfs://namenode-

MapReduce Exam 2019 - Solved Paper 18


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

host.company.com:8020/var/log/ha
doop-yarn/apps, using the fully
qualified domain name of your
yarn/apps
NameNode
host, orhdfs:/var/log/hadoop-yarn/a
pps.

7.3. Configure the JobHistory Server


mapred-site.xml configuration

Recommended
Property Description
value
historyserver. The address of the
mapreduce.jobhistory.address
company.com:10020 JobHistory Server host:port
The address of the
mapreduce.jobhistory.webapp.addres historyserver.
JobHistory Server web
s company.com:19888
application host:port
Allows the mapreduser to
hadoop.proxyuser.mapred.groups * move files belonging to users
in these groups
Allows the mapreduser to
hadoop.proxyuser.mapred.hosts * move files belonging on
these hosts

7.4. Configure the Staging Directory


mapred-site.xml configuration

Recommended
7.5. Property Description
value
YARN requires a staging directory
for temporary files created by
running jobs. By default it creates
/tmp/hadoop-yarn/staging with
yarn.app.mapreduce.am.staging-dir /user restrictive permissions that may
prevent your users from running
jobs. To forestall this, you should
configure and create the staging
directory yourself.
mapreduce.jobhistory.intermediate Set permissions on intermediate-
/user/tmp
-done-dir done-dir to 777
hadoop.proxyuser.mapred.groups /user/done Set permissions on done-dir to 750

Default Configuration
 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
 https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

MapReduce Exam 2019 - Solved Paper 19


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

 https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/mapred-default.xml
 https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

MapReduce Exam 2019 - Solved Paper 20


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

Hadoop Configuration

7. Role of YARN, Hue, Application Manager, Node Manager in MapReduce

A. YARN
YARN is the resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system resources to the
various applications running in a Hadoop cluster and scheduling tasks to be executed on
different cluster nodes.

o YARN vs MapReduce v1 Comparison


YARN (formerly known as
Criteria MapReduce 2) MapReduce
Real-time, batch, interactive Silo & batch processing with
Type of processing processing with multiple engines single engine
Cluster resource Excellent due to central resource Average due to fixed Map &
optimization management Reduce slots
Multi-tenancy - MapReduce & Non –
Suitable for MapReduce applications Only MapReduce applications
Managing cluster
resource Done by YARN Done by JobTracker
With YARN, Hadoop can now support Only one namespace could be
Namespace multiple namespaces supported, that is, HDFS

MapReduce Exam 2019 - Solved Paper 21


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

B. HUE
Hue is a web-based interactive query editor in the Hadoop stack that lets you visualize and
share data. Hue brings the power of business intelligence (BI) and analytics to SQL developers.
It’s built to bridge the gap between IT and the business for trusted self-service analytics.
It allows us to:
- BI - Query Hive, Impala or HBase database tables
- View HDFS/Amazon S3 file system
- View Job Status
- Add/View Oozie Workflows
- Configure hive security (Database and Table privileges)

C. Application Manager, App Master, Node Manager

The Resource Manager is a single point of failure in YARN. Using Application Masters, YARN
is spreading over the cluster the metadata related to running applications. This reduces the
load of the Resource Manager and makes it fast recoverable.
Application manager is responsible for maintaining a list of submitted application. After
application is submitted by the client, application manager firstly validates whether application
requirement of resources for its application master can be satisfied or not. If enough resources
are available then it forwards the application to scheduler otherwise application will be rejected.
It also make sure that no other application is submitted with same application id.
Note: Application manager keeps the cache of completed application so that if user requests for
application data via web UI or command line at later point of time, it can fulfill the request of the
user.

MapReduce Exam 2019 - Solved Paper 22


MapReduce Exam 2019 - Solved Paper - Hadoop Configuration

The Application Master is responsible for the execution of a single application. It asks for
containers from the Resource Scheduler (Resource Manager) and executes specific programs
(e.g., the main of a Java class) on the obtained containers. The Application Master knows the
application logic and thus it is framework-specific. The MapReduce framework provides its own
implementation of an Application Master.

The Hadoop Yarn Node Manager is the per-machine/per-node framework agent who is
responsible for containers, monitoring their resource usage (memory, CPU) of individual
containers, tracking node-health, log’s management and auxiliary services and reporting the
same to the Resource Manager.

Reference for further reading: https://hortonworks.com/blog/apache-hadoop-yarn-concepts-


and-applications/

Take the quiz at - https://data-flair.training/blogs/hadoop-mapreduce-quiz/

MapReduce Exam 2019 - Solved Paper 23

You might also like