Unit 4&5

UNIT-4
Basic file commands in HDFS:

HDFS is the primary or major component of the Hadoop ecosystem which is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby maintaining the
metadata in the form of log files. To use the HDFS commands, first you need to start the Hadoop
services using the following command:
sbin/start-all.sh
To check the Hadoop services are up and running use the following command:
jps
Commands:
ls: This command is used to list all the files. Use lsr for recursive approach. It is useful when we
want a hierarchy of a folder.
Syntax:
bin/hdfs dfs -ls <path>
Example:
bin/hdfs dfs -ls /
mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first

create it.
Syntax:
bin/hdfs dfs -mkdir <folder name>
creating home directory:
hdfs/bin -mkdir /user

hdfs/bin -mkdir /user/username -> write the username of your computer
touchz: It creates an empty file.

Syntax:
bin/hdfs dfs -touchz <file_path>
Example:
bin/hdfs dfs -touchz /geeks/myfile.txt
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example:
Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks present on
hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
cat: To print file contents.

Syntax:
bin/hdfs dfs -cat <path>
Example:
print the content of AI.txt present
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
bin/hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>
Example:
bin/hdfs dfs -copyToLocal /geeks ../Desktop/hero
moveFromLocal: This command will move file from local to hdfs.

Syntax:
bin/hdfs dfs -moveFromLocal <local src> <dest(on hdfs)>
Example:
bin/hdfs dfs -moveFromLocal ../Desktop/cutAndPaste.txt /geeks
cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from
geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself
du: It will give the size of each file in directory.

Syntax:
bin/hdfs dfs -du <dirName>
Example:
bin/hdfs dfs -du /geeks
dus:: This command will give the total size of directory/file.

Syntax:
bin/hdfs dfs -dus <dirName>
Example:
bin/hdfs dfs -dus /geeks
stat: It will give the last modified time of directory or path. In short it will give stats of the directory or
file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geek
Anatomy of Hadoop Datatypes-
Data types-
Hadoop provides classes that wrap the Java primitive types and implement the WritableComparable
and Writable Interfaces. They are provided in the org.apache.hadoop.io package.
All the Writable wrapper classes have a get() and a set() method for retrieving and storing the wrapped
value.
Primitive Writable Classes
These are Writable Wrappers for Java primitive data types and they hold a single primitive value that
can be set either at construction or via a setter method.
All these primitive writable wrappers have get() and set() methods to read or write the wrapped
value. Below is the list of primitive writable data types available in Hadoop.
 BooleanWritable
 ByteWritable
 IntWritable
 VIntWritable
 FloatWritable
 LongWritable
 VLongWritable
 DoubleWritable
VIntWritable and VLongWritable are used for variable length Integer types and variable length long
types respectively.
Serialized sizes of the above primitive writable data types are same as the size of actual java data type.
So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.
Array Writable Classes:
Hadoop provided two types of array writable classes, one for single-dimensional and another for two-
dimensional arrays. But the elements of these arrays must be other writable objects like IntWritable or
LongWritable only but not the java native data types like int or float.
ArrayWritable
 TwoDArrayWritable
Map Writable Classes

Hadoop provided below MapWritable data types which implement java.util.Map interface
o AbstractMapWritable – This is abstract or base class for other MapWritable classes.

o MapWritable – This is a general purpose map mapping Writable keys to Writable values.
o SortedMapWritable – This is a specialization of the MapWritable class that also implements
the SortedMap interface.
Other Writable Classes
NullWritable
NullWritable is a special type of Writable representing a null value. No bytes are read or written
when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a
NullWritable when we don’t need to use that field.
ObjectWritable
This is a general-purpose generic object wrapper which can store any objects like Java primitives,
String, Enum, Writable, null, or arrays.
Text
Text can be used as the Writable equivalent of java.lang.String and It’s max size is 2 GB. Unlike
java’s String data type, Text is mutable in Hadoop.
BytesWritable
BytesWritable is a wrapper for an array of binary data.
GenericWritable
It is similar to ObjectWritable but supports only a few types. User need to subclass this
GenericWritable class and need to specify the types to support.
Mapper:--
Mapper task is the first phase of processing that processes each input record (from RecordReader)
and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the
local disk.
Hadoop Mapper task processes each input record and it generates a new <key, value> pairs. The <key,
value> pairs can be completely different from the input pair. In mapper task, the output is the full collection
of all these <key, value> pairs.
Before writing the output for each mapper task, partitioning of output take place on the basis of the key and
then sorting is done. This partitioning specifies that all the values for each key are grouped together.
MapReduce frame generates one map task for each InputSplit generated by the InputFormat for the
job.Mapper only understands <key, value> pairs of data, so before passing data to the mapper, data should
be first converted into <key, value> pairs.
Key value pair generation in Hadoop:
 InputSplit – It is the logical representation of data. It describes a unit of work that contains a
single map task in a MapReduce program.
 RecordReader – It communicates with the InputSplit and it converts the data into key-value pairs
suitable for reading by the Mapper. By default, it uses TextInputFormat for converting data into the
key-value pair. RecordReader communicates with the Inputsplit until the file reading is not
completed.
For example, if we have a block size of 128 MB and we expect 10TB of input data, we will have
82,000 maps. Thus, the InputFormat determines the number of maps.
Reducer:-
In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of
them to generate the output. The output of the reducer is the final output, which is stored in HDFS.
Usually, in the Hadoop Reducer, we do aggregation or summation sort of computation.
The Reducer process the output of the mapper. It produces a new set of output. At last HDFS stores
this output data.Hadoop Reducer takes a set of an intermediate key-value pair produced by the
mapper as the input and runs a Reducer function on each of them.
One can aggregate, filter, and combine this data (key, value) in a number of ways for a wide range
of processing. Reducer first processes the intermediate values for particular key generated by the
map function and then generates the output (zero or more key-value pair).One-one mapping takes
place between keys and reducers. Reducers run in parallel since they are independent of one
another. The user decides the number of reducers. By default number of reducers is 1.
Phases of MapReduce Reducer:
There are 3 phases of Reducer in Hadoop MapReduce.
Shuffle Phase of MapReduce Reducer
In this phase, the sorted output from the mapper is the input to the Reducer. In Shuffle phase, with
the help of HTTP, the framework fetches the relevant partition of the output of all the mappers.
Sort Phase of MapReduce Reducer

In this phase, the input from different mappers is again sorted based on the similar keys in different
Mappers. The shuffle and sort phases occur concurrently.
Reduce Phase
In this phase, after shuffling and sorting, reduce task aggregates the key-value pairs.
The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem.
Reducer output is not sorted.
Partitioner:-
1. Hadoop Partitioner / MapReduce Partitioner
The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper
output.The key (or a subset of the key) is used to derive the partition by using hash function. The
total number of partitions depends on the number of reduce task Partitioner.
The key-value each mapper output is partitioned and records having the same key value go into the
same partition (within each mapper), and then each partition is sent to a reducer. Partition class
determines which partition a given (key, value) pair will go. Partition phase takes place after map
phase and before reduce phase
Poor Partitioning in Hadoop MapReduce:-
If in data input one key appears more than any other key. In such case, we use two mechanisms to
send data to partitions.The key appearing more will be sent to one partition.All the other key will
be sent to partitions according to their hashCode().
But if hashCode() method does not uniformly distribute other keys data over partition range, then
data will not be evenly sent to reducers. Poor partitioning of data means that some reducers will
have more data input than other i.e. they will have more work to do than other reducers. So, the
entire job will wait for one reducer to finish its extra-large share of the load.
Combiner :---
Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record with the
same Key before passing to the Reducer.
On a large dataset when we run MapReduce job, large chunks of intermediate data is generated by
the Mapper and this intermediate data is passed on the Reducer for further processing, which leads
to enormous network congestion. MapReduce framework provides a function known as Hadoop
Combiner that plays a key role in reducing network congestion.
The combiner in MapReduce is also known as ‘Mini-reducer’. The primary job of Combiner is to
process the output data from the Mapper, before passing it to Reducer.
In the above diagram, no combiner is used. Input is split into two mappers and 9 keys are generated
from the mappers. Now we have (9 key/value) intermediate data, the further mapper will send
directly this data to reducer and while sending data to the reducer, it consumes some network
bandwidth (bandwidth means time taken to transfer data between 2 machines). It will take more
time to transfer data to reducer if the size of data is big.
Now in between mapper and reducer if we use a hadoop combiner, then combiner shuffles
intermediate data (9 key/value) before sending it to the reducer and generates 4 key/value pair as
an output.
Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
It decreases the amount of data that needed to be processed by the reducer
Disadvantages of Hadoop combiner in MapReduce:
There are also some disadvantages of hadoop Combiner.
 MapReduce jobs cannot depend on the Hadoop combiner execution because there is no
guarantee in its execution.
 In the local filesystem, the key-value pairs are stored in the Hadoop and run the combiner
later which will cause expensive disk IO.
Word counting with pre-defined mapper and reducer:

In Hadoop framework there are some predefined Mapper and Reducer classes
Predefined Mapper classes in Hadoop:
InverseMapper – This predefined mapper swaps keys and values. So the input (key, value) pair is
reversed and the key becomes value and value becomes key in the output (key, value) pair.
TokenCounterMapper – This mapper tokenizes the input values and emit each word with a count
of 1. So the mapper you write in case of word count MapReduce program can be replaced by this
inbuilt mapper. See an example word count program using TokenCounterMapper and
IntSumReducer.
MultithreadedMapper – This is the multi-threaded implementation of Mapper. Mapper
implementations using this MapRunnable must be thread-safe.
ChainMapper– The ChainMapper class allows to use multiple Mapper classes within a single
Map task. The Mapper classes are invoked in a chained fashion, the output of the first mapper
becomes the input of the second, and so on until the last Mapper, the output of the last Mapper will
be written to the task’s output.
FieldSelectionMapper– This class implements a mapper class that can be used to perform field
selections in a manner similar to unix cut. The input data is treated as fields separated by a user
specified separator. The user can specify a list of fields that form the map output keys, and a list of
fields that form the map output values.
RegexMapper– This predefined Mapper class in Hadoop extracts text from input that matches a
regular expression.
Predefined Reducer classes in Hadoop:
IntSumReducer– This predefinder Reducer class will sum the integer values associated with the
specific key.
LongSumReducer– This predefinder Reducer class will sum the long values associated with the
specific key.
FieldSelectionReducer– This class implements a reducer class that can be used to perform field
selections in a manner similar to unix cut. The input data is treated as fields separated by a user
specified separator. The user can specify a list of fields that form the reduce output keys, and a list
of fields that form the reduce output values. The fields are the union of those from the key and
those from the value.
ChainReducer– The ChainReducer class allows to chain multiple Mapper classes after a Reducer
within the Reducer task. For each record output by the Reducer, the Mapper classes are invoked in
a chained fashion. The output of the reducer becomes the input of the first mapper and output of
first becomes the input of the second, and so on until the last Mapper, the output of the last Mapper
will be written to the task’s output.
WrappedReducer– A Reducer which wraps a given one to allow for custom Reducer.Context
implementations. This Reducer is useful if you want provide implementation of Context interface.
Reading and Writing –Input Format,Output Format:--
Hadoop InputFormat:
InputFormat checks the Input-Specification of the job. InputFormat split the Input file into
InputSplit and assign to individual Mapper.
An Hadoop InputFormat is the first component in Map-Reduce, it is responsible for creating the
input splits and dividing them into records. If you are not familiar with MapReduce Job Flow, so
follow our Hadoop MapReduce Data flow tutorial for more understanding.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside
in HDFS. Although these files format is arbitrary, line-based log files and binary format can be
used. Using InputFormat we define how these input files are split and read. The InputFormat class
is one of the fundamental classes in the Hadoop MapReduce framework which provides the
following functionality
 The files or other objects that should be used for input is selected by the InputFormat.
 InputFormat defines the Data splits, which defines both the size of individual Map tasks and
its potential execution server.
 InputFormat defines the RecordReader, which is responsible for reading actual records from
the input files.
Types of InputFormat
FileInputFormat in Hadoop:
It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input
directory where data files are located. When we start a Hadoop job, FileInputFormat is provided
with a path containing files to read.
TextInputFormat:
It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input file as a
separate record and performs no parsing. This is useful for unformatted data or line-based records
like log files.
 Key – It is the byte offset of the beginning of the line within the file (not whole file just one split),
so it will be unique if combined with the file name.
 Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat:
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the line
itself into key and value by a tab character (‘/t’).
SequenceFileInputFormat:
Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files

are binary files that stores sequences of binary key-value pairs.
SequenceFileAsBinaryInputFormat:
Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we can

extract the sequence file’s keys and values as an opaque binary object.
NLineInputFormat:
Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte offset of
the line and values are contents of the line. Each mapper receives a variable number of lines of
input with TextInputFormat and KeyValueTextInputFormat and the number depends on the size of
the split and the length of the lines. And if we want our mapper to receive a fixed number of lines
of input, then we use NLineInputFormat
DBInputFormat:
Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using JDBC.
As it doesn’t have portioning capabilities, so we need to careful not to swamp the database from
which we are reading too many mappers.
Hadoop Output Format:
The Hadoop Output Format checks the Output-Specification of the job. It determines how

RecordWriter implementation is used to write output to output files.
Record Writer:
As we know, Reducer takes as input a set of an intermediate key-value pair produced by

the mapper and runs a reducer function on them to generate output that is again zero or more key-
value pairs.
RecordWriter writes these output key-value pairs from the Reducer phase to output files.
Hadoop RecordWriter takes output data from Reducer and writes this data to output files. The way
these output key-value pairs are written in output files by RecordWriter is determined by the
Output Format:
The Output Format and InputFormat functions are alike. OutputFormat instances provided by
Hadoop are used to write to files on the HDFS or local disk. OutputFormat describes the output-
specification for a Map-Reduce job. On the basis of output specification;
MapReduce job checks that the output directory does not already exist.
OutputFormat provides the RecordWriter implementation to be used to write the output files of the
job. Output files are stored in a FileSystem.
FileOutputFormat.setOutputPath() method is used to set the output directory. Every Reducer

writes a separate file in a common output directory.
Types of Hadoop Output Formats

There are various types of Hadoop OutputFormat.
TextOutputFormat
MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on
individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to
string by calling toString() on them. Each key-value pair is separated by a tab character, which can be
changed using MapReduce.output.textoutputformat.separator property.
SequenceFileAsBinaryOutputFormat
It is another form of SequenceFileInputFormat which writes keys and values to sequence file in
binary format.
SequenceFileOutputFormat
It is an Output Format which writes sequences files for its output and it is intermediate format use
between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the
corresponding SequenceFileInputFormat will deserialize the file into the same types and presents
the data to the next mapper Compression is controlled by the static methods on
SequenceFileOutputFormat.
MapFileOutputFormat
It is another form of FileOutputFormat in Hadoop Output Format, which is used to write output as
map files. The key in a MapFile must be added in order, so we need to ensure that reducer emits
keys in sorted order.
MultipleOutputs
It allows writing data to files whose names are derived from the output keys and values, or in fact
from an arbitrary string.
LazyOutputFormat
FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is a wrapper
OutputFormat which ensures that the output file will be created only when the record is emitted for
a given partition.
DBOutputFormat
DBOutputFormat in Hadoop is an Output Format for writing to relational databases and HBase. It

sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type
extending DBwritable. Returned RecordWriter writes only the key to the database with a batch
SQL query.
UNIT-5
Background of Yarn:
Hadoop YARN knits the storage unit of Hadoop i.e. HDFS (Hadoop Distributed File System) with
the various processing tools. For those of you who are completely new to this topic, YARN stands
for “Yet Another Resource Negotiator”
Why YARN?
Hadoop MapReduce performed both processing and resource management functions. It consisted
of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed
scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of
subordinate processes called the Task Trackers. The Task Trackers periodically reported their
progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. the practical limits of
such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the
responsibility of Resource Management and Job Scheduling.
Limitations Of Mapreduce:
Real-time processing.
It's not always very easy to implement each and everything as a MR program.
When your intermediate processes need to talk to each other(jobs run in isolation).
When your processing requires lot of data to be shuffled over the network.

When you need to handle streaming data. MR is best suited to batch process huge amounts of data
which you already have with you.
When you can get the desired result with a standalone system. It's obviously less painful to
configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transaction
Yarn Architecture
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove
the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a
“Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known
as large-scale distributed operating system used for Big Data processing.
YARN also allows different data processing engines like graph processing, interactive processing,
stream processing as well as batch processing to run and process data stored in HDFS (Hadoop
Distributed File System) thus making the system much more efficient. Through its various
components, it can dynamically allocate various resources and schedule the application processing.
For large volume data processing, it is quite necessary to manage the available resources properly
so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend

and manage thousands of nodes and clusters.
Compatability: YARN supports the existing map-reduce applications without disruptions thus

making it compatible with Hadoop 1.0 as well.
Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which

enables optimized Cluster Utilization.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
Hadoop YARN architecture
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).
Resource Manager (RM)
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications. follow Resource Manager guide to learn Yarn Resource manager in great detail.
Resource Manager has two Main components
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The scheduler
is pure scheduler it means that it performs no monitoring no tracking for the application and even
doesn’t guarantees about restarting failed tasks either due to application failure or hardware
failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting application
masters and for monitoring and restarting them on different nodes in case of failures.
Node Manager (NM)
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN
Application Master (AM)
One application master runs per application. It negotiates resources from the resource manager and
works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs to
start the application’s individual tasks.
Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single
node. The containers are invoked by Container Launch Context(CLC) which is a record that
contains information such as environment variables, security tokens, dependencies etc.
 The Resource Manager allocates a container to start the Application Manager

 The Application Manager registers itself with the Resource Manager
 The Application Manager negotiates containers from the Resource Manager
 The Application Manager notifies the Node Manager to launch containers
 Application code is executed in the container
 Client contacts Resource Manager/Application Manager to monitor application’s status
 Once the processing is complete, the Application Manager un-registers with the Resource
Manager
 Once the processing is complete, the Application Manager un-registers with the Resource
Manager
Advantages of YARN:
The YARN framework, introduced in Hadoop 2.0, is meant to share the responsibilities of
MapReduce and take care of the cluster management task. This allows MapReduce to execute data
processing only and hence, streamline the process.
YARN brings in the concept of a central resource management. This allows multiple applications
to run on Hadoop, sharing a common resource management.
resourceManager – The ResourceManager component is the negotiator in a cluster for all the
resources present in that cluster. Furthermore, this component is classified into an application
manager which is responsible for managing user jobs. From Hadoop 2.0 any MapReduce job will
be considered as an application.
ApplicationMaster – This component is the place in which a job or application exists. It also
manages all the MapReduce jobs and is concluded after the job processing is complete.
NodeManager – The node manager component acts as the server for job history. It is responsible
for securing information of the completed jobs. It also keeps track of the users’ jobs along with
their workflow for a particular node.
Keeping in mind that the YARN framework has different components to manage the different
tasks, let’s see how it counters the limitations of Hadoop 1.0.
Better utilization of resources – The YARN framework does not have any fixed slots for tasks. It
provides a central resource manager which allows you to share multiple applications through a
common resource.
Running non-MapReduce applications – In YARN, the scheduling and resource management

capabilities are separated from the data processing component. This allows Hadoop to run varied
types of applications which do not conform tothe programming of the Hadoop framework. Hadoop
clusters are now capable of running independent interactive queries and performing better real-time
analysis.
Backward compatibility – YARN comes as a backward-compatible framework, which means any

existing job of MapReduce can be executed in Hadoop 2.0.
JobTracker no longer exists – The two major roles of the JobTracker were resource management
and job scheduling.
HIVE:
What is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). Hive makes job easy for performing operations like
 Data encapsulation
 Ad-hoc queries
 Analysis of huge datasets
 Important characteristics of Hive
 In Hive, tables and databases are created first and then data is loaded into these tables.
 Hive as data warehouse designed for managing and querying only structured data that is
stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features
like UDFs but Hive framework does. Query optimization refers to an effective way of query
execution in terms of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming. It reuses familiar concepts from the relational database world, such as tables, rows,
columns and schema, etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition" data
to improve performance on certain queries.
A new and important component of Hive i.e. Metastore used for storing schema information. This
Metastore typically resides in a relational database. We can interact with Hive using methods like
Web GUI
Java Database Connectivity (JDBC) interface
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to
write Hive queries using Hive Query Language(HQL)
Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The
Sample query below display all the records present in mentioned table name.
Sample query : Select * from <TableName>
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
For single user metadata storage, Hive uses derby database and for multiple user Metadata or
shared Metadata case Hive uses MYSQL.

Unit 4&5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4&5

Uploaded by

Copyright:

Available Formats

UNIT-4

Basic file commands in HDFS:

mkdir: To create a directory. In Hadoop dfs there is no home directory by default. So let’s first

creating home directory:

hdfs/bin -mkdir /user

touchz: It creates an empty file.

cat: To print file contents.

moveFromLocal: This command will move file from local to hdfs.

du: It will give the size of each file in directory.

dus:: This command will give the total size of directory/file.

Map Writable Classes

o AbstractMapWritable – This is abstract or base class for other MapWritable classes.

BytesWritable is a wrapper for an array of binary data.

Shuffle Phase of MapReduce Reducer

Sort Phase of MapReduce Reducer

Word counting with pre-defined mapper and reducer:

Hadoop SequenceFileInputFormat is an InputFormat which reads sequence files. Sequence files

Hadoop SequenceFileAsBinaryInputFormat is a SequenceFileInputFormat using which we can

Hadoop Output Format:

The Hadoop Output Format checks the Output-Specification of the job. It determines how

As we know, Reducer takes as input a set of an intermediate key-value pair produced by

FileOutputFormat.setOutputPath() method is used to set the output directory. Every Reducer

Types of Hadoop Output Formats

DBOutputFormat in Hadoop is an Output Format for writing to relational databases and HBase. It

It's not always very easy to implement each and everything as a MR program.

When your processing requires lot of data to be shuffled over the network.

YARN Features: YARN gained popularity because of the following features-

Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to extend

Compatability: YARN supports the existing map-reduce applications without disruptions thus

Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which

Resource Manager (RM)

Node Manager (NM)

Application Master (AM)

 The Resource Manager allocates a container to start the Application Manager

Running non-MapReduce applications – In YARN, the scheduling and resource management

Backward compatibility – YARN comes as a backward-compatible framework, which means any

Java Database Connectivity (JDBC) interface

Sample query : Select * from <TableName>

You might also like