Professional Documents
Culture Documents
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store. This is the most
important command. Local filesystem means the files present on the OS.
Syntax:
bin/hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>
Example:
Let’s suppose we have a file AI.txt on Desktop which we want to copy to folder geeks present on
hdfs.
bin/hdfs dfs -copyFromLocal ../Desktop/AI.txt /geeks
cp: This command is used to copy files within hdfs. Lets copy folder geeks to geeks_copied.
Syntax:
bin/hdfs dfs -cp <src(on hdfs)> <dest(on hdfs)>
Example:
bin/hdfs -cp /geeks /geeks_copied
mv: This command is used to move files within hdfs. Lets cut-paste a file myfile.txt from
geeks folder to geeks_copied.
Syntax:
bin/hdfs dfs -mv <src(on hdfs)> <src(on hdfs)>
Example:
bin/hdfs -mv /geeks/myfile.txt /geeks_copied
rmr: This command deletes a file from HDFS recursively. It is very useful command when you want to
delete a non-empty directory.
Syntax:
bin/hdfs dfs -rmr <filename/directoryName>
Example:
bin/hdfs dfs -rmr /geeks_copied -> It will delete all the content inside the directory then the
directory itself
stat: It will give the last modified time of directory or path. In short it will give stats of the directory or
file.
Syntax:
bin/hdfs dfs -stat <hdfs file>
Example:
bin/hdfs dfs -stat /geek
Anatomy of Hadoop Datatypes-
Data types-
Hadoop provides classes that wrap the Java primitive types and implement the WritableComparable
and Writable Interfaces. They are provided in the org.apache.hadoop.io package.
All the Writable wrapper classes have a get() and a set() method for retrieving and storing the wrapped
value.
Primitive Writable Classes
These are Writable Wrappers for Java primitive data types and they hold a single primitive value that
can be set either at construction or via a setter method.
All these primitive writable wrappers have get() and set() methods to read or write the wrapped
value. Below is the list of primitive writable data types available in Hadoop.
BooleanWritable
ByteWritable
IntWritable
VIntWritable
FloatWritable
LongWritable
VLongWritable
DoubleWritable
VIntWritable and VLongWritable are used for variable length Integer types and variable length long
types respectively.
Serialized sizes of the above primitive writable data types are same as the size of actual java data type.
So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.
Array Writable Classes:
Hadoop provided two types of array writable classes, one for single-dimensional and another for two-
dimensional arrays. But the elements of these arrays must be other writable objects like IntWritable or
LongWritable only but not the java native data types like int or float.
ArrayWritable
TwoDArrayWritable
NullWritable
NullWritable is a special type of Writable representing a null value. No bytes are read or written
when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a
NullWritable when we don’t need to use that field.
ObjectWritable
This is a general-purpose generic object wrapper which can store any objects like Java primitives,
String, Enum, Writable, null, or arrays.
Text
Text can be used as the Writable equivalent of java.lang.String and It’s max size is 2 GB. Unlike
java’s String data type, Text is mutable in Hadoop.
BytesWritable
GenericWritable
It is similar to ObjectWritable but supports only a few types. User need to subclass this
GenericWritable class and need to specify the types to support.
Mapper:--
Mapper task is the first phase of processing that processes each input record (from RecordReader)
and generates an intermediate key-value pair. Hadoop Mapper store intermediate-output on the
local disk.
Hadoop Mapper task processes each input record and it generates a new <key, value> pairs. The <key,
value> pairs can be completely different from the input pair. In mapper task, the output is the full collection
of all these <key, value> pairs.
Before writing the output for each mapper task, partitioning of output take place on the basis of the key and
then sorting is done. This partitioning specifies that all the values for each key are grouped together.
MapReduce frame generates one map task for each InputSplit generated by the InputFormat for the
job.Mapper only understands <key, value> pairs of data, so before passing data to the mapper, data should
be first converted into <key, value> pairs.
Key value pair generation in Hadoop:
InputSplit – It is the logical representation of data. It describes a unit of work that contains a
single map task in a MapReduce program.
RecordReader – It communicates with the InputSplit and it converts the data into key-value pairs
suitable for reading by the Mapper. By default, it uses TextInputFormat for converting data into the
key-value pair. RecordReader communicates with the Inputsplit until the file reading is not
completed.
For example, if we have a block size of 128 MB and we expect 10TB of input data, we will have
82,000 maps. Thus, the InputFormat determines the number of maps.
Reducer:-
In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of
them to generate the output. The output of the reducer is the final output, which is stored in HDFS.
Usually, in the Hadoop Reducer, we do aggregation or summation sort of computation.
The Reducer process the output of the mapper. It produces a new set of output. At last HDFS stores
this output data.Hadoop Reducer takes a set of an intermediate key-value pair produced by the
mapper as the input and runs a Reducer function on each of them.
One can aggregate, filter, and combine this data (key, value) in a number of ways for a wide range
of processing. Reducer first processes the intermediate values for particular key generated by the
map function and then generates the output (zero or more key-value pair).One-one mapping takes
place between keys and reducers. Reducers run in parallel since they are independent of one
another. The user decides the number of reducers. By default number of reducers is 1.
Phases of MapReduce Reducer:
There are 3 phases of Reducer in Hadoop MapReduce.
In this phase, the sorted output from the mapper is the input to the Reducer. In Shuffle phase, with
the help of HTTP, the framework fetches the relevant partition of the output of all the mappers.
Reduce Phase
In this phase, after shuffling and sorting, reduce task aggregates the key-value pairs.
The OutputCollector.collect() method, writes the output of the reduce task to the Filesystem.
Reducer output is not sorted.
Partitioner:-
1. Hadoop Partitioner / MapReduce Partitioner
The Partitioner in MapReduce controls the partitioning of the key of the intermediate mapper
output.The key (or a subset of the key) is used to derive the partition by using hash function. The
total number of partitions depends on the number of reduce task Partitioner.
The key-value each mapper output is partitioned and records having the same key value go into the
same partition (within each mapper), and then each partition is sent to a reducer. Partition class
determines which partition a given (key, value) pair will go. Partition phase takes place after map
phase and before reduce phase
Poor Partitioning in Hadoop MapReduce:-
If in data input one key appears more than any other key. In such case, we use two mechanisms to
send data to partitions.The key appearing more will be sent to one partition.All the other key will
be sent to partitions according to their hashCode().
But if hashCode() method does not uniformly distribute other keys data over partition range, then
data will not be evenly sent to reducers. Poor partitioning of data means that some reducers will
have more data input than other i.e. they will have more work to do than other reducers. So, the
entire job will wait for one reducer to finish its extra-large share of the load.
Combiner :---
Combiner is also known as “Mini-Reducer” that summarizes the Mapper output record with the
same Key before passing to the Reducer.
On a large dataset when we run MapReduce job, large chunks of intermediate data is generated by
the Mapper and this intermediate data is passed on the Reducer for further processing, which leads
to enormous network congestion. MapReduce framework provides a function known as Hadoop
Combiner that plays a key role in reducing network congestion.
The combiner in MapReduce is also known as ‘Mini-reducer’. The primary job of Combiner is to
process the output data from the Mapper, before passing it to Reducer.
In the above diagram, no combiner is used. Input is split into two mappers and 9 keys are generated
from the mappers. Now we have (9 key/value) intermediate data, the further mapper will send
directly this data to reducer and while sending data to the reducer, it consumes some network
bandwidth (bandwidth means time taken to transfer data between 2 machines). It will take more
time to transfer data to reducer if the size of data is big.
Now in between mapper and reducer if we use a hadoop combiner, then combiner shuffles
intermediate data (9 key/value) before sending it to the reducer and generates 4 key/value pair as
an output.
Hadoop Combiner reduces the time taken for data transfer between mapper and reducer.
It decreases the amount of data that needed to be processed by the reducer
Disadvantages of Hadoop combiner in MapReduce:
There are also some disadvantages of hadoop Combiner.
MapReduce jobs cannot depend on the Hadoop combiner execution because there is no
guarantee in its execution.
In the local filesystem, the key-value pairs are stored in the Hadoop and run the combiner
later which will cause expensive disk IO.
Hadoop InputFormat:
InputFormat checks the Input-Specification of the job. InputFormat split the Input file into
InputSplit and assign to individual Mapper.
An Hadoop InputFormat is the first component in Map-Reduce, it is responsible for creating the
input splits and dividing them into records. If you are not familiar with MapReduce Job Flow, so
follow our Hadoop MapReduce Data flow tutorial for more understanding.
Initially, the data for a MapReduce task is stored in input files, and input files typically reside
in HDFS. Although these files format is arbitrary, line-based log files and binary format can be
used. Using InputFormat we define how these input files are split and read. The InputFormat class
is one of the fundamental classes in the Hadoop MapReduce framework which provides the
following functionality
The files or other objects that should be used for input is selected by the InputFormat.
InputFormat defines the Data splits, which defines both the size of individual Map tasks and
its potential execution server.
InputFormat defines the RecordReader, which is responsible for reading actual records from
the input files.
Types of InputFormat
FileInputFormat in Hadoop:
It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input
directory where data files are located. When we start a Hadoop job, FileInputFormat is provided
with a path containing files to read.
TextInputFormat:
It is the default InputFormat of MapReduce. TextInputFormat treats each line of each input file as a
separate record and performs no parsing. This is useful for unformatted data or line-based records
like log files.
Key – It is the byte offset of the beginning of the line within the file (not whole file just one split),
so it will be unique if combined with the file name.
Value – It is the contents of the line, excluding line terminators.
KeyValueTextInputFormat:
It is similar to TextInputFormat as it also treats each line of input as a separate record. While
TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat breaks the line
itself into key and value by a tab character (‘/t’).
SequenceFileInputFormat:
SequenceFileAsBinaryInputFormat:
NLineInputFormat:
Hadoop NLineInputFormat is another form of TextInputFormat where the keys are byte offset of
the line and values are contents of the line. Each mapper receives a variable number of lines of
input with TextInputFormat and KeyValueTextInputFormat and the number depends on the size of
the split and the length of the lines. And if we want our mapper to receive a fixed number of lines
of input, then we use NLineInputFormat
DBInputFormat:
Hadoop DBInputFormat is an InputFormat that reads data from a relational database, using JDBC.
As it doesn’t have portioning capabilities, so we need to careful not to swamp the database from
which we are reading too many mappers.
Record Writer:
RecordWriter writes these output key-value pairs from the Reducer phase to output files.
Hadoop RecordWriter takes output data from Reducer and writes this data to output files. The way
these output key-value pairs are written in output files by RecordWriter is determined by the
Output Format:
The Output Format and InputFormat functions are alike. OutputFormat instances provided by
Hadoop are used to write to files on the HDFS or local disk. OutputFormat describes the output-
specification for a Map-Reduce job. On the basis of output specification;
MapReduce job checks that the output directory does not already exist.
OutputFormat provides the RecordWriter implementation to be used to write the output files of the
job. Output files are stored in a FileSystem.
TextOutputFormat
MapReduce default Hadoop reducer Output Format is TextOutputFormat, which writes (key, value) pairs on
individual lines of text files and its keys and values can be of any type since TextOutputFormat turns them to
string by calling toString() on them. Each key-value pair is separated by a tab character, which can be
changed using MapReduce.output.textoutputformat.separator property.
SequenceFileAsBinaryOutputFormat
It is another form of SequenceFileInputFormat which writes keys and values to sequence file in
binary format.
SequenceFileOutputFormat
It is an Output Format which writes sequences files for its output and it is intermediate format use
between MapReduce jobs, which rapidly serialize arbitrary data types to the file; and the
corresponding SequenceFileInputFormat will deserialize the file into the same types and presents
the data to the next mapper Compression is controlled by the static methods on
SequenceFileOutputFormat.
MapFileOutputFormat
It is another form of FileOutputFormat in Hadoop Output Format, which is used to write output as
map files. The key in a MapFile must be added in order, so we need to ensure that reducer emits
keys in sorted order.
MultipleOutputs
It allows writing data to files whose names are derived from the output keys and values, or in fact
from an arbitrary string.
LazyOutputFormat
FileOutputFormat will create output files, even if they are empty. LazyOutputFormat is a wrapper
OutputFormat which ensures that the output file will be created only when the record is emitted for
a given partition.
DBOutputFormat
Hadoop YARN knits the storage unit of Hadoop i.e. HDFS (Hadoop Distributed File System) with
the various processing tools. For those of you who are completely new to this topic, YARN stands
for “Yet Another Resource Negotiator”
Why YARN?
Hadoop MapReduce performed both processing and resource management functions. It consisted
of a Job Tracker which was the single master. The Job Tracker allocated the resources, performed
scheduling and monitored the processing jobs. It assigned map and reduce tasks on a number of
subordinate processes called the Task Trackers. The Task Trackers periodically reported their
progress to the Job Tracker.
This design resulted in scalability bottleneck due to a single Job Tracker. the practical limits of
such a design are reached with a cluster of 5000 nodes and 40,000 tasks running concurrently.
To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over the
responsibility of Resource Management and Job Scheduling.
Limitations Of Mapreduce:
Real-time processing.
When your intermediate processes need to talk to each other(jobs run in isolation).
When you can get the desired result with a standalone system. It's obviously less painful to
configure and manage a standalone system as compared to a distributed system.
When you have OLTP needs. MR is not suitable for a large number of short on-line transaction
Yarn Architecture
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to remove
the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described as a
“Redesigned Resource Manager” at the time of its launching, but it has now evolved to be known
as large-scale distributed operating system used for Big Data processing.
YARN also allows different data processing engines like graph processing, interactive processing,
stream processing as well as batch processing to run and process data stored in HDFS (Hadoop
Distributed File System) thus making the system much more efficient. Through its various
components, it can dynamically allocate various resources and schedule the application processing.
For large volume data processing, it is quite necessary to manage the available resources properly
so that every application can leverage them.
Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
Hadoop YARN architecture
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications. follow Resource Manager guide to learn Yarn Resource manager in great detail.
Resource Manager has two Main components
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The scheduler
is pure scheduler it means that it performs no monitoring no tracking for the application and even
doesn’t guarantees about restarting failed tasks either due to application failure or hardware
failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting application
masters and for monitoring and restarting them on different nodes in case of failures.
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN
One application master runs per application. It negotiates resources from the resource manager and
works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs to
start the application’s individual tasks.
Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single
node. The containers are invoked by Container Launch Context(CLC) which is a record that
contains information such as environment variables, security tokens, dependencies etc.
Advantages of YARN:
The YARN framework, introduced in Hadoop 2.0, is meant to share the responsibilities of
MapReduce and take care of the cluster management task. This allows MapReduce to execute data
processing only and hence, streamline the process.
YARN brings in the concept of a central resource management. This allows multiple applications
to run on Hadoop, sharing a common resource management.
resourceManager – The ResourceManager component is the negotiator in a cluster for all the
resources present in that cluster. Furthermore, this component is classified into an application
manager which is responsible for managing user jobs. From Hadoop 2.0 any MapReduce job will
be considered as an application.
ApplicationMaster – This component is the place in which a job or application exists. It also
manages all the MapReduce jobs and is concluded after the job processing is complete.
NodeManager – The node manager component acts as the server for job history. It is responsible
for securing information of the completed jobs. It also keeps track of the users’ jobs along with
their workflow for a particular node.
Keeping in mind that the YARN framework has different components to manage the different
tasks, let’s see how it counters the limitations of Hadoop 1.0.
Better utilization of resources – The YARN framework does not have any fixed slots for tasks. It
provides a central resource manager which allows you to share multiple applications through a
common resource.
JobTracker no longer exists – The two major roles of the JobTracker were resource management
and job scheduling.
HIVE:
What is Hive?
Hive is an ETL and Data warehousing tool developed on top of Hadoop Distributed File System
(HDFS). Hive makes job easy for performing operations like
Data encapsulation
Ad-hoc queries
Analysis of huge datasets
Important characteristics of Hive
In Hive, tables and databases are created first and then data is loaded into these tables.
Hive as data warehouse designed for managing and querying only structured data that is
stored in tables.
While dealing with structured data, Map Reduce doesn't have optimization and usability features
like UDFs but Hive framework does. Query optimization refers to an effective way of query
execution in terms of performance.
Hive's SQL-inspired language separates the user from the complexity of Map Reduce
programming. It reuses familiar concepts from the relational database world, such as tables, rows,
columns and schema, etc. for ease of learning.
Hadoop's programming works on flat files. So, Hive can use directory structures to "partition" data
to improve performance on certain queries.
A new and important component of Hive i.e. Metastore used for storing schema information. This
Metastore typically resides in a relational database. We can interact with Hive using methods like
Web GUI
Most interactions tend to take place over a command line interface (CLI). Hive provides a CLI to
write Hive queries using Hive Query Language(HQL)
Generally, HQL syntax is similar to the SQL syntax that most data analysts are familiar with. The
Sample query below display all the records present in mentioned table name.
Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and
RCFILE (Record Columnar File).
For single user metadata storage, Hive uses derby database and for multiple user Metadata or
shared Metadata case Hive uses MYSQL.