HDFS Concepts
1. Data Block
Have you ever thought about how the Hadoop Distributed File system
stores files of large size?
Hadoop is known for its reliable storage. Hadoop HDFS can store data of any
size and format.
HDFS in Hadoop divides the file into small size blocks called data blocks.
These data blocks serve many advantages to the Hadoop HDFS. Let us study
these data blocks in detail.
In this article, we will study data blocks in Hadoop HDFS. The article
discusses:
• What is a HDFS data block and the size of the HDFS data block?
• Blocks created for a file with an example.
• Why are blocks in HDFS huge?
• Advantages of Hadoop Data Blocks
Let us first begin with an introduction to the data block and its default size.
What is a data block in HDFS?
Files in HDFS are broken into block-sized chunks called data blocks. These
blocks are stored as independent units.
The size of these HDFS data blocks is 128 MB by default. We can configure the
block size as per our requirement by changing the dfs.block.size property
in hdfs-site.xml
Hadoop distributes these blocks on different slave machines, and the master
machine stores the metadata about blocks location.
All the blocks of a file are of the same size except the last one (if the file size is
not a multiple of 128). See the example below to understand this fact.
Example
Suppose we have a file of size 612 MB, and we are using the default block
configuration (128 MB). Therefore five blocks are created, the first four blocks
are 128 MB in size, and the fifth block is 100 MB in size (128*4+100=612).
From the above example, we can conclude that:
1. A file in HDFS, smaller than a single block does not occupy a full
block size space of the underlying storage.
2. Each file stored in HDFS doesn’t need to be an exact multiple of the
configured block size.
Now let’s see the reasons behind the large size of the data blocks in HDFS.
Why are blocks in HDFS huge?
The default size of the HDFS data block is 128 MB. The reasons for the large
size of blocks are:
1. To minimize the cost of seek: For the large size blocks, time taken to
transfer the data from disk can be longer as compared to the time
taken to start the block. This results in the transfer of multiple
blocks at the disk transfer rate.
2. If blocks are small, there will be too many blocks in Hadoop HDFS
and thus too much metadata to store. Managing such a huge number
of blocks and metadata will create overhead and lead to traffic in a
network.
Advantages of Hadoop Data Blocks
1. No limitation on the file size
A file can be larger than any single disk in the network.
2. Simplicity of storage subsystem
Since blocks are of fixed size, we can easily calculate the number of blocks that
can be stored on a given disk. Thus provide simplicity to the storage
subsystem.
3. Fit well with replication for providing Fault
Tolerance and High Availability
Blocks are easy to replicate between DataNodes thus, provide fault
tolerance and high availability.
4. Eliminating metadata concerns
Since blocks are just chunks of data to be stored, we don’t need to store file
metadata (such as permission information) with the blocks, another system
can handle metadata separately.
2. Name Node and Data Node
An HDFS cluster has two types of nodes operating in a master−slave pattern:
1. NameNode (the master) and
2. Number of DataNodes (slaves/workers).
HDFS NameNode
1. NameNode is the main central component of HDFS architecture framework.
2. NameNode is also known as Master node.
3. HDFS Namenode stores meta-data i.e. number of data blocks, file name, path,
Block IDs, Block location, no. of replicas, and also Slave related configuration. This
meta-data is available in memory in the master for faster retrieval of data.
4. NameNode keeps metadata related to the file system namespace in memory, for
quicker response time. Hence, more memory is needed. So NameNode
configuration should be deployed on reliable configuration.
5. NameNode maintains and manages the slave nodes, and assigns tasks to them.
6. NameNode has knowledge of all the DataNodes containing data blocks for a
given file.
7. NameNode coordinates with hundreds or thousands of data nodes and serves
the requests coming from client applications.
Two files ‘FSImage’ and the ‘EditLog’ are used to store metadata information.
FsImage: It is the snapshot the file system when Name Node is started. It is an
“Image file”. FsImage contains the entire filesystem namespace and stored as a file
in the NameNode’s local file system. It also contains a serialized form of all the
directories and file inodes in the filesystem. Each inode is an internal
representation of file or directory’s metadata.
EditLogs: It contains all the recent modifications made to the file system on the
most recent FsImage. NameNode receives a create/update/delete request from the
client. After that this request is first recorded to edits file.
Functions of NameNode in HDFS
1. It is the master daemon that maintains and manages the DataNodes (slave
nodes).
2. It records the metadata of all the files stored in the cluster, e.g. The location of
blocks stored, the size of the files, permissions, hierarchy, etc.
3. It records each change that takes place to the file system metadata. For example,
if a file is deleted in HDFS, the NameNode will immediately record this in the
EditLog.
4. It regularly receives a Heartbeat and a block report from all the DataNodes in the
cluster to ensure that the DataNodes are live.
5. It keeps a record of all the blocks in HDFS and in which nodes these blocks are
located.
6. The NameNode is also responsible to take care of the replication factor of all the
blocks.
7. In case of the DataNode failure, the NameNode chooses new DataNodes for new
replicas, balance disk usage and manages the communication traffic to the
DataNodes.
HDFS DataNode
1. DataNode is also known as Slave node.
2. In Hadoop HDFS Architecture, DataNode stores actual data in HDFS.
3. DataNodes responsible for serving, read and write requests for the clients.
4. DataNodes can deploy on commodity hardware.
5. DataNodes sends information to the NameNode about the files and blocks
stored in that node and responds to the NameNode for all filesystem operations.
6. When a DataNode starts up it announce itself to the NameNode along with the
list of blocks it is responsible for.
7. DataNode is usually configured with a lot of hard disk space. Because the actual
data is stored in the DataNode.
Functions of DataNode in HDFS
1. These are slave daemons or process which runs on each slave machine.
2. The actual data is stored on DataNodes.
3. The DataNodes perform the low-level read and write requests from the file
system’s clients.
4. Every DataNode sends a heartbeat message to the Name Node every 3 seconds
and conveys that it is alive. In the scenario when Name Node does not receive a
heartbeat from a Data Node for 10 minutes, the Name Node considers that
particular Data Node as dead and starts the process of Block replication on some
other Data Node..
5. All Data Nodes are synchronized in the Hadoop cluster in a way that they can
communicate with one another and make sure of
i. Balancing the data in the system
ii. Move data for keeping high replication
iii. Copy Data when required
3. Hadoop High Availability &
NameNode High Availability
architecture
High Availability was a new feature added to Hadoop 2.x to solve the Single
point of failure problem in the older versions of Hadoop.
As the Hadoop HDFS follows the master-slave architecture where the
NameNode is the master node and maintains the filesystem tree. So HDFS
cannot be used without NameNode. This NameNode becomes a bottleneck.
HDFS high availability feature addresses this issue.
In this article, we will discuss the following points of Hadoop High Availability
feature in detail:
• What is high availability?
• Introduction to High availability in Hadoop
• How Hadoop achieves High Availability?
• Reason for introducing High Availability Architecture
• NameNode High Availability Architecture
• Implementation of NameNode High Availability Architecture
• Fencing of NameNode
High availability refers to the availability of system or data in the wake of
component failure in the system.
The high availability feature in Hadoop ensures the availability of the Hadoop
cluster without any downtime, even in unfavorable conditions like NameNode
failure, DataNode failure, machine crash, etc.
It means if the machine crashes, data will be accessible from another path.
How Hadoop HDFS achieves High Availability?
As we know, HDFS (Hadoop distributed file system) is a distributed file
system in Hadoop. HDFS stores users’ data in files and internally, the files are
split into fixed-size blocks. These blocks are stored on DataNodes. NameNode
is the master node that stores the metadata about file system i.e. block
location, different blocks for each file, etc.
1. Availability if DataNode fails
• In HDFS, replicas of files are stored on different nodes.
• DataNodes in HDFS continuously sends heartbeat messages to
NameNode every 3 seconds by default.
• If NameNode does not receive a heartbeat from DataNode within a
specified time (10 minutes by default), the NameNode considers the
DataNode to be dead.
• NameNode then checks for the data in DataNode and initiates data
replication. NameNode instructs the DataNodes containing a copy of
that data to replicate that data on other DataNodes.
• Whenever a user requests to access his data, NameNode provides
the IP of the closest DataNode containing user data. Meanwhile, if
DataNode fails, the NameNode redirects the user to the other
DataNode containing a copy of the same data. The user requesting
for data read, access the data from other DataNodes containing a
copy of data, without any downtime. Thus cluster is available to the
user even if any of the DataNodes fails.
2. Availability if NameNode fails
NameNode is the only node that knows the list of files and directories in a
Hadoop cluster. “The filesystem cannot be used without NameNode”.
The addition of the High Availability feature in Hadoop 2 provides a fast
failover to the Hadoop cluster. The Hadoop HA cluster consists of two
NameNodes (or more after Hadoop 3) running in a cluster in an
active/passive configuration with a hot standby. So, if an active node fails,
then a passive node becomes the active NameNode, takes the responsibility,
and serves the client request.
This allows for the fast failover to the new machine even if the machine
crashes.
Thus, data is available and accessible to the user even if the NameNode itself
goes down.
Let us now study the NameNode High Availability in detail.
Before going to NameNode High Availability architecture, one should know
the reason for introducing such architecture.
Reason for introducing NameNode High
Availability Architecture
Prior to Hadoop 2.0, NameNode is the single point of failure in a Hadoop
cluster. This is because:
1. Each cluster consists of only one NameNode. If the NameNode fails, then
the whole cluster would go down. The cluster would be available only when we
either restart the NameNode or bring it on a separate machine. These had
limited availability in two ways:
• The cluster would be unavailable if the machine crash until an
operator restarts the NameNode.
• Planned maintenance events such as software or hardware upgrades
on the NameNode, results in downtime of the Hadoop cluster.
2. The time taken by NameNode to start from cold on large clusters with many
files can be 30 minutes or more. This long recovery time is a problem.
To overcome these problems Hadoop High Availability architecture was
introduced in Hadoop 2.
Hadoop NameNode High Availability
Architecture
The HDFS high availability feature introduced in Hadoop 2 addressed this
problem by providing the option for running two NameNodes in the same
cluster in an Active/Passive configuration with a hot standby.
Thus if the running NameNode (active) goes down, then the other NameNode
(passive) takes the responsibility of serving the client request without
interruption.
Passive node is the standby node that acts as a slave node, having similar data
as an active node. It maintains enough state to provide a fast failover.
This allows for the fast failover to a new NameNode in the case of the machine
crash or during administrative initiated failure for planned maintenance.
1. Issues in maintaining consistency Of HDFS HA
cluster:
There are two issues in maintaining the consistency of the HDFS high
availability cluster. They are:
• The active node and the passive node should always be in sync with
each other and must have the same metadata. This allows us to
restore the Hadoop cluster to the same namespace where it crashed.
• Only one NameNode in the same cluster must be active at a time. If
two NameNodes are active at a time, then cluster gets divided into
smaller clusters, each one believing it is the only active cluster. This
is known as the “Split-brain scenario” which leads to data loss or
other incorrect results. Fencing is a process that ensures that only
one NameNode is active at a time.