You are on page 1of 18

HDFS Federation in Hadoop

By
Dr. Fazeel Abid
Department of Information
System
SBE-UMT-LHE
HDFS Federation in Hadoop
• In Hadoop HDFS, Name Node stores the metadata for every file and block in the
file system.
• On very large clusters with many files, Name Node requires a large amount of
memory for storing metadata for each file and block.
• Also, the prior HDFS architecture supported a single Name Node that manages
the file system namespace.
• Thus, the memory becomes the limiting factor for scaling, and single Name Node
becomes a bottleneck.
• HDFS Federation feature added in Hadoop allowed a cluster to scale by adding
more Name Nodes.
Current HDFS Architecture
HDFS architecture has two layers:
1. Namespace
The Namespace layer in the HDFS architecture consists of files, blocks, and directories. This
layer provides support for namespace related file system operations like create, delete, modify,
and list files and directories.
2. Block Storage layer
Block Storage layer has two parts:
• Block Management: Name Node performs block management, provides Data Node cluster
membership by handling registrations, and periodic heartbeats. It processes block reports and
supports block related operations like create, delete, modify, or get block location. It also
maintains locations of blocks, replica placement. Block Management manages block replication
for under replicated blocks and deletes over replicated blocks.
• Storage: Data Node manages storage space by storing blocks on the local file system and
providing read/write access. This architecture allows for only single Name Node to maintain
the file system namespace.
.
Limitations of Current HDFS Architecture
• Due to the tight coupling of namespace and the storage layer, an alternate
implementation of Name Node is difficult. This limits the usage of block storage
directly by the other services.
• Due to single Name Node, we can have only a limited number of Data Nodes that
a single Name Node can handle.
• The operations of the file system are also limited to the number of tasks that Name
Node handles at a time. Thus, the performance of the cluster depends on the Name
Node throughput.
• Also, because of a single namespace, there is no isolation among the occupant
organizations which are using the cluster.
Introduction to HDFS Federation
• HDFS Federation feature introduced in Hadoop enhances the existing HDFS
architecture.
• It overcomes HDFS architecture limitations by adding multiple Name
Node/namespaces support to HDFS.
• This allows the use of more than one Name Node/namespace.
• Therefore, it scales the namespace horizontally by allowing the addition of Name
Node in the cluster.
HDFS Federation Architecture
• In HDFS Federation architecture, there are multiple Name Nodes and Data
Nodes.
• Each Name Node has its own namespace and block pool.
• All the Name Nodes uses Data Nodes as the common storage.
• Every Name Node is independent of the other and does not require any
coordination amongst themselves.
• Each Data node gets registered to all the Name Nodes in the cluster and store
blocks for all the block pools in the cluster.
• Also, Data Nodes periodically send heartbeats and block reports to all the Name
Node in the cluster and handles the instructions from the Name Nodes.
HDFS Federation Architecture

• In HDFS Federation architecture, there are


multiple Name Nodes which are represented as
NN1, NN2, ..NNn.
• NS1, NS2, and so on are the multiple namespaces
managed by their respective Name Node (NS1 by
NN1, NS2 by NN2, and so on).
• Each namespace has its own block pool (NS1 has
Pool1, NS2 has Pool2, and so on).
• Each Data node store blocks for all the block
pools in the cluster.
• For example, DataNode1 stores the blocks from
Pool 1, Pool 2, Pool3, etc.
HDFS Federation
Block pool
• Block pool is the collection of blocks belonging to the single namespace.
• HDFS Federation architecture has a collection of block pools, and each block pool is
managed independently from each other.
• This allows the generation of the Block IDs for new blocks by the namespace, without
any coordination with other namespaces.
Namespace Volume
• Namespace with its block pool is termed as Namespace Volume.
• The HDFS Federation architecture has the collection of Namespace volume, which is a
self-contained management unit.
• On deleting the Name Node or namespace, the corresponding block pool present in the
Data Nodes also gets deleted.
• On upgrading the cluster, each namespace volume gets upgraded as a unit.
HDFS Disk Balancer
Need for HDFS Disk Balancer
• Disk Balancer is a command-line tool introduced in Hadoop for balancing the
disks within the Data Node.
• HDFS disk balancer is different from the HDFS Balancer, which balances the
distribution across the nodes.
• In Hadoop HDFS, Data Node distributes data blocks between the disks on the
Data Node. While writing new blocks in HDFS, Data Nodes chooses volume-
choosing policies (round-robin policy or available space policy) to choose disk
(volume) for a block.
• Round-Robin policy: It spread the new blocks evenly across the available disks.
Data Node uses this policy by default.
• Available space policy: This policy writes data to those disks that have more free
space (by percentage).
HDFS Disk Balancer

with round-robin policy in the long-running cluster, Data Nodes sometimes unevenly fill their storage directories (disks/volume),
leading to situations where certain disks are full while others are significantly less used. This happens either because of large
amounts of writes and deletes or due to disk replacement.
Also, if we use available-space based volume-choosing policy, then every new write will go to the newly-added empty disk
making other disks idle during the period. This will create a bottleneck on the new disk.
Thus there arises a need for Intra Data Node Balancing (even distribution of data blocks within Data Node) to address the Intra-
Data Node skews (uneven distribution of blocks across disk), which occur due to disk replacement or random writes and deletes.
Therefore, a tool named Disk Balancer was introduced in Hadoop 3.0 that focused on distributing data within a node.
Introduction to HDFS Disk Balancer
• Disk Balancer is a command-line tool introduced in Hadoop HDFS for Intra-
Data Node balancing.
• HDFS disk balancer spread data evenly across all disks of a Data Node.
• Unlike a Balancer which rebalances data across the Data Node, Disk Balancer
distributes data within the Data Node.
• HDFS Disk Balancer operates against a given Data Node and moves blocks from
one disk to another.
How HDFS Disk Balancer Works
• HDFS Disk Balancer operates by creating a plan, which is a set of statements that
describes how much data should move between two disks, and goes on to execute
that set of statements on the Data Node.
• A plan consists of multiple move steps. Each move step in a plan has an address of
the destination disk, source disk. A move step also has the number of bytes to
move.
• This plan is executed against an operational Data Node.
Functions of HDFS Disk Balancer
HDFS Disk balancer supports two major functions i.e, reporting and balancing.
1. Data Spread Report
• In order to define a way to measure which machines in the cluster suffer from the
uneven data distribution, the HDFS disk balancer defines the HDFS as :
Volume Data Density metric
Node Data Density metric.
• HDFS Volume data density metric allows us to compare how well the data is
spread across different volumes of a given node.
• The Node data density metric allows comparing between nodes.
Functions of HDFS Disk Balancer
Volume Data Density or Intra-Node Data Density
• Volume data density metric computes how much data exits on a node and what
should be the ideal storage on each volume.
• The ideal storage percentage for each device is equal to the total data stored on
that node divided by the total disk capacity on that node for each storage-type.
• Suppose we have a machine with four volumes – Disk1, Disk2, Disk3, Disk4.
Volume Data Density or Intra-Node Data Density
In this example,
Total capacity= 200 + 300 + 350 + 500 = 1350GB
and
Total Used= 100 + 76 + 300 + 475 = 951 GB
 Therefore, the ideal storage on each volume/disk is:
Ideal storage = total Used ÷ total capacity
951÷1350 = 0.70 or 70% of capacity of each disk.
 Also, volume data density is equal to the difference between
ideal-Storage and current dfsUsedRatio.
Therefore, volume data density for disk1 is:
Volume Data Density = ideal Storage – dfs Used Ratio
= 0.70-0.50 = 0.20
 A positive value for volume Data Density indicates that disk
is under-utilized and, a negative value indicates that disk is
over-utilized in relation to the current ideal storage target.
Node Data Density or Inter-Node Data Density
• After calculating volume data density, we can calculate Node Data Density, which
is the sum of all absolute values of volume data density.
• This allows comparing nodes that need our attention in a given cluster.
• Lower node Data Density values indicate better spread, and higher values indicate
more skewed data distribution.
1.3 Reports
Once we have volume Data Density and node Data Density, we can find the top 20 nodes in
the cluster that skewed data distribution, or we can get the volume Data Density for a given
node.
Functions of HDFS Disk Balancer-Disk Balancing
• Once we know that a certain node needs balancing, we compute or read the
current volume Data Density.
• With this information, we can easily decide which volumes are over-provisioned
and which are under-provisioned.
• In order to move data from one volume to another in the Data Node, we would
add a protocol based RPC similar to the one used by the balancer.
• Thus, allowing the user to replace disks without worrying about decommissioning
a node.

You might also like