You are on page 1of 40

UNIT-II

HADOOP DISTRIBUTED FILE SYSTEM

Richa
Assistant Professor
CSE Dept
Chandigarh University

University Institute of Engineering (UIE)


CONTENTS

Basic Concepts:
• Nodes
• Racks
• Data Center
• Rack Awareness
• HeartBeat Signal
Data Organisation: Blocks and Replication
Anatomy of File Write
Anatomy of File Read

University Institute of Engineering (UIE)


INTRODUCTION
•HDFS is an acronym for Hadoop Distributed File
System.
•It is the fundamental block of Hadoop Architecture.
•We are having so much of data now a days. And the
biggest problem is to store that huge amount of data.
HDFS helps to overcome this problem.
•HDFS is the framework that is used to store huge
amount of data.

University Institute of Engineering (UIE)


•Following are the concepts that are really important to
understand the HDFS architecture properly.
1. Nodes
2. Racks
3. Data Center
4. Rack Awareness
5. HeartBeat Signal

University Institute of Engineering (UIE)


1. NODES
• NODE: simply a COMPUTER.
• The cluster of Hadoop is a set of host machines that are referred
to as Nodes.
• We can also say that a cluster is collection of servers called
Nodes which usually communicate with each other to make the
set of services highly available to the clients.
• The cluster service runs on each node of the server cluster and
controls all the aspects of server cluster operation.

University Institute of Engineering (UIE)


NODE MANAGER
• Node Manager runs on each node and maintains a local list of
nodes and network interfaces within the cluster.
• There is a regular communication between the nodes which
ensures that all the nodes within the cluster have same functional
nodes.
• Node Manager uses information in the cluster configuration
databases to determine which nodes have been added to the cluster
or which nodes are evicted from the cluster.
• Also it monitors the node failure by a signal called HeartBeat
Signal.

University Institute of Engineering (UIE)


2.RACKS
• Storage of nodes is basically referred to as Racks.
• A Rack is a collection of 30 to 40 nodes that are physically stored
close together and all are connected to the same switch.
• Network Bandwidth between any two nodes in the rack is always
greater than the Network Bandwidth between any two nodes on
different racks.
• A Hadoop cluster is a collection of Racks.

University Institute of Engineering (UIE)


RACKS

Switch

Switch Switch

DN1 DN3

DN2 DN4

Rack 1 Rack 2

University Institute of Engineering (UIE)


3.DATA CENTER
• A Data Center is a technical facility that how the IT
operations and equipments stores, manages and distributes
its data.
• Also a data centre is a facility used to house computer
systems and its associated components.
• Today, everyone needs to backup the computer data
including the things like documents or some financial data to
keep it safe from theft.
• The data can be stored on the server containing hard disks
and the data can be send to other system using Internet.

University Institute of Engineering (UIE)


• Today we are having so much of data that we need to store .
• More data means there is need to have more servers.
• More servers means we need to have the server rooms.
• The servers produce lots of heat, so they need to keep cool
otherwise they can fail.
• The traditional way for cooling is AC and this was the
expensive way.
• So a new Eco friendly cooling system is used that is based on
evaporation.

University Institute of Engineering (UIE)


• Data Center is secure, computer friendly environment that is
maintained by IT specialist, and many other experts to keep
it at just the right temperature.
• Data Center is usually equipped with backup power supplies
usually in the form of large batteries or powerful generators.
Even if the main power failure is there, the servers are keep
on running.
• As it is shared by many organisation at once, so it is also
referred to as Colocation.

University Institute of Engineering (UIE)


• Data Center Components :

1. Physical Space
2. Raised Flooring
3. In-Room Electrical
4. Standby Generators
5. IT Equipments
6. Data Cabling
7. Cooling
8. Fire Alarm
9. Security System

University Institute of Engineering (UIE)


4.RACK AWARENESS
• The process of making Hadoop aware of what machine is a
part of which rack and how these racks are connected to
each other within the Hadoop cluster is known as Rack
Awareness
• We know that Hadoop keeps multiple copies of all the data
that are present within the HDFS.
• If we make the Hadoop to aware about the rack topology,
it would help Hadoop in placing each copy across different
racks.
• By doing this, if entire rack fails, then also we would be
able to retrieve the data from other rack.

University Institute of Engineering (UIE)


• The MapReduce jobs can also be benefitted from Rack
Awareness.
• By knowing where the data is located i.e required by the Map
task, it can run the Map task on that particular Machine
itself thereby saving a lot of bandwidth and time. This is
known as data local task.
• Sometimes, the Map task find all the machine that have copy of
data is busy in processing some other task and none of the
machine is having available slots to process this task. In this
case, jobTracker would run this task on any other machine
within the same rack where the slots are available.

University Institute of Engineering (UIE)


• It needs to stream the data to that machine on the same rack
and the bandwidth consumption is less. This is known as
Rack local task.
• In Hadoop, both NameNode and DataNode are rack aware.

University Institute of Engineering (UIE)


5.HeartBeat Signal
• It handles coordination.
• It is used by the NameNode to check whether the DataNode is
alive/ active or not.
• HeartBeat signal is send from the DataNode to the NameNode
(Slave node to the Master node) after every 3 seconds.
• If the DataNode is down, the NameNode would not receive the
HeartBeat signal from that DataNode.
- In this case, the NameNode will wait for 5-10 minutes, after that
it will consider that DataNode as Out Of Service(OOS).
• It carries the information like: Total Disk Space, Used Space, Free
Space and Data Transfer in Progress.

University Institute of Engineering (UIE)


DATA ORGANISATION:
BLOCKS
• The actual data is stored in the DataNode.
• Files are divided into blocks and then these blocks are stored
in different DataNodes across the cluster.
• for e.g: If the file size is 150 MB and the block size is 64MB
(default) then:

University Institute of Engineering (UIE)


• 64+64+22=150 MB and number of blocks are 3.
• After dividing the data into Blocks, blocks are stored one by one
on DataNode.
• While it is being stored, it is also replicated and stored in different
DataNodes as shown:

University Institute of Engineering (UIE)


DATA ORGANISATION:
REPLICATION POLICY
• Block Replication Policy is having 3 factors:
1. Reliability
2. Availability
3. Network Bandwidth Utilisation

If the data is written from the outside world to HDFS cluster


for e.g. copying of abc.txt file to HDFS.

University Institute of Engineering (UIE)


CONTD…

• A DataNode is chosen randomly to store the first replica, the


second replica is stored on different DataNode on different rack
and the third replica is stored on different DataNode but on the
same rack as that of second replica as shown in the fig:

1 1 2 3

2 1 2 3

University Institute of Engineering (UIE)


CONTD…

• If DataNode1 or Rack1 crashes, then also block ID 1 can be


retrieved from DataNode4 or DataNode5 of Rack2.
• It means data is reliably stored and highly available on HDFS.

When data is written by some task within the cluster.


• Consider a task running on some DataNode, this task writes a
data in HDFS, in this case the first replica is stored on the
DataNode where the task exists.
• The second and third replica are stored in the same way like
that of the previous case.

University Institute of Engineering (UIE)


CONTD…

• The only difference is that here the DataNode is not chosen


randomly to store the first replica rather it is written on the same
DataNode where that task is running.
• Again the data stored is reliable and highly available.

University Institute of Engineering (UIE)


ANATOMY OF FILE WRITE

University Institute of Engineering (UIE)


University Institute of Engineering (UIE)
ANATOMY OF FILE WRITE
• HDFS basically supports the file operations like read, write and
delete. Also it supports the directory operations like create and
delete.
• Consider writing of abc.txt file on HDFS.
• In MetaData the information is:

University Institute of Engineering (UIE)


ANATOMY OF FILE WRITE
• The interaction between the client and the HDFS cluster is
handled by HDFS Client Library.
• To write abc.txt file on HDFS, the client must call the create
method in the HDFS client Library.
• By doing this, RPC call is made to the NameNode.
• Then the NameNode will do two things:
1. It will check the access rights, whether the user
have the access to write on HDFS. If yes, then the
NameNode will provide the location to write the
block.
2. It will also ensures that the filename must not
already exists.
University Institute of Engineering (UIE)
ANATOMY OF FILE WRITE
• If any of the test fails, then an exception would be raised for
the user.
• Only if these two tests passes, then the file is created in the
HDFS namespace means a record is created in the edit log.
• The client requests the NameNode to allocate the blocks and
NameNode returns the address of as many DataNodes(all it
depends on the space available in DataNodes) as the
Replication factor.
• A pipeline of these DataNodes are formed and also it
acknowledges for each block it receives from the client.

University Institute of Engineering (UIE)


ANATOMY OF FILE WRITE
• As the client receives the address of the DataNodes so the client
will directly go to the DataNode to start writing the data on
that DataNodes.
• As soon as the first block is written on DataNode1, the
DataNode1 will start copying the block to another DataNode i.e
DataNode4 and DataNode4 will start copying the block to
DataNode5.
• As the required number of replication is complete, the
acknowledgement will be send from DataNode5 to DataNode4
to DataNode1 to the client.

University Institute of Engineering (UIE)


ANATOMY OF FILE WRITE

• NameNode and DataNodes are in direct contact with each


other as DataNodes used to send a block report to the
NameNode.
• After a block is stored in the DataNode, the DataNode will
inform the NameNode that it has received a block.
• The NameNode will update the block location mapping in
its main memory.
• Same procedure is repeated for Block ID 2 and Block ID 3.
• This is how the data is stored on HDFS block by block.

University Institute of Engineering (UIE)


ANATOMY OF FILE WRITE
• After writing whole data into HDFS, client will close the file
by calling close operation of HDFSClient Library indicating
the NameNode that the file write operation is complete.
• As the NameNode allocates the blocks, it stores the block
location mapping address information in the main memory.
• Block location knows which block is stored on which
DataNode.
• Finally, MetaData and Block Location of abc.txt file is stored
in the NameNode and the actual data of abc.txt is stored on
the DataNode across the cluster.

University Institute of Engineering (UIE)


ANATOMY OF FILE READ

University Institute of Engineering (UIE)


University Institute of Engineering (UIE)
ANATOMY OF FILE READ
• Consider reading a file abc.txt from HDFS.
• The information maintained in MetaData is:

University Institute of Engineering (UIE)


ANATOMY OF FILE READ

• Number of blocks for abc.txt is three and also each block is


having three replicas because the replication factor is three.
• Replicas of all the three blocks are stored in DataNode1,
DataNode2 and DataNode3.
• Other replicas are stored in DataNode4 and DataNode5.

University Institute of Engineering (UIE)


ANATOMY OF FILE READ
• In the block location, all the block ID along with the DataNode
information is stored.
• Interaction between the client and the HDFS cluster is handled
by the HDFS Client Library.
• To read abc.txt file the client must call the open method in
HDFS Client Library.
• This makes one RPC to the NameNode to get the block IDs and
Location of the data
• The NameNode will check the access rights and if the user is
allowed to read the file it will give the address of the
DataNodes to the client.

University Institute of Engineering (UIE)


ANATOMY OF FILE READ
• Block IDs information is available in the MetaData which is
stored in the main memory.
• The mapping of DataNodes and Block IDs are available in the
block location which is stored in the main memory.
• Using these two information, NameNode returns the block IDs
of the blocks as
1. NameNode returns address for block ID1 as DN1, DN4
and DN5.
2. NameNode returns address for block ID2 as DN2, DN4
and DN5.
3. NameNode returns address for block ID3 as DN3, DN4
and DN5.
University Institute of Engineering (UIE)
ANATOMY OF FILE READ
• This list is sorted by Network distance from the client.
• For block ID 1 DataNode1 is nearer to the client than
DataNode5
• and DataNode5 is nearer to the client than DataNode6.
• Now the HDFS client can contact the DataNode1 directly and
request to transfer block ID 1.
• Data from Block ID 1 is transferred to the client, if it fails to
read from DataNode1 then the client will contact to the next
DataNode from the list i.e. DataNode4.

University Institute of Engineering (UIE)


ANATOMY OF FILE READ
• If block ID 1 is transferred from DataNode1 then the HDFS
client disconnects from DataNode1 and find the nearest
DataNode for the next block.
• The list is sorted by the network distance from the client.
• Same procedure will be repeated for the Block ID 2 and Block
ID 3.
• The whole file is read block by block and the client application
sees it as a continuous stream of data.
• This is how the data is read from the HDFS cluster.
• At any point of time, the DataNode can go down and the client is
not able to read the file .

University Institute of Engineering (UIE)


ANATOMY OF FILE READ
• In this case the client will complaint this to the NameNode and
NameNode will check that if that particular node stops sending
the HeartBeat Signal to the NameNode.
• Then the NameNode will provide the address of another
DataNode where the replicas of the data is placed.

University Institute of Engineering (UIE)


THANK YOU

University Institute of Engineering (UIE)

You might also like