You are on page 1of 2

Hadoop Distributed File System

Why DFS?
To handle huge amounts of data which surpasses the traditional storage sizes then we need to partition
data into a number of separate machines. To store this data across network of machines is called
Distributed File System.

HDFS: Since Hadoop is capable of providing a distributed file system, we term it as HDFS.

Purpose of HDFS
To store very large files, HDFS is designed to store huge files of sizes GBs and PBs. These are
stored as blocks in File System.
Write Once Read Many Times, A dataset can be read by many instances at a single time and can
run analyses.
Commodity Hardware, To store the data we need not have an industry grade hardware, a
normal regular machines are sufficient. HDFS usually maintain a default factor of 3 for
replicating data. So, in case of failure of a machine, we need not bother about the loss of data.
Drawbacks
Hadoop does not support the following concepts:
1) Low latency access
2) Lots of small files
3) Multiple writers

Goutam Tadi goutamhadoop@gmail.com

HDFS Blocks

Node 1

Node 2

Node 3

Block

Data is usually stored in blocks in any kind of file system. A block is the minimum size of the data that
can be read and write and store. But what is different in HDFS block and normal block is the block size.
Usually our normal file systems have block sizes of range 512 kbs. But default block size of HDFS is
64MB. There is a significant reason for this difference. Here is a story behind this difference.
To read a 1 MB file we need to store data in two blocks. So we have to access data twice while IO
operations if 512kb is default block size.
To read a 10 MB file we need to store data in 20 blocks. So we have to access data 20 times while IO
operations if 512kb is default block size.
To read a 10^n MB file we need to store data in 2*(10^n) blocks. So we have to access data 2*(10^n)
times while IO operations.
As we are handling data in GBs, TBs, PBs we need to have faster reads at the same time. To have this
condition satisfied we need to have huge block sizes for accessing the data without any tradeoffs
accessing them.

So default block size in Hadoop file system is 64MB.

Goutam Tadi goutamhadoop@gmail.com

You might also like