Professional Documents
Culture Documents
dataset with a
distributed file
system
1
Hello!
I am Minakshi Gogoi
You can find me at
minakshi_cse@gimt-guwahati.ac.in
Contents
● Hadoop.
● It is a distributed and scalable file system that manages how data is stored
across the cluster
● In an HDFS cluster, there are two ● Each block is then replicated
types of nodes: a single namenode across multiple datanodes
and multiple datanodes. (typically three) that are chosen
at random.
● When a file is uploaded to HDFS,
the file is first chunked into blocks ● The namenode keeps
of a fixed size, typically between
track of the file-to-block
64 MB and 256 MB.
mapping and where each block
is located.
Figure : Files are chunked into blocks, which are dispersed to datanodes in the cluster
Fig. Clients communicate with the namenode to determine which datanodes hold the
blocks for the desired file.
● Distributing a file in this way across many nodes allows it to be easily
processed in parallel.
● When a program needs to access a file stored in HDFS, it contacts the
namenode to determine which datanodes host the file contents.
● Additionally, with each block replicated across multiple nodes, and data
remains available even when individual nodes are offline.
● Files are spread across multiple machines for scalability and also to
enable parallel processing.
● File blocks are replicated across multiple nodes for fault tolerance.
Storing a master dataset with a distributed
file system
● With unmodifiable files we ● Each file would contain
can’t store the entire master many serialized data objects
dataset in a single file.
file containing the new data objects to the master dataset folder
Figure : Appending to the master dataset by uploading a new file with new data
Table : How distributed filesystems meet the storage requirement checklist
Thanks !