You are on page 1of 15

Storing a master

dataset with a
distributed file
system

1
Hello!
I am Minakshi Gogoi
You can find me at
minakshi_cse@gimt-guwahati.ac.in
Contents
● Hadoop.

● Storing a master dataset with a distributed file system.

● Distributed file system


Let’s start with the first set of slides
Hadoop
● HDFS and Hadoop MapReduce are the two prongs of the Hadoop project

● A Java framework for distributed storage and distributed processing of large


amounts of data.

● Hadoop is deployed across multiple servers, typically called a cluster

● It is a distributed and scalable file system that manages how data is stored
across the cluster
● In an HDFS cluster, there are two ● Each block is then replicated
types of nodes: a single namenode across multiple datanodes
and multiple datanodes. (typically three) that are chosen
at random.
● When a file is uploaded to HDFS,
the file is first chunked into blocks ● The namenode keeps
of a fixed size, typically between
track of the file-to-block
64 MB and 256 MB.
mapping and where each block
is located.
Figure : Files are chunked into blocks, which are dispersed to datanodes in the cluster
Fig. Clients communicate with the namenode to determine which datanodes hold the
blocks for the desired file.
● Distributing a file in this way across many nodes allows it to be easily
processed in parallel.
● When a program needs to access a file stored in HDFS, it contacts the
namenode to determine which datanodes host the file contents.
● Additionally, with each block replicated across multiple nodes, and data
remains available even when individual nodes are offline.

● Files are spread across multiple machines for scalability and also to
enable parallel processing.
● File blocks are replicated across multiple nodes for fault tolerance.
Storing a master dataset with a distributed
file system
● With unmodifiable files we ● Each file would contain
can’t store the entire master many serialized data objects
dataset in a single file.

● Instead spread the master


dataset among many files,
and store all those files in the
same folder.
Figure : Spreading the master dataset throughout many files
● To append to the master dataset, it is needed to simply add a new

file containing the new data objects to the master dataset folder
Figure : Appending to the master dataset by uploading a new file with new data
Table : How distributed filesystems meet the storage requirement checklist
Thanks !

You might also like