Professional Documents
Culture Documents
Hdfs and Mapreduce: Prof S.Ramachandram
Hdfs and Mapreduce: Prof S.Ramachandram
Prof S.Ramachandram
DISTRIBUTED FILE
SYSTEMS
Goals
1 Network transparency: uses do not have to aware the
location of files to access them
location transparency: the name of a file does not reveal any
kind of the file's physical storage location.
/server1/dir1/dir2/X
server1 can be moved anywhere (e.g., from CIS to SEAS).
location independence: the name of a file does not need to be
changed when the file's physical storage location changes.
The above file X cannot moved to server2 if server1 is full
and server2 is no so full.
Architecture
Computation model
file severs -- machines dedicated to storing files and performing
storage and retrieval operations (for high performance)
clients -- machines used for computational activities may have a local
disk for caching remote files
Two most important services
name server -- maps user specified names to stored objects, files and
directories
cache manager -- to reduce network delay, disk delay problem:
inconsistency
Typical data access actions
open, close, read, write, etc.
Design Issues
What is HAdoop
Hadoop MapReduce
MapReduce is a programming model and software
framework first developed by Google (Googles
MapReduce paper submitted in 2004)
Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
Petabytes of data
Thousands of nodes
HDFS Architecture
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that
manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in
files.
Internally, a file is split into one or more blocks and these blocks are stored
in a set of DataNodes.
The NameNode executes file system namespace operations like opening,
closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the
file systems clients.
The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode
HDFS Architecture
The NameNode and DataNode are pieces of software designed
to run on commodity machines. These machines typically run
a GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that
supports Java can run the NameNode or the DataNode
software.
Usage of the highly portable Java language means that HDFS
can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only
the NameNode software.
Each of the other machines in the cluster runs one instance of
the DataNode software
The NameNode is the arbitrator and repository for all HDFS
metadata.
HDFS Architecture
HDFS supports a traditional hierarchical file
organization.
A user or an application can create directories
and store files inside these directories.
The NameNode maintains the file system
namespace.
An application can specify the number of replicas
of a file that should be maintained by HDFS.
The number of copies of a file is called the
replication factor of that file. This information is
stored by the NameNode.
HDFS Architecture
An HDFS client wanting to read a file first contacts the
NameNode for the locations of data blocks comprising
the file and then reads block contents from the
DataNode closest to the client.
When writing data, the client requests the NameNode
to nominate a suite of three DataNodes to host the
block replicas.
The client then writes data to the DataNodes in a
pipeline fashion. The current design has a single
NameNode for each cluster.
The cluster can have thousands of DataNodes and
tens of thousands of HDFS clients per cluster
Each DataNode may execute multiple application
tasks concurrently.
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Rack2
Write
Client
12/28/16
19
HDFS Architecture
HDFS keeps the entire namespace in RAM.
The inode data and the list of blocks belonging to each file comprise
the metadata of the name system called the image.
The persistent record of the image stored in the local hosts native
files system is called a checkpoint.
The NameNode also stores the modification log of the image called
the journal in the local hosts native file system.
For improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
During restarts the NameNode restores the namespace by reading
the namespace and replaying the journal.
The locations of block replicas may change over time and are not
part of the persistent checkpoint
Architectre-DataNode
Each block replica on a DataNode is represented by
two files in the local hosts native file system.
The first file contains the data itself and the second
file is blocks metadata including checksums for the
block data and the blocks generation stamp.
The size of the data file equals the actual length of
the block and does not require extra space to round
it up to the nominal block size as in traditional file
systems.
Thus, if a block is half full it needs only half of the
space of the full block on the local drive.
HDFS Architecture
The HDFS namespace is a hierarchy of files and
directories.
Files and directories are represented on the
NameNode by inodes, which record attributes
like permissions, modification and access times
The file content is split into large blocks
(typically 128 megabytes, butuser selectable
file-by-file)
Each block of the file is independently
replicated at multiple DataNodes(typically 3)
REPLICATION
Replication
Large HDFS instances run on a cluster of computers that commonly
spread across many racks.
Communication between two nodes in different racks has to go through
switches.
In most cases, network bandwidth between machines in the same rack
is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via
the process outlined in Hadoop Rack Awareness.
A simple but non-optimal policy is to place replicas on unique racks.
This prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data.
This policy evenly distributes replicas in the cluster which makes it easy
to balance load on component failure. However, this policy increases
the cost of writes because a write needs to transfer blocks to multiple
racks.
Replication
For the common case, when the replication factor is three, HDFSs
placement policy is to put one replica on one node in the local rack,
another on a node in a different (remote) rack, and the last on a different
node in the same remote rack.
This policy cuts the interrack write traffic which generally improves write
performance. The chance of rack failure is far less than that of node
failure; this policy does not impact data reliability and availability
guarantees.
However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than
three.
With this policy, the replicas of a file do not evenly distribute across the
racks. One third of replicas are on one node, two thirds of replicas are on
one rack, and the other third are evenly distributed across the remaining
racks.
This policy improves write performance without compromising data
reliability or read performance.
Replica Selection
To minimize global bandwidth consumption and
read latency
HDFS tries to satisfy a read request from a
replica that is closest to the reader. If there
exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy
the read request.
If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the
local data center is preferred over any remote
replica.
Safemode Startup
On startup Namenode enters Safemode.
Replication of data blocks do not occur in
Safemode.
Each DataNode checks in with Heartbeat and
BlockReport.
Namenode verifies that each block has acceptable
number of replicas
After a configurable percentage of safely replicated
blocks check in with the Namenode, Namenode
exits Safemode.
It then makes the list of blocks that need to be
replicated.
28
12/28/16
Namenode then proceeds to replicate these
blocks
Filesystem Metadata
The HDFS namespace is stored by
Namenode.
Namenode uses a transaction log called the
EditLog to record every change that occurs
to the filesystem meta data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local
filesystem
Namenode
Keeps image of entire file system namespace and file
Blockmap in memory.
4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of
a crash.
12/28/16
30
Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same
directory.
It uses heuristics to determine optimal number of files
per directory and creates directories appropriately:
Research issue?
31
The Communication
Protocol
All HDFS communication protocols are layered on top
of the TCP/IP protocol
A client establishes a connection to a configurable TCP
port on the Namenode machine. It talks ClientProtocol
with the Namenode.
The Datanodes talk to the Namenode using Datanode
protocol.
RPC abstraction wraps both ClientProtocol and
Datanode protocol.
Namenode is simply a server and never initiates a
request; it only responds to RPC requests issued by
DataNodes or clients.
12/28/16
32
ROBUSTNESS
12/28/16
33
Objectives
Primary objective of HDFS is to store
data reliably in the presence of
failures.
Three common failures are: Namenode
failure, Datanode failure and network
partition.
12/28/16
34
35
Re-replication
The necessity for re-replication may
arise due to:
A Datanode may become unavailable,
A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be
increased.
12/28/16
36
Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one Datanode
to another if the free space on a Datanode falls
below a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically
create additional replicas and rebalance other
data in the cluster.
These types of data rebalancing are not yet
implemented: research issue.
12/28/16
37
Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a
storage device, network faults, or buggy software.
A HDFS client creates the checksum of every
block of its file and stores it in hidden files in the
HDFS namespace.
When a clients retrieves the contents of file, it
verifies that the corresponding checksums match.
If does not match, the client can retrieve the
block from a replica.
12/28/16
38
DATA ORGANIZATION
12/28/16
40
Data Blocks
HDFS support write-once-read-many
with reads at streaming speeds.
A typical block size is 64MB (or even
128 MB).
A file is chopped into 64MB chunks
and stored.
12/28/16
41
Staging
A client request to create a file does not reach
Namenode immediately.
HDFS client caches the data into a temporary file.
When the data reached a HDFS block size the
client contacts the Namenode.
Namenode inserts the filename into its hierarchy
and allocates a data block for it.
The Namenode responds to the client with the
identity of the Datanode and the destination of the
replicas (Datanodes) for the block.
Then the client flushes it from its local memory.
12/28/16
42
Staging (contd.)
The client sends a message that the file is
closed.
Namenode proceeds to commit the file for
creation operation into the persistent store.
If the Namenode dies before file is closed,
the file is lost.
This client side caching is required to avoid
network congestion; also it has precedence
is AFS (Andrew file system).
12/28/16
43
Replication Pipelining
When the client receives response
from Namenode, it flushes its block in
small pieces (4K) to the first replica,
that in turn copies it to the next
replica and so on.
Thus data is pipelined from Datanode
to the next.
12/28/16
44
API (ACCESSIBILITY)
12/28/16
45
Application Programming
Interface
HDFS provides Java API for application
to use.
Python access is also used in many
applications.
A C language wrapper for Java API is
also available.
A HTTP browser can be used to browse
the files of a HDFS instance.
12/28/16
46
47
Space Reclamation
When a file is deleted by a client, HDFS renames file
to a file in be the /trash directory for a configurable
amount of time.
A client can request for an undelete in this allowed
time.
After the specified time the file is deleted and the
space is reclaimed.
When the replication factor is reduced, the Namenode
selects excess replicas that can be deleted.
Next heartbeat(?) transfers this information to the
Datanode that clears the blocks for use.
12/28/16
48
Summary
We discussed the features of the
Hadoop File System, a peta-scale file
system to handle big-data sets.
What discussed: Architecture,
Protocol, API, etc.
Missing element: Implementation
The Hadoop file system (internals)
An implementation of an instance of the
HDFS (for use by applications such as web
crawlers).
12/28/16
49