You are on page 1of 49

HDFS and MAPREDUCE

Prof S.Ramachandram

DISTRIBUTED FILE
SYSTEMS

A Distributed File System ( DFS ) is simply a classical model of a


file system ( as discussed before ) distributed across multiple
machines. The purpose is to promote sharing of dispersed files.

This is an area of active research interest today.

The resources on a particular machine are local to itself.


Resources on other machines are remote.

A file system provides a service for clients. The server interface is


the normal set of file operations: create, read, etc. on files.

Goals
1 Network transparency: uses do not have to aware the
location of files to access them
location transparency: the name of a file does not reveal any
kind of the file's physical storage location.
/server1/dir1/dir2/X
server1 can be moved anywhere (e.g., from CIS to SEAS).
location independence: the name of a file does not need to be
changed when the file's physical storage location changes.
The above file X cannot moved to server2 if server1 is full
and server2 is no so full.

2 High availability: system failures or scheduled activities


such as backups, addition of nodes

Architecture

Computation model
file severs -- machines dedicated to storing files and performing
storage and retrieval operations (for high performance)
clients -- machines used for computational activities may have a local
disk for caching remote files
Two most important services
name server -- maps user specified names to stored objects, files and
directories
cache manager -- to reduce network delay, disk delay problem:
inconsistency
Typical data access actions
open, close, read, write, etc.

Design Issues

Naming and name resolution


Semantics of file sharing
Stateless versus stateful servers
Caching -- where to store files
Cache consistency
Replication

Distributed File System-Present Needs


Need to process huge datasets on large
clusters of computers
Very expensive to build reliability into each
application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not
constant

Need a common infrastructure


Efficient, reliable, easy to use
Open Source, Apache Licence

What is HAdoop

Hadoop MapReduce
MapReduce is a programming model and software
framework first developed by Google (Googles
MapReduce paper submitted in 2004)
Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
Petabytes of data
Thousands of nodes

Computational processing occurs on both:


Unstructured data : file system
Structured data : database

Hadoop Distributed File System (HFDS)


Inspired by Google File System
Scalable, distributed, portable files ystem written in Java for
Hadoop framework
Primary distributed storage used by Hadoop applications

HFDS can be part of a Hadoop cluster or can be a stand-alone


general purpose distributed file system
An HFDS cluster primarily consists of
NameNode that manages file system metadata
DataNode that stores actual data

Stores very large files in blocks across machines in a large cluster


Reliability and fault tolerance ensured by replicating data across multiple
hosts

Has data awareness between nodes


Designed to be deployed on low-cost hardware

Assumptions and Goals


Hardware Failure
Hardware failure is the norm rather than the exception.

Streaming Data Access


Applications that run on HDFS need streaming access to
their data sets. The emphasis is on high throughput of
data access rather than low latency of data access.

Large Data Set(GB to TB)


Simple Coherency Model (write-once-read-many
access model for)
Moving Computation is Cheaper than Moving Data
Portability Across Heterogeneous Hardware and
Software Platforms

Hadoop Distributed File


System
Expects large file size
Small number of large files
Hundreds of MB to GB each

Expects sequential access


Default block size in HDFS is 64MB
Result:
Reduces amount of metadata storage per file
Supports fast streaming of data (large
amounts of contiguous data)

Hadoop Distributed File


System
HDFS expects to read a block startto-finish
Useful for MapReduce
Not good for random access
Not a good general purpose file system

Hadoop Distributed File


System
HDFS files are NOT part of the ordinary file
system
HDFS files are in separate name space
Not possible to interact with files using ls, cp, mv,
etc.
However, HDFS provides similar utilities

Hadoop Distributed File


System
Meta data handled by NameNode
Deal with synchronization by only
allowing one machine to handle it
Store meta data for entire file system
Not much data: file names, permissions,
& locations of each block of each file

HDFS Architecture
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that
manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in
files.
Internally, a file is split into one or more blocks and these blocks are stored
in a set of DataNodes.
The NameNode executes file system namespace operations like opening,
closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the
file systems clients.
The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode

HDFS Architecture
The NameNode and DataNode are pieces of software designed
to run on commodity machines. These machines typically run
a GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that
supports Java can run the NameNode or the DataNode
software.
Usage of the highly portable Java language means that HDFS
can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only
the NameNode software.
Each of the other machines in the cluster runs one instance of
the DataNode software
The NameNode is the arbitrator and repository for all HDFS
metadata.

HDFS Architecture
HDFS supports a traditional hierarchical file
organization.
A user or an application can create directories
and store files inside these directories.
The NameNode maintains the file system
namespace.
An application can specify the number of replicas
of a file that should be maintained by HDFS.
The number of copies of a file is called the
replication factor of that file. This information is
stored by the NameNode.

HDFS Architecture
An HDFS client wanting to read a file first contacts the
NameNode for the locations of data blocks comprising
the file and then reads block contents from the
DataNode closest to the client.
When writing data, the client requests the NameNode
to nominate a suite of three DataNodes to host the
block replicas.
The client then writes data to the DataNodes in a
pipeline fashion. The current design has a single
NameNode for each cluster.
The cluster can have thousands of DataNodes and
tens of thousands of HDFS clients per cluster
Each DataNode may execute multiple application
tasks concurrently.

HDFS Architecture
Metadata ops

Metadata(Name, replicas..)
(/home/foo/data,6. ..

Namenode

Client
Block ops
Read

Datanodes

Datanodes
replication

B
Blocks

Rack1

Rack2

Write
Client

12/28/16

19

Hadoop Distributed File


System

HDFS Architecture
HDFS keeps the entire namespace in RAM.
The inode data and the list of blocks belonging to each file comprise
the metadata of the name system called the image.
The persistent record of the image stored in the local hosts native
files system is called a checkpoint.
The NameNode also stores the modification log of the image called
the journal in the local hosts native file system.
For improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
During restarts the NameNode restores the namespace by reading
the namespace and replaying the journal.
The locations of block replicas may change over time and are not
part of the persistent checkpoint

Architectre-DataNode
Each block replica on a DataNode is represented by
two files in the local hosts native file system.
The first file contains the data itself and the second
file is blocks metadata including checksums for the
block data and the blocks generation stamp.
The size of the data file equals the actual length of
the block and does not require extra space to round
it up to the nominal block size as in traditional file
systems.
Thus, if a block is half full it needs only half of the
space of the full block on the local drive.

HDFS Architecture
The HDFS namespace is a hierarchy of files and
directories.
Files and directories are represented on the
NameNode by inodes, which record attributes
like permissions, modification and access times
The file content is split into large blocks
(typically 128 megabytes, butuser selectable
file-by-file)
Each block of the file is independently
replicated at multiple DataNodes(typically 3)

REPLICATION

Replication
Large HDFS instances run on a cluster of computers that commonly
spread across many racks.
Communication between two nodes in different racks has to go through
switches.
In most cases, network bandwidth between machines in the same rack
is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via
the process outlined in Hadoop Rack Awareness.
A simple but non-optimal policy is to place replicas on unique racks.
This prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data.
This policy evenly distributes replicas in the cluster which makes it easy
to balance load on component failure. However, this policy increases
the cost of writes because a write needs to transfer blocks to multiple
racks.

Replication
For the common case, when the replication factor is three, HDFSs
placement policy is to put one replica on one node in the local rack,
another on a node in a different (remote) rack, and the last on a different
node in the same remote rack.
This policy cuts the interrack write traffic which generally improves write
performance. The chance of rack failure is far less than that of node
failure; this policy does not impact data reliability and availability
guarantees.
However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than
three.
With this policy, the replicas of a file do not evenly distribute across the
racks. One third of replicas are on one node, two thirds of replicas are on
one rack, and the other third are evenly distributed across the remaining
racks.
This policy improves write performance without compromising data
reliability or read performance.

Replica Selection
To minimize global bandwidth consumption and
read latency
HDFS tries to satisfy a read request from a
replica that is closest to the reader. If there
exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy
the read request.
If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the
local data center is preferred over any remote
replica.

Safemode Startup
On startup Namenode enters Safemode.
Replication of data blocks do not occur in
Safemode.
Each DataNode checks in with Heartbeat and
BlockReport.
Namenode verifies that each block has acceptable
number of replicas
After a configurable percentage of safely replicated
blocks check in with the Namenode, Namenode
exits Safemode.
It then makes the list of blocks that need to be
replicated.
28
12/28/16
Namenode then proceeds to replicate these
blocks

Filesystem Metadata
The HDFS namespace is stored by
Namenode.
Namenode uses a transaction log called the
EditLog to record every change that occurs
to the filesystem meta data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local
filesystem

Entire filesystem namespace including


mapping of blocks to files and file system
12/28/16
29
properties is stored in a file FsImage.
Stored

Namenode
Keeps image of entire file system namespace and file
Blockmap in memory.
4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of
a crash.
12/28/16

30

Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same
directory.
It uses heuristics to determine optimal number of files
per directory and creates directories appropriately:
Research issue?

When the filesystem starts up it generates a list of all


HDFS blocks and send this report to Namenode:
Blockreport.
12/28/16

31

The Communication
Protocol
All HDFS communication protocols are layered on top
of the TCP/IP protocol
A client establishes a connection to a configurable TCP
port on the Namenode machine. It talks ClientProtocol
with the Namenode.
The Datanodes talk to the Namenode using Datanode
protocol.
RPC abstraction wraps both ClientProtocol and
Datanode protocol.
Namenode is simply a server and never initiates a
request; it only responds to RPC requests issued by
DataNodes or clients.
12/28/16

32

ROBUSTNESS

12/28/16

33

Objectives
Primary objective of HDFS is to store
data reliably in the presence of
failures.
Three common failures are: Namenode
failure, Datanode failure and network
partition.

12/28/16

34

DataNode failure and


heartbeat
A network partition can cause a subset of
Datanodes to lose connectivity with the Namenode.
Namenode detects this condition by the absence of
a Heartbeat message.
Namenode marks Datanodes without Hearbeat and
does not send any IO requests to them.
Any data registered to the failed Datanode is not
available to the HDFS.
Also the death of a Datanode may cause replication
factor of some of the blocks to fall below their
specified value.
12/28/16

35

Re-replication
The necessity for re-replication may
arise due to:
A Datanode may become unavailable,
A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be
increased.

12/28/16

36

Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one Datanode
to another if the free space on a Datanode falls
below a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically
create additional replicas and rebalance other
data in the cluster.
These types of data rebalancing are not yet
implemented: research issue.
12/28/16

37

Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a
storage device, network faults, or buggy software.
A HDFS client creates the checksum of every
block of its file and stores it in hidden files in the
HDFS namespace.
When a clients retrieves the contents of file, it
verifies that the corresponding checksums match.
If does not match, the client can retrieve the
block from a replica.
12/28/16

38

Metadata Disk Failure


FsImage and EditLog are central data structures of
HDFS.
A corruption of these files can cause a HDFS
instance to be non-functional.
For this reason, a Namenode can be configured to
maintain multiple copies of the FsImage and
EditLog.
Multiple copies of the FsImage and EditLog files are
updated synchronously.
Meta-data is not data-intensive.
The Namenode could be single point failure:
automatic failover is NOT supported! Another
research topic.
12/28/16
39

DATA ORGANIZATION

12/28/16

40

Data Blocks
HDFS support write-once-read-many
with reads at streaming speeds.
A typical block size is 64MB (or even
128 MB).
A file is chopped into 64MB chunks
and stored.

12/28/16

41

Staging
A client request to create a file does not reach
Namenode immediately.
HDFS client caches the data into a temporary file.
When the data reached a HDFS block size the
client contacts the Namenode.
Namenode inserts the filename into its hierarchy
and allocates a data block for it.
The Namenode responds to the client with the
identity of the Datanode and the destination of the
replicas (Datanodes) for the block.
Then the client flushes it from its local memory.
12/28/16

42

Staging (contd.)
The client sends a message that the file is
closed.
Namenode proceeds to commit the file for
creation operation into the persistent store.
If the Namenode dies before file is closed,
the file is lost.
This client side caching is required to avoid
network congestion; also it has precedence
is AFS (Andrew file system).
12/28/16

43

Replication Pipelining
When the client receives response
from Namenode, it flushes its block in
small pieces (4K) to the first replica,
that in turn copies it to the next
replica and so on.
Thus data is pipelined from Datanode
to the next.

12/28/16

44

API (ACCESSIBILITY)

12/28/16

45

Application Programming
Interface
HDFS provides Java API for application
to use.
Python access is also used in many
applications.
A C language wrapper for Java API is
also available.
A HTTP browser can be used to browse
the files of a HDFS instance.
12/28/16

46

FS Shell, Admin and Browser


Interface
HDFS organizes its data in files and directories.
It provides a command line interface called the FS
shell that lets the user interact with data in the
HDFS.
The syntax of the commands is similar to bash and
csh.
Example: to create a directory /foodir
/bin/hadoop dfs mkdir /foodir
There is also DFSAdmin interface available
Browser interface is also available to view the
namespace.
12/28/16

47

Space Reclamation
When a file is deleted by a client, HDFS renames file
to a file in be the /trash directory for a configurable
amount of time.
A client can request for an undelete in this allowed
time.
After the specified time the file is deleted and the
space is reclaimed.
When the replication factor is reduced, the Namenode
selects excess replicas that can be deleted.
Next heartbeat(?) transfers this information to the
Datanode that clears the blocks for use.

12/28/16

48

Summary
We discussed the features of the
Hadoop File System, a peta-scale file
system to handle big-data sets.
What discussed: Architecture,
Protocol, API, etc.
Missing element: Implementation
The Hadoop file system (internals)
An implementation of an instance of the
HDFS (for use by applications such as web
crawlers).
12/28/16
49

You might also like