Hdfs and Mapreduce: Prof S.Ramachandram

HDFS and MAPREDUCE
Prof S.Ramachandram
DISTRIBUTED FILE
SYSTEMS
A Distributed File System ( DFS ) is simply a classical model of a

file system ( as discussed before ) distributed across multiple
machines. The purpose is to promote sharing of dispersed files.
This is an area of active research interest today.
The resources on a particular machine are local to itself.

Resources on other machines are remote.
A file system provides a service for clients. The server interface is

the normal set of file operations: create, read, etc. on files.
Goals
1 Network transparency: uses do not have to aware the
location of files to access them
location transparency: the name of a file does not reveal any
kind of the file's physical storage location.
/server1/dir1/dir2/X
server1 can be moved anywhere (e.g., from CIS to SEAS).
location independence: the name of a file does not need to be
changed when the file's physical storage location changes.
The above file X cannot moved to server2 if server1 is full
and server2 is no so full.
2 High availability: system failures or scheduled activities

such as backups, addition of nodes
Architecture
Computation model
file severs -- machines dedicated to storing files and performing
storage and retrieval operations (for high performance)
clients -- machines used for computational activities may have a local
disk for caching remote files
Two most important services
name server -- maps user specified names to stored objects, files and
directories
cache manager -- to reduce network delay, disk delay problem:
inconsistency
Typical data access actions
open, close, read, write, etc.
Design Issues
Naming and name resolution

Semantics of file sharing
Stateless versus stateful servers
Caching -- where to store files
Cache consistency
Replication
Distributed File System-Present Needs

Need to process huge datasets on large
clusters of computers
Very expensive to build reliability into each
application
Nodes fail every day
Failure is expected, rather than exceptional
The number of nodes in a cluster is not
constant
Need a common infrastructure

Efficient, reliable, easy to use
Open Source, Apache Licence
What is HAdoop
Hadoop MapReduce
MapReduce is a programming model and software
framework first developed by Google (Googles
MapReduce paper submitted in 2004)
Intended to facilitate and simplify the processing of
vast amounts of data in parallel on large clusters of
commodity hardware in a reliable, fault-tolerant
manner
Petabytes of data
Thousands of nodes
Computational processing occurs on both:

Unstructured data : file system
Structured data : database
Hadoop Distributed File System (HFDS)

Inspired by Google File System
Scalable, distributed, portable files ystem written in Java for
Hadoop framework
Primary distributed storage used by Hadoop applications
HFDS can be part of a Hadoop cluster or can be a stand-alone

general purpose distributed file system
An HFDS cluster primarily consists of
NameNode that manages file system metadata
DataNode that stores actual data
Stores very large files in blocks across machines in a large cluster

Reliability and fault tolerance ensured by replicating data across multiple
hosts
Has data awareness between nodes

Designed to be deployed on low-cost hardware
Assumptions and Goals

Hardware Failure
Hardware failure is the norm rather than the exception.
Streaming Data Access

Applications that run on HDFS need streaming access to
their data sets. The emphasis is on high throughput of
data access rather than low latency of data access.
Large Data Set(GB to TB)

Simple Coherency Model (write-once-read-many
access model for)
Moving Computation is Cheaper than Moving Data
Portability Across Heterogeneous Hardware and
Software Platforms
Hadoop Distributed File

System
Expects large file size
Small number of large files
Hundreds of MB to GB each
Expects sequential access

Default block size in HDFS is 64MB
Result:
Reduces amount of metadata storage per file
Supports fast streaming of data (large
amounts of contiguous data)

System
HDFS expects to read a block startto-finish
Useful for MapReduce
Not good for random access
Not a good general purpose file system

System
HDFS files are NOT part of the ordinary file
system
HDFS files are in separate name space
Not possible to interact with files using ls, cp, mv,
etc.
However, HDFS provides similar utilities

System
Meta data handled by NameNode
Deal with synchronization by only
allowing one machine to handle it
Store meta data for entire file system
Not much data: file names, permissions,
& locations of each block of each file
HDFS Architecture
HDFS has a master/slave architecture.
An HDFS cluster consists of a single NameNode, a master server that
manages the file system namespace and regulates access to files by clients.
In addition, there are a number of DataNodes, usually one per node in the
cluster, which manage storage attached to the nodes that they run on.
HDFS exposes a file system namespace and allows user data to be stored in
files.
Internally, a file is split into one or more blocks and these blocks are stored
in a set of DataNodes.
The NameNode executes file system namespace operations like opening,
closing, and renaming files and directories.
It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the
file systems clients.
The DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode
HDFS Architecture
The NameNode and DataNode are pieces of software designed
to run on commodity machines. These machines typically run
a GNU/Linux operating system (OS).
HDFS is built using the Java language; any machine that
supports Java can run the NameNode or the DataNode
software.
Usage of the highly portable Java language means that HDFS
can be deployed on a wide range of machines.
A typical deployment has a dedicated machine that runs only
the NameNode software.
Each of the other machines in the cluster runs one instance of
the DataNode software
The NameNode is the arbitrator and repository for all HDFS
metadata.
HDFS Architecture
HDFS supports a traditional hierarchical file
organization.
A user or an application can create directories
and store files inside these directories.
The NameNode maintains the file system
namespace.
An application can specify the number of replicas
of a file that should be maintained by HDFS.
The number of copies of a file is called the
replication factor of that file. This information is
stored by the NameNode.
HDFS Architecture
An HDFS client wanting to read a file first contacts the
NameNode for the locations of data blocks comprising
the file and then reads block contents from the
DataNode closest to the client.
When writing data, the client requests the NameNode
to nominate a suite of three DataNodes to host the
block replicas.
The client then writes data to the DataNodes in a
pipeline fashion. The current design has a single
NameNode for each cluster.
The cluster can have thousands of DataNodes and
tens of thousands of HDFS clients per cluster
Each DataNode may execute multiple application
tasks concurrently.
HDFS Architecture
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Namenode
Client
Block ops
Read
Datanodes
Datanodes
replication
B
Blocks
Rack1
Rack2
Write
Client
12/28/16
19

System
HDFS Architecture
HDFS keeps the entire namespace in RAM.
The inode data and the list of blocks belonging to each file comprise
the metadata of the name system called the image.
The persistent record of the image stored in the local hosts native
files system is called a checkpoint.
The NameNode also stores the modification log of the image called
the journal in the local hosts native file system.
For improved durability, redundant copies of the checkpoint and
journal can be made at other servers.
During restarts the NameNode restores the namespace by reading
the namespace and replaying the journal.
The locations of block replicas may change over time and are not
part of the persistent checkpoint
Architectre-DataNode
Each block replica on a DataNode is represented by
two files in the local hosts native file system.
The first file contains the data itself and the second
file is blocks metadata including checksums for the
block data and the blocks generation stamp.
The size of the data file equals the actual length of
the block and does not require extra space to round
it up to the nominal block size as in traditional file
systems.
Thus, if a block is half full it needs only half of the
space of the full block on the local drive.
HDFS Architecture
The HDFS namespace is a hierarchy of files and
directories.
Files and directories are represented on the
NameNode by inodes, which record attributes
like permissions, modification and access times
The file content is split into large blocks
(typically 128 megabytes, butuser selectable
file-by-file)
Each block of the file is independently
replicated at multiple DataNodes(typically 3)
REPLICATION
Replication
Large HDFS instances run on a cluster of computers that commonly
spread across many racks.
Communication between two nodes in different racks has to go through
switches.
In most cases, network bandwidth between machines in the same rack
is greater than network bandwidth between machines in different racks.
The NameNode determines the rack id each DataNode belongs to via
the process outlined in Hadoop Rack Awareness.
A simple but non-optimal policy is to place replicas on unique racks.
This prevents losing data when an entire rack fails and allows use of
bandwidth from multiple racks when reading data.
This policy evenly distributes replicas in the cluster which makes it easy
to balance load on component failure. However, this policy increases
the cost of writes because a write needs to transfer blocks to multiple
racks.
Replication
For the common case, when the replication factor is three, HDFSs
placement policy is to put one replica on one node in the local rack,
another on a node in a different (remote) rack, and the last on a different
node in the same remote rack.
This policy cuts the interrack write traffic which generally improves write
performance. The chance of rack failure is far less than that of node
failure; this policy does not impact data reliability and availability
guarantees.
However, it does reduce the aggregate network bandwidth used when
reading data since a block is placed in only two unique racks rather than
three.
With this policy, the replicas of a file do not evenly distribute across the
racks. One third of replicas are on one node, two thirds of replicas are on
one rack, and the other third are evenly distributed across the remaining
racks.
This policy improves write performance without compromising data
reliability or read performance.
Replica Selection
To minimize global bandwidth consumption and
read latency
HDFS tries to satisfy a read request from a
replica that is closest to the reader. If there
exists a replica on the same rack as the reader
node, then that replica is preferred to satisfy
the read request.
If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the
local data center is preferred over any remote
replica.
Safemode Startup
On startup Namenode enters Safemode.
Replication of data blocks do not occur in
Safemode.
Each DataNode checks in with Heartbeat and
BlockReport.
Namenode verifies that each block has acceptable
number of replicas
After a configurable percentage of safely replicated
blocks check in with the Namenode, Namenode
exits Safemode.
It then makes the list of blocks that need to be
replicated.
28
12/28/16
Namenode then proceeds to replicate these
blocks
Filesystem Metadata
The HDFS namespace is stored by
Namenode.
Namenode uses a transaction log called the
EditLog to record every change that occurs
to the filesystem meta data.
For example, creating a new file.
Change replication factor of a file
EditLog is stored in the Namenodes local
filesystem
Entire filesystem namespace including

mapping of blocks to files and file system
12/28/16
29
properties is stored in a file FsImage.
Stored
Namenode
Keeps image of entire file system namespace and file
Blockmap in memory.
4GB of local RAM is sufficient to support the above
data structures that represent the huge number of
files and directories.
When the Namenode starts up it gets the FsImage and
Editlog from its local file system, update FsImage with
EditLog information and then stores a copy of the
FsImage on the filesytstem as a checkpoint.
Periodic checkpointing is done. So that the system can
recover back to the last checkpointed state in case of
a crash.
12/28/16
30
Datanode
A Datanode stores data in files in its local file system.
Datanode has no knowledge about HDFS filesystem
It stores each block of HDFS data in a separate file.
Datanode does not create all files in the same
directory.
It uses heuristics to determine optimal number of files
per directory and creates directories appropriately:
Research issue?
When the filesystem starts up it generates a list of all

HDFS blocks and send this report to Namenode:
Blockreport.
12/28/16
31
The Communication
Protocol
All HDFS communication protocols are layered on top
of the TCP/IP protocol
A client establishes a connection to a configurable TCP
port on the Namenode machine. It talks ClientProtocol
with the Namenode.
The Datanodes talk to the Namenode using Datanode
protocol.
RPC abstraction wraps both ClientProtocol and
Datanode protocol.
Namenode is simply a server and never initiates a
request; it only responds to RPC requests issued by
DataNodes or clients.
12/28/16
32
ROBUSTNESS
12/28/16
33
Objectives
Primary objective of HDFS is to store
data reliably in the presence of
failures.
Three common failures are: Namenode
failure, Datanode failure and network
partition.
12/28/16
34
DataNode failure and

heartbeat
A network partition can cause a subset of
Datanodes to lose connectivity with the Namenode.
Namenode detects this condition by the absence of
a Heartbeat message.
Namenode marks Datanodes without Hearbeat and
does not send any IO requests to them.
Any data registered to the failed Datanode is not
available to the HDFS.
Also the death of a Datanode may cause replication
factor of some of the blocks to fall below their
specified value.
12/28/16
35
Re-replication
The necessity for re-replication may
arise due to:
A Datanode may become unavailable,
A replica may become corrupted,
A hard disk on a Datanode may fail, or
The replication factor on the block may be
increased.
12/28/16
36
Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one Datanode
to another if the free space on a Datanode falls
below a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically
create additional replicas and rebalance other
data in the cluster.
These types of data rebalancing are not yet
implemented: research issue.
12/28/16
37
Data Integrity
Consider a situation: a block of data fetched from
Datanode arrives corrupted.
This corruption may occur because of faults in a
storage device, network faults, or buggy software.
A HDFS client creates the checksum of every
block of its file and stores it in hidden files in the
HDFS namespace.
When a clients retrieves the contents of file, it
verifies that the corresponding checksums match.
If does not match, the client can retrieve the
block from a replica.
12/28/16
38
Metadata Disk Failure

FsImage and EditLog are central data structures of
HDFS.
A corruption of these files can cause a HDFS
instance to be non-functional.
For this reason, a Namenode can be configured to
maintain multiple copies of the FsImage and
EditLog.
Multiple copies of the FsImage and EditLog files are
updated synchronously.
Meta-data is not data-intensive.
The Namenode could be single point failure:
automatic failover is NOT supported! Another
research topic.
12/28/16
39
DATA ORGANIZATION
12/28/16
40
Data Blocks
HDFS support write-once-read-many
with reads at streaming speeds.
A typical block size is 64MB (or even
128 MB).
A file is chopped into 64MB chunks
and stored.
12/28/16
41
Staging
A client request to create a file does not reach
Namenode immediately.
HDFS client caches the data into a temporary file.
When the data reached a HDFS block size the
client contacts the Namenode.
Namenode inserts the filename into its hierarchy
and allocates a data block for it.
The Namenode responds to the client with the
identity of the Datanode and the destination of the
replicas (Datanodes) for the block.
Then the client flushes it from its local memory.
12/28/16
42
Staging (contd.)
The client sends a message that the file is
closed.
Namenode proceeds to commit the file for
creation operation into the persistent store.
If the Namenode dies before file is closed,
the file is lost.
This client side caching is required to avoid
network congestion; also it has precedence
is AFS (Andrew file system).
12/28/16
43
Replication Pipelining
When the client receives response
from Namenode, it flushes its block in
small pieces (4K) to the first replica,
that in turn copies it to the next
replica and so on.
Thus data is pipelined from Datanode
to the next.
12/28/16
44
API (ACCESSIBILITY)
12/28/16
45
Application Programming
Interface
HDFS provides Java API for application
to use.
Python access is also used in many
applications.
A C language wrapper for Java API is
also available.
A HTTP browser can be used to browse
the files of a HDFS instance.
12/28/16
46
FS Shell, Admin and Browser

Interface
HDFS organizes its data in files and directories.
It provides a command line interface called the FS
shell that lets the user interact with data in the
HDFS.
The syntax of the commands is similar to bash and
csh.
Example: to create a directory /foodir
/bin/hadoop dfs mkdir /foodir
There is also DFSAdmin interface available
Browser interface is also available to view the
namespace.
12/28/16
47
Space Reclamation
When a file is deleted by a client, HDFS renames file
to a file in be the /trash directory for a configurable
amount of time.
A client can request for an undelete in this allowed
time.
After the specified time the file is deleted and the
space is reclaimed.
When the replication factor is reduced, the Namenode
selects excess replicas that can be deleted.
Next heartbeat(?) transfers this information to the
Datanode that clears the blocks for use.
12/28/16
48
Summary
We discussed the features of the
Hadoop File System, a peta-scale file
system to handle big-data sets.
What discussed: Architecture,
Protocol, API, etc.
Missing element: Implementation
The Hadoop file system (internals)
An implementation of an instance of the
HDFS (for use by applications such as web
crawlers).
12/28/16
49

Hdfs and Mapreduce: Prof S.Ramachandram

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hdfs and Mapreduce: Prof S.Ramachandram

Uploaded by

Copyright:

Available Formats

HDFS and MAPREDUCE

A Distributed File System ( DFS ) is simply a classical model of a

This is an area of active research interest today.

The resources on a particular machine are local to itself.

A file system provides a service for clients. The server interface is

2 High availability: system failures or scheduled activities

Naming and name resolution

Distributed File System-Present Needs

Need a common infrastructure

Computational processing occurs on both:

Hadoop Distributed File System (HFDS)

HFDS can be part of a Hadoop cluster or can be a stand-alone

Stores very large files in blocks across machines in a large cluster

Has data awareness between nodes

Assumptions and Goals

Streaming Data Access

Large Data Set(GB to TB)

Hadoop Distributed File

Expects sequential access

Hadoop Distributed File

Hadoop Distributed File

Hadoop Distributed File

Hadoop Distributed File

Entire filesystem namespace including

When the filesystem starts up it generates a list of all

DataNode failure and

Metadata Disk Failure

FS Shell, Admin and Browser

You might also like