Professional Documents
Culture Documents
I hereby declare that this submission is my own work and that to the best of my knowledge and
belief. It contains no matter previously published or written by another person nor material
which to a substantial extent has been accepted for the award of any other Degree or Diploma of
the University or other Institute of Higher Learning, except where due acknowledgement has
been made in the text.
Signature
Name: Anushka Bajpai
ACKNOWLEDGEMENT
There are times when silence speaks so much more loudly than words of praise to only as good
as belittle a person, whose words do not express, but only put a veneer over true feelings, which
are of gratitude at this point of time.
Thank you GOD for giving me strength and ability to understand, learn and complete this report.
I would like to express my sincere gratitude to everyone who contributed towards preparing and
making this project successful.
I am deeply indebted to R.K. Banyal Sir, whose mentoring and encouragement have been
especially valuable, and his early insights launched a greater part of this project.
I express my profound and sincere thanks to Hukum Chand Saini Sir, who acted as a mariners
compass and steered me throughout my project voyage through his excellent guidance and
constant inspiration.
Above ground, I am indebted to my family, whose value to me only grows with age.
ABSTRACT
The data scheduler is a recurring and automated utility to automatically schedule your tasks on
periodic basis. Typically it is used in processes which are to be run for a longer period of time.
These processes perform different tasks at an instant as scheduled according to the scheduler.
The scheduler takes the processes to be processed and their time of start as an input and then
automates the tasks to be performed after scheduling them according to the inputs.
This data scheduler is useful in performing tasks which have to be run on a daily basis. In most
firms, there are a lot of tasks that are to be performed periodically each day. The scheduler
provides an ease to perform such tasks.
Pipeline is a logical grouping of Processes. They are used to group processes into a unit that
performs a task.
The need for a GENERIC DATA-SCHEDULER FOR DATA FLOW PIPELINING for the
execution of tasks in an efficient manner, is a serious demand. We therefore tend to provide a
solution by constructing a system that has the capability of configuring the tasks and at the same
time scheduling the tasks, so as to improvise the CPU utilization at the maximum level. The
project involves the use of:
TABLE OF CONTENTS
Declaration 1
Acknowledgement 2
Abstract 3
Table of Contents 4
List of Illustrations 5
Chapter 1: Introduction 6
Chapter 2: Big Data 7
Type chapter title (level 2) 5
Type chapter title (level 3) 6
CHAPTER 1: INTRODUCTION
We live in the data age. Its not easy to measure the total volume of data stored
electronically, but an IDC estimate put the size of the digital universe at 4.4
zettabytes in 2013 and is forecasting a tenfold growth by 2020. This flood of data is
coming from many sources. Consider the following:
The New York Stock Exchange generates about 45 terabytes of data per
day.
Facebook hosts more than 240 billion photos, growing at 7 petabytes per
month.
The trend is for every individuals data footprint to grow, but perhaps more
significantly, the amount of data generated by machines as a part of the Internet of
Things will be even greater than that generated by people. The volume of data
being made publicly available increases every year, too.
The problem is simple: although the storage capacities of hard drives have
increased massively over the years, access speedsthe rate at which data can be
read from drives have not kept up. The obvious way to reduce the time is to read
from multiple disks at once. Imagine if we had 100 drives, each holding one
hundredth of the data. Working in parallel, we could read the data in under two
minutes.
Theres more to being able to read and write data in parallel to or from multiple
disks, though.
The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by
the system so that in the event of failure, there is another copy available.
The second problem is that most analysis tasks need to be able to combine the data
in some way, and data read from one disk may need to be combined with data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources.
Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data: Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data: The stock exchange data holds information about the buy
and sell decisions made on a share of different companies made by the customers.
Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data: Transport data includes model, capacity, distance and availability of
a vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.
Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much
known to all of us:
Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and
other advertising mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
In this approach, an enterprise will have a computer to store and process big data.
Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2
and sophisticated softwares can be written to interact with the database, process
the required data and present it to the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the processor
which is processing the data. But when it comes to dealing with huge amounts of
data, it is really a tedious task to process such data through a traditional database
server.
GOOGLES SOLUTION
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns those parts to many computers
connected over the network, and collects the results to form the final result dataset.
HADOOP
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after his
son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is capable
enough to develop applications capable of running on clusters of computers and
they could perform complete statistical analysis for a huge amounts of data.
3.1 A BRIEF HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, a widely used
text search library. The Apache Nutch project, an open source web search engine,
had a significant contribution to building Hadoop. Hadoop is not an acronym; it is a
made-up name. The project creator, Doug Cutting, explains how the name came
about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such. It is ambitious to build a web search
engine from scratch as it is not only challenging to build software required to crawl
and index websites, but also challenges to run without a dedicated operations team
as there are so many moving parts. It is estimated that a 1 billion page index would
cost around $500,000 to build and monthly $30,000 for maintenance. Nevertheless,
this goal is worth pursuing as Search engine algorithms are opened to the world for
review and improvement. The Nutch project was started in 2002, with the crawler
and search system being quickly developed. However, they soon realized that their
system would not scale to a billion pages. Timely publication from Google in 2003,
the architecture of Google File System, called the GFS came in very handy.The GFS
or something like that was enough to solve their storage needs for the very large
files generated as part of the web crawl and indexing process. The GFS particularly
frees up the time spent
onmaintainingthestoragenodes.ThiseffortgavewaytotheNutchDistributedFile System
(NDFS) . Google produced another paper in 2004 that would introduce MapReduce
to the world. By early 2005 the Nutch developers had a working MapReduce
implementation in Nutch and by the middle of that year most of the Nutch
Algorithms were ported to MapReduce and NDFS.
NDFSandMapReduceimplementationinNutchfoundapplicationsinareasbeyond the
scope of Nutch; in Feb 2006 they were moved out of Nutch to form their own
independent subproject called Hadoop. Around the same time Doug Cutting joined
Yahoo! which gave him access to a dedicated team and resources to turn Hadoop
into a system that ran at web-scale. This ability of Hadoop was announced that its
production search index was generated by the 10,000 node Hadoop cluster.
In January 2008, Hadoop was promoted to a top level project at Apache, confirming
its success and its diverse active community. By this time, Hadoop was being used
by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New
York Times. The capability of Hadoop was demonstrated and publicly put at the
epitome of the distributed computing sphere when The New York Times used
Amazons EC2 compute cloud to crunch through four terabytes of scanned archives
from the paper converting them into PDFs for the Web. The project came at the
right time with great publicity toward Hadoop and the cloud. It would have been
impossible to try this project if not for the popular pay-by-the-hour cloud model from
Amazon. The NYT used a large number of machines, about a 100 and Hadoops
easy-to-use parallel programming model to process the task in 24 hours. Hadoops
successes did not stop here; it went on to break the world record to become the
fastest system to sort a terabyte of data in April 2008. It took 209 seconds to sort a
terabyte of data on a 910 node cluster, beating the previous years winner of 297
seconds. It did not end here, later that year Google reported that its MapReduce
implementation sorted one terabyte in 68 seconds. Later, Yahoo! reported to have
broken Googles record by sorting one terabyte in 62 seconds.
CHAPTER 4: HADOOP ECOSYSTEM
Hadoop is a generic processing framework designed to execute queries and other
batch read operations on massive datasets that can scale from tens of terabytes to
petabytes in size. HDFS and MapReduce together provide a reliable, fault-tolerant
software framework for processing vast amounts of data in parallel on large clusters
of commodity hardware (potentially scaling to thousands of nodes). Hadoop meets
the needs of many organizations for flexible data analysis capabilities with an
unmatched price-performance curve. The flexibility in data analysis feature applies
to data in a variety of formats, from unstructured data, such as raw text, to semi-
structured data, such as logs, to structured data with a fixed schema. In
environments where massive server farms are used to collect data from a variety of
sources, Hadoop is able to process parallel queries as background batch jobs on the
same server farm. Thus, the requirement for an additional hardware to process data
from a traditional database system is eliminated (assuming such a system can scale
to the required size). The effort and time required to load data in to another system
is also reduced since it can be processed directly within Hadoop. This overhead
becomes impractical in very large datasets.
All the components of the Hadoop ecosystem, as explicit entities are evident. The
holistic view of Hadoop architecture gives prominence to Hadoop common, Hadoop
YARN, Hadoop Distributed File Systems (HDFS) and Hadoop MapReduce of the
Hadoop Ecosystem. Hadoop common provides all Java libraries, utilities, OS level
abstraction, necessary Java files and script to run Hadoop, while Hadoop YARN is a
framework for job scheduling and cluster resource management. HDFS in Hadoop
architecture provides high throughput access to application data and Hadoop
MapReduce provides YARN based parallel processing of large data sets.
The Hadoop ecosystem includes other tools to address particular needs. Hive is a
SQL dialect and Pig is a dataflow language for that hide the tedium of creating
MapReduce jobs behind higher-level abstractions more appropriate for user goals.
HBase is a column-oriented database management system that runs on top of
HDFS. Avro provides data serialization and data exchange services for Hadoop.
Sqoop is a combination of SQL and Hadoop. Zookeeper is used for federating
services and Oozie is a scheduling system. In the absence of an ecosystem,
developers have to implement separate sets of technologies to create Big Data
solutions.
The Hadoop ecosystem includes the following tools to address particular needs:
Common: A set of components and interfaces for distributed file systems and
general I/O (serialization, Java RPC, persistent data structures).
Avro: A serialization system for efficient, cross-language RPC and persistent data
storage.
HDFS: A distributed file system that runs on large clusters of commodity machines.
Pig: A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
Hive: A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the runtime
engine to MapReduce jobs) for querying the data.
Sqoop: A tool for efficiently moving data between relational databases and HDFS.
Cascalog: Cascalog is similar to Scalding in the way it hides the limitations of Java
behind a powerful Clojure API for cascading. Cascalog includes logic programming
constructs inspired by Datalog. The name is derived from Cascading + Datalog.
Apache BigTop: It was originally a part of the Clouderas CDH distribution, which is
used to test the Hadoop ecosystem.
Apache Flume: It is responsible for data transfer between source and sink,
which can be scheduled or triggered upon an event. It is used to harvest, aggregate,
and move large amounts of data in and out of Hadoop. Flume allows different data
formats for sources, Avro, files, and sinks, HDFS and HBase. It also has a querying
engine so that the user can transform any data before it is moved between sources
and sinks.
Clustering
Streaming Data Access: Applications that run on the Hadoop HDFS need access to
streaming data. These applications cannot be run on general-purpose file systems.
HDFS is designed to enable large-scale batch processing, which is enabled by the
high-throughput data access. Several POSIX requirements are relaxed to enable
these special needs of high throughput rates.
Large Data Sets: The HDFS-based applications feed on large datasets. A typical file
size is in the range of high gigabytes to low terabytes. It should provide high data
bandwidth and support millions of files across hundreds of nodes in a single cluster.
Moving compute instead of data: Any computation is efficient if it executes near the
data because it avoids the network transfer bottleneck. Migrating the computation
closer to the data is a cornerstone of HDFS-based programming. HDFS provides all
the necessary application interfaces to move the computation close to the data
prior to execution.
Low-latency data access: The high-throughput data access comes at the cost of
latency. Latency sensitive applications are not suitable for HDFS. HBase has shown
promise in handling low-latency applications along with large-scale data access.
Lots of small files: The metadata of the file system is stored in the Namenodes
memory (master node). The limit to the number of files in the file-system is
dependent on the namenode memory. Typically, each file, directory, and block takes
about 150 bytes. For example, if you had one million files, each taking one block,
you would need at least 300 MB of memory. Storing millions of files seems possible,
but hardware is incapable of accommodating billions of files.
The HDFS architecture is built in tandem with the popular master/slave architecture.
An HDFS cluster consists of a master server that manages the file-system
namespace and regulates files access, called the Namenode. Analogous to slaves,
there are a number of Datanodes. These are typically one per cluster node and
manage the data stored in the nodes. The HDFS file-system exists independently of
the host file-system and allows data to be stored in its own namespace . The
Namenode allows typical file-system operations like opening, closing, and renaming
files and directories. It also maintains data block and Datanode mapping
information. The Datanodes handle the read and write requests. Upon Namenodes
instructions the Datanodes perform block creation, deletion, and replications
operations. HDFS architecture is given in Figure 2.2 Namenode and Datanodes are
software services provided by HDFS that are designed to run on heterogeneous
commodity machines. These applications typically run on Unix/Linux-based
operating systems. Java programming language is used to build these services. Any
machine that supports the Java runtime environment can run Namenode and
Datanode services. Given the highly portable nature of Java programming language
HDFS can be deployed on wide range of machines. A typical cluster installation has
one dedicated machine that acts as a master running the Namenode service. Other
machines in the cluster run one instance of Datanode service per node. Although
you can run multiple Datanodes on one machine, this practice is rare in real-world
deployments. A single instance of Namenode/master simplifies the architecture of
the system. It acts as an arbitrator and a repository to all the HDFS meta-data. The
system is designed such that there is data flow through the Namenode. Figure 2.3
shows how the Hadoop ecosystem interacts together.
2.3.3 FileSystem
The file organization in HDFS is similar to the traditional hierarchical type. A user or
an application can create and store files in directories. The namespace hierarchy is
similar to other file-systems in the sense that one can create, remove, and move
files from one directory to another, or even rename a file. HDFS does not support
hard-links or soft-links. Any changes to the file-system namespace are recorded by
the Namenode. An application can specify the replication factor of a file that should
be maintained by HDFS. The number of copies of a file is called the replication
factor.
Figure: Interaction between HDFS and MapReduce
HDFS provides a reliable storage mechanism even though the cluster spans
thousands of machines. Each file is stored as a sequence of blocks, where each
block except the last one is of the same size. Fault tolerance is ensured by
replicating the file blocks. The block size and replication factor is configurable for
each file either by the user or an application. The replications factor can be set at
file creation and can be modified later. Files in HDFS are write-once and strictly
adhere to a single writer at a time property (Figure 2.4). Namenode, acting as the
master, takes all the decisions regarding data block replication. It receives a
heartbeat, check Figure 2.5 on how it works, and block report from the Datanodes in
the cluster. Heartbeat implies that the Datanode is functional and the block report
provides a list of all the blocks in a Datanode. Replication is vital to HDFS reliability
and performance is improved by optimizing the placement of these replicas. This
optimized placement contributes significantly to the performance and distinguishes
itself from the rest of the file-systems. This feature requires much tuning and
experience to get it right. Using a rack-aware replication policy improves data
availability, reliability, and efficient network bandwidth utilization. This policy is the
first of its kind and has seen much attention with better and sophisticated policies
to follow. Typically large HDFS instances span across multiple racks.
Communications between these racks go through switches. Network bandwidth
between machines in different racks is less than machines within the same rack.
Figure: Data Replication in Hadoop
The Namenode is aware of the rack-ids each Datanode belongs to. A simple policy
can be implemented by placing unique replicas in individual racks; in this way even
if an entire rack fails, although unlikely, the replica on another rack is still available.
However, this policy can be costly as the number of writes between racks increases.
When the replication factor is, say 3, the placement policy in HDFS is put a replica in
one node per rack. This policy cuts the inter-rack write, which improves
performance. The chance of rack failure is far less than node failure. This policy
significantly reduces the network bandwidth used when reading data as a replica is
placed in only two unique racks instead of three. One-third of replicas are on one
node and one-third in another rack, while another third is evenly distributed across
remaining racks. This policy improves performance along with data reliability and
read performance.
Safe-mode: When HDFS services are started, the Namenode enters a safe-mode.
The Namenode receives heartbeats and block reports from the Datanodes. Each
block has a specified number of replicas and Namenode checks if these replicas are
present. It also checks with Datanodes through the heartbeats. Once a good
percentage of blocks are present the Namenode exits the safe-mode (takes about
30s). It completes the remaining replication in other Datanodes.
2.3.5 Communication
The TCP/IP protocols are foundations to all the communication protocols built in
HDFS. Remote Procedure Calls (RPCs) are designed around the client protocols and
the Datanode protocols. The Namenode does not initiate any procedure calls, it only
responds to the request from the clients and the Datanodes.
Robustness: The fault-tolerant nature of HDFS ensures data reliability. The three
common type of failures are Namenode, Datanode, and network partitions.
Cluster Rebalancing: Data re-balancing schemes are compatible with the HDFS
architecture. It automatically moves the data from a Datanode when its free space
falls below a certain threshold. In case a file is repeatedly used in an application, a
scheme might create additional replicas of the file to re-balance the thirst for the
data on the file in the cluster.
Data Integrity: Data corruption can occur for various reasons like storage device
failure, network faults, buggy software, etc. To recognize corrupted data blocks
HDFS implements a checksum to check on the contents of the retrieved HDFS files.
Each block of data has an associated checksum that is stored in a separate hidden
file in the same HDFS namespace. When a file is retrieved the quality of the file is
checked with the checksum stored, if not then the client can ask for that data block
from another Datanode (Figure 2.6).
Figure: Each cluster contains one Namenode. This design facilitates a simplified
model for managing each Namespace and arbitrating data distribution.
Data Blocks: HDFS by design supports very large files. Applications written to work
on HDFS write their data once and read many times, with reads at streaming
speeds. A typical block size used by HDFS is 64 MB. Each file is chopped into 64 MB
chunks and replicated.
Staging: A request to create a file does not reach the Namenode immediately.
Initially, The HDFS client caches the file data into a temporary local file. Application
writes are redirected to this temporary files. When the local file accumulates
content over the block size, the client contacts the Namenode. The Namenode
creates a file in the file-system and allocates a data block for it. The Namenode
responds to the client with the Datanode and data block identities. The client
flushes the block of data from the temporary file to this data block. If the file is
closed the Namenode is informed, and it starts a file creation operation into a
persistent store. Suppose, if the Namenode dies before commit the file is lost.
The benefits of the above approach is to allow streaming write to files. If a client
writes a file directly to a remote file without buffering, the network speed and
network congestion impacts the throughput considerably. There are earlier file
systems that successfully use client side caching to improve performance. In case of
HDFS a few POSIX rules have been relaxed to achieve high performance data
uploads.
A. NameNode
The HDFS namespace is a hierarchy of files and directories. Files and
directories are represented on the NameNode by nodes, which record
attributes like permissions, modification and access times, namespace and
disk space quotas. The file content is split into large blocks (typically 128
megabytes, but user selectable file-by-file) and each block of the file is
independently replicated at multiple DataNodes (typically three, but user
selectable file-by-file). The NameNode maintains the namespace tree and the
mapping of file blocks to DataNodes (the physical location of file data). An
HDFS client wanting to read a file first contacts the NameNode for the
locations of data blocks comprising the file and then reads block contents
from the DataNode closest to the client. When writing data, the client
requests the NameNode to nominate a suite of three DataNodes to host the
block replicas. The client then writes data to the DataNodes in a pipeline
fashion. The current design has a single NameNode for each cluster. The
cluster can have thousands of DataNodes and tens of thousands of HDFS
clients per cluster, as each DataNode may execute multiple application tasks
concurrently. HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata of the name
system called the image. The persistent record of the image stored in the
local hosts native files system is called a checkpoint. The NameNode also
stores the modification log of the image called the journal in the local hosts
native file system. For improved durability, redundant copies of the
checkpoint and journal can be made at other servers. During restarts the
NameNode restores the namespace by reading the namespace and replaying
the journal. The locations of block replicas may change over time and are not
part of the persistent checkpoint.
B. DataNodes
Each block replica on a DataNode is represented by two files in the local
hosts native file system. The first file contains the data itself and the second
file is blocks metadata including checksums for the block data and the
blocks generation stamp. The size of the data file equals the actual length of
the block and does not require extra space to round it up to the nominal block
size as in traditional file systems. Thus, if a block is half full it needs only half
of the space of the full block on the local drive. During startup each DataNode
connects to the NameNode and performs a handshake. The purpose of the
handshake is to verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the DataNode
automatically shuts down. The namespace ID is assigned to the file system
instance when it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will not be able to
join the cluster, thus preserving the integrity of the file system. The
consistency of software versions is important because incompatible version
may cause data corruption or loss, and on large clusters of thousands of
machines it is easy to overlook nodes that did not shut down properly prior to
the software upgrade or were not available during the upgrade. A DataNode
that is newly initialized and without any namespace ID is permitted to join the
cluster and receive the clusters namespace ID. After the handshake the
DataNode registers with the NameNode. DataNodes persistently store their
unique storage IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different IP address
or port. The storage ID is assigned to the DataNode when it registers with
the NameNode for the first time and never changes after that. A DataNode
identifies block replicas in its possession to the NameNode by sending a block
report. A block report contains the block id, the generation stamp and the
length for each block replica the server hosts. The first block report is sent
immediately after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-todate view of where
block replicas are located on the cluster. During normal operation DataNodes
send heartbeats to the NameNode to confirm that the DataNode is operating
and the block replicas it hosts are available. The default heartbeat interval is
three seconds. If the NameNode does not receive a heartbeat from a
DataNode in ten minutes the NameNode considers the DataNode to be out of
service and the block replicas hosted by that DataNode to be unavailable.
The NameNode then schedules creation of new replicas of those blocks on
other DataNodes. Heartbeats from a DataNode also carry information about
total storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the NameNodes
space allocation and load balancing decisions. The NameNode does not
directly call DataNodes. It uses replies to heartbeats to send instructions to
the DataNodes. The instructions include commands to:
replicate blocks to other nodes;
remove local block replicas;
re-register or to shut down the node;
send an immediate block report.
These commands are important for maintaining the overall system integrity and
therefore it is critical to keep heartbeats frequent even on big clusters. The
NameNode can process thousands of heartbeats per second without affecting other
NameNode operations.
C. HDFS Client
User applications access the file system using the HDFS client, a code library
that exports the HDFS file system interface. Similar to most conventional file
systems, HDFS supports operations to read, write and delete files, and
operations to create and delete directories. The user references files and
directories by paths in the namespace. The user application generally does
not need to know that file system metadata and storage are on different
servers, or that blocks have multiple replicas. When an application reads a
file, the HDFS client first asks the NameNode for the list of DataNodes that
host replicas of the blocks of the file. It then contacts a DataNode directly and
requests the transfer of the desired block. When a client writes, it first asks
the NameNode to choose DataNodes to host replicas of the first block of the
file. The client organizes a pipeline from node-to-node and sends the data.
When the first block is filled, the client requests new DataNodes to be chosen
to host replicas of the next block. A new pipeline is organized, and the client
sends the further bytes of the file.
Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For
each block of the file, the NameNode returns a list of DataNodes to host its replicas.
The client then pipelines data to the chosen DataNodes, which eventually confirm
the creation of the block replicas to the NameNode.
Creating periodic checkpoints is one way to protect the file system metadata. The
system can start from the most recent checkpoint if all other persistent copies of
the namespace image or journal are unavailable. Creating a checkpoint lets the
NameNode truncate the tail of the journal when the new checkpoint is uploaded to
the NameNode. HDFS clusters run for prolonged periods of time without restarts
during which the journal constantly grows. If the journal grows very large, the
probability of loss or corruption of the journal file increases. Also, a very large
journal extends the time required to restart the NameNode. For a large cluster, it
takes an hour to process a week-long journal. Good practice is to create a daily
checkpoint.
and journal files and merges them in memory. Then it writes the new checkpoint
and the empty journal to a new location, so that the old checkpoint and journal
remain unchanged. During handshake the NameNode instructs DataNodes whether
to create a local snapshot. The local snapshot on the DataNode cannot be created
by replicating the data files directories as this will require doubling the storage
capacity of every DataNode on the cluster. Instead each DataNode creates a copy of
the storage directory and hard links existing block files into it. When the DataNode
removes a block it removes only the hard link, and block modifications during
appends use the copy-on-write technique. Thus old block replicas remain untouched
in their old directories. The cluster administrator can choose to roll back HDFS to the
snapshot state when restarting the system. The NameNode recovers the checkpoint
saved when the snapshot was created. DataNodes restore the previously renamed
directories and initiate a background process to delete block replicas created after
the snapshot was made. Having chosen to roll back, there is no provision to roll
forward. The cluster administrator can recover the storage occupied by the
snapshot by commanding the system to abandon the snapshot, thus finalizing the
software upgrade. System evolution may lead to a change in the format of the
NameNodes checkpoint and journal files, or in the data representation of block
replica files on DataNodes. The layout version identifies the data representation
formats, and is persistently stored in the NameNodes and the DataNodes storage
directories. During startup each node compares the layout version of the current
software with the version stored in its storage directories and automatically
converts data from older formats to the newer ones. The conversion requires the
mandatory creation of a snapshot when the system restarts with the new software
layout version. HDFS does not separate layout versions for the NameNode and
DataNodes because snapshot creation must be an allcluster effort rather than a
node-selective event. If an upgraded NameNode due to a software bug purges its
image then backing up only the namespace state still results in total data loss, as
the NameNode will not recognize the blocks reported by DataNodes, and will order
their deletion. Rolling back in this case will recover the metadata, but the data itself
will be lost. A coordinated snapshot is required to avoid a cataclysmic destruction.
A. File Read and Write An application adds data to HDFS by creating a new file and
writing the data to it. After the file is closed, the bytes written cannot be altered or
removed except that new data can be added to the file by reopening the file for
append. HDFS implements a single-writer, multiple-reader model. The HDFS client
that opens a file for writing is granted a lease for the file; no other client can write
to the file. The writing client periodically renews the lease by sending a heartbeat to
the NameNode. When the file is closed, the lease is revoked.
The lease duration is bound by a soft limit and a hard limit. Until the soft limit
expires, the writer is certain of exclusive access to the file. If the soft limit expires
and the client fails to close the file or renew the lease, another client can preempt
the lease. If after the hard limit expires (one hour) and the client has failed to renew
the lease, HDFS assumes that the client has quit and will automatically close the file
on behalf of the writer, and recover the lease. The writer's lease does not prevent
other clients from reading the file; a file may have many concurrent readers. An
HDFS file consists of blocks. When there is a need for a new block, the NameNode
allocates a block with a unique block ID and determines a list of DataNodes to host
replicas of the block. The DataNodes form a pipeline, the order of which minimizes
the total network distance from the client to the last DataNode. Bytes are pushed to
the pipeline as a sequence of packets. The bytes that an application writes first
buffer at the client side. After a packet buffer is filled (typically 64 KB), the data are
pushed to the pipeline. The next packet can be pushed to the pipeline before
receiving the acknowledgement for the previous packets. The number of
outstanding packets is limited by the outstanding packets window size of the client.
After data are written to an HDFS file, HDFS does not provide any guarantee that
data are visible to a new reader until the file is closed. If a user application needs
the visibility guarantee, it can explicitly call the hflush operation. Then the current
packet is immediately pushed to the pipeline, and the hflush operation will wait until
all DataNodes in the pipeline acknowledge the successful transmission of the
packet. All data written before the hflush operation are then certain to be visible to
readers.
B. Block Placement For a large cluster, it may not be practical to connect all nodes
in a flat topology. A common practice is to spread the nodes across multiple racks.
Nodes of a rack share a switch, and rack switches are connected by one or more
core switches. Communication between two nodes in different racks has to go
through multiple switches. In most cases, network bandwidth between nodes in the
same rack is greater than network bandwidth between nodes in different racks. Fig.
3 describes a cluster with two racks, each of which contains three nodes.
HDFS estimates the network bandwidth between two nodes by their distance. The
distance from a node to its parent node is assumed to be one. A distance between
two nodes can be calculated by summing up their distances to their closest
common ancestor. A shorter distance between two nodes means that the greater
bandwidth they can utilize to transfer data. HDFS allows an administrator to
configure a script that returns a nodes rack identification given a nodes address.
The NameNode is the central place that resolves the rack location of each
DataNode. When a DataNode registers with the NameNode, the NameNode runs a
configured script to decide which rack the node belongs to. If no such a script is
configured, the NameNode assumes that all the nodes belong to a default single
rack. The placement of replicas is critical to HDFS data reliability and read/write
performance. A good replica placement policy should improve data reliability,
availability, and network bandwidth utilization. Currently HDFS provides a
configurable block placement policy interface so that the users and researchers can
experiment and test any policy thats optimal for their applications. The default
HDFS block placement policy provides a tradeoff between minimizing the write cost,
and maximizing data reliability, availability and aggregate read bandwidth. When a
new block is created, HDFS places the first replica on the node where the writer is
located, the second and the third replicas on two different nodes in a different rack,
and the rest are placed on random nodes with restrictions that no more than one
replica is placed at one node and no more than two replicas are placed in the same
rack when the number of replicas is less than twice the number of racks. The choice
to place the second and third replicas on a different rack better distributes the block
replicas for a single file across the cluster. If the first two replicas were placed on the
same rack, for any file, two-thirds of its block replicas would be on the same rack.
After all target nodes are selected, nodes are organized as a pipeline in the order of
their proximity to the first replica. Data are pushed to nodes in this order. For
reading, the NameNode first checks if the clients host is located in the cluster. If
yes, block locations are returned to the client in the order of its closeness to the
reader. The block is read from DataNodes in this preference order. (It is usual for
MapReduce applications to run on cluster nodes, but as long as a host can connect
to the NameNode and DataNodes, it can execute the HDFS client.) This policy
reduces the inter-rack and inter-node write traffic and generally improves write
performance. Because the chance of a rack failure is far less than that of a node
failure, this policy does not impact data reliability and availability guarantees. In the
usual case of three replicas, it can reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than
three. The default HDFS replica placement policy can be summarized as follows:
2. No rack contains more than two replicas of the same block, provided there
are sufficient racks on the cluster.
E. Block Scanner Each DataNode runs a block scanner that periodically scans its
block replicas and verifies that stored checksums match the block data. In each
scan period, the block scanner adjusts the read bandwidth in order to complete the
verification in a configurable period. If a client reads a complete block and
checksum verification succeeds, it informs the DataNode. The DataNode treats it as
a verification of the replica. The verification time of each block is stored in a human
readable log file. At any time there are up to two files in toplevel DataNode
directory, current and prev logs. New verification times are appended to current file.
Correspondingly each DataNode has an in-memory scanning list ordered by the
replicas verification time. Whenever a read client or a block scanner detects a
corrupt block, it notifies the NameNode. The NameNode marks the replica as
corrupt, but does not schedule deletion of the replica immediately. Instead, it starts
to replicate a good copy of the block. Only when the good replica count reaches the
replication factor of the block the corrupt replica is scheduled to be removed. This
policy aims to preserve data as long as possible. So even if all replicas of a block are
corrupt, the policy allows the user to retrieve its data from the corrupt replicas.
F. Decommissioing The cluster administrator specifies which nodes can join the
cluster by listing the host addresses of nodes that are permitted to register and the
host addresses of nodes that are not permitted to register. The administrator can
command the system to re-evaluate these include and exclude lists. A present
member of the cluster that becomes excluded is marked for decommissioning. Once
a DataNode is marked as decommissioning, it will not be selected as the target of
replica placement, but it will continue to serve read requests. The NameNode starts
to schedule replication of its blocks to other DataNodes. Once the NameNode
detects that all blocks on the decommissioning DataNode are replicated, the node
enters the decommissioned state. Then it can be safely removed from the cluster
without jeopardizing any data availability.
G. Inter-Cluster Data Copy When working with large datasets, copying data into and
out of a HDFS cluster is daunting. HDFS provides a tool called DistCp for large
inter/intra-cluster parallel copying. It is a MapReduce job; each of the map tasks
copies a portion of the source data into the destination file system. The MapReduce
framework automatically handles parallel task scheduling, error detection and
recovery.
Apache Hadoop is the most popular implementation of the MapReduce paradigm for
distributed computing, but its design doesnt adapt automatically to computing
nodes context and capabilities. By introducing context-awareness into Hadoop, we
intent to dynamically adapt its scheduling to the execution environment. This is a
necessary feature in the context of pervasive grids, which are heterogeneous,
dynamic and shared environments.
ThisisanespecialconcernwhendeployingHadoopoverpervasivegrids.Pervasivegridsare
aninterestingalternativetocostlydedicatedclusters,astheacquisitionandmaintenanceo
fadedicatedclusterremainhighanddissuasive for many organizations. According to
Parashar and Pierson2, pervasive grids represent the extreme generalization of the
grid concept, in which the resources are pervasive. Pervasive grids propose using
resources embedded in
pervasiveenvironmentsinordertoperformcomputingtasksinadistributedway.Concretel
y,theycanbeseenas computing grids formed by existing resources (desktop
machines, spare servers, etc.) that occasionally contribute to the computing grid
power. These resources are inherently heterogeneous and potentially mobiles,
coming in and out the grid dynamically. Knowing that, in essence, pervasive grids
are heterogeneous, dynamic, shared and distributed environments, its efficient
management becomes a very complex task3. Task scheduling is thus severely
affected by the management of the environment complexity.
ManyworkshaveproposedtoimprovetheadaptabilityoftheHadoopframeworkonenviron
mentsthatdiverge from the initial supposition, each having their own proposal and
objectives4,5,6,7. The PER-MARE project8, in which this work was developed, aims
at adapting Hadoop to pervasive environments9. Indeed, Hadoop is based on static
configuration files and the current versions do not adapt well to resources variations
over the time. In addition, the installation procedures force the administrator to
manually define the characteristics of each potential resource, such as the memory
and the number of cores of each machine, which is a hard task in a heterogeneous
environment. All these factors prevent deploying Hadoop on more volatile
environments. The PER-MARE project aims at the improvement of Hadoop so that it
could adapt itself to the execution context and therefore be deployed over
pervasive grids. In order to adapt Hadoop to a pervasive grid environment,
supporting context-awareness is essential. Context aware is the capacity of an
application or software to detect and respond to environment changes10. A context-
aware system is able to adapt its operations to current state without human
intervention, therefore improving the systems usability andefficiency11. In
pervasive grids, the scheduling is a task that may be benefited in context-aware
systems, collecting data about the grid resources and making decisions based on
the data collected. This work focuses on our developments to introduce context-
awareness capabilities on Hadoop task scheduling mechanisms. Through a context
collection procedure and minimal changes on Hadoops resource manager, we are
able to update the information about the availability of resources in each node of
the grid and then influence the scheduler tasks assignments. The rest of the paper
is organized as follows: Section 2 presents Apache Hadoop architecture and
scheduling mechanisms. Section 3 discusses related work, focusing on context-
awareness and on other works that try to improve Hadoop schedulers. Section4
presents our proposal of context-aware scheduling, while Section5 presents the
experiments conducted and the achieved results. We finally conclude this paper in
Section 6.
2. AboutHadoopScheduling
The Apache Hadoop framework is organized in a master and slave architecture, with
two main services: storage (HDFS) and processing (YARN). Both services have their
own master and slave components, as presented on Fig. 1. It is possible to see the
NameNode and ResourceManager services, which are the masters of the HDFS and
YARNrespectively,andtheirslavecounterparts,the DataNode and NodeManager. It is
also possible to note the ApplicationMaster, the component responsible for internal
application (job) management, or simply task scheduling. While ResourceManager is
the component responsible for job scheduling. Each node also runs a set of
Containers, where the execution of Map and Reduce tasks takes place.
2.1. Hadoop Schedulers
Concerning job scheduling, Hadoop offers several options. The simplest scheduler,
called Hadoop Internal
Scheduler,processesalljobsinarrivalorder(FIFO).Thisschedulerhasagoodperformancei
ndedicatedclusterswherethe
competitionforresourcesisnotaproblem.AnotherscheduleravailableistheFairScheduler
,mainlyusedtocompute batches of small jobs. It uses a two level scheduling to fairly
distribute the resources12.
ThethirdscheduleravailableistheCapacityScheduler.TheCapacitySchedulerisdesigned
torunHadoopMapReduce as a shared, multi-tenant cluster in an operator-friendly
manner while maximizing the throughput and the utilization of the cluster while
running Map-Reduce applications. The CapacityScheduler is designed to allow
sharing a large cluster while giving each organization a minimum capacity
guarantee. The central idea is that the available resources in the Hadoop
MapReduce cluster are partitioned among multiple organizations that collectively
fund the cluster based on computing needs. There is an added benefit that an
organization can access any excess capacity not being used by the others users.
This provides elasticity for the organizations in a cost-effective manner12. The
existence of these schedulers allows a flexible management of the framework.
Despite that, the available schedulers neither detect nor react to the dynamicity
and heterogeneity of the computing environment, a typical concern on pervasive
grids.
Context-AwareScheduling
The main goal of this work is to improve the scheduling of Hadoop by adding
support to dynamic changes in the availability of resources, like those occurring in a
pervasive environment. Similar to works on Section 3 we try to improve the
resource distribution, since faster and more robust nodes would have more data to
process. Different from these works, we opted to modify the Hadoop code through
insertion of dynamic context information using, as far as possible, an existing
scheduler (Capacity Scheduler). In order to detect dynamic changes, the scheduler
must collect context information that, in this case, refers to available resources on
the nodes. Slaves must communicate periodically with the master in order to keep
information updated and let the scheduler adapt to the new context. On the
following section we present a more detailed explanation of the changes
implemented in Apache Hadoop.
By default, Hadoop reads information about the nodes from XML configuration files.
These files contain many
Hadoopconfigurationparameters,includingtheresourcecapacityofeachnode.Onceload
ed,theinformationwillnot beupdateduntilthenexttimetheserviceisstarted.
Aspervasiveenvironmentsmayfaceperformancechangesduring the execution of an
application, we need a mechanism that updates contextual information during
runtime according to the environmental conditions.
Tosolvethisproblem,weintegrateacollectormoduleintoHadoop,allowingthecollectionof
contextualinformation about the available resources. The collector was developed
for the PER-MARE project8, and its class diagram is presented in Fig. 2. The collector
module is based on standard Java monitoring API18, which allows to easily access
the real characteristics of a node, with no additional libraries required. It allows
collecting different contextinformation, such as the number of processors (cores)
and the system memory, using a set of interface and abstract/concrete classes that
generalize the collecting process. Due to its design, it is easy to integrate new
collectors and improve the resources available for the scheduling process, providing
data about the CPU load or disk usage, for example.
4.2. Communication
Gathering the context information required to feed the Hadoop scheduler requires
transmitting this information through the network from slave nodes (NodeManager)
till master node (ResourceManager), which is responsible for the scheduling. Instead
of relying on a separate service, we chose to use the ZooKeeper API19, a tool
initially developed inside Hadoop that becomes a full project as its usage was
extended to other applications. Zookeeper provides efficient, reliable, and fault-
tolerant tools for the coordination of distributed systems. In our case, we use
ZooKeeper services to distributed context information. As illustrated in Fig. 3, all
slaves (NodeManager) run an instance of the NodeStatusUpdater thread, that
collect data about the real resource availability of the node every 30 seconds. If the
amount of available resources changes, the DHT on ZooKeeper will be updated.
Similarly, the master (ResourceManager) also creates a thread to watch
ZooKeeper.IftheZookeepernodedetectsaDHTchange,themasterwillbenotifiedandupda
tethescheduler information based on the new information. This solution extends a
previous one we proposed in20 by offering a real timeobservationofavailablenodes.
Indeed,ourprevioussolution20 onlyupdatedinformationregardingtheresources on
service initialization, replacing the XML configuration file, while this one updates
resource information whenever the availability changes. As a result, scheduling is
performed based on the current resource state.
Figure: Context collector structure
Figure: Using ZooKeeper to distribute context information
The HDFS is the base layer of Hadoop Architecture contains different classifications
of data and it is more sensitive to security issues. It has no appropriate role based
access for controlling security problems.Also the risk of data access,theft,and
unwanted disclosure takes place when embedded a data in single Hadoop
environment.Thereplicated data is also not secure which needs more security for
protecting from breaches and vulnerabilities. Mostly Government Sector and
Organisations never using Hadoop environment for storing valuable data because of
less security concerns inside a Hadoop Technology. They are providing security in
outside of Hadoop Environment like firewall and Intrusion Detection System. Some
authors represented that the HDFS in Hadoop environment is prevented with
security for avoiding the theft, vulnerabilities only by encrypting the block levels
and individual file system in Hadoop Environment.Even though other authors
encrypted the block and nodes using encryption technique but no perfect algorithm
is mentioned to maintain the security in Hadoop Environment. In order to increase
the security some approaches are mentioned below.
Figure: 4 HDFS Architecture
The proposed work represents different Approaches for securing data in Hadoop
distributed file system. The first approach is based on Kerberos in HDFS.
a. Kerberos Mechanism
Kerberos [10] is the network authentication protocol which allows the node to
transfer any file over non secure channel by a tool called ticket to prove their
unique identification between them. This Kerberos mechanism is used to enhance
the security in HDFS. In HDFS the connection between client and Name node is
achieved using Remote Procedure Call [11] and the connection from Client (client
uses HTTP) to Data node is Achieved using Block Transfer. Here the Token or
Kerberos is used to authenticate a RPC connection. If the Client needs to obtain a
token means, the client makes use of Kerberos Authenticated Connection. Ticket
Granting Ticket (TGT) or Service Ticket are used to authenticate a name node by
using Kerberos. Both TGT and ST can be renewed after long running of jobs while
Kerberos is renewed, new TGT and ST is also issued and distributed to all task. The
Key Distribution Centre (KDC) issues the Kerberos Service Ticket using TGT after
getting request from task and network traffic is avoided to the KDC by using Tokens
In name node, only the time period is extended but the ticket remain constant. The
major advantage is even if the ticket is stolen by the attacker it cant be renewed.
We can also use another method for providing security for file access in HDFS. If the
client wants to access a block from the data node it must first contact the name
node in order to identify which data node holds the files of the blocks. Because of
name node only authorize access to file permission and issues a token called Block
Token where data node verifies the token. The data node also issues a token called
Name Token where it allows the Name node to enforce permission for correct control
access on its data blocks. Block Token allows the data node to identify whether the
client is authorized access to access data blocks. These block token and Name
Token is sent back to client who contains data block respective locations and youre
the authorized person to access the location. These two methods are used to
increase security by preventing from unauthorized client must read and write in
data blocks. The figure 5 shows the design view of Kerberos key distribution centre.
In big data the sensitive data are credit card numbers, passwords, account
numbers, personal details are stored in a large technology called Hadoop. In order
to increase the security in Hadoop base layer the new approach is introduced for
securing sensitive information which is called Bull Eye Approach. This approach is
introduced on Hadoop module to view all sensitive information in 360 to find
whether all the secured information are stored without any risk, and allows the
authorized person to preserve the personal information in a right way. Recently this
approach is using in companies like Dateguises DGsecure[8] and Amazon Elastic
Map Reduce[9]. The DGsecure Company which is famous for providing a Data
centric security and Governance solutions also involves in providing a security for
Hadoop in the cloud. The data guise company is decided to maintain and provide
security in Hadoop wherever it is located in cloud. Now a days the Companies are
storing a more sensitive data in cloud because of more breaches taking place in
traditional on premise data store. To increase the security in Hadoop base layers,
the Bull eye Approach also used in HDFS to provide security in 360 from node to
node. This approach is implemented in Data node of rack 1, where it checks the
sensitive data are stored properly in block without any risk and allows only the
particular client to store in required blocks. It also bridges a gap between a data
driven from original data node and replicated data node. When the client wants to
retrieve any data from replicating data nodes it also maintained by Bull Eye
Approach and it checks where there is a proper relation between two racks. This
Algorithm allows the data nodes to be more secure, only the authorized person read
or write about it. The algorithm can be implemented below the data node where the
client read or writes the data to store in blocks. It is not only implemented in the
rack 1 similarly it is implemented in Rack 2 in order to increase the security of the
blocks inside the data nodes in 360. It checks for any attacks, breaches or theft of
data taking place in the blocks of the data node. Sometimes data are encrypted for
protection in data mode. These types of encrypted data also protected using this
Algorithm in order to main order security. The Algorithm travels from less terabyte
to multi-petabytes of semistructured, structured and unstructured data stored in
HDFS layer in all angles. Mostly encryption and Wrapping of data occurs at the block
levels of Hadoop rather than entire file level. This algorithm scans before the data is
allowed to enter into the blocks and also after enters both rack 1 and rack 2. Thus,
this Algorithm concentrates only on the sensitive data that matters about the
information stored in the data nodes. In our work, we mentioned this new Algorithm
to enhance more security in the data nodes of HDFS.
c. Namenode Approach
In HDFS if there is any problem in Name node event and becomes unavailable, it
makes the group of system service and data stored in the HDFS make unavailable
so it is not easy to access the data in secure way from this critical situation. In order
to increase the security in data availability, it is achieved by using two Namenode.
These two Name nodes servers are allowed to run successfully in the same cluster.
These two redundant name nodes are provided by Name Node Security Enhance
(NNSE), which holds Bull Eye Algorithm. It allows the Hadoop administrator to run
the options for two nodes. From these name node one acts as Master and other acts
as a slave in order to reduce an unnecessary or unexpected server crash and allows
predicting from natural disasters. If the Master Name node crashes, the
administrator needs to ask permission from Name Node Security Enhance to provide
a data from a slave node in order to cover a time lagging and data unavailability in
secure manner. Without getting permission from NNSE admin never retrieves the
data from slave node to reduce the complex retrieval issue. If both Name node acts
as a master there is a continuous risk occurs, reduces a secure data availability and
bottleneck in performance over a local area network or Wide Area Network. Thus in
future we can also increase security by using Vital configuration that provides and
ensures data is available in secured way to client by replicating many Name node
by Name Node Security Enhance in HDFS blocks between many data centres and
clusters.
Energy consumption and cooling are now large components of the operational cost
of datacenters and pose significant limitations in terms of scalability and reliability.
A growing segment of datacenter workloads is managed with MapReduce-style
frameworks, whether by privately managed instances of Yahoo!s Hadoop, by
Amazons Elastic MapReduce, or ubiquitously at Google by their archetypal
implementation. Therefore, it is important to understand the energy efficiency of
this emerging workload. The energy efficiency of a cluster can be improved in two
ways: by matching the number of active nodes to the current needs of the
workload, placing the remaining nodes in low-power standby modes; by engineering
the compute and storage features of each node to match its workload and avoid
energy waste on oversized components. Unfortunately, MapReduce frameworks
have many characteristics that complicate both options.
Hadoop has the global knowledge necessary to manage the transition of nodes to
and from low-power modes. Hence, Hadoop should be, or cooperate with, the
energy controller for a cluster. It is possible to recast the data layout and task
distribution of Hadoop to enable significant portions of a cluster to be powered
down while still fully operational.
Hadoops file system (HDFS) spreads data across the disks of a cluster to take
advantage of the aggregate I/O, and improve the data-locality of computation.
While beneficial in terms of performance, this design principle complicates power-
management. With data distributed across all nodes, any node may be participating
in the reading, writing, or computation of a data-block at any time. This makes it
difficult to determine when it is safe to turn a node or component (e.g. disk) off.
Tangentially, Hadoop must also handle the case of node failures, which can be
frequent in large clusters. To address this problem, it implements a data-block
replication and placement strategy to mitigate the effect of certain classes of
common failures; namely,
single-node failures
Whole-rack failures.
When data is stored in a HDFS, the user specifies a block replication factor. A
replication factor of n instructs HDFS to ensure that n identical copies of any data-
block are stored across a cluster (by default n = 3). Whilst replicating blocks,
Hadoop maintains two invariants:
The fact that Hadoop maintains replicas of all data affords an opportunity to save
energy on inactive nodes. That is, there is an expectation that if an inactive node is
turned off, the data it stores will be found somewhere else on the cluster. However,
this is only true up to a point. Should the n nodes that hold the n replicas of a single
block be selected for deactivation, that piece of data is no longer available to the
cluster. In fact, we have found through examination of our own Hadoop cluster that
when configured as a single rack, removing any n nodes from the cluster (where n is
the replication factor) will render some data unavailable. Thus the largest number of
nodes we could disable without impacting data availability is n 1, or merely two
nodes when n = 3. While Hadoops autonomous re-replication feature can, over
time, allow additional nodes to be disabled, this comes with severe storage capacity
and resource penalties (i.e. significant amounts of data must be transferred over the
network and condensed onto the disks of the remaining nodes). Hadoops rack-
aware replication strategy mitigates this effect only moderately; at best a single
rack can be disabled before data begins to become unavailable. To address this
short-fall in Hadoops data-layout strategy, we propose a new invariant for use
during block replication: at least one replica of a data-block must be stored in a
subset of nodes we refer to as a covering subset. The premise behind a covering
subset is that it contains a sufficient set of nodes to ensure the immediate
availability of data, even were all nodes not in the covering subset to be disabled.
This invariant leaves the specific designation of the covering subset as a matter of
policy. The purpose in establishing a covering subset and utilizing this storage
invariant is so that large numbers of nodes can be gracefully removed from a
cluster (i.e. turned off) without affecting the availability of data or interrupting the
normal operation of the cluster; thus, it should be a minority portion of the cluster.
On the other hand, it cannot be too small, or else it would limit storage capacity or
even become an I/O bottleneck. As such, a covering subset would best be sized as a
moderate fraction (10% to 30%) of a whole cluster, to balance these concerns.1 Just
as replication factors can be specified by users on a file by file basis, covering
subsets should be established and specified for files by users (or cluster
administrators). In large clusters (thousands of nodes), this allows covering subsets
to be intelligently managed as current activity dictates, rather than as a
compromise between several potentially active users or applications. Thus, if a
particular user or application vacates a cluster for some period of time, the nodes of
its associated covering subset can be turned off without affecting the availability of
resident users and applications.