Hadoop Report

DECLARATION
I hereby declare that this submission is my own work and that to the best of my knowledge and
belief. It contains no matter previously published or written by another person nor material
which to a substantial extent has been accepted for the award of any other Degree or Diploma of
the University or other Institute of Higher Learning, except where due acknowledgement has
been made in the text.
Signature
Name: Anushka Bajpai
ACKNOWLEDGEMENT
There are times when silence speaks so much more loudly than words of praise to only as good
as belittle a person, whose words do not express, but only put a veneer over true feelings, which
are of gratitude at this point of time.
Thank you GOD for giving me strength and ability to understand, learn and complete this report.
I would like to express my sincere gratitude to everyone who contributed towards preparing and
making this project successful.
I am deeply indebted to R.K. Banyal Sir, whose mentoring and encouragement have been
especially valuable, and his early insights launched a greater part of this project.
I express my profound and sincere thanks to Hukum Chand Saini Sir, who acted as a mariners
compass and steered me throughout my project voyage through his excellent guidance and
constant inspiration.
Above ground, I am indebted to my family, whose value to me only grows with age.
ABSTRACT
The data scheduler is a recurring and automated utility to automatically schedule your tasks on
periodic basis. Typically it is used in processes which are to be run for a longer period of time.
These processes perform different tasks at an instant as scheduled according to the scheduler.
The scheduler takes the processes to be processed and their time of start as an input and then
automates the tasks to be performed after scheduling them according to the inputs.
This data scheduler is useful in performing tasks which have to be run on a daily basis. In most
firms, there are a lot of tasks that are to be performed periodically each day. The scheduler
provides an ease to perform such tasks.
Pipeline is a logical grouping of Processes. They are used to group processes into a unit that
performs a task.
The need for a GENERIC DATA-SCHEDULER FOR DATA FLOW PIPELINING for the
execution of tasks in an efficient manner, is a serious demand. We therefore tend to provide a
solution by constructing a system that has the capability of configuring the tasks and at the same
time scheduling the tasks, so as to improvise the CPU utilization at the maximum level. The
project involves the use of:
Linux as an Operating System,
Python as a Programming Language.
TABLE OF CONTENTS
Declaration 1
Acknowledgement 2
Abstract 3
Table of Contents 4
List of Illustrations 5
Chapter 1: Introduction 6
Chapter 2: Big Data 7
Type chapter title (level 2) 5
Type chapter title (level 3) 6
CHAPTER 1: INTRODUCTION
We live in the data age. Its not easy to measure the total volume of data stored
electronically, but an IDC estimate put the size of the digital universe at 4.4
zettabytes in 2013 and is forecasting a tenfold growth by 2020. This flood of data is
coming from many sources. Consider the following:
The New York Stock Exchange generates about 45 terabytes of data per
day.
Facebook hosts more than 240 billion photos, growing at 7 petabytes per
month.
Ancestry.com, the genealogy site, stores around 10 petabytes of data.
The Internet Archive stores around 18.5 petabytes of data.
The Large Hadron Collider near Geneva, Switzerland, produces around 15

petabytes of data per year.
The trend is for every individuals data footprint to grow, but perhaps more
significantly, the amount of data generated by machines as a part of the Internet of
Things will be even greater than that generated by people. The volume of data
being made publicly available increases every year, too.
The problem is simple: although the storage capacities of hard drives have
increased massively over the years, access speedsthe rate at which data can be
read from drives have not kept up. The obvious way to reduce the time is to read
from multiple disks at once. Imagine if we had 100 drives, each holding one
hundredth of the data. Working in parallel, we could read the data in under two
minutes.
Theres more to being able to read and write data in parallel to or from multiple
disks, though.
The first problem to solve is hardware failure: as soon as you start using many
pieces of hardware, the chance that one will fail is fairly high. A common way of
avoiding data loss is through replication: redundant copies of the data are kept by
the system so that in the event of failure, there is another copy available.
The second problem is that most analysis tasks need to be able to combine the data
in some way, and data read from one disk may need to be combined with data from
any of the other 99 disks. Various distributed systems allow data to be combined
from multiple sources.
CHAPTER 2: BIG DATA

Due to the advent of new technologies, devices, and communication means like
social networking sites, the amount of data produced by mankind is growing rapidly
every year. The amount of data produced by us from the beginning of time till 2003
was 5 billion gigabytes. If you pile up the data in the form of disks it may fill an
entire football field. The same amount was created in every two days in 2011, and
in every ten minutes in 2013. This rate is still growing enormously. Though all this
information produced is meaningful and can be useful when processed, it is being
neglected.
Big data means really a big data; it is a collection of large datasets that cannot be
processed using traditional computing techniques. Big data is not merely a data,
rather it has become a complete subject, which involves various tools, techniques
and frameworks.
Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.
Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data: Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data: The stock exchange data holds information about the buy
and sell decisions made on a share of different companies made by the customers.
Power Grid Data: The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data: Transport data includes model, capacity, distance and availability of
a vehicle.
Search Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.
Structured data : Relational data.
Semi Structured data : XML data.

Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much
known to all of us:
Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and
other advertising mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Big Data also helps in:
1. Retaining and Analysing more data.

2. Increasing the speed of analysis.
3. Producing more accurate results.
4. Reducing or Eliminating manual processes.
5. Cost Saving.
6. Better Relationship with Customers.
7. Better Sense of Risk.
8. Better Financial Performance.
9. Making Operations more efficient.
10.Lowering IT costs.
11.Attracting and retaining customers.
12.Fraud Detection and Risk Management.
CHAPTER 3: INTRODUCTION TO HADOOP

TRADITIONAL APPROACH
In this approach, an enterprise will have a computer to store and process big data.
Here data will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2
and sophisticated softwares can be written to interact with the database, process
the required data and present it to the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be
accommodated by standard database servers, or up to the limit of the processor
which is processing the data. But when it comes to dealing with huge amounts of
data, it is really a tedious task to process such data through a traditional database
server.
GOOGLES SOLUTION
Google solved this problem using an algorithm called MapReduce. This algorithm
divides the task into small parts and assigns those parts to many computers
connected over the network, and collects the results to form the final result dataset.
HADOOP
Doug Cutting, Mike Cafarella and team took the solution provided by Google and
started an Open Source Project called HADOOP in 2005 and Doug named it after his
son's toy elephant. Now Apache Hadoop is a registered trademark of the Apache
Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is
processed in parallel on different CPU nodes. In short, Hadoop framework is capable
enough to develop applications capable of running on clusters of computers and
they could perform complete statistical analysis for a huge amounts of data.
3.1 A BRIEF HISTORY OF HADOOP
Hadoop was created by Doug Cutting, the creator of Apache Lucene, a widely used
text search library. The Apache Nutch project, an open source web search engine,
had a significant contribution to building Hadoop. Hadoop is not an acronym; it is a
made-up name. The project creator, Doug Cutting, explains how the name came
about: The name my kid gave a stuffed yellow elephant. Short, relatively easy to
spell and pronounce, meaningless, and not used elsewhere: those are my naming
criteria. Kids are good at generating such. It is ambitious to build a web search
engine from scratch as it is not only challenging to build software required to crawl
and index websites, but also challenges to run without a dedicated operations team
as there are so many moving parts. It is estimated that a 1 billion page index would
cost around $500,000 to build and monthly $30,000 for maintenance. Nevertheless,
this goal is worth pursuing as Search engine algorithms are opened to the world for
review and improvement. The Nutch project was started in 2002, with the crawler
and search system being quickly developed. However, they soon realized that their
system would not scale to a billion pages. Timely publication from Google in 2003,
the architecture of Google File System, called the GFS came in very handy.The GFS
or something like that was enough to solve their storage needs for the very large
files generated as part of the web crawl and indexing process. The GFS particularly
frees up the time spent
onmaintainingthestoragenodes.ThiseffortgavewaytotheNutchDistributedFile System
(NDFS) . Google produced another paper in 2004 that would introduce MapReduce
to the world. By early 2005 the Nutch developers had a working MapReduce
implementation in Nutch and by the middle of that year most of the Nutch
Algorithms were ported to MapReduce and NDFS.
NDFSandMapReduceimplementationinNutchfoundapplicationsinareasbeyond the
scope of Nutch; in Feb 2006 they were moved out of Nutch to form their own
independent subproject called Hadoop. Around the same time Doug Cutting joined
Yahoo! which gave him access to a dedicated team and resources to turn Hadoop
into a system that ran at web-scale. This ability of Hadoop was announced that its
production search index was generated by the 10,000 node Hadoop cluster.
In January 2008, Hadoop was promoted to a top level project at Apache, confirming
its success and its diverse active community. By this time, Hadoop was being used
by many other companies besides Yahoo!, such as Last.fm, Facebook, and the New
York Times. The capability of Hadoop was demonstrated and publicly put at the
epitome of the distributed computing sphere when The New York Times used
Amazons EC2 compute cloud to crunch through four terabytes of scanned archives
from the paper converting them into PDFs for the Web. The project came at the
right time with great publicity toward Hadoop and the cloud. It would have been
impossible to try this project if not for the popular pay-by-the-hour cloud model from
Amazon. The NYT used a large number of machines, about a 100 and Hadoops
easy-to-use parallel programming model to process the task in 24 hours. Hadoops
successes did not stop here; it went on to break the world record to become the
fastest system to sort a terabyte of data in April 2008. It took 209 seconds to sort a
terabyte of data on a 910 node cluster, beating the previous years winner of 297
seconds. It did not end here, later that year Google reported that its MapReduce
implementation sorted one terabyte in 68 seconds. Later, Yahoo! reported to have
broken Googles record by sorting one terabyte in 62 seconds.
CHAPTER 4: HADOOP ECOSYSTEM
Hadoop is a generic processing framework designed to execute queries and other
batch read operations on massive datasets that can scale from tens of terabytes to
petabytes in size. HDFS and MapReduce together provide a reliable, fault-tolerant
software framework for processing vast amounts of data in parallel on large clusters
of commodity hardware (potentially scaling to thousands of nodes). Hadoop meets
the needs of many organizations for flexible data analysis capabilities with an
unmatched price-performance curve. The flexibility in data analysis feature applies
to data in a variety of formats, from unstructured data, such as raw text, to semi-
structured data, such as logs, to structured data with a fixed schema. In
environments where massive server farms are used to collect data from a variety of
sources, Hadoop is able to process parallel queries as background batch jobs on the
same server farm. Thus, the requirement for an additional hardware to process data
from a traditional database system is eliminated (assuming such a system can scale
to the required size). The effort and time required to load data in to another system
is also reduced since it can be processed directly within Hadoop. This overhead
becomes impractical in very large datasets.
Hadoop Ecosystem is a framework of various types of complex and evolving tools

and components which have proficient advantage in solving problems. Some of the
elements may be very dissimilar from each other in terms of their architecture;
however, what keeps them all together under a single roof is that they all derive
their functionalities from the scalability and power of Hadoop. Hadoop Ecosystem is
alienated in four different layers: data storage, data processing, data access, data
management.
All the components of the Hadoop ecosystem, as explicit entities are evident. The
holistic view of Hadoop architecture gives prominence to Hadoop common, Hadoop
YARN, Hadoop Distributed File Systems (HDFS) and Hadoop MapReduce of the
Hadoop Ecosystem. Hadoop common provides all Java libraries, utilities, OS level
abstraction, necessary Java files and script to run Hadoop, while Hadoop YARN is a
framework for job scheduling and cluster resource management. HDFS in Hadoop
architecture provides high throughput access to application data and Hadoop
MapReduce provides YARN based parallel processing of large data sets.
The Hadoop ecosystem includes other tools to address particular needs. Hive is a
SQL dialect and Pig is a dataflow language for that hide the tedium of creating
MapReduce jobs behind higher-level abstractions more appropriate for user goals.
HBase is a column-oriented database management system that runs on top of
HDFS. Avro provides data serialization and data exchange services for Hadoop.
Sqoop is a combination of SQL and Hadoop. Zookeeper is used for federating
services and Oozie is a scheduling system. In the absence of an ecosystem,
developers have to implement separate sets of technologies to create Big Data
solutions.
The Hadoop ecosystem includes the following tools to address particular needs:
Common: A set of components and interfaces for distributed file systems and
general I/O (serialization, Java RPC, persistent data structures).
Avro: A serialization system for efficient, cross-language RPC and persistent data
storage.
MapReduce: A distributed data processing model and execution environment that

runs on large clusters of commodity machines.
HDFS: A distributed file system that runs on large clusters of commodity machines.
Pig: A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
Hive: A distributed data warehouse. Hive manages data stored in HDFS and
provides a query language based on SQL (and which is translated by the runtime
engine to MapReduce jobs) for querying the data.
HBase: A distributed, column-oriented database. HBase uses HDFS for its

underlying storage, and supports both batch-style computations using MapReduce
and point queries (random reads).
ZooKeeper: A distributed, highly available coordination service. ZooKeeper

provides primitives such as distributed locks that can be used for building
distributed applications.
Sqoop: A tool for efficiently moving data between relational databases and HDFS.
Cascading: MapReduce is very powerful as a general-purpose computing

framework, but writing applications in the Hadoop Java API for MapReduce is
daunting, due to the low-level abstractions of the API, the verbosity of Java, and the
relative inflexibility of the MapReduce for expressing many common algorithms.
Cascading is the most popular high-level Java API that hides many of the
complexities of MapReduce programming behind more intuitive pipes and data flow
abstractions.
Twitter Scalding: Cascading is known to provide close abstraction to the daunting
Java API using pipes and custom data flows. It still suffers from limitations and
verbosity of Java. Scalding is Scala API on top of cascading that indents to remove
most of the boilerplate Java code and provides concise implementations of common
data analytics and manipulation functions similar to SQL and Pig. Scalding also
provides algebra and matrix models, which are useful in implementing machine
learning and other linear algebra-dependent algorithms.
Cascalog: Cascalog is similar to Scalding in the way it hides the limitations of Java
behind a powerful Clojure API for cascading. Cascalog includes logic programming
constructs inspired by Datalog. The name is derived from Cascading + Datalog.
Impala: It is a scalable parallel database technology to Hadoop, which can be used

to launch SQL queries on the data stored in HDFS and Apache HBase without any
data movement or transformation. It is a massively parallel processing engine that
runs natively on Hadoop.
Apache BigTop: It was originally a part of the Clouderas CDH distribution, which is
used to test the Hadoop ecosystem.
Apache Drill: It is an open-source version of Google Drell. It is used for interactive

analysis on large-scale datasets. The primary goals of Apache Drill are real time
querying of large datasets and scaling to clusters bigger than 10,000 nodes. It is
designed to support nested data, but also supports other data schemes like Avro
and JSON. The primary language, DrQL, is SQL like.
Apache Flume: It is responsible for data transfer between source and sink,
which can be scheduled or triggered upon an event. It is used to harvest, aggregate,
and move large amounts of data in and out of Hadoop. Flume allows different data
formats for sources, Avro, files, and sinks, HDFS and HBase. It also has a querying
engine so that the user can transform any data before it is moved between sources
and sinks.
Apache Mahout: it is a collection of scalable data mining and machine learning

algorithms implemented in Java. Four main groups of algorithms are:
Recommendations, a.k.a. collective filtering
Classification, a.k.a categorization
Clustering
Frequent itemset mining, a.k.a parallel frequent pattern mining
It is not merely a collection of algorithms as many machine learning algorithms are

non-scalable, the algorithms in Mahout are written to be distributed in nature and
use the MapReduce paradigm for execution.
Oozie: It is used to manage and coordinate jobs executed on Hadoop.
Figure: Hadoop Ecosystem
Hadoop Distributed File System
The distributed file system in Hadoop is designed to run on commodity hardware.

Although it has many similarities with other distributed file systems, the differences
are significant. It is highly fault-tolerant and is designed to run on low-cost
hardware. It also provides high-throughput to stored data, hence can be used to
store and process large datasets. To enable this streaming of data it relaxes some
POSIX standards. HDFS was originally built for the Apache Nutch project and later
forked in to an individual project under Apache. HDFS by design is able to provide
reliable storage to large datasets, allowing high bandwidth data streaming to user
applications. By distributing the storage and computation across several servers,
the resource can scale up and down with demand while remaining economical.
HDFS is different from other distributed file systems in the sense that it uses a write
once-read-many model that relaxes concurrency requirements, provides simple data
coherency, and enables high-throughput data access. HDFS prides on the principle
and proves to be more efficient when the processing is done near the data rather
than moving the data to the applications space. The data writes are restricted to
one writer at a time. The bytes are appended to the end of the stream and are
stored in the order written. HDFS has many notable goals:
Ensuring fault tolerance by detecting faults and applying quick recovery

methods.
MapReduce streaming for data access.
Simple and robust coherency model.
Processing is moved to the data, rather than data to processing.
Support heterogeneous commodity hardware and operating systems.
Scalability in storing and processing large amounts of data.
Distributing data and processing across clusters economically.
Reliability by replicating data across the nodes and redeploying processing

in the event of failures.
2.3.1 Characteristics of HDFS
Hardware Failure: Hardware failure is fairly common in clusters. A Hadoop cluster

consists of thousands of machines, each of which stores a block of data. HDFS
consists of a huge number of components and with that there is a good chance of
failure among them at any point of time. The detection of these failures and the
ability to quickly recover is part of the core architecture.
Streaming Data Access: Applications that run on the Hadoop HDFS need access to
streaming data. These applications cannot be run on general-purpose file systems.
HDFS is designed to enable large-scale batch processing, which is enabled by the
high-throughput data access. Several POSIX requirements are relaxed to enable
these special needs of high throughput rates.
Large Data Sets: The HDFS-based applications feed on large datasets. A typical file
size is in the range of high gigabytes to low terabytes. It should provide high data
bandwidth and support millions of files across hundreds of nodes in a single cluster.
Simple Coherency Model: The write-once-read-many access model of files enables

high throughput data access as the data once written need not be changed, thus
simplifying data coherency issues. A MapReduce-based application takes advantage
of this model.
Moving compute instead of data: Any computation is efficient if it executes near the
data because it avoids the network transfer bottleneck. Migrating the computation
closer to the data is a cornerstone of HDFS-based programming. HDFS provides all
the necessary application interfaces to move the computation close to the data
prior to execution.
Heterogeneous hardware and software portability: HDFS is designed to run on

commodity hardware, which hosts multiple platforms. This features helps
widespread adoption of this platform for large-scale computations.
Drawbacks:
Low-latency data access: The high-throughput data access comes at the cost of
latency. Latency sensitive applications are not suitable for HDFS. HBase has shown
promise in handling low-latency applications along with large-scale data access.
Lots of small files: The metadata of the file system is stored in the Namenodes
memory (master node). The limit to the number of files in the file-system is
dependent on the namenode memory. Typically, each file, directory, and block takes
about 150 bytes. For example, if you had one million files, each taking one block,
you would need at least 300 MB of memory. Storing millions of files seems possible,
but hardware is incapable of accommodating billions of files.
Figure : HDFS Architecture
2.3.2 Namenode and Datanode
The HDFS architecture is built in tandem with the popular master/slave architecture.
An HDFS cluster consists of a master server that manages the file-system
namespace and regulates files access, called the Namenode. Analogous to slaves,
there are a number of Datanodes. These are typically one per cluster node and
manage the data stored in the nodes. The HDFS file-system exists independently of
the host file-system and allows data to be stored in its own namespace . The
Namenode allows typical file-system operations like opening, closing, and renaming
files and directories. It also maintains data block and Datanode mapping
information. The Datanodes handle the read and write requests. Upon Namenodes
instructions the Datanodes perform block creation, deletion, and replications
operations. HDFS architecture is given in Figure 2.2 Namenode and Datanodes are
software services provided by HDFS that are designed to run on heterogeneous
commodity machines. These applications typically run on Unix/Linux-based
operating systems. Java programming language is used to build these services. Any
machine that supports the Java runtime environment can run Namenode and
Datanode services. Given the highly portable nature of Java programming language
HDFS can be deployed on wide range of machines. A typical cluster installation has
one dedicated machine that acts as a master running the Namenode service. Other
machines in the cluster run one instance of Datanode service per node. Although
you can run multiple Datanodes on one machine, this practice is rare in real-world
deployments. A single instance of Namenode/master simplifies the architecture of
the system. It acts as an arbitrator and a repository to all the HDFS meta-data. The
system is designed such that there is data flow through the Namenode. Figure 2.3
shows how the Hadoop ecosystem interacts together.
2.3.3 FileSystem
The file organization in HDFS is similar to the traditional hierarchical type. A user or
an application can create and store files in directories. The namespace hierarchy is
similar to other file-systems in the sense that one can create, remove, and move
files from one directory to another, or even rename a file. HDFS does not support
hard-links or soft-links. Any changes to the file-system namespace are recorded by
the Namenode. An application can specify the replication factor of a file that should
be maintained by HDFS. The number of copies of a file is called the replication
factor.
Figure: Interaction between HDFS and MapReduce
2.3.4 Data Replication
HDFS provides a reliable storage mechanism even though the cluster spans
thousands of machines. Each file is stored as a sequence of blocks, where each
block except the last one is of the same size. Fault tolerance is ensured by
replicating the file blocks. The block size and replication factor is configurable for
each file either by the user or an application. The replications factor can be set at
file creation and can be modified later. Files in HDFS are write-once and strictly
adhere to a single writer at a time property (Figure 2.4). Namenode, acting as the
master, takes all the decisions regarding data block replication. It receives a
heartbeat, check Figure 2.5 on how it works, and block report from the Datanodes in
the cluster. Heartbeat implies that the Datanode is functional and the block report
provides a list of all the blocks in a Datanode. Replication is vital to HDFS reliability
and performance is improved by optimizing the placement of these replicas. This
optimized placement contributes significantly to the performance and distinguishes
itself from the rest of the file-systems. This feature requires much tuning and
experience to get it right. Using a rack-aware replication policy improves data
availability, reliability, and efficient network bandwidth utilization. This policy is the
first of its kind and has seen much attention with better and sophisticated policies
to follow. Typically large HDFS instances span across multiple racks.
Communications between these racks go through switches. Network bandwidth
between machines in different racks is less than machines within the same rack.
Figure: Data Replication in Hadoop
The Namenode is aware of the rack-ids each Datanode belongs to. A simple policy
can be implemented by placing unique replicas in individual racks; in this way even
if an entire rack fails, although unlikely, the replica on another rack is still available.
However, this policy can be costly as the number of writes between racks increases.
When the replication factor is, say 3, the placement policy in HDFS is put a replica in
one node per rack. This policy cuts the inter-rack write, which improves
performance. The chance of rack failure is far less than node failure. This policy
significantly reduces the network bandwidth used when reading data as a replica is
placed in only two unique racks instead of three. One-third of replicas are on one
node and one-third in another rack, while another third is evenly distributed across
remaining racks. This policy improves performance along with data reliability and
read performance.
Replica Selection: Selecting a replica that is closer to the reader significantly

reduces the global bandwidth consumption and read latency. If a replica is present
on the same rack then it is chosen instead of another rack, similarly if the replica
spans data centers then the local data center hosting the replica is chosen.
Safe-mode: When HDFS services are started, the Namenode enters a safe-mode.
The Namenode receives heartbeats and block reports from the Datanodes. Each
block has a specified number of replicas and Namenode checks if these replicas are
present. It also checks with Datanodes through the heartbeats. Once a good
percentage of blocks are present the Namenode exits the safe-mode (takes about
30s). It completes the remaining replication in other Datanodes.
2.3.5 Communication
The TCP/IP protocols are foundations to all the communication protocols built in
HDFS. Remote Procedure Calls (RPCs) are designed around the client protocols and
the Datanode protocols. The Namenode does not initiate any procedure calls, it only
responds to the request from the clients and the Datanodes.
Robustness: The fault-tolerant nature of HDFS ensures data reliability. The three
common type of failures are Namenode, Datanode, and network partitions.
Data Disk Failure, Heartbeats, and Re-replication: Namenode receives periodic

heartbeats from the Datanodes. A network partition, failure of a switch, can cause
all the Datanodes connected via that network to be invisible to the Namenode.
Namenode used heartbeats to detect the condition of nodes in the cluster; it marks
them dead if there are no recent heartbeats and does not forward any I/O requests.
Any data part of the dead Datanode is not available and the Namenode keeps track
of these blocks and triggers replications whenever necessary. There are several
reasons to re-replicate a data block: Datanode failure, corrupted data block, storage
disk failure, increased replication factor.
Figure: Heart Beat Mechanism in Hadoop
Cluster Rebalancing: Data re-balancing schemes are compatible with the HDFS
architecture. It automatically moves the data from a Datanode when its free space
falls below a certain threshold. In case a file is repeatedly used in an application, a
scheme might create additional replicas of the file to re-balance the thirst for the
data on the file in the cluster.
Data Integrity: Data corruption can occur for various reasons like storage device
failure, network faults, buggy software, etc. To recognize corrupted data blocks
HDFS implements a checksum to check on the contents of the retrieved HDFS files.
Each block of data has an associated checksum that is stored in a separate hidden
file in the same HDFS namespace. When a file is retrieved the quality of the file is
checked with the checksum stored, if not then the client can ask for that data block
from another Datanode (Figure 2.6).
Figure: Each cluster contains one Namenode. This design facilitates a simplified
model for managing each Namespace and arbitrating data distribution.
2.3.6 Data Organization
Data Blocks: HDFS by design supports very large files. Applications written to work
on HDFS write their data once and read many times, with reads at streaming
speeds. A typical block size used by HDFS is 64 MB. Each file is chopped into 64 MB
chunks and replicated.
Staging: A request to create a file does not reach the Namenode immediately.
Initially, The HDFS client caches the file data into a temporary local file. Application
writes are redirected to this temporary files. When the local file accumulates
content over the block size, the client contacts the Namenode. The Namenode
creates a file in the file-system and allocates a data block for it. The Namenode
responds to the client with the Datanode and data block identities. The client
flushes the block of data from the temporary file to this data block. If the file is
closed the Namenode is informed, and it starts a file creation operation into a
persistent store. Suppose, if the Namenode dies before commit the file is lost.
The benefits of the above approach is to allow streaming write to files. If a client
writes a file directly to a remote file without buffering, the network speed and
network congestion impacts the throughput considerably. There are earlier file
systems that successfully use client side caching to improve performance. In case of
HDFS a few POSIX rules have been relaxed to achieve high performance data
uploads.
Replication Pipelining: Local file caching mechanism described in the previous

section is used for writing data to an HDFS file. Suppose the HDFS file has a
replication factor of three. In such event the local file accumulates a data block of
data, and the client receives a list of Datanodes from the Namenode that will host
the replica of the data block. The client then flushes the data block in the first
Datanode. Each Datanode in the list receives data block in smaller chunks of 4KB,
the first Datanode persists this chunk in its repository after which it is flushed to the
second Datanode. The second Datanode in turn persists this data chunk in its
repository and flushes to the third Datanode, which persists the chunk in its
repository. Thus the Datanodes receive data from the previous nodes in a pipeline
and at the same time forwarding to the next Datanode in the pipeline.
Components of HDFS Architecture
A. NameNode
The HDFS namespace is a hierarchy of files and directories. Files and
directories are represented on the NameNode by nodes, which record
attributes like permissions, modification and access times, namespace and
disk space quotas. The file content is split into large blocks (typically 128
megabytes, but user selectable file-by-file) and each block of the file is
independently replicated at multiple DataNodes (typically three, but user
selectable file-by-file). The NameNode maintains the namespace tree and the
mapping of file blocks to DataNodes (the physical location of file data). An
HDFS client wanting to read a file first contacts the NameNode for the
locations of data blocks comprising the file and then reads block contents
from the DataNode closest to the client. When writing data, the client
requests the NameNode to nominate a suite of three DataNodes to host the
block replicas. The client then writes data to the DataNodes in a pipeline
fashion. The current design has a single NameNode for each cluster. The
cluster can have thousands of DataNodes and tens of thousands of HDFS
clients per cluster, as each DataNode may execute multiple application tasks
concurrently. HDFS keeps the entire namespace in RAM. The inode data and
the list of blocks belonging to each file comprise the metadata of the name
system called the image. The persistent record of the image stored in the
local hosts native files system is called a checkpoint. The NameNode also
stores the modification log of the image called the journal in the local hosts
native file system. For improved durability, redundant copies of the
checkpoint and journal can be made at other servers. During restarts the
NameNode restores the namespace by reading the namespace and replaying
the journal. The locations of block replicas may change over time and are not
part of the persistent checkpoint.
B. DataNodes
Each block replica on a DataNode is represented by two files in the local
hosts native file system. The first file contains the data itself and the second
file is blocks metadata including checksums for the block data and the
blocks generation stamp. The size of the data file equals the actual length of
the block and does not require extra space to round it up to the nominal block
size as in traditional file systems. Thus, if a block is half full it needs only half
of the space of the full block on the local drive. During startup each DataNode
connects to the NameNode and performs a handshake. The purpose of the
handshake is to verify the namespace ID and the software version of the
DataNode. If either does not match that of the NameNode the DataNode
automatically shuts down. The namespace ID is assigned to the file system
instance when it is formatted. The namespace ID is persistently stored on all
nodes of the cluster. Nodes with a different namespace ID will not be able to
join the cluster, thus preserving the integrity of the file system. The
consistency of software versions is important because incompatible version
may cause data corruption or loss, and on large clusters of thousands of
machines it is easy to overlook nodes that did not shut down properly prior to
the software upgrade or were not available during the upgrade. A DataNode
that is newly initialized and without any namespace ID is permitted to join the
cluster and receive the clusters namespace ID. After the handshake the
DataNode registers with the NameNode. DataNodes persistently store their
unique storage IDs. The storage ID is an internal identifier of the DataNode,
which makes it recognizable even if it is restarted with a different IP address
or port. The storage ID is assigned to the DataNode when it registers with
the NameNode for the first time and never changes after that. A DataNode
identifies block replicas in its possession to the NameNode by sending a block
report. A block report contains the block id, the generation stamp and the
length for each block replica the server hosts. The first block report is sent
immediately after the DataNode registration. Subsequent block reports are
sent every hour and provide the NameNode with an up-todate view of where
block replicas are located on the cluster. During normal operation DataNodes
send heartbeats to the NameNode to confirm that the DataNode is operating
and the block replicas it hosts are available. The default heartbeat interval is
three seconds. If the NameNode does not receive a heartbeat from a
DataNode in ten minutes the NameNode considers the DataNode to be out of
service and the block replicas hosted by that DataNode to be unavailable.
The NameNode then schedules creation of new replicas of those blocks on
other DataNodes. Heartbeats from a DataNode also carry information about
total storage capacity, fraction of storage in use, and the number of data
transfers currently in progress. These statistics are used for the NameNodes
space allocation and load balancing decisions. The NameNode does not
directly call DataNodes. It uses replies to heartbeats to send instructions to
the DataNodes. The instructions include commands to:
replicate blocks to other nodes;
remove local block replicas;
re-register or to shut down the node;
send an immediate block report.
These commands are important for maintaining the overall system integrity and
therefore it is critical to keep heartbeats frequent even on big clusters. The
NameNode can process thousands of heartbeats per second without affecting other
NameNode operations.
C. HDFS Client
User applications access the file system using the HDFS client, a code library
that exports the HDFS file system interface. Similar to most conventional file
systems, HDFS supports operations to read, write and delete files, and
operations to create and delete directories. The user references files and
directories by paths in the namespace. The user application generally does
not need to know that file system metadata and storage are on different
servers, or that blocks have multiple replicas. When an application reads a
file, the HDFS client first asks the NameNode for the list of DataNodes that
host replicas of the blocks of the file. It then contacts a DataNode directly and
requests the transfer of the desired block. When a client writes, it first asks
the NameNode to choose DataNodes to host replicas of the first block of the
file. The client organizes a pipeline from node-to-node and sends the data.
When the first block is filled, the client requests new DataNodes to be chosen
to host replicas of the next block. A new pipeline is organized, and the client
sends the further bytes of the file.
Figure 1. An HDFS client creates a new file by giving its path to the NameNode. For
each block of the file, the NameNode returns a list of DataNodes to host its replicas.
The client then pipelines data to the chosen DataNodes, which eventually confirm
the creation of the block replicas to the NameNode.
Each choice of DataNodes is likely to be different. The interactions among the

client, the NameNode and the DataNodes are illustrated in Fig. 1. Unlike
conventional file systems, HDFS provides an API that exposes the locations of a file
blocks. This allows applications like the MapReduce framework to schedule a task to
where the data are located, thus improving the read performance. It also allows an
application to set the replication factor of a file. By default a files replication factor
is three. For critical files or files which are accessed very often, having a higher
replication factor improves their tolerance against faults and increase their read
bandwidth.
D. Image and Journal

The namespace image is the file system metadata that describes the
organization of application data as directories and files. A persistent record of
the image written to disk is called a checkpoint. The journal is a write-ahead
commit log for changes to the file system that must be persistent. For each
client-initiated transaction, the change is recorded in the journal, and the
journal file is flushed and synched before the change is committed to the
HDFS client. The checkpoint file is never changed by the NameNode; it is
replaced in its entirety when a new checkpoint is created during restart, when
requested by the administrator, or by the CheckpointNode described in the
next section. During startup the NameNode initializes the namespace image
from the checkpoint, and then replays changes from the journal until the
image is up-to-date with the last state of the file system. A new checkpoint
and empty journal are written back to the storage directories before the
NameNode starts serving clients. If either the checkpoint or the journal is
missing, or becomes corrupt, the namespace information will be lost partly or
entirely. In order to preserve this critical information HDFS can
be configured to store the checkpoint and journal in multiple storage directories.

Recommended practice is to place the directories on different volumes, and for one
storage directory to be on a remote NFS server. The first choice prevents loss from
single volume failures, and the second choice protects against failure of the entire
node. If the NameNode encounters an error writing the journal to one of the storage
directories it automatically excludes that directory from the list of storage
directories. The NameNode automatically shuts itself down if no storage directory is
available. The NameNode is a multithreaded system and processes requests
simultaneously from multiple clients. Saving a transaction to disk becomes a
bottleneck since all other threads need to wait until the synchronous flush-and-sync
procedure initiated by one of them is complete. In order to optimize this process
the NameNode batches multiple transactions initiated by different clients. When one
of the NameNodes threads initiates a flush-and-sync operation, all transactions
batched at that time are committed together. Remaining threads only need to check
that their transactions have been saved and do not need to initiate a flush-and-sync
operation.
E. CheckpointNode The NameNode in HDFS, in addition to its primary role serving

client requests, can alternatively execute either of two other roles, either a
CheckpointNode or a BackupNode. The role is specified at the node startup. The
CheckpointNode periodically combines the existing checkpoint and journal to create
a new checkpoint and an empty journal. The CheckpointNode usually runs on a
different host from the NameNode since it has the same memory requirements as
the NameNode. It downloads the current checkpoint and journal files from the
NameNode, merges them locally, and returns the new checkpoint back to the
NameNode.
Creating periodic checkpoints is one way to protect the file system metadata. The
system can start from the most recent checkpoint if all other persistent copies of
the namespace image or journal are unavailable. Creating a checkpoint lets the
NameNode truncate the tail of the journal when the new checkpoint is uploaded to
the NameNode. HDFS clusters run for prolonged periods of time without restarts
during which the journal constantly grows. If the journal grows very large, the
probability of loss or corruption of the journal file increases. Also, a very large
journal extends the time required to restart the NameNode. For a large cluster, it
takes an hour to process a week-long journal. Good practice is to create a daily
checkpoint.
F. BackupNode A recently introduced feature of HDFS is the BackupNode. Like a

CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in
addition it maintains an inmemory, up-to-date image of the file system namespace
that is always synchronized with the state of the NameNode. The BackupNode
accepts the journal stream of namespace transactions from the active NameNode,
saves them to its own storage directories, and applies these transactions to its own
namespace image in memory. The NameNode treats the BackupNode as a journal
store the same as it treats journal files in its storage directories. If the NameNode
fails, the BackupNodes image in memory and the checkpoint on disk is a record of
the latest namespace state. The BackupNode can create a checkpoint without
downloading checkpoint and journal files from the active NameNode, since it
already has an up-to-date namespace image in its memory. This makes the
checkpoint process on the BackupNode more efficient as it only needs to save the
namespace into its local storage directories. The BackupNode can be viewed as a
read-only NameNode. It contains all file system metadata information except for
block locations. It can perform all operations of the regular NameNode that do not
involve modification of the namespace or knowledge of block locations. Use of a
BackupNode provides the option of running the NameNode without persistent
storage, delegating responsibility for the namespace state persisting to the
BackupNode.
G. Upgrades, File System Snapshots During software upgrades the possibility of

corrupting the system due to software bugs or human mistakes increases. The
purpose of creating snapshots in HDFS is to minimize potential damage to the data
stored in the system during upgrades. The snapshot mechanism lets administrators
persistently save the current state of the file system, so that if the upgrade results
in data loss or corruption it is possible to rollback the upgrade and return HDFS to
the namespace and storage state as they were at the time of the snapshot. The
snapshot (only one can exist) is created at the cluster administrators option
whenever the system is started. If a snapshot is requested, the NameNode first
reads the checkpoint
and journal files and merges them in memory. Then it writes the new checkpoint
and the empty journal to a new location, so that the old checkpoint and journal
remain unchanged. During handshake the NameNode instructs DataNodes whether
to create a local snapshot. The local snapshot on the DataNode cannot be created
by replicating the data files directories as this will require doubling the storage
capacity of every DataNode on the cluster. Instead each DataNode creates a copy of
the storage directory and hard links existing block files into it. When the DataNode
removes a block it removes only the hard link, and block modifications during
appends use the copy-on-write technique. Thus old block replicas remain untouched
in their old directories. The cluster administrator can choose to roll back HDFS to the
snapshot state when restarting the system. The NameNode recovers the checkpoint
saved when the snapshot was created. DataNodes restore the previously renamed
directories and initiate a background process to delete block replicas created after
the snapshot was made. Having chosen to roll back, there is no provision to roll
forward. The cluster administrator can recover the storage occupied by the
snapshot by commanding the system to abandon the snapshot, thus finalizing the
software upgrade. System evolution may lead to a change in the format of the
NameNodes checkpoint and journal files, or in the data representation of block
replica files on DataNodes. The layout version identifies the data representation
formats, and is persistently stored in the NameNodes and the DataNodes storage
directories. During startup each node compares the layout version of the current
software with the version stored in its storage directories and automatically
converts data from older formats to the newer ones. The conversion requires the
mandatory creation of a snapshot when the system restarts with the new software
layout version. HDFS does not separate layout versions for the NameNode and
DataNodes because snapshot creation must be an allcluster effort rather than a
node-selective event. If an upgraded NameNode due to a software bug purges its
image then backing up only the namespace state still results in total data loss, as
the NameNode will not recognize the blocks reported by DataNodes, and will order
their deletion. Rolling back in this case will recover the metadata, but the data itself
will be lost. A coordinated snapshot is required to avoid a cataclysmic destruction.
FILE I/O OPERATIONS AND REPLICA MANGEMENT
A. File Read and Write An application adds data to HDFS by creating a new file and
writing the data to it. After the file is closed, the bytes written cannot be altered or
removed except that new data can be added to the file by reopening the file for
append. HDFS implements a single-writer, multiple-reader model. The HDFS client
that opens a file for writing is granted a lease for the file; no other client can write
to the file. The writing client periodically renews the lease by sending a heartbeat to
the NameNode. When the file is closed, the lease is revoked.
The lease duration is bound by a soft limit and a hard limit. Until the soft limit
expires, the writer is certain of exclusive access to the file. If the soft limit expires
and the client fails to close the file or renew the lease, another client can preempt
the lease. If after the hard limit expires (one hour) and the client has failed to renew
the lease, HDFS assumes that the client has quit and will automatically close the file
on behalf of the writer, and recover the lease. The writer's lease does not prevent
other clients from reading the file; a file may have many concurrent readers. An
HDFS file consists of blocks. When there is a need for a new block, the NameNode
allocates a block with a unique block ID and determines a list of DataNodes to host
replicas of the block. The DataNodes form a pipeline, the order of which minimizes
the total network distance from the client to the last DataNode. Bytes are pushed to
the pipeline as a sequence of packets. The bytes that an application writes first
buffer at the client side. After a packet buffer is filled (typically 64 KB), the data are
pushed to the pipeline. The next packet can be pushed to the pipeline before
receiving the acknowledgement for the previous packets. The number of
outstanding packets is limited by the outstanding packets window size of the client.
After data are written to an HDFS file, HDFS does not provide any guarantee that
data are visible to a new reader until the file is closed. If a user application needs
the visibility guarantee, it can explicitly call the hflush operation. Then the current
packet is immediately pushed to the pipeline, and the hflush operation will wait until
all DataNodes in the pipeline acknowledge the successful transmission of the
packet. All data written before the hflush operation are then certain to be visible to
readers.
Figure: Data pipeline during block construction

If no error occurs, block construction goes through three stages as shown in Fig. 2
illustrating a pipeline of three DataNodes (DN) and a block of five packets. In the
picture, bold lines represent data packets, dashed lines represent acknowledgment
messages, and thin lines represent control messages to setup and close the
pipeline. Vertical lines represent activity at the client and the three DataNodes
where time proceeds from top to bottom. From t0 to t1 is the pipeline setup stage.
The interval t1 to t2 is the data streaming stage, where t1 is the time when the first
data packet gets sent and t2 is the time that the acknowledgment to the last packet
gets received. Here an hflush operation transmits the second packet. The hflush
indication travels with the packet data and is not a separate operation. The final
interval t2 to t3 is the pipeline close stage for this block. In a cluster of thousands of
nodes, failures of a node (most commonly storage faults) are daily occurrences. A
replica stored on a DataNode may become corrupted because of faults in memory,
disk, or network. HDFS generates and stores checksums for each data block of an
HDFS file. Checksums are verified by the HDFS client while reading to help detect
any corruption caused either by client, DataNodes, or network. When a client
creates an HDFS file, it computes the checksum sequence for each block and sends
it to a DataNode along with the data. A DataNode stores checksums in a metadata
file separate from the blocks data file. When HDFS reads a file, each blocks data
and checksums are shipped to the client. The client computes the checksum for the
received data and verifies that the newly computed checksums matches the
checksums it received. If not, the client notifies the NameNode of the corrupt replica
and then fetches a different replica of the block from another DataNode. When a
client opens a file to read, it fetches the list of blocks and the locations of each block
replica from the NameNode. The locations of each block are ordered by their
distance from the reader. When reading the content of a block, the client tries the
closest replica first. If the read attempt fails, the client tries the next replica in
sequence. A read may fail if the target DataNode is unavailable, the node no longer
hosts a replica of the block, or the replica is found to be corrupt when checksums
are tested. HDFS permits a client to read a file that is open for writing. When
reading a file open for writing, the length of the last block still being written is
unknown to the NameNode. In this case, the client asks one of the replicas for the
latest length before starting to read its content. The design of HDFS I/O is
particularly optimized for batch processing systems, like MapReduce, which require
high throughput for sequential reads and writes. However, many efforts have been
put to improve its read/write response time in order to support applications like
Scribe that provide real-time data streaming to HDFS, or HBase that provides
random, realtime access to large tables.
B. Block Placement For a large cluster, it may not be practical to connect all nodes
in a flat topology. A common practice is to spread the nodes across multiple racks.
Nodes of a rack share a switch, and rack switches are connected by one or more
core switches. Communication between two nodes in different racks has to go
through multiple switches. In most cases, network bandwidth between nodes in the
same rack is greater than network bandwidth between nodes in different racks. Fig.
3 describes a cluster with two racks, each of which contains three nodes.
Figure 3: Cluster topology example
HDFS estimates the network bandwidth between two nodes by their distance. The
distance from a node to its parent node is assumed to be one. A distance between
two nodes can be calculated by summing up their distances to their closest
common ancestor. A shorter distance between two nodes means that the greater
bandwidth they can utilize to transfer data. HDFS allows an administrator to
configure a script that returns a nodes rack identification given a nodes address.
The NameNode is the central place that resolves the rack location of each
DataNode. When a DataNode registers with the NameNode, the NameNode runs a
configured script to decide which rack the node belongs to. If no such a script is
configured, the NameNode assumes that all the nodes belong to a default single
rack. The placement of replicas is critical to HDFS data reliability and read/write
performance. A good replica placement policy should improve data reliability,
availability, and network bandwidth utilization. Currently HDFS provides a
configurable block placement policy interface so that the users and researchers can
experiment and test any policy thats optimal for their applications. The default
HDFS block placement policy provides a tradeoff between minimizing the write cost,
and maximizing data reliability, availability and aggregate read bandwidth. When a
new block is created, HDFS places the first replica on the node where the writer is
located, the second and the third replicas on two different nodes in a different rack,
and the rest are placed on random nodes with restrictions that no more than one
replica is placed at one node and no more than two replicas are placed in the same
rack when the number of replicas is less than twice the number of racks. The choice
to place the second and third replicas on a different rack better distributes the block
replicas for a single file across the cluster. If the first two replicas were placed on the
same rack, for any file, two-thirds of its block replicas would be on the same rack.
After all target nodes are selected, nodes are organized as a pipeline in the order of
their proximity to the first replica. Data are pushed to nodes in this order. For
reading, the NameNode first checks if the clients host is located in the cluster. If
yes, block locations are returned to the client in the order of its closeness to the
reader. The block is read from DataNodes in this preference order. (It is usual for
MapReduce applications to run on cluster nodes, but as long as a host can connect
to the NameNode and DataNodes, it can execute the HDFS client.) This policy
reduces the inter-rack and inter-node write traffic and generally improves write
performance. Because the chance of a rack failure is far less than that of a node
failure, this policy does not impact data reliability and availability guarantees. In the
usual case of three replicas, it can reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than
three. The default HDFS replica placement policy can be summarized as follows:
1. No Datanode contains more than one replica of any block.
2. No rack contains more than two replicas of the same block, provided there
are sufficient racks on the cluster.
C. Replication management The NameNode endeavors to ensure that each block

always has the intended number of replicas. The NameNode detects that a block
has become under- or over-replicated when a block report from a DataNode arrives.
When a block becomes over replicated, the NameNode chooses a replica to remove.
The NameNode will prefer not to reduce the number of racks that host replicas, and
secondly prefer to remove a replica from the DataNode with the least amount of
available disk space. The goal is to balance storage utilization across DataNodes
without reducing the blocks availability. When a block becomes under-replicated, it
is put in the replication priority queue. A block with only one replica has the highest
priority, while a block with a number of replicas that is greater than two thirds of its
replication factor has the lowest priority. A background thread periodically scans the
head of the replication queue to decide where to place new replicas. Block
replication follows a similar policy as that of the new block placement. If the number
of existing replicas is one, HDFS places the next replica on a different rack. In case
that the block has two existing replicas, if the two existing replicas are on the same
rack, the third replica is placed on a different rack; otherwise, the third replica is
placed on a different node in the same rack as an existing replica. Here the goal is
to reduce the cost of creating new replicas. The NameNode also makes sure that not
all replicas of a block are located on one rack. If the NameNode detects that a
blocks replicas end up at one rack, the NameNode treats the block as under-
replicated and replicates the block to a different rack using the same block
placement policy described above. After the NameNode receives the notification
that the replica is created, the block becomes over-replicated. The NameNode then
will decides to remove an old replica because the overreplication policy prefers not
to reduce the number of racks.
D. Balancer HDFS block placement strategy does not take into account DataNode
disk space utilization. This is to avoid placing newmore likely to be referenced
data at a small subset of the DataNodes. Therefore data might not always be placed
uniformly across DataNodes. Imbalance also occurs when new nodes are added to
the cluster. The balancer is a tool that balances disk space usage on an HDFS
cluster. It takes a threshold value as an input parameter, which is a fraction in the
range of (0, 1). A cluster is balanced if for each DataNode, the utilization of the node
(ratio of used space at the node to total capacity of the node) differs from the
utilization of the whole cluster (ratio of used space in the cluster to total capacity of
the cluster) by no more than the threshold value. The tool is deployed as an
application program that can be run by the cluster administrator. It iteratively
moves replicas from DataNodes with higher utilization to DataNodes with lower
utilization. One key requirement for the balancer is to maintain data availability.
When choosing a replica to move and deciding its destination, the balancer
guarantees that the decision does not reduce either the number of replicas or the
number of racks. The balancer optimizes the balancing process by minimizing the
inter-rack data copying. If the balancer decides that a replica A needs to be moved
to a different rack and the destination rack happens to have a replica B of the same
block, the data will be copied from replica B instead of replica A. A second
configuration parameter limits the bandwidth consumed by rebalancing operations.
The higher the allowed bandwidth, the faster a cluster can reach the balanced state,
but with greater competition with application processes.
E. Block Scanner Each DataNode runs a block scanner that periodically scans its
block replicas and verifies that stored checksums match the block data. In each
scan period, the block scanner adjusts the read bandwidth in order to complete the
verification in a configurable period. If a client reads a complete block and
checksum verification succeeds, it informs the DataNode. The DataNode treats it as
a verification of the replica. The verification time of each block is stored in a human
readable log file. At any time there are up to two files in toplevel DataNode
directory, current and prev logs. New verification times are appended to current file.
Correspondingly each DataNode has an in-memory scanning list ordered by the
replicas verification time. Whenever a read client or a block scanner detects a
corrupt block, it notifies the NameNode. The NameNode marks the replica as
corrupt, but does not schedule deletion of the replica immediately. Instead, it starts
to replicate a good copy of the block. Only when the good replica count reaches the
replication factor of the block the corrupt replica is scheduled to be removed. This
policy aims to preserve data as long as possible. So even if all replicas of a block are
corrupt, the policy allows the user to retrieve its data from the corrupt replicas.
F. Decommissioing The cluster administrator specifies which nodes can join the
cluster by listing the host addresses of nodes that are permitted to register and the
host addresses of nodes that are not permitted to register. The administrator can
command the system to re-evaluate these include and exclude lists. A present
member of the cluster that becomes excluded is marked for decommissioning. Once
a DataNode is marked as decommissioning, it will not be selected as the target of
replica placement, but it will continue to serve read requests. The NameNode starts
to schedule replication of its blocks to other DataNodes. Once the NameNode
detects that all blocks on the decommissioning DataNode are replicated, the node
enters the decommissioned state. Then it can be safely removed from the cluster
without jeopardizing any data availability.
G. Inter-Cluster Data Copy When working with large datasets, copying data into and
out of a HDFS cluster is daunting. HDFS provides a tool called DistCp for large
inter/intra-cluster parallel copying. It is a MapReduce job; each of the map tasks
copies a portion of the source data into the destination file system. The MapReduce
framework automatically handles parallel task scheduling, error detection and
recovery.
Context-Aware Scheduling for Apache Hadoop over Pervasive Environments
Apache Hadoop is the most popular implementation of the MapReduce paradigm for
distributed computing, but its design doesnt adapt automatically to computing
nodes context and capabilities. By introducing context-awareness into Hadoop, we
intent to dynamically adapt its scheduling to the execution environment. This is a
necessary feature in the context of pervasive grids, which are heterogeneous,
dynamic and shared environments.
Apache Hadoop is a framework for distributed and parallel computing,

implementing the MapReduce programming paradigm, which aims at processing big
data sets1. Hadoop is designed to scale up from a single server to thousands of
machines, each offering local computation and storage. Without specific
configuration by the administrator, Apache Hadoop supposes the use of dedicated
homogeneous clusters for executing MapReduce applications. As the overall
performance depends on the task scheduling, Hadoop performance may be
seriously impacted when running on heterogeneous and dynamic environments, for
which it was not designed for.
ThisisanespecialconcernwhendeployingHadoopoverpervasivegrids.Pervasivegridsare
aninterestingalternativetocostlydedicatedclusters,astheacquisitionandmaintenanceo
fadedicatedclusterremainhighanddissuasive for many organizations. According to
Parashar and Pierson2, pervasive grids represent the extreme generalization of the
grid concept, in which the resources are pervasive. Pervasive grids propose using
resources embedded in
pervasiveenvironmentsinordertoperformcomputingtasksinadistributedway.Concretel
y,theycanbeseenas computing grids formed by existing resources (desktop
machines, spare servers, etc.) that occasionally contribute to the computing grid
power. These resources are inherently heterogeneous and potentially mobiles,
coming in and out the grid dynamically. Knowing that, in essence, pervasive grids
are heterogeneous, dynamic, shared and distributed environments, its efficient
management becomes a very complex task3. Task scheduling is thus severely
affected by the management of the environment complexity.
ManyworkshaveproposedtoimprovetheadaptabilityoftheHadoopframeworkonenviron
mentsthatdiverge from the initial supposition, each having their own proposal and
objectives4,5,6,7. The PER-MARE project8, in which this work was developed, aims
at adapting Hadoop to pervasive environments9. Indeed, Hadoop is based on static
configuration files and the current versions do not adapt well to resources variations
over the time. In addition, the installation procedures force the administrator to
manually define the characteristics of each potential resource, such as the memory
and the number of cores of each machine, which is a hard task in a heterogeneous
environment. All these factors prevent deploying Hadoop on more volatile
environments. The PER-MARE project aims at the improvement of Hadoop so that it
could adapt itself to the execution context and therefore be deployed over
pervasive grids. In order to adapt Hadoop to a pervasive grid environment,
supporting context-awareness is essential. Context aware is the capacity of an
application or software to detect and respond to environment changes10. A context-
aware system is able to adapt its operations to current state without human
intervention, therefore improving the systems usability andefficiency11. In
pervasive grids, the scheduling is a task that may be benefited in context-aware
systems, collecting data about the grid resources and making decisions based on
the data collected. This work focuses on our developments to introduce context-
awareness capabilities on Hadoop task scheduling mechanisms. Through a context
collection procedure and minimal changes on Hadoops resource manager, we are
able to update the information about the availability of resources in each node of
the grid and then influence the scheduler tasks assignments. The rest of the paper
is organized as follows: Section 2 presents Apache Hadoop architecture and
scheduling mechanisms. Section 3 discusses related work, focusing on context-
awareness and on other works that try to improve Hadoop schedulers. Section4
presents our proposal of context-aware scheduling, while Section5 presents the
experiments conducted and the achieved results. We finally conclude this paper in
Section 6.
2. AboutHadoopScheduling
The Apache Hadoop framework is organized in a master and slave architecture, with
two main services: storage (HDFS) and processing (YARN). Both services have their
own master and slave components, as presented on Fig. 1. It is possible to see the
NameNode and ResourceManager services, which are the masters of the HDFS and
YARNrespectively,andtheirslavecounterparts,the DataNode and NodeManager. It is
also possible to note the ApplicationMaster, the component responsible for internal
application (job) management, or simply task scheduling. While ResourceManager is
the component responsible for job scheduling. Each node also runs a set of
Containers, where the execution of Map and Reduce tasks takes place.
2.1. Hadoop Schedulers
Concerning job scheduling, Hadoop offers several options. The simplest scheduler,
called Hadoop Internal
Scheduler,processesalljobsinarrivalorder(FIFO).Thisschedulerhasagoodperformancei
ndedicatedclusterswherethe
competitionforresourcesisnotaproblem.AnotherscheduleravailableistheFairScheduler
,mainlyusedtocompute batches of small jobs. It uses a two level scheduling to fairly
distribute the resources12.
ThethirdscheduleravailableistheCapacityScheduler.TheCapacitySchedulerisdesigned
torunHadoopMapReduce as a shared, multi-tenant cluster in an operator-friendly
manner while maximizing the throughput and the utilization of the cluster while
running Map-Reduce applications. The CapacityScheduler is designed to allow
sharing a large cluster while giving each organization a minimum capacity
guarantee. The central idea is that the available resources in the Hadoop
MapReduce cluster are partitioned among multiple organizations that collectively
fund the cluster based on computing needs. There is an added benefit that an
organization can access any excess capacity not being used by the others users.
This provides elasticity for the organizations in a cost-effective manner12. The
existence of these schedulers allows a flexible management of the framework.
Despite that, the available schedulers neither detect nor react to the dynamicity
and heterogeneity of the computing environment, a typical concern on pervasive
grids.
Over the years, different works proposed improvements to the scheduler

mechanisms from Hadoop in order to respond to specific needs. These contributions
may be divided as proposals of new scheduling methods or proposals of
improvement for the resource distribution. Works like4,13 and6 assume that most
applications are periodic and demand similar resources regarding CPU, memory,
network and hard disk load. These assumptions allow the applications and nodes to
be analyzed regarding the CPU and I/O potential, enabling the optimization of
execution through matching of nodes and applications with the same
characteristics. Another work that focuses on a new scheduling method is14, where
the authors propose the usage ofacapacity-
demandgraphthatassiststhecalculationofoptimalscheduling
basedonanoverallcostfunction. While previously works focus on performance
improvement using static information about resources and applications, other works
sought to incorporate task specific information on their proposals. For example,
works like5 and15
attemptedtobetterdistributethetasksofanapplicationasawaytoreduceitsresponsetime
largeclusters. Theauthors of5
useheuristicstoinfertheestimatedtaskprogressandtomakeadecisionaboutthelaunchin
gofspeculativetasks. Speculative tasks are copies of tasks launched when there is a
possibility that the original task is on a faulty or too slow node. Another work15
proposes the usage of historical execution data to improve decision making. The
final result of both methods new scheduling mechanics and improvement of
resource distribution is a load rebalancing, forcing faster nodes to process more
data and slower nodes to process less data. The work7 tries to achieve that through
a system based on resource supply and demand, allowing each user to directly
influence scheduling through spending rates. The main objective is to allow a
dynamic resource sharing based on preferences set by each user. There are also
works like16, which attempt to provide a performance boost in jobs through better
data placement, mainly using data location as information to decision making. The
performance gain is achieved through data rebalancing on nodes, raising the load
on faster nodes. This proposal reduces the number of speculative tasks and data
transfers through the network. The work17 uses a P2P structure to arrange the
cluster. On this approach, nodes can change their function (master/slave) overtime
and can have both functions at the same time, since the functions are tied to
applications and not the cluster. The objective of this work was the adaptation of
MapReduceparadigm to a P2Penvironment, which given the natural volatility of P2P
environments, would offer support to pervasive grids. However, this proposal
focuses on providing a resilient infrastructure and does not explore the scheduling
of jobs and tasks. Indeed, most of previously cited works does not actually consider
current state of the available resources. Resources are described, not observed.
However, context-aware computing11 has demonstrated that this observation is
possible and that the execution environment may influence application behavior. A
question then raises: can we improve MapReduce scheduling by observing current
execution environment?
Context-AwareScheduling
The main goal of this work is to improve the scheduling of Hadoop by adding
support to dynamic changes in the availability of resources, like those occurring in a
pervasive environment. Similar to works on Section 3 we try to improve the
resource distribution, since faster and more robust nodes would have more data to
process. Different from these works, we opted to modify the Hadoop code through
insertion of dynamic context information using, as far as possible, an existing
scheduler (Capacity Scheduler). In order to detect dynamic changes, the scheduler
must collect context information that, in this case, refers to available resources on
the nodes. Slaves must communicate periodically with the master in order to keep
information updated and let the scheduler adapt to the new context. On the
following section we present a more detailed explanation of the changes
implemented in Apache Hadoop.
4.1. Context collector
By default, Hadoop reads information about the nodes from XML configuration files.
These files contain many
Hadoopconfigurationparameters,includingtheresourcecapacityofeachnode.Onceload
ed,theinformationwillnot beupdateduntilthenexttimetheserviceisstarted.
Aspervasiveenvironmentsmayfaceperformancechangesduring the execution of an
application, we need a mechanism that updates contextual information during
runtime according to the environmental conditions.
Tosolvethisproblem,weintegrateacollectormoduleintoHadoop,allowingthecollectionof
contextualinformation about the available resources. The collector was developed
for the PER-MARE project8, and its class diagram is presented in Fig. 2. The collector
module is based on standard Java monitoring API18, which allows to easily access
the real characteristics of a node, with no additional libraries required. It allows
collecting different contextinformation, such as the number of processors (cores)
and the system memory, using a set of interface and abstract/concrete classes that
generalize the collecting process. Due to its design, it is easy to integrate new
collectors and improve the resources available for the scheduling process, providing
data about the CPU load or disk usage, for example.
4.2. Communication
Gathering the context information required to feed the Hadoop scheduler requires
transmitting this information through the network from slave nodes (NodeManager)
till master node (ResourceManager), which is responsible for the scheduling. Instead
of relying on a separate service, we chose to use the ZooKeeper API19, a tool
initially developed inside Hadoop that becomes a full project as its usage was
extended to other applications. Zookeeper provides efficient, reliable, and fault-
tolerant tools for the coordination of distributed systems. In our case, we use
ZooKeeper services to distributed context information. As illustrated in Fig. 3, all
slaves (NodeManager) run an instance of the NodeStatusUpdater thread, that
collect data about the real resource availability of the node every 30 seconds. If the
amount of available resources changes, the DHT on ZooKeeper will be updated.
Similarly, the master (ResourceManager) also creates a thread to watch
ZooKeeper.IftheZookeepernodedetectsaDHTchange,themasterwillbenotifiedandupda
tethescheduler information based on the new information. This solution extends a
previous one we proposed in20 by offering a real timeobservationofavailablenodes.
Indeed,ourprevioussolution20 onlyupdatedinformationregardingtheresources on
service initialization, replacing the XML configuration file, while this one updates
resource information whenever the availability changes. As a result, scheduling is
performed based on the current resource state.
Figure: Context collector structure
Figure: Using ZooKeeper to distribute context information
Security Issues in HDFS
The HDFS is the base layer of Hadoop Architecture contains different classifications
of data and it is more sensitive to security issues. It has no appropriate role based
access for controlling security problems.Also the risk of data access,theft,and
unwanted disclosure takes place when embedded a data in single Hadoop
environment.Thereplicated data is also not secure which needs more security for
protecting from breaches and vulnerabilities. Mostly Government Sector and
Organisations never using Hadoop environment for storing valuable data because of
less security concerns inside a Hadoop Technology. They are providing security in
outside of Hadoop Environment like firewall and Intrusion Detection System. Some
authors represented that the HDFS in Hadoop environment is prevented with
security for avoiding the theft, vulnerabilities only by encrypting the block levels
and individual file system in Hadoop Environment.Even though other authors
encrypted the block and nodes using encryption technique but no perfect algorithm
is mentioned to maintain the security in Hadoop Environment. In order to increase
the security some approaches are mentioned below.
Figure: 4 HDFS Architecture
ii) HDFS Security Approaches
The proposed work represents different Approaches for securing data in Hadoop
distributed file system. The first approach is based on Kerberos in HDFS.
a. Kerberos Mechanism
Kerberos [10] is the network authentication protocol which allows the node to
transfer any file over non secure channel by a tool called ticket to prove their
unique identification between them. This Kerberos mechanism is used to enhance
the security in HDFS. In HDFS the connection between client and Name node is
achieved using Remote Procedure Call [11] and the connection from Client (client
uses HTTP) to Data node is Achieved using Block Transfer. Here the Token or
Kerberos is used to authenticate a RPC connection. If the Client needs to obtain a
token means, the client makes use of Kerberos Authenticated Connection. Ticket
Granting Ticket (TGT) or Service Ticket are used to authenticate a name node by
using Kerberos. Both TGT and ST can be renewed after long running of jobs while
Kerberos is renewed, new TGT and ST is also issued and distributed to all task. The
Key Distribution Centre (KDC) issues the Kerberos Service Ticket using TGT after
getting request from task and network traffic is avoided to the KDC by using Tokens
In name node, only the time period is extended but the ticket remain constant. The
major advantage is even if the ticket is stolen by the attacker it cant be renewed.
We can also use another method for providing security for file access in HDFS. If the
client wants to access a block from the data node it must first contact the name
node in order to identify which data node holds the files of the blocks. Because of
name node only authorize access to file permission and issues a token called Block
Token where data node verifies the token. The data node also issues a token called
Name Token where it allows the Name node to enforce permission for correct control
access on its data blocks. Block Token allows the data node to identify whether the
client is authorized access to access data blocks. These block token and Name
Token is sent back to client who contains data block respective locations and youre
the authorized person to access the location. These two methods are used to
increase security by preventing from unauthorized client must read and write in
data blocks. The figure 5 shows the design view of Kerberos key distribution centre.
Figure: Kerberos Key Distribution Centre
b. Bull Eye Algorithm Approach
In big data the sensitive data are credit card numbers, passwords, account
numbers, personal details are stored in a large technology called Hadoop. In order
to increase the security in Hadoop base layer the new approach is introduced for
securing sensitive information which is called Bull Eye Approach. This approach is
introduced on Hadoop module to view all sensitive information in 360 to find
whether all the secured information are stored without any risk, and allows the
authorized person to preserve the personal information in a right way. Recently this
approach is using in companies like Dateguises DGsecure[8] and Amazon Elastic
Map Reduce[9]. The DGsecure Company which is famous for providing a Data
centric security and Governance solutions also involves in providing a security for
Hadoop in the cloud. The data guise company is decided to maintain and provide
security in Hadoop wherever it is located in cloud. Now a days the Companies are
storing a more sensitive data in cloud because of more breaches taking place in
traditional on premise data store. To increase the security in Hadoop base layers,
the Bull eye Approach also used in HDFS to provide security in 360 from node to
node. This approach is implemented in Data node of rack 1, where it checks the
sensitive data are stored properly in block without any risk and allows only the
particular client to store in required blocks. It also bridges a gap between a data
driven from original data node and replicated data node. When the client wants to
retrieve any data from replicating data nodes it also maintained by Bull Eye
Approach and it checks where there is a proper relation between two racks. This
Algorithm allows the data nodes to be more secure, only the authorized person read
or write about it. The algorithm can be implemented below the data node where the
client read or writes the data to store in blocks. It is not only implemented in the
rack 1 similarly it is implemented in Rack 2 in order to increase the security of the
blocks inside the data nodes in 360. It checks for any attacks, breaches or theft of
data taking place in the blocks of the data node. Sometimes data are encrypted for
protection in data mode. These types of encrypted data also protected using this
Algorithm in order to main order security. The Algorithm travels from less terabyte
to multi-petabytes of semistructured, structured and unstructured data stored in
HDFS layer in all angles. Mostly encryption and Wrapping of data occurs at the block
levels of Hadoop rather than entire file level. This algorithm scans before the data is
allowed to enter into the blocks and also after enters both rack 1 and rack 2. Thus,
this Algorithm concentrates only on the sensitive data that matters about the
information stored in the data nodes. In our work, we mentioned this new Algorithm
to enhance more security in the data nodes of HDFS.
c. Namenode Approach
In HDFS if there is any problem in Name node event and becomes unavailable, it
makes the group of system service and data stored in the HDFS make unavailable
so it is not easy to access the data in secure way from this critical situation. In order
to increase the security in data availability, it is achieved by using two Namenode.
These two Name nodes servers are allowed to run successfully in the same cluster.
These two redundant name nodes are provided by Name Node Security Enhance
(NNSE), which holds Bull Eye Algorithm. It allows the Hadoop administrator to run
the options for two nodes. From these name node one acts as Master and other acts
as a slave in order to reduce an unnecessary or unexpected server crash and allows
predicting from natural disasters. If the Master Name node crashes, the
administrator needs to ask permission from Name Node Security Enhance to provide
a data from a slave node in order to cover a time lagging and data unavailability in
secure manner. Without getting permission from NNSE admin never retrieves the
data from slave node to reduce the complex retrieval issue. If both Name node acts
as a master there is a continuous risk occurs, reduces a secure data availability and
bottleneck in performance over a local area network or Wide Area Network. Thus in
future we can also increase security by using Vital configuration that provides and
ensures data is available in secured way to client by replicating many Name node
by Name Node Security Enhance in HDFS blocks between many data centres and
clusters.
On the Energy (In)Efficiency of Hadoop Clusters
Distributed processing frameworks, such as Yahoo!s Hadoop and Googles

MapReduce, have been successful at harnessing expansive datacenter resources for
large-scale data analysis. However, their effect on datacenter energy efficiency has
not been scrutinized. Moreover, the filesystem component of these frameworks
effectively precludes scale-down of clusters deploying these frameworks (i.e.
operating at reduced capacity). Thus, in order to allow scale-down of operational
clusters, we can modify Hadoop. Running Hadoop clusters in fractional
configurations can save between 9% and 50% of energy consumption, although
there is a tradeoff between performance energy consumption.
Energy consumption and cooling are now large components of the operational cost
of datacenters and pose significant limitations in terms of scalability and reliability.
A growing segment of datacenter workloads is managed with MapReduce-style
frameworks, whether by privately managed instances of Yahoo!s Hadoop, by
Amazons Elastic MapReduce, or ubiquitously at Google by their archetypal
implementation. Therefore, it is important to understand the energy efficiency of
this emerging workload. The energy efficiency of a cluster can be improved in two
ways: by matching the number of active nodes to the current needs of the
workload, placing the remaining nodes in low-power standby modes; by engineering
the compute and storage features of each node to match its workload and avoid
energy waste on oversized components. Unfortunately, MapReduce frameworks
have many characteristics that complicate both options.
First, MapReduce frameworks implement a distributed data-store comprised of the

disks in each node, which enables affordable storage for multi-petabyte datasets
with good performance and reliability. Associating each node with such a large
amount of state renders state-of-the-art techniques that manage the number of
active nodes, such as VMwares VMotion, impractical. Even idle nodes remain
powered on to ensure data availability.
Second, MapReduce frameworks are typically deployed on thousands of commodity

nodes, such as low cost 1U servers. The node configuration is a compromise
between the compute/data-storage requirements of MapReduce, as well as the
requirements of other workloads hosted on the same cluster (e.g., front-end web
serving). This implies a mismatch between the hardware and any workload, leading
to energy waste on idling components.
Finally, given the (un)reliability of commodity hardware, MapReduce frameworks
incorporate mechanisms to mitigate hardware and software failures and load
imbalance. Such mechanisms may negatively impact energy efficiency.
Hadoop has the global knowledge necessary to manage the transition of nodes to
and from low-power modes. Hence, Hadoop should be, or cooperate with, the
energy controller for a cluster. It is possible to recast the data layout and task
distribution of Hadoop to enable significant portions of a cluster to be powered
down while still fully operational.
Hadoops file system (HDFS) spreads data across the disks of a cluster to take
advantage of the aggregate I/O, and improve the data-locality of computation.
While beneficial in terms of performance, this design principle complicates power-
management. With data distributed across all nodes, any node may be participating
in the reading, writing, or computation of a data-block at any time. This makes it
difficult to determine when it is safe to turn a node or component (e.g. disk) off.
Tangentially, Hadoop must also handle the case of node failures, which can be
frequent in large clusters. To address this problem, it implements a data-block
replication and placement strategy to mitigate the effect of certain classes of
common failures; namely,
single-node failures
Whole-rack failures.
When data is stored in a HDFS, the user specifies a block replication factor. A
replication factor of n instructs HDFS to ensure that n identical copies of any data-
block are stored across a cluster (by default n = 3). Whilst replicating blocks,
Hadoop maintains two invariants:
(1) No two replicas of a data-block are stored on any one node
(2) Replicas of a data-block must be found on at least two racks.
The fact that Hadoop maintains replicas of all data affords an opportunity to save
energy on inactive nodes. That is, there is an expectation that if an inactive node is
turned off, the data it stores will be found somewhere else on the cluster. However,
this is only true up to a point. Should the n nodes that hold the n replicas of a single
block be selected for deactivation, that piece of data is no longer available to the
cluster. In fact, we have found through examination of our own Hadoop cluster that
when configured as a single rack, removing any n nodes from the cluster (where n is
the replication factor) will render some data unavailable. Thus the largest number of
nodes we could disable without impacting data availability is n 1, or merely two
nodes when n = 3. While Hadoops autonomous re-replication feature can, over
time, allow additional nodes to be disabled, this comes with severe storage capacity
and resource penalties (i.e. significant amounts of data must be transferred over the
network and condensed onto the disks of the remaining nodes). Hadoops rack-
aware replication strategy mitigates this effect only moderately; at best a single
rack can be disabled before data begins to become unavailable. To address this
short-fall in Hadoops data-layout strategy, we propose a new invariant for use
during block replication: at least one replica of a data-block must be stored in a
subset of nodes we refer to as a covering subset. The premise behind a covering
subset is that it contains a sufficient set of nodes to ensure the immediate
availability of data, even were all nodes not in the covering subset to be disabled.
This invariant leaves the specific designation of the covering subset as a matter of
policy. The purpose in establishing a covering subset and utilizing this storage
invariant is so that large numbers of nodes can be gracefully removed from a
cluster (i.e. turned off) without affecting the availability of data or interrupting the
normal operation of the cluster; thus, it should be a minority portion of the cluster.
On the other hand, it cannot be too small, or else it would limit storage capacity or
even become an I/O bottleneck. As such, a covering subset would best be sized as a
moderate fraction (10% to 30%) of a whole cluster, to balance these concerns.1 Just
as replication factors can be specified by users on a file by file basis, covering
subsets should be established and specified for files by users (or cluster
administrators). In large clusters (thousands of nodes), this allows covering subsets
to be intelligently managed as current activity dictates, rather than as a
compromise between several potentially active users or applications. Thus, if a
particular user or application vacates a cluster for some period of time, the nodes of
its associated covering subset can be turned off without affecting the availability of
resident users and applications.

Hadoop Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Report

Uploaded by

Copyright:

Available Formats

DECLARATION

Linux as an Operating System,

Python as a Programming Language.

Ancestry.com, the genealogy site, stores around 10 petabytes of data.

The Internet Archive stores around 18.5 petabytes of data.

The Large Hadron Collider near Geneva, Switzerland, produces around 15

CHAPTER 2: BIG DATA

Structured data : Relational data.

Semi Structured data : XML data.

Benefits of Big Data

Big Data also helps in:

1. Retaining and Analysing more data.

CHAPTER 3: INTRODUCTION TO HADOOP

Hadoop Ecosystem is a framework of various types of complex and evolving tools

MapReduce: A distributed data processing model and execution environment that

HBase: A distributed, column-oriented database. HBase uses HDFS for its

ZooKeeper: A distributed, highly available coordination service. ZooKeeper

Cascading: MapReduce is very powerful as a general-purpose computing

Impala: It is a scalable parallel database technology to Hadoop, which can be used

Apache Drill: It is an open-source version of Google Drell. It is used for interactive

Apache Mahout: it is a collection of scalable data mining and machine learning

Recommendations, a.k.a. collective filtering

Classification, a.k.a categorization

Frequent itemset mining, a.k.a parallel frequent pattern mining

It is not merely a collection of algorithms as many machine learning algorithms are

Figure: Hadoop Ecosystem

Hadoop Distributed File System

The distributed file system in Hadoop is designed to run on commodity hardware.

Ensuring fault tolerance by detecting faults and applying quick recovery

Simple and robust coherency model.

Processing is moved to the data, rather than data to processing.

Support heterogeneous commodity hardware and operating systems.

Scalability in storing and processing large amounts of data.

Distributing data and processing across clusters economically.

Reliability by replicating data across the nodes and redeploying processing

2.3.1 Characteristics of HDFS

Hardware Failure: Hardware failure is fairly common in clusters. A Hadoop cluster

Simple Coherency Model: The write-once-read-many access model of files enables

Heterogeneous hardware and software portability: HDFS is designed to run on

Figure : HDFS Architecture

2.3.2 Namenode and Datanode

2.3.4 Data Replication

Replica Selection: Selecting a replica that is closer to the reader significantly

Data Disk Failure, Heartbeats, and Re-replication: Namenode receives periodic

Figure: Heart Beat Mechanism in Hadoop

2.3.6 Data Organization

Replication Pipelining: Local file caching mechanism described in the previous

Components of HDFS Architecture

Each choice of DataNodes is likely to be different. The interactions among the

D. Image and Journal

be configured to store the checkpoint and journal in multiple storage directories.

E. CheckpointNode The NameNode in HDFS, in addition to its primary role serving

F. BackupNode A recently introduced feature of HDFS is the BackupNode. Like a

G. Upgrades, File System Snapshots During software upgrades the possibility of

FILE I/O OPERATIONS AND REPLICA MANGEMENT

Figure: Data pipeline during block construction

Figure 3: Cluster topology example

1. No Datanode contains more than one replica of any block.

C. Replication management The NameNode endeavors to ensure that each block

Context-Aware Scheduling for Apache Hadoop over Pervasive Environments

Apache Hadoop is a framework for distributed and parallel computing,

Over the years, different works proposed improvements to the scheduler

4.1. Context collector

Security Issues in HDFS