You are on page 1of 5

4.

Basic Modular Decomposition

4.1 Hadoop Ecosystem

Hadoop has established its place in business organizations that requires working on big data
sets which are sensitive and need efficient management. Hadoop framework enables the
processing of large data sets residing in blocks over the commodity hardware-based clusters.
The Hadoop Ecosystem framework comprises several modules that are supported by a large
ecosystem of technologies. It is a platform or framework that addresses challenges related to
big data. In simpler terms, it can be understood as a package that contains a variety of services
(storage, analysis, management, and maintenance). Figure 1.7 presents the core components of
the Hadoop Ecosystem. To get a better understanding of the Hadoop Ecosystem,
the components it comprises are outlined below:

4.1.1 HDFS
Hadoop has emerged as the most popular cloud computing platform (Apache Hadoop, n.d.)
which ensures high availability by replicating data around the cluster of commodity computers.
It offers scalability and fault tolerance. One of the main components of the Hadoop Ecosystem
is Hadoop Distributed File System (HDFS). As the name suggests, HDFS handles data storage,
whereas MapReduce handles data processing.

Based on the architectural foundations on GFS, HDFS is an open-source distributed file system,
developed by Apache foundations. In HDFS, the data is distributed, replicated, and stored over
multiple nodes, to ensure durability against failures and high availability for parallel
applications. It is a cost-effective file system as it utilizes commodity hardware. The key
concept involved in HDFS includes Blocks, Data Node (DN), and Name Node (NN). Figure 1
presents the core components of Hadoop.

 Blocks: The actual data in HDFS is stored in DN in the form of independent units,
called Blocks. The default block size of HDFS is 64MB in Hadoop 1.0, 128 MB in
Hadoop 2.0, and is configurable. Large files are divided into blocks and then stored on
DNs. HDFS promotes optimal use of available storage, i.e., if a file is smaller than the
block size, it does not occupy full block size space. To ensure the high availability of
data, replicated copies of each block are stored on the Hadoop cluster. The default
replication factor is 3 and is also configurable.
Figure 1: Core Components of Hadoop
 Name Node: HDFS works in Master-Slave mode, where the NN acts as the master
node and acts as the manager of HDFS. It holds the metadata of all files stored on the
HDFS cluster, which includes file permissions, names, and locations of all blocks. The
concurrent requests from multiple clients on the Hadoop cluster are handled by a single
machine, i.e., NN. It performs the operations like opening, renaming, closing, etc.
 Data Node: DNs hold the actual data of HDFS. They are multiple in number and
consist of commodity hardware. They store and retrieve blocks as and when asked by
NN and periodically update NN with the list of blocks stored on them. The DN on the
Hadoop cluster does not have any information about the state and data stored on other
DNs and works as per the instructions received from NN.
4.1.2 MapReduce
MapReduce is a programming paradigm used over the Hadoop platform. A job running over
the cluster is divided into multiple Map-Reduce tasks. A typical MapReduce task is further
partitioned into two tasks, namely, Map task and Reduce task. The Map task consists of two
phases: Execute Map Function (M1) and Reorder Intermediate Result (M2). The Reduce task
consists of three phases: Copy Data (R1), Order Data (R2), and Merge and Result Generation
(R3). MapReduce is based on the concept of the divide and conquer technique. Figure 2
presents the steps followed in the MapReduce programming paradigm.
Figure 2: Steps of MapReduce Programming Paradigm
The main focus of this research is on improving the performance of the MapReduce model.
The emphasis of this research is on reducing the speculative task detection time to allow early
and efficient launching of backup tasks and optimizing the job scheduling for MapReduce
frameworks in terms of throughput and resource utilization. Hence, improving the QoS over
the Hadoop cluster.

4.2 Introduction to Framework of Scheduler

All MapReduce jobs consist of a definite number of Map and Reduce tasks. The specifics
related to job execution time depend on the number of resources assigned to the job. Jobs keep
coming into the system with different priorities and deadlines. Thus, optimal resource
assignment and constant monitoring advancement of jobs become important. Since, trying to
execute and allocate resources to such a job, which is expected to miss its deadline, results in
the wastage of resources. Adding to it, there can be scenarios where we are lacking resources
to meet the job’s demands, in such cases, the job cannot be accommodated. Thus, efficient
handling and allocating resources is an important issue and is required to be dealt with.

The framework of the scheduler is based on adaptive resource management to meet the
maximum number of higher priorities demands. Figure 3 presents the block diagram of the
framework for scheduler based on deadlines and priorities. The proposed frameworks support
heterogeneous environments and fit flawlessly with the Hadoop architecture. Its main
components are:

 Job Analyzer (JA): It will make use of machine learning techniques like ANN/K-
means for estimating the resource requirement of the job based on the previous history.
Moreover, it will also try to keep track of running jobs and identify the jobs which are
expected to miss the deadline using some existing speculative task detection
mechanisms. It will further supply the gathered information to the adaptive resource
manager.
 Adaptive Resource Manager (ARM): It will keep a record of the currently available
and consumed resources of the system. Based on the information, it will be supplied by
the job analyzer, it will decide whether to assign resources to arrive job or not and
whether to provide extra resources to the job which is expected to miss its deadline or
withdrawing currently allocated resources of the running job. In case, if an arriving job
cannot be allocated sufficient resources, then it will again be sent back to the Job
Analyzer.

Figure 3: Framework for Scheduler based on Deadlines and Priorities


There will be multiple communications between the Job Analyzer and Adaptive Resource
Manager which will help in enhancing the number of successfully processed higher priority
jobs and optimal resource utilization. Our proposed framework supports the concept of multiple
priority queues, each with an assured minimum share to avoid starvation of lower priority jobs.
The adaptive resource manager will employ a suitable mechanism so that the resources when
not consumed will be transferred temporarily to other queues for efficient resources utilization.

For the running jobs, the job analyzer will be keeping a constant track of the progress of the
jobs, task-wise individually, using some speculative task detection mechanism like the LATE
scheduler. Based on the information obtained on speculative tasks of jobs, the job analyzer will
then again analyze jobs and then supply the updated resource requirement to the adaptive
resource manager. The adaptive resource manager will then play its role in analyzing the cluster
for the extra resources that the jobs with speculative tasks require and will then try to
accommodate extra resources to these jobs. It will then take the important decision of whether
the allocation of extra resources has to be made or previously allocated resources are to be
withdrawn like the situation of newly arriving jobs. Figure 4 presents the framework for
speculative task detection by scheduler based on deadlines and priorities.

Figure 4: Framework for Speculative Task detection by Scheduler based on Deadlines


and Priorities

You might also like