Professional Documents
Culture Documents
net/publication/329541245
CITATIONS READS
3 544
11 authors, including:
Mohsen Marjani
11 PUBLICATIONS 261 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Looking forwards to explore the capability of data mining in big data. View project
All content following this page was uploaded by Ibrar Yaqoob on 09 April 2019.
Ibrahim Abaker Targio Hashem, et al. [full author details at the end of the
article]
Abstract
Recent trends in big data have shown that the amount of data continues to increase at
an exponential rate. This trend has inspired many researchers over the past few years
to explore new research direction of studies related to multiple areas of big data. The
widespread popularity of big data processing platforms using MapReduce framework
is the growing demand to further optimize their performance for various purposes.
In particular, enhancing resources and jobs scheduling are becoming critical since
they fundamentally determine whether the applications can achieve the performance
goals in different use cases. Scheduling plays an important role in big data, mainly
in reducing the execution time and cost of processing. This paper aims to survey
the research undertaken in the field of scheduling in big data platforms. Moreover,
this paper analyzed scheduling in MapReduce on two aspects: taxonomy and perfor-
mance evaluation. The research progress in MapReduce scheduling algorithms is also
discussed. The limitations of existing MapReduce scheduling algorithms and exploit
future research opportunities are pointed out in the paper for easy identification by
researchers. Our study can serve as the benchmark to expert researchers for proposing
a novel MapReduce scheduling algorithm. However, for novice researchers, the study
can be used as a starting point.
1 Introduction
Recent trends in big data have shown that the amount of data continues to increase
at an exponential rate [1]. This trend has inspired many researchers over the past few
years to explore new research direction of studies related to multiple areas of big data
123
I. A. T. Hashem et al.
[2–4]. Meanwhile, the widespread popularity of big data processing platforms using
MapReduce framework is the growing demand to further optimize their performance
for various purposes [5]. In particular, enhancing resources and jobs scheduling are
becoming critical since they fundamentally determine whether the applications can
achieve the performance goals in different use cases [6]. Scheduling plays an impor-
tant role in big data, mainly in reducing the execution time and cost of processing [7,
8]. Moreover, scheduling is definitely not a new problem; it has been widely studied
in distributed computing literature. The main process that the MapReduce platform
depends on significantly is tasks scheduling [9]. In MapReduce tasks, scheduling is
given access to the resources (e.g., process time, CPU, communication, bandwidth)
in order to perform the execution and achieve the optimum quality of service [10].
Moreover, solving the problem of scheduling may require making a discrete choice
in order to obtain a desirable solution among different alternatives [7]. For example,
Althebyan et al. [9] proposed a new scheduling algorithm to overcome the challenge
in tasks scheduling. The algorithm is based on utilized multi-threading principles that
consider the time and energy consumption factors. Likewise, Jayasena et al. [11] pro-
posed the implementation of multimedia big data analytics including data distribution.
The architecture of the proposed multimedia implementation system contains three
different layers: service, platform, and infrastructure. The MapReduce that runs on
Hadoop distributed file system and the media processing libraries Xuggler is used to
design and implement the platform layers of the system. This reduces the transcod-
ing time of large-scale data into certain formats. In addition, ant colony optimization
algorithm (ACO) is used to efficiently allocate resources in the infrastructure layer.
The result indicated that the ACO can allocate virtual machine with minimal response
time compared to particle swarm optimization and artificial bee colony.
MapReduce scheduling is an emerging technology and has been explored from a
various angles by many researchers over the past few years. More specifically, opti-
mization, building new algorithms, and framework in the heterogeneous environment
have been the major focused in big data research [12]. Beside, Hadoop cluster(s) has
also gained its popularity through storing and analyzing a large amount of unstructured
data in a distributed cluster environment. However, processing the massive amount of
data in a distributed cluster requires effective scheduling mechanism to deal with them.
As such, this work aims to provide a comprehensive and performance study of MapRe-
duce scheduling on the large-scale distributed environment, where jobs/tasks are in a
distributed fashion but under resource management control. Our focus is motivated,
first, by the current availability of resource scheduling frameworks and secondly, by
the widespread of different MapReduce scheduling algorithms. Moreover, this paper
aims to serve as a useful guide of research progress in MapReduce scheduling, as
well as a point of reference for future work in improving the scheduling MapReduce
algorithms or introducing novel scheduling algorithms and frameworks for big data
analytics.
In order to achieve the above aims, we first offer a comparison of Hadoop, Mesos,
Corona frameworks. Then, we provide a comprehensive study of the existing schedul-
ing systems and classify them into different categories based on their implementation
requirement. Our identified category that is based on the related MapReduce schedul-
ing is as follows: scheduling strategy, resource, requirement, optimization approach,
123
MapReduce scheduling algorithms: a review
2 Background
123
I. A. T. Hashem et al.
The need for resource scheduling is most evident in big data storage and complex
processing engine. The growing demand over the past last decade has resulted in a
lot of scheduling frameworks have emerged and also become popular, including Yarn,
Mesos, and Corona. These frameworks are used for parallel data-intensive applications
such as Hadoop, Spark, Kafka. Table 1 shows a comparison of MapReduce default,
Yarn, Mesos, and Corona scheduling frameworks.
2.1.1 Yarn
Hadoop Yarn is a framework, which provides a management solution for big data
in distributed environments. The main idea is to separate the resource management
and job scheduling from the data processing which allows Hadoop to supports vari-
ous big data computing paradigms such as interactive analysis and stream processing
[22]. Moreover, Yarn offers a Hadoop framework, a giant flexibility in terms of job
completions, which offer an effective management and monitoring of the workloads.
Moreover, such features can assist in providing support for the maintenance of the
multi-tenant environment, cluster utilization, high scalability and the implementation
of the security controls for the Hadoop framework. Yarn consists of two main com-
ponents which are: ResourceManager and ApplicationMaster as shown in Fig. 1. The
ResourceManager is used as the master to manage the available resource in the cluster,
whereas ApplicationMaster is working with the NodeManagers to start the containers
and used for monitor and execute resources.
123
MapReduce scheduling algorithms: a review
NodeManager
ApplicationMaster
Scheduler
Client NodeManager
ResourceManager
NodeManager
2.1.3 Corona
123
I. A. T. Hashem et al.
Mesos Master
Mesos Agent Mesos Agent Mesos Agent Mesos Agent Mesos Agent Mesos Agent
cluster and report their available resources. For each job, a dedicated job tracker is
initialized in either small or large job with a separate process.
2.1.4 Summary
This section focuses on the use of scheduling algorithms on big data platforms like
MapReduce. These scheduling algorithms can be classified into the following cate-
gories
123
MapReduce scheduling algorithms: a review
There have been numerous studies that attempt to solve tasks scheduling problems
in MapReduce with the aim of making use of the framework efficiently. MapReduce
scheduling algorithm is classified using two strategies to manage workload according
to the way they schedule the tasks as follows: (1) adaptive algorithm which consider
data, physical resources and workload while taking scheduling decisions [14], (2)
non-adaptive where each task are assigned a fixed number of resources at runtime
as shown in Fig. 4. Homogeneous virtual machines (VM) in cloud data processing
platforms can clearly affect the time needed to complete the tasks influencing the
overall performance. The proposed algorithm in [26] developed to allocate VM with
the minimum data transfer delay and faster completion time based on the VM location
and the triangular equality and inequality. The MapReduce job assigned to VMs with
adjacent juxtaposition will execute faster. Pulgar-Rubio et al. [27] proposed a novel
algorithm based on evolutionary fuzzy systems for subgroup discovery in big data.
The development of the proposed algorithm was done in the Apache Spark based
on MapReduce. To evaluate the effectiveness and efficiency of the algorithm, the
experiment is conducted with the high-dimensional dataset. The result shows that
the algorithm significantly reduces the runtime, while values in the standard measure
quality in the subgroup were maintained.
The adaptive scheduling algorithm is the process of prioritizing and coordinating the
time necessary for tasks to be executed in order to meet their desired objectives [28].
For example, the adaptive scheduler would be one which takes many parameters to
allocate a task to a node based on the availability, capacity, and load on nodes in the
cluster at runtime [29].
123
123
Scheduling in MapReduce
Node
Adaptive Heterogeneity Data locality Single objective Source
Rack
Non-Adaptive Sharing SLA-Based Multiobjective Job structure
Jobs
Fair scheduler
Delay scheduling
Others
123
I. A. T. Hashem et al.
job cannot launch local tasks. Such a solution is suitable if the jobs to be scheduled
are less and the task is not been scheduled locally. Thus, delaying its scheduling
time can significantly improve data locality. The result shows that delay scheduling
can achieve nearly optimal data locality. Moreover, it outperformed fair sharing by its
simplicity under a wide variety of scheduling policies. Also, Tan et al. [33] suggested an
analytically tractable model under different schedulers such as default FIFO Scheduler
and the popular Fair Scheduler for job processing delay distribution in MapReduce.
The non-adaptive scheduler is the process of scheduling in which the basic control
mechanism is not modified on the basis of the system activity. FIFO [34] is the default
Hadoop scheduler and the most popular algorithm in the non-adaptive scheduling of
Hadoop MapReduce. Possibly the most straightforward approach to schedule task is
to maintain a FIFO run queue based on policies or the solution of some optimization
problems.
Casas et al. [35] proposed a genetic algorithm to solve the challenges of scheduling
scientific workflows in the cloud computing environment. This is because a sched-
uler that manages both computing and data-intensive applications is scared. The
genetic algorithm crossover and mutation operators were modified to allow the addi-
tion/removal of a virtual machine from a given chromosome. The experiment indicated
that the proposed genetic algorithm scheduler was found to be better than the state-
of-the-art schedulers in terms of makespan and monetary cost.
3.1.3 Summary
3.2 Resources
3.2.1 Heterogeneity
123
Table 2 Summary of MapReduce scheduling algorithms
[21] FCFS To minimize the completion Run a queue Does not handle multiple jobs at
time the same time
Each task’s progress is Low performance when running
compared with the average multiple jobs
progress.
Poor response times for short
jobs compared to large jobs
Fair scheduling Share resources equally among Fair sharing The inefficient resource can be
the jobs assigned selected for a job
MapReduce scheduling algorithms: a review
123
Table 2 continued
123
Predictability and consequently
fairness
[38] MCP To solve the problem of the Based on the bandwidth and Poor performance due to the
speculative execution progress rate static manner in computing
the progress of the tasks
[39] SHadoop To improve the performance of Optimize the setup and cleanup Hadoop cluster can only be
the job completion tasks statically configured
Messaging communication It does not work well in
mechanism heterogeneous
[40] MRA++ Provide a set of algorithms for Grouping Running multiple jobs slow the
heavy workload application performance of the processing
when dealing with a large
amount of data
Data distribution
Task scheduling
[41] Maestro Improve the overall Using a replica-aware execution Maestro algorithm can only be
performance of MapReduce of map tasks statically configured
computation
[42] ARIA To provides scheduler Uses job profiles and a soft It does not consider node
mechanism for job completion deadline failures
deadline
[43] Flex To optimize any of the variety Flexible scheduling allocation Schemes are very metric
of standard scheduling theory scheme dependent
metrics
I. A. T. Hashem et al.
MapReduce scheduling algorithms: a review
have similar resources such as CPU, memory, storage and networking capabilities,
whereas heterogeneous clusters consist of different resources with the nodes in terms
of CPU, memory, storage, or communication speeds [45, 36]. Hadoop assumed that
the resources are similar or often the only scheduling constraint is homogenous and
the data locality [46, 47]. However, the popularity of virtual machines in many data
centers has brought about heterogeneous in Hadoop cluster. As the result, a virtual-
ized environment like cloud computing can contain more than one cluster with various
characteristics [47, 48]. Many of the scheduling algorithms are not aware of the various
characteristics of the virtual machines, which may affect the selection of the data local-
ity in the Hadoop cluster. Krish et al. [47] adopted a novel method of tackling this issue
by studying different applications running on various hardware configurations. Then
incorporate the information into a hardware-aware scheduler in order to improve the
resource-application match. Moreover, Ahmad et al. [46] introduced an optimization
approach named Tarazu in order to improve the performance of the MapReduce. The
approach is based on communication-aware scheduling, communication-ware load
balancing of map computation and the predictive load balancing.
An adaptive task tuning (Ant) approach to improve MapReduce clusters per-
formance in heterogeneous environments was proposed by Cheng et al. [49]. The
proposed technique handles practical issues and challenges that are related to auto-
matic configurations of MapReduce workloads (i.e., in large scale) in heterogeneous
environments. In Ant, jobs are modified with different settings to equal the capacities
of the available heterogeneous nodes by providing the ideal settings for each job. In
addition, it used in several clusters because of its flexible and adaptive nature.
3.2.2 Sharing
123
I. A. T. Hashem et al.
a solution can offer a learning mechanism in which tasks can be classified based on
their resources and assign jobs as appropriate. Furthermore, Guo et al. [57] offered
a new way of resource aware, named, resource stealing to allow tasks running in the
cluster. The idea is, some resources from other idle slots are used until new tasks are
assigned to those slots. Such a method can help to prevent wasting resources of the
cluster. The results show that resource stealing may enhance the performance of the
compute-intensive and network intensive applications.
3.2.4 Summary
Based on our discussion above on the latest research progress, we summarize the
resource based on three subcategories: heterogeneity, sharing, and resource aware of
MapReduce schedule in Table 3. There are the different mechanism used to develop the
scheduling algorithms, the most common one is the queue-based mechanism such as
FIFO, Fair scheduler, and Maestro. In regard to heterogeneity, most of the MapReduce
scheduling algorithms are implemented to work on the homogenous environment,
except the few that are supportive of the heterogeneous environment and the resource
can be sharable among the nodes. However, these algorithms can only deal with the
resources in statically.
123
MapReduce scheduling algorithms: a review
3.3 Requirement
In this subsection, we have provided a discussion of the resources and its effect in
determining the scheduling requirements such as data locality and SLA. Scheduling
level is also one of the important requirements which determine the granularity or the
level of detail considered when making a scheduling decision. We discuss three levels
of scheduling decisions: job, task and speculative.
Data locality is a major part of the MapReduce framework during the assignment of
the tasks for data processing in data parallel systems. Data locality is the assigning
of the tasks locally or close to the data. Data locality consists of many levels such
as node and rack level. Hadoop MapReduce determines whether cluster/rack is being
scheduled based on the availability of the data locally. It assumes that nodes within
the same rack have higher bandwidth than those that are not resident in the same
cluster. Knowing this, the scheduler can simply increase data locality for tasks. One
of the main objectives of using data locality is to overcome the problem of network
traffic in data-intensive computing [59]. Some scheduling policies in Hadoop consider
the effect of data locality, which can be classified as a cluster and rack [60, 61]. For
instance, to assign map tasks to a node, the Hadoop default FIFO chooses the job from
the queue and schedule its local map tasks. When the job does not have any map task
locally, it will be assigned to the non-local map in the cluster [62]. Furthermore, Abad
et al. [63] proposed a new algorithm based on distributed adaptive data replication that
helps the scheduler to achieve better data locality. The advantages of this approach
are to allow many replicas to be allocated for each file and make use of probabilistic
sampling. Zhang et al. [64] emphasized on data locality problem of MapReduce by
introducing a next-k-node scheduling method, which has implemented in Hadoop
framework. Also, Jin et al. [65] suggested that initial task allocation is produced first
before the job completion time can be reduced gradually. The author introduced a
heuristic task scheduling algorithm called BAlance-Reduce (BAR), which adjust data
locality dynamically according to network state and cluster workload by tuning the
initial task allocation using a global view. The experimental result of the algorithm
shows that the BAR is able to outperform previously related algorithms in terms of the
job completion time and deal with large problem instances in a few seconds. Wang
et al. [66] demonstrated a resilient proof through practical experimentations to illustrate
that the conventional concept of data locality is generally not constantly suitable for
MapReduce in a virtual environment. This shows that there is to differentiate between
the physical local nodes and the virtual nodes. To solve this issue, a task scheduling
algorithm mainly for the purpose of locating VMs and provide priorities between
them was developed by the authors, which lead them to the development of vLocality.
The vLocality is a robust solution for data locality in virtual environments and it is
implemented in Hadoop 1.2.1.
123
I. A. T. Hashem et al.
Scheduling Hadoop tasks in virtual machines in the cloud demand resources of the
cloud, typically, users are aware of the deadline of when the job will be completed.
However, in the cloud computing environment, all machines compete for resources to
execute the jobs. These resources are controlled by batch queue systems, which may
not offer guarantee deadline during the task execution, only if the priority used for
resource reservation which is a restricted level of service.
Lim et al. [67] proposed a real technique to manage cloud resources using SLAs
for processing MapReduce end-to-end jobs. It is a novel constraint programming (CP)
based and ResourceManager (MRCP-RM) approach that is capable of matching and
scheduling different MapReduce jobs through SLAs on a system that is liable to open
stream of arrival jobs. Java programming is used for the implementation of MRCP-RM
and CP uses to IBM CPLEX to solve matching and scheduling problem.
3.3.3 Level
In map and reduce phases, the allocation and decisions are made based on the job and
task levels. For example, many tasks may not be scheduled instantaneously during
the execution. Unlike resource (node) level, job scheduling level can enable various
machines with different requirements to provide numbers of slots. Furthermore, the
completion time is assumed by the scheduling algorithm to be the same for each slot
[68]. There are two types of level in Hadoop scheduling, which are: job level and the
task level. The job level means the job scheduler has to choose a ready job while the
task level means the task scheduler has to select a ready task.
(1) Job level Scheduler determines which job run next from the job queue at the
job level. Normal jobs in a cluster are selected to run, may be put in the queue or
rejected to be resubmitted [69]. For instances, queued MapReduce jobs and that
are ready for execution and resources in a distributed environment managed and
monitored by the job trackers.
(2) Map/Reduce task level Scheduler decides which task of a given job will run
and in which node. Two types of tasks are created by The MapReduce—map and
reduce—that need to be scheduled in respective slots. The reason is, the resource
requirements for each task are different during the execution which complicates the
process of allowing the resources to be available for task-level schedulers to reduce
job execution time [70].
In a distributed environment, the optimizer must decide, among others, where each
node of the plan will be executed [71]. Most of the scheduling problems mainly
fall under two approaches, single-objective, and multi-objective approach [72] as we
discuss in the next subsection (Fig. 5).
123
MapReduce scheduling algorithms: a review
MTSD
Flex
Maestro
MRA++
Delay scheduling
MOMTH
FIFO
Quincy
MCP
SHadoop
ARIA
123
Table 4 Comparison of scheduling algorithms based on a single objective
123
Algorithm Objectives Constraints Minimum schedule length Global optimality
from the Hadoop system: a message signaling a new job arrival from a user to store the
incoming job in an appropriate queue, and a heartbeat message from a free resource
by triggering the routing process to assign a job.
MOMTH [75] proposed Hadoop scheduling algorithm for many tasks based on
multi-constrained and multi-objective approaches to improve the process of the big
data. Thus, two objective functions are considered associated with users and resources
with constraints such as deadline and budget. With the purpose of evaluating the algo-
rithm in the scheduling load simulator, a collaboration platform known as MobiWay
is used for performance analysis to provide the ability to deal with a large amount of
sensor mobile and various applications The algorithm is compared with FIFO and fair
schedulers and it obtained similar performance for the same approach.
MORM [76] trying to solve the problem of data management in the large-
scale distributed environment. The author proposed a multi-objective optimization
approach called Multi-objective Optimized Replication Management based on offline
for replication management. The proposed solution focused on mean service time, file
unavailability, energy consumption, load variance, and latency. The artificial immune
algorithm is used to provide improvement for replication using a set of solution can-
didates via mutation, clone, and selection processes. The experimental result shows
the effectiveness of the proposed solution. Moreover, the approach outperforms the
existing default replication management in terms of load balancing. Table 5 shows
the comparison of scheduling algorithms based on a single objective. It shows the
objectives of the proposed scheduling algorithms with different constraints. These
scheduling algorithms are dealing with more than one objectives. Furthermore, it out-
lines the minimization length and the schedule of global optimality supported by the
optimization model.
Jiang et al. [77] introduced the makespan minimization technique for MapRe-
duce systems to process data from different servers at different speeds. Though, it is
the first time to study how to minimize MapReduce running on servers at different
speeds, which makes it difficult to develop a model and also analyzing the algorithm
for real MapReduce is complex. However, the makespan minimization scheduling
in this technique is classified into two, namely; offline and online scheduling. The
123
I. A. T. Hashem et al.
offline scheduling is basically design for non-preemptive reduced jobs, while the online
scheduling can work for both preemptive non-preemptive reduced jobs.
In [78] multi-objective task scheduling is developed to improve energy-efficient in a
green data center. The purpose is to overcome the problem of energy in the data center
through green renewable energy. Improved multi-objective evolutionary algorithm
technique is used to develop the solution by applying generalized opposition-based
learning to find appropriate node computing, an efficient time scheduling strategy, and
clock frequency and supply voltage for the assigned jobs. Moreover, scheduling time
is utilized to determine when to run the task to guarantee that tasks wait in the queue
until the renewable energy is enough and the task can be completed in minimum time
needed. Experimental results show that the proposed solution confirm the superiority
and effectiveness.
3.4.3 Summary
Based on the recent research progress discussed in this subsection, we summarize the
objective functions, constraints, minimum schedule length and global optimality for
MapReduce scheduling optimization problem in Table 6.
The jobs are divided across many virtual nodes in the cloud to be executed in parallel.
However, it is possible that a few nodes slow down the overall execution of a task. The
123
MapReduce scheduling algorithms: a review
execution of the task may be slow due to various reasons such as software miscon-
figuration and hardware degradation. When the client submits the jobs to the master
node, the jobs will be broken down into tasks. These tasks will be executed by the
DataNode in which the total execution is dominated by the slowest DataNode in the
cluster. Thus, to overcome such a challenge, Hadoop is designed in such a way that, it
detects the slow running tasks and runs backup. For example, if one node has a slow
disk controller, then it may be reading its input at only 10% the speed of all the other
nodes.
LATE [36] proposed a new scheduling algorithm called Longest Approximate Time
to End (LATE) to improve the performance of the Hadoop scheduler in terms of
degradation in the heterogeneous environment. Late has built upon three important
concepts prioritizing task to speculate, selecting fast nodes to run on, and capping
speculative tasks to prevent thrashing. The idea is based on the assumptions that
homogeneous cluster makes progress linearly and has many speculative executions.
As the result, many nodes may slow down the overall performance of the Hadoop
cluster. The proposed algorithm has improved Hadoop performance by reducing the
response time and offer high robust to heterogeneity.
MCP [38] proposed novel speculative execution strategies. The main goal of the
scheduler is to choice struggle tasks accurately and then moves them to an available
node. This process is to guarantee fairness when the submitted jobs are divided into
multiple tasks for particular slots. Unlike Late, the proposed solution uses bandwidth
and rate progress in order to choose the slowest task. However, for the remaining time
and the progress speed of the tasks are calculated using averages weighted. Moreover,
the scheduler can decide on the backup of the task based on the cluster load using a
different method such as cost–benefit model, the backups of map tasks is optimized
locally, and for each works nodes that are slower is known by the process speed of
its map tasks to be executed on the node. The result of the experiment has shown that
MCP is 39% faster when running jobs in the cluster and also offer better throughput
that is up to 44% compared to default Hadoop.
Yang and Chen [79] provide new scheduling solution for speculative execution
named adaptive task allocation in order to improve the Hadoop framework. The algo-
rithm is designed to deal with the selection of the response time and backup in an
accurate manner to improve the Hadoop framework. The idea is to make the success
rate higher in terms of backup of the tasks and the response time. The experimental
results indicated that the proposed solution can decrease the average task throughput
by 33% and the task latency by 36% when to compare with the traditional Hadoop
Speculative.
Xu and Lau [80] consider various loading conditions in their design when dealing
with speculative execution. They have designed new schemes for parallel processing
cluster. These proposed schemes are based on two algorithms which are the straggler
detection algorithm and the smart cloning algorithm. Straggler detection algorithm
focuses on minimizing the overall resource consumption of the job. While Smart
cloning algorithm based on the maximization of the job utility.
You et al. [81] offered a new scheduling method called Load-Aware scheduler to
improve the performance of the MapReduce. The objective is to reduce the number of
speculative tasks in the heterogeneous cluster. The proposed method can improve the
123
I. A. T. Hashem et al.
3.5.1 Summary
In MapReduce, any available nodes that marked as straggler may poorly perform task
execution, which may pose some challenges to the overall execution time. However,
to overcome this challenges of straggler slowing the overall job execution, Hadoop
has implemented a speculative execution mechanism by running the copy of the task
on a different node. Thus, it avoids the misbehavior of some nodes to slow down the
whole job. Hadoop assumes that all the machines are homogeneous, but as a matter
of fact, in most cases, machines are not homogeneous especially in cloud computing
environment where the hardware could be in different generations and virtualized
data center as the uncontrollable several of virtualized resources. Table 7 depicts the
summary of speculative execution algorithms.
123
MapReduce scheduling algorithms: a review
Since the birth of the Hadoop paradigm, the MapReduce programming model has been
one of its main components as highlighted in Sect. 1. The traditional implementation
of MapReduce has revealed high latencies during execution of Hadoop MapReduce
jobs. Basically, the submitted jobs are performed based on the steps structure in which
the data will be split, Map, shuffle, sort and then reduce [15]. This problem is exac-
erbated for more complexes processing involving statistical MapReduce jobs which
require time on the order of minutes, hours, or longer—even with fairly small data
volumes. In order to describe and compare the performance characteristics of the
MapReduce in the cluster, define a series of performance indicators is required. This
section mainly focuses on measuring the working capability of the MapReduce jobs,
including the measurement of throughput and execution time of each MapReduce job,
and the processing duration, CPU utilization of the node.
4.1 Summary
MapReduce scheduling is an emerging technology and has been explored from a vari-
ous angles by many researchers over the past few years. More specifically, optimization
and building new algorithms and framework in the homogeneous environment have
been the major focus on big data research. For example, performance issues like het-
erogeneity, hardware and software failure, and datasets have brought new challenges
when processing large amount of data in the distributed environment. Thus, tuning
MapReduce scheduling is one of the important factors to be considered to improve the
performance of the framework for a different environment. Table 8 shows the compar-
ison of different performance measurement. Gouasmi et al. [86] proposed federated
and geographically distributed MapReduce frameworks for federated cloud platforms.
Exact MapReduce scheduling algorithm is also proposed in the study. The federated
distributed MapReduce is responsible for reducing the cost of the job while maintaining
a deadline. Performance analysis suggested that the federated distributed MapReduce
has the potential of enhancing the utilization of resource while maintaining minimal
cost, the response time of a job and deadline. The Exact MapReduce scheduling algo-
rithm can serve as a point of reference in case of benchmarking. Zhao et al. [87]
motivated by the challenge of scheduling subtasks in a cloud computing environment
for minimizing transcoding convergence time. As such, the authors propose the par-
allelizing of video transcoding on the heterogeneous MapReduce clusters based on
prediction and locality-aware task scheduling technique (PLATT). The decoding and
123
I. A. T. Hashem et al.
123
MapReduce scheduling algorithms: a review
ration in Hadoop. The idea behind the slot management scheme is to improve
resource utilization and reduce the makespan of multiple jobs. Thus, there is a
lack of related metrics for scheduling algorithms considering multi-dimensional
resource. For future research in, the main idea of the scheduling algorithm with
a multi-dimensional resource such as CPU, memory, and network bandwidth. is
to obtain a minimal execution schedule through efficient management of available
cloud resources [92, 93]. Moreover, the scheduling algorithms should be able to
tasks into account the different resource requirements of different tasks and shown
to obtain a minimal execution schedule through efficient management of available
cloud computing resources.
(3) Event-based scheduling Previously, the offline problem has been the main focus
in scheduling research in order to minimize execution time for a single workflow
with known task runtimes [94]. Heterogeneity brings a new challenging issue to
the Hadoop scheduling the low support for complex requirements in current queue-
based scheduling algorithms and arising problems of the schedule-based solution
when applied in a dynamic environment with uncertainties. Optimization of these
scheduling algorithms for Hadoop demanded to schedule many jobs in the queue.
Thus, it is essential to overcome the traditional method and develop automatic
synchronization activity execution for enabling workflows to exchange data with
other workflows or other applications [95, 96].
(4) Energy consumption and efficiency models for MapReduce jobs: Although a
new power-aware MapReduce application model has introduced in [97] to be used
for power-aware computing with consideration of users’ requirements. There is a
need for detailed energy efficiency model for MapReduce environments to predict
the energy consumed for mix workload scenarios [98]. It should also consider the
background HDFS activities carried out for availability checks. It should be able
to incorporate the idle nodes energy as well. The performance of the MapReduce
framework can possibly use for prediction of the map and reduce task timings
depending on the data volume, their distribution, underlying hardware, etc. Energy
and performance models can be combined to evaluate various scheduling algorithms
for predicting the energy consumptions and thus decide which one maximizes the
performance and energy efficiency.
6 Conclusion
123
I. A. T. Hashem et al.
Acknowledgements This paper is financially supported by University Malaya Research Grant Programme
(Equitable Society) under Grant RP032B-16SBS.
Appendix A
See Table 9.
Table 9 Modifications induced by existing scheduling approach to MapReduce
123
MapReduce scheduling algorithms: a review
References
1. Chen M et al (2014) Big data: a survey. Mob Netw Appl 19(2):171–209
2. Maass W et al (2017) Big data and theory. In: Schintler LA, McNeely CL (eds) Encyclopedia of big
data, Springer International Publishing, Cham, pp 1–5
3. Wang Y et al (2018) Big data analytics: understanding its capabilities and potential benefits for health-
care organizations. Technol Forecast Soc Change 126:3–13
4. Tahmassebi A et al (2018) Deep learning in medical imaging: fMRI big data analysis via convolutional
neural networks. In: Proceedings of the Practice and Experience on Advanced Research Computing.
ACM
5. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM
51(1):107–113
6. Lee K-H et al (2012) Parallel data processing with MapReduce: a survey. AcM sIGMoD Rec
40(4):11–20
7. Chang H et al (2011) Scheduling in MapReduce-like systems for fast completion time. In: 2011
Proceedings IEEE INFOCOM. IEEE
8. Yoo D, Sim KM (2011) A comparative review of job scheduling for MapReduce. In: 2011 IEEE
International Conference on Cloud Computing and Intelligence Systems (CCIS). Citeseer
9. Althebyan Q et al (2017) A scalable MapReduce tasks scheduling: a threading-based approach. Int J
Comput Sci Eng 14(1):44–54
10. Tang Z et al (2012) MTSD: a task scheduling algorithm for MapReduce base on deadline constraints.
In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & Ph.D.
Forum (IPDPSW). IEEE
11. Jayasena K, Li L, Xie Q (2017) Multi-modal multimedia big data analyzing architecture and resource
allocation on cloud platform. Neurocomputing 253:135
12. Page AJ, Naughton TJ (2005) Framework for task scheduling in heterogeneous distributed computing
using genetic algorithms. Artif Intell Rev 24(3–4):415–429
13. Rao BT, Reddy L (2012) Survey on improved scheduling in Hadoop MapReduce in cloud environments.
arXiv preprint arXiv:1207.0780
14. Tiwari N et al (2015) Classification framework of MapReduce scheduling algorithms. ACM Comput
Surv (CSUR) 47(3):49
15. Doulkeridis C, Nørvåg K (2014) A survey of large-scale analytical query processing in MapReduce.
VLDB J 23(3):355–380
16. Arora S, Goel DM (2014) Survey paper on scheduling in Hadoop. Int J Adv Res Comput Sci Softw
Eng 4(5):4886
17. Chen C-H, Lin J-W, Kuo S-Y (2018) MapReduce scheduling for deadline-constrained jobs in hetero-
geneous cloud computing systems. IEEE Trans Cloud Comput 6(1):127–140
18. Nagarajan V et al. (2018) Malleable scheduling for flows of jobs and applications to MapReduce. J
Sched 752:1–19
19. Duan N et al (2018) Scheduling MapReduce tasks based on estimated workload distribution. Google
Patents
20. Tang Y et al (2018) OEHadoop: accelerate Hadoop applications by co-designing Hadoop with data
center network. IEEE Access 6:25849–25860
21. Hadoop A (2011) Apache Hadoop. https://hadoop.apache.org/. Accessed 3 May 2017
22. Vavilapalli VK et al (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings
of the 4th Annual Symposium on Cloud Computing. ACM
23. Hindman B et al (2011) Mesos: a platform for fine-grained resource sharing in the data center. In:
NSDI
24. Facebook (2012) Facebook engineering. Under the hood: scheduling MapReduce jobs more efficiently
with Corona. 2012 [cited 2015 5 March]. https://www.facebook.com/notes/facebook-engineering/
under-the-hood-scheduling-mapreduce-jobs-more-efficiently-with-corona/10151142560538920
25. Scott J (2015) A tale of two clusters: Mesos and YARN. [cited 2016 1/6/2016]. http://radar.oreilly.
com/2015/02/a-tale-of-two-clusters-mesos-and-yarn.html
26. Shabeera T, Kumar SM, Chandran P (2016) Curtailing job completion time in MapReduce clouds
through improved Virtual Machine allocation. Comput Electr Eng 58:190–202
27. Pulgar-Rubio F et al (2017) MEFASD-BD: multi-objective evolutionary fuzzy algorithm for subgroup
discovery in big data environments-a MapReduce solution. Knowl-Based Syst 117:70–78
123
I. A. T. Hashem et al.
28. Casavant TL, Kuhl JG (1988) A taxonomy of scheduling in general-purpose distributed computing
systems. IEEE Trans Softw Eng 14(2):141–154
29. Gao Y, Rong H, Huang JZ (2005) Adaptive grid job scheduling with genetic algorithms. Future Gener
Comput Syst 21(1):151–161
30. Hadoop A (2009) Fair scheduler. https://hadoop.apache.org/docs/stable1/fair_scheduler.html.
Accessed 13 June 2017
31. Hadoop A Capacity scheduler guide. https://hadoop.apache.org/docs/r1.2.1/capacity_scheduler.html.
Accessed 13 June 2017
32. Zaharia M et al (2010) Delay scheduling: a simple technique for achieving locality and fairness in
cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems. ACM
33. Tan J, Meng X, Zhang L (2012) Delay tails in MapReduce scheduling. ACM SIGMETRICS Perform
Eval Rev 40(1):5–16
34. Hadoop A Apache Hadoop. https://hadoop.apache.org/. Accessed 3 May 2017
35. Casas I et al (2016) GA-ETI: an enhanced genetic algorithm for the scheduling of scientific workflows
in cloud environments. J Comput Sci 26:318–331
36. Zaharia M et al (2008) Improving MapReduce performance in heterogeneous environments. In: OSDI
37. Isard M et al (2009) Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the
ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM
38. Qi C, Cheng L, Zhen X (2014) Improving MapReduce performance using smart speculative execution
strategy. IEEE Trans Comput 63(4):954–967
39. Gu R et al (2014) SHadoop: improving MapReduce performance by optimizing job execution mech-
anism in Hadoop clusters. J Parallel Distrib Comput 74(3):2166–2179
40. Anjos JC et al (2015) MRA++: scheduling and data placement on MapReduce for heterogeneous
environments. Future Gener Comput Syst 42:22–35
41. Ibrahim S et al (2012) Maestro: Replica-aware map scheduling for MapReduce. In: 2012 12th
IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). IEEE
42. Verma A, Cherkasova L, Campbell RH (2011) ARIA: automatic resource inference and allocation for
MapReduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic
Computing. ACM
43. Wolf J et al (2010) Flex: a slot allocation scheduling optimizer for MapReduce workloads. In: Mid-
dleware 2010. Springer, pp 1–20
44. Polo J et al (2010) Performance management of accelerated MapReduce workloads in heterogeneous
clusters. In: 2010 39th International Conference on Parallel Processing (ICPP). IEEE
45. Lopes R, Menascé D (2015) A taxonomy of job scheduling on distributed computing systems. http://
cs.gmu.edu. Accessed 3 Sept 2017
46. Ahmad F et al (2012) Tarazu: optimizing MapReduce on heterogeneous clusters. In: ACM SIGARCH
Computer Architecture News. ACM
47. Krish K, Anwar A, Butt AR (2014) [phi] Sched: a heterogeneity-aware Hadoop workflow scheduler.
In: 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and
Telecommunication Systems (MASCOTS). IEEE
48. Dong F, Akl SG (2007) PFAS: a resource-performance-fluctuation-aware workflow scheduling algo-
rithm for grid computing. In: IEEE International Parallel and Distributed Processing Symposium.
IPDPS 2007. IEEE
49. Cheng D, Rao J, Guo Y, Jiang C, Zhou X (2017) Improving performance of heterogeneous mapreduce
clusters with adaptive task tuning. IEEE Trans Parallel Distrib Syst 28(3):774–786
50. Murthy AC et al (2011) Architecture of next generation Apache Hadoop MapReduce framework.
Technical report, Apache Hadoop
51. Ghit B et al (2014) Balanced resource allocations across multiple dynamic MapReduce clusters. In:
ACM SIGMETRICS
52. Barham P et al (2003) Xen and the art of virtualization. ACM SIGOPS Oper Syst Rev 37(5):164–177
53. Chen F, Kodialam M, Lakshman T (2012) Joint scheduling of processing and shuffle phases in MapRe-
duce systems. In: Proceedings IEEE INFOCOM. IEEE
54. Polo J et al (2011) Resource-aware adaptive scheduling for MapReduce clusters. In: Middleware 2011.
Springer, pp 187–207
55. Sousa E et al (2014) Resource-aware computer vision application on heterogeneous multi-tile archi-
tecture. In: Proceedings of the Hardware and Software Demo at the University Booth at Design,
Automation and Test in Europe (DATE), Dresden
123
MapReduce scheduling algorithms: a review
56. Yong M, Garegrat N, Mohan S (2009) Towards a resource aware scheduler in Hadoop. In: Proceedings
of the 2009 IEEE International Conference on Web Services, Los Angeles, CA, USA
57. Guo Z et al (2012) Improving resource utilization in MapReduce. In: 2012 IEEE International Con-
ference on Cluster Computing (CLUSTER). IEEE
58. Rasooli A, Down DG (2014) COSHH: a classification and optimization based scheduler for heteroge-
neous Hadoop systems. Future Gener Comput Syst 36:1–15
59. Guo Z, Fox G, Zhou M (2012) Investigation of data locality in MapReduce. In: Proceedings of the
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID
2012). IEEE Computer Society
60. Park J et al (2012) Locality-aware dynamic VM reconfiguration on MapReduce clouds. In: Proceedings
of the 21st International Symposium on High-Performance Parallel and Distributed Computing. ACM
61. Li J-J et al (2011) Survey of MapReduce parallel programming model. Dianzi Xuebao (Acta Electron
Sin) 39(11):2635–2642
62. He C, Lu Y, Swanson D (2011) Matchmaking: a new MapReduce scheduling technique. In: 2011 IEEE
Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEE
63. Abad CL, Lu Y, Campbell RH (2011) DARE: adaptive data replication for efficient cluster scheduling.
In: 2011 IEEE International Conference on Cluster Computing (CLUSTER). IEEE
64. Zhang X et al (2011) Improving data locality of MapReduce by scheduling in homogeneous computing
environments. In: 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with
Applications (ISPA). IEEE
65. Jin J et al (2011) Bar: an efficient data locality driven task scheduling algorithm for cloud computing.
In: Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing. IEEE Computer Society
66. Wang W, Zhu K, Ying L, Tan J, Zhang L (2016) Maptask scheduling in mapreduce with data locality:
Throughput and heavy-traffic optimality. IEEE/ACM Trans Networking (TON) 24(1):190–203
67. Lim N, Majumdar S, Ashwood-Smith P (2014) Engineering resource management middleware for
optimizing the performance of clouds processing MapReduce jobs with deadlines. In: Proceedings of
the 5th ACM/SPEC International Conference on Performance Engineering. ACM
68. Sandholm T, Lai K (2010) Dynamic proportional share scheduling in hadoop. In: Workshop on Job
Scheduling Strategies for Parallel Processing, Springer, Berlin, Heidelberg, pp 110–131
69. Nanduri R et al (2011) Job aware scheduling algorithm for MapReduce framework. In: 2011 IEEE
Third International Conference on Cloud Computing Technology and Science (CloudCom). IEEE
70. Zhang Q et al (2015) PRISM: fine-grained resource-aware scheduling for MapReduce. IEEE Trans
Cloud Comput 1:1
71. Kllapi H et al (2011) Schedule optimization for data processing flows on the cloud. In: Proceedings of
the 2011 ACM SIGMOD International Conference on Management of data. ACM
72. Ponnambalam S, Jawahar N, Chandrasekaran S (2009) Discrete particle swarm optimization algorithm
for flowshop scheduling. INTECH Open Access Publisher
73. Savic D (2002) Single-objective vs. multiobjective optimisation for integrated decision support. Integr
Assess Decision Support 1:7–12
74. Chen Q, Liu C, Xiao Z (2013) Improving MapReduce performance using smart speculative execution
strategy. Parallel Distrib Syst 24:1107
75. Nita M-C et al (2015) MOMTH: multi-objective scheduling algorithm of many tasks in Hadoop. Clust
Comput 18:1–14
76. Long S-Q, Zhao Y-L, Chen W (2014) MORM: a multi-objective optimized replication management
strategy for cloud storage cluster. J Syst Archit 60(2):234–244
77. Jiang Y et al (2017) Makespan minimization for MapReduce systems with different servers. Future
Gener Comput Syst 67:13–21
78. Lei H et al (2016) A multi-objective co-evolutionary algorithm for energy-efficient scheduling on a
green data center. Comput Oper Res 75:103–117
79. Yang S-J, Chen Y-R (2015) Design adaptive task allocation scheduler to improve MapReduce perfor-
mance in heterogeneous clouds. J Netw Comput Appl 57:61–70
80. Xu H, Lau WC (2014) Optimization for speculative execution of multiple jobs in a MapReduce-like
cluster. arXiv preprint arXiv:1406.0609
81. You H-H, Yang C-C, Huang J-L (2011) A load-aware scheduler for MapReduce framework in hetero-
geneous cloud environments. In: Proceedings of the 2011 ACM Symposium on Applied Computing.
ACM
123
I. A. T. Hashem et al.
82. Lei L, Wo T, Hu C (2011) CREST: towards fast speculation of straggler tasks in MapReduce. In: 2011
IEEE 8th International Conference on e-Business Engineering (ICEBE). IEEE
83. Fu H et al (2017) FARMS: efficient MapReduce speculation for failure recovery in short jobs. Parallel
Comput 61:68–82
84. Brahmwar M, Kumar M, Sikka G (2016) Tolhit—a scheduling algorithm for Hadoop cluster. Proc
Comput Sci 89:203–208
85. Memishi B, Pérez MS, Antoniu G (2017) Failure detector abstractions for MapReduce-based systems.
Inf Sci 379:112–127
86. Gouasmi T et al (2018) Exact and heuristic MapReduce scheduling algorithms for cloud federation.
Comput Electr Eng 69:274
87. Zhao H et al (2018) Prediction-based and locality-aware task scheduling for parallelizing video
transcoding over heterogeneous MapReduce cluster. IEEE Trans Circuits Syst Video Technol
28(4):1009–1020
88. Singh S, Chana I (2015) QoS-aware autonomic resource management in cloud computing: a systematic
review. ACM Comput Surv (CSUR) 48(3):42
89. Yu J (2007) QoS-based scheduling of workflows on global grids
90. Sheikhalishahi M et al (2016) A multi-dimensional job scheduling. Future Gener Comput Syst
54:123–131
91. Yao Y et al (2015) Self-adjusting slot configurations for homogeneous and heterogeneous Hadoop
clusters. IEEE Trans Cloud Comput 5:344
92. Khoo BB et al (2007) A multi-dimensional scheduling scheme in a Grid computing environment. J
Parallel Distrib Comput 67(6):659–673
93. Yao Z, Papapanagiotou I, Callaway RD (2015) Multi-dimensional scheduling in cloud storage systems.
In: International Communications Conference (ICC)
94. Dong X, Wang Y, Liao H (2011) Scheduling mixed real-time and non-real-time applications in MapRe-
duce environment. In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems
(ICPADS). IEEE
95. Casati F, Shan M-C (2007) Event-based scheduling method and system for workflow activities. Google
Patents
96. Ilyushkin A, Ghit B, Epema D (2015) Scheduling workloads of workflows with unknown task runtimes.
In: 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).
IEEE
97. Li Y, Zhang H, Kim KH (2011) A power-aware scheduling of MapReduce applications in the cloud.
In: 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing
(DASC). IEEE
98. Goiri Í et al (2012) GreenHadoop: leveraging green energy in data-processing frameworks. In: Pro-
ceedings of the 7th ACM European Conference on Computer Systems. ACM
Affiliations
123
MapReduce scheduling algorithms: a review
Ahmad Firdaus
firdausza@ump.edu.my
Muhamad Taufik Abdullah
mta@upm.edu.my
Faiz Alotaibi
faiz.eid@hotmail.com
Waleed Kamaleldin Mahmoud Ali
waleed.k.ali@gmail.com
Ibrar Yaqoob
ibraryaqoob@ieee.org
Abdullah Gani
abdullah.gani@taylors.edu.my
1 School of Computing and Information Technology, Taylor’s University, 47500 Subang Jaya,
Malaysia
2 Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur,
Malaysia
3 Centre for Mobile Cloud Computing Research, University of Malaya, Kuala Lumpur, Malaysia
4 Department of Computer Science, Federal College of Education (Technical), Gombe, Nigeria
5 Faculty of Computer Systems and Software Engineering, Universiti Malaysia
Pahang, 26300 Gambang, Kuantan, Pahang, Malaysia
6 Faculty of Computer Science and Information Technology, Universiti Putra Malaysia,
43400 Serdang, Selangor, Malaysia
123