Professional Documents
Culture Documents
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 1
Abstract—Cloud data centers (CDCs) have become increasingly popular and widespread in recent years with the growing popularity
of cloud computing and high-performance computing. Due to the multi-step computation of data streams and heterogeneous task
dependencies, task failure frequently occurs, resulting in poor user experience and additional energy consumption. To reduce task
execution failure as well as energy consumption, we propose a novel AI-driven energy-aware proactive fault-tolerant scheduling
scheme for CDCs in this paper. Firstly, a prediction model based on the machine learning approach is trained to classify the arriving
tasks into “failure-prone tasks” and “non-failure-prone tasks” according to the predicted failure rate. Then, two efficient scheduling
mechanisms are proposed to allocate two types of tasks to the most appropriate hosts in a CDC. The vector reconstruction method is
developed to construct super tasks from failure-prone tasks and separately schedule these super tasks and non-failure-prone tasks to
the most suitable physical host. All the tasks are scheduled in an earliest-deadline-first manner. Our evaluation results show that the
proposed scheme can intelligently predict task failure and achieves better fault tolerance and reduces total energy consumption better
than the existing schemes.
Index Terms—Cloud computing, cloud data center, scheduling, fault-tolerance, energy-efficiency, task failure, prediction, deep neural
network.
F
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 2
all the task parameters are inserted into TensorFlow1 as network partitions, hardware malfunctions, power outages,
inputs. Using a deep neural network (DNN) approach, a abrupt software failures, etc. [16].
model is trained to predict the failure rate of each arriving The existing fault-tolerant techniques in CDCs include
task. In this way, all the arriving tasks can be classified into replication, check-point, job migration, retry, task resubmis-
failure-prone tasks and non-failure-prone tasks based on sion, etc. Some studies [17], [18] introduced methods based
model outputs. In the second stage, a scheduling algorithm on certain principles, such as retry, resubmission, replica-
is developed based on vector bin packing to schedule the tion, renovation of software, screening, and migration, to
two types of tasks efficiently. The main difference between harmonize the fault-tolerant mechanism with CDC task
these two scheduling schemes is that, for the failure-prone scheduling. However, for parallel and distributed comput-
tasks, super tasks are firstly generated based on an elegant ing systems, the most widely adopted and acknowledged
vector reconstruction method for fault-tolerant purpose. A method is to replicate data to multiple hosts [9].
replication strategy is applied to replicate only the fault- A rearrangement-based improved fault-tolerant schedul-
prone tasks, then arranged into super tasks. The elegant ing algorithm (RTFR) has been presented to deal with the
vector reconstruction method is designed to construct super dynamic scheduling issue for tasks in cloud systems [19]. A
tasks to execute different copies of the tasks in different primary-backup model is adopted to realize fault-tolerance
hosts. The super tasks are uniquely constructed so that the in this method. The corresponding backup copy will be
sub-task will not be overlapped, and redundant execution released after the primary replica is completed, in order
will be avoided. to release the resource it occupies. In addition, the waiting
The main contributions of this work include: tasks can be rearranged to utilize the released resources. In
contrast, after the task is sent to the waiting queue of the
• A DNN model is proposed for the prediction of the virtual machine, the execution sequence is fixed and cannot
possibility of failure of incoming tasks, so that further be changed.
scheduling strategy can be developed based on the The backup overlapping mechanism and virtual ma-
prediction. chine migration strategy are adopted in the cloud to im-
• Different resource allocation and scheduling strate- prove resource utilization [20]. It is a primary-backup ap-
gies are developed for failure-prone and non-failure- proach similar to the proposal in [19]. A dynamic integrated
prone tasks for both reduction of task failure and task scheduling algorithm is presented in [21] by modifying
energy consumption. breadth-first search algorithm to find the overall optimal
• A unique fault-tolerant mechanism is developed to virtual machine for each task. The above techniques do not
schedule failure-prone tasks by constructing super only minimize the makespan and response time for tasks
tasks based on the vector reconstruction method. but can also be easily integrated with other virtual machine
• Extensive experiments using the CloudSim2 toolkit management techniques to reduce energy consumption and
have been conducted to evaluate our scheme. The improve fault tolerance.
results validate that our scheme outperforms the By observing resource behavior during job execution,
state-of-the-art in terms of failure ratio, resource uti- various statistics and probability-based techniques can be
lization, and energy consumption. used to identify the failure rate of jobs. The history of
The remainder of the paper is organized as follows: resource failure can be leveraged for careful selection of
Related works are introduced in Section 2. The structure of resources and fault-tolerant scheduling [22]. This depends
the Prediction based Energy-aware Fault-tolerant Schedul- on a scheduling indicator, which is used to generate the
ing scheme (PEFS) system, as well as the models for re- scheduling decision whenever a job arrives at a grid sched-
sources, tasks, energy consumption, and fault-tolerance, are uler. The scheduling technique selects resources with the
described in Section 3. Resource allocation and task schedul- lowest failure rate.
ing schemes are developed in Section 4. The experimental A multi-constrained load-balancing fault-tolerant
setting is introduced, and quantitative analysis of the cor- scheduling is proposed to reduce the makespan, cost, and
relations of different parameters is presented, in Section 5. task failure rate while improving resource utilization [23].
The results of the experiments are presented and analyzed Resource selection is made on the basis of initial failure
in Section 6. Finally, some concluding remarks are presented rates, the number of jobs submitted, successfully executed
in Section 7. jobs, and the processing capabilities of the resources. A
new scheduling approach named PreAntPolicy is proposed
[24] that consists of a prediction model based on fractal
2 R ELATED W ORK mathematics and a scheduler on the basis of an improved
When multiple task instances from different applications ant colony algorithm. The prediction model determines
start to execute on numerous hosts, some of the hosts whether to trigger the execution of the scheduler by virtue
may fail accidentally, resulting in a fault in the system. of load trend prediction, and the scheduler is responsible for
This phenomenon is usually avoided by a fault-tolerance resource scheduling while minimizing energy consumption
mechanism [9]. Various factors can lead to a host failure. under the premise of guaranteeing the Quality-of-Service
In addition, a failure event usually stimulates another fault (QoS). The combination of energy-aware optimal allocation
event. These failures may include operating system crashes, and consolidation algorithm is developed as a bin packing
problem with a minimum power consumption objective
1. https://www.tensorflow.org/. [25].
2. http://www.cloudbus.org/cloudsim/. An adaptive fault-tolerant job scheduling strategy is pro-
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 3
posed by [26]. It is a checkpointing method. The proposed where Piidle represents the idle power consumption of the
strategy dynamically updates the failure index based on Hi . Similarly, Pimax represents the maximum power con-
the successful completion of the assigned tasks in order to sumption of the Hi .
maintain a grid failure index. The resource broker uses a Assuming that all the CPU cores are homogeneous. i.e.,
fault index from the scheduler to apply different scheduling ∀c = 1,2,· · ·, P Ei : M IP Si,c = M IP Si,1 , the CPU utilization
intensities for arriving tasks. The success and the fault index of the host Hi at time t, indicated as Ui (t), is defined as the
value of the resource is decreased if the task is completed average percentage of the total allocated computing powers
within the defined time. of Vi (t) which are assigned to the Hi . It can be expressed as
To the best of our knowledge, no work has been per- following.
formed so far using a DNN to predict task failure and lever- P Ei X
aging the prediction information to facilitate the energy- 1 X
Ui (t) = ( ) mipsj,c (2)
aware fault-tolerant scheduling in CDCs as we do in this P Ei × M IP Si,1 c=1
j⊆Vi (t)
paper.
The total energy consumption of the host in time period
3 M ODEL D ESIGN [t1 , t2 ] is formulated as:
Z t2
The Prediction based Energy-aware Fault-tolerant Schedul-
Ei = Pi (t)dt (3)
ing scheme (PEFS) proposed in this paper involves two t1
stages: 1) failure prediction and, 2) task scheduling, as
illustrated in Fig. 1. The predictor is designed based on a where,
DNN to train and test the task in the HDS. This predictor Ui (t): the utilization of the host Hi at time t; 0 ≤ Ui (t) ≤ 1.
is used to predict the probability of task failure, and based P Ei : the number of cores of the host Hi .
on this probability, and tasks are classified into failure-prone mipsj,c : the assigned MIPS of the cth core to the Vj on the
and non-failure-prone tasks. Failure-prone and non-failure- host Hi .
prone tasks are organized in a failure-prone task queue M IP Si,c : the max computing power capacity of the cth core
and non-failure task queue, respectively, then these types on the host Hi .
of tasks are using the scheduling method, separately. The
power model is adopted to address the energy-saving con- 3.3 Fault-tolerant Model
cern of the CDC. Similarly, the fault model is used to design The task failures may occur due to the unavailability of
a fault-tolerant mechanism for the failure-prone tasks. resources, hardware failures, execution cost and time ex-
ceeding a threshold value, system running out of memory
3.1 Task and Resource Model or disk space, over-utilization of resources, improper instal-
In a CDC environment, the service providers receive inde- lation of required libraries, and so on. These faults can be
pendent tasks submitted by the end-users. A set of inde- transient or permanent and are assumed to be independent.
pendent tasks is given as T ={T1 , T2 , T3 , · · ·, Tk }. Each So, developing a fault-tolerant scheduling scheme needs to
ϕ guarantee the deadline of all the tasks in the system that are
task Tk is associated with task requirements Tk having a
ϕ p p
set of parameters Tk ={Tk , Tk , Tk }, where Tk , Tkm and
m s met before fault occurs even under the worst-case scenario.
s As we know, the replication strategy is widely used for
Tk represents the CPU, RAM and disk storage required to
execute a given task Tk . In addition, Tk can be modeled as fault tolerance, which generally replicates tasks into two or
Tk = (tak , tdk , tlk ), where tak , tdk and tlk represent the arrival more copies, then schedules to different hosts. Therefore,
time, the deadline and the work volume, respectively. The there are more possibilities for wastage of resources and
tasks are categorized into failure-prone and non-failure- an increase in the unusual energy consumption. Thus, in
prone tasks in accordance with their proneness to failure. this paper, only failure-prone tasks are replicated. First,
A set of failure-prone tasks is designed as T ={T1 , T2 , T3 , · · three consecutive tasks are taken from the failure-prone task
·, Tl } for a failure-prone scheduling process based on fault- queue, and each task is replicated into three copies. Then,
tolerant mechanism. Similarly, a set of non-failure-prone the vector reconstruction method is designed to reconstruct
tasks T ={T1 , T2 , T3 , · · ·, Tm } is scheduled by using non- the super task from replicate copies, as shown in Section
failure-prone scheduling process. 4.2 and Fig. 3. Reconstructed super tasks are mapped to
A CDC consists of a set of hosts H = {H1 , H2 , · · ·, Hi }, the most suitable hosts, allocated with resources, and then
providing the physical infrastructures for creating virtual- scheduled in different hosts, separately. The sequence of
ϕ
ized resources to satisfy the end-user’s requirements. Vj replicate copies of tasks in super tasks is designed as shown
is the virtual machine requirement, which is modeled as in Fig. 3 so that the execution of different copies of the tasks
Vjϕ = {Vjp , Vjm , Vjs }, where Vjp , Vjm and Vjs represent the in different hosts will not be overlapped in order to avoid
parameters of CPU, RAM and disk storage, respectively. redundant execution.
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 4
failure-prone tasks based on a replication oriented elegant In this paper, the percentage split is used to divide the
vector reconstruction method, which is designed for fault- data set into the training set and testing set. The accuracy
tolerant mechanism. Third, vector bin packing is adopted to of the predicted task failures is measured using root mean
schedule super tasks and non-failure-prone tasks to suitable squared error (er ) and mean absolute percentage error (ea )
hosts. PEFS is described in Algorithm 1. given by equation (6) and equation (7), respectively.
v
u n
u1 X
4.1 Task Failure Prediction er = t (Yi − YiP F )2 (6)
The task failure prediction involves multiple steps, as shown
n i=1
in Fig. 2 [27], which include preprocessing, training the n
preprocessed data, prediction of task failure as well as
1 X |Yi − YiP F |
ea = (7)
checking the accuracy of predicted results. n i=1 Yi
Let the data set used for normalization be defined by Z ,
where YiP F and Yi are the predicted and actual task
and the normalized data set be Ž. The equation (4) is used
failure, respectively, and n is the total number of samples.
to normalize the data set Z in the range (0, 1) [27].
The prediction algorithm is described in Algorithm 2.
Zi − Zmin
Zi = (4) Algorithm 1 Prediction based Energy-aware Fault-tolerant
Zmax − Zmin
Scheduling
where Zmax and Zmin are the maximum and minimum
values respectively obtained from data set Z , the normal- 1: procedure PEFS()
ized data Ž is fed into the network as input which is 2: Task Tk , historical data set,V MList - a set of VMs, and
followed by training and evaluation of the network along HostList - a set of hosts
with task failure prediction. 3: Predict task failure: PREDICTION()
The architecture with p − q − r input parameters is 4: if (Tk prediction status==f ail) then
defined, where p, q , and r represent the number of neurons 5: Keep Tk on failure-prone task queue
in input, hidden, and output layers, respectively. Actual 6: FAILURE-PRONE SCHEDULING ()
failures are extracted and analyzed to predict the failure 7: Else
ratio of upcoming new task on the CDC. Input data set (Z) 8: Keep Tk on non-failure-prone task queue
and corresponding output (Y ) are constructed in equation 9: NON-FAILURE-PRONE SCHEDULING ()
(5), where each Zi denotes one data point and Zj represents 10: end if
the number of actual requests received. 11: end procedure
Z11 Z21 . . . Z1l Y1
Z21 Z22 . . . Z2l Y2 4.2 Vector Reconstruction
Z= . .. , Y = .. (5)
.. ..
.. . . . . Failure-prone tasks are arranged in the failure-prone task
Zk1 Zk2 ... Zkl Yk queue in an earliest-deadline-first manner. First, three con-
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 5
Z1 Ž1
Z2 Ž2 N1 N1 N1
Z3 Ž3 𝑃𝐹
Normalization 𝑌𝑛+1
I1 N2 N2 N2 O1
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 6
𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘1
T1 T2 T3
𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘2
T1 T2 T3 Tl T1 T2 T3 T2 T3 T1
Algorithm 4 Non-failure-prone Scheduling chosen for the resource vector value of super tasks. Thus,
1: procedure N ON - FAILURE - PRONE - SCHEDULING () the final resource vector value < 6.7.5 > is allocated to each
2: A set of non-failure-prone tasks { T1 , T2 , T3 , · · ·, Tm } of the super tasks, i.e.,
3: Create Vector for Task Tk SuperT ask1 = 6.7.5,
4: V M List= Sort VM list by order(enough resources) SuperT ask2 = 6.7.5,
5: HostList= Sort host list by order(energy consumption) SuperT ask3 = 6.7.5.
6: a ←− size of HostList, b ←− size of V M List
7: for i = 1 to a Start
8: for j = 1 to b
9: if Vj resources can schedule Tk then
𝑇1 𝑇2 𝑇3
10: Schedule Tk to Hosti
11: else
12: break
13: end if
14: end for 𝑇2 𝑇3 𝑇1
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 7
𝑇1 𝑇2 𝑇3
Vector
Reconstruction
Method
𝑇1 𝑇2 𝑇3 𝑇2 𝑇3 𝑇1 𝑇3 𝑇1 𝑇2
CPU
6
RAM
7
DISK
5
CPU
6
RAM
7
DISK
5
CPU
6
RAM
7
DISK
5
Fig. 7: System diagram of PEFS0 .
5 E XPERIMENTS
Experiments are conducted to validate the proposed
scheme. In the experiments, the predicted task failure is
Fig. 6: System diagram of PEFS∗ . compared with actual task failure to validate the failure
prediction. In addition, the performance of the proposed
scheduling scheme, i.e. PEFS, is compared with some
4.5.2 PEFS0 existing techniques, real-time fault-tolerant scheduling
The primary aim of developing this scheme is to obtain algorithm with rearrangement (RFTR) [19], dynamic fault
the performance variation of the scheme when all the tasks tolerant scheduling mechanism (DFTS) [20] and modified
are non-failure-prone. Upon a task’s arrival in the system, breadth first search (MBFS) [21] as all of them are designed
the task joins the task queue to be as earliest-deadline-first for fault-tolerant scheduling. The algorithms for comparison
manner. We assume all the tasks to be as non-failure-prone are briefly described as follows.
tasks. Then vectors are constructed for non-failure-prone
tasks before mapping to hosts. Then they are organized RFTR: Real-time fault-tolerant scheduling algorithm with
and scheduled, as shown in Algorithm 4 and Fig. 7. Only rearrangement.
a non-failure-prone scheduling mechanism is adopted for This algorithm adaptively optimizes the execution
this scheme. orders of real-time tasks and fully utilizes the computing
resources. Primary-backup (PB) approach is adopted to
realize fault tolerance. After a primary copy is completed,
4.6 The time complexity of PEFS the corresponding backup copy will be deallocated in order
The complexity of the PEFS scheme is measured based on to release the resource it occupies. Moreover, the waiting
its three phases. tasks can be rearranged to utilize the released resource,
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 8
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 9
2.5
Based on an experimental setting, we summarize the results
using a DNN based prediction model. Total 50,000 tasks are 2
taken from IDS, where 40,000 tasks are used as testing, and 1.5
taken for a DNN. Similarly, 10, 20, and 10 nodes are taken
in the first, second, and third hidden layers, respectively. Fig. 10: Total energy consumption-IDS.
Errors er and ea based on equation (6) and equation (7)
are used to validate the prediction accuracy of failure task, 3
PEFS PEFS* PEFS0 RFTR DFTS MBFS
respectively. Fig. 9 compares the actual task failures and
Total Energy Consumption (KWh)
2.5
predicted task failures of DNN, Naive, and support vector
machine (SVM) methods to different task counts. The pre- 2
diction accuracy always stays above 84%, and the DNN has
(104)
1.5
0.5
2.5
Actual Failure DNN Naïve SVM 0
Failure Task Count (104)
Task
2
1.5
Fig. 11: Total energy consumption-EDS.
1
0.5
every task count, the proposed scheme optimized energy
0
10 20 30 40
outstandingly.
Task Count (103)
0.75
0.6
tire schemes, as shown in Fig. 10, and Fig. 11. The pro- 0.15
posed PEFS reduces energy consumption by approximately 0
12.328%, 14.315%, 28.66%, 21.74%, and 26.10% relative to 10 20 30 40
Task Count (103)
those of PEFS∗ , PEFS0 , RFTR, DFTS, and MBFS, respectively
on IDS. Similarly, PEFS reduces energy consumption by ap- Fig. 12: Total energy consumption with task count-IDS.
proximately 2.053%, 3.345%, 26.366%, 17.068%, and 24.062%
relative to those of PEFS∗ , PEFS0 , RFTR, DFTS, and MBFS,
respectively on EDS.
The PEFS generates a schedule that uses lower energy 6.3 Task Failure ratio
consumption than the other approaches. Fig. 12 and Fig. Fig. 14 and Fig. 15 show the task failure ratio of algorithms.
13 exhibit that the total energy consumption of the six It can be seen that the failure ratio of the PEFS is lower than
algorithms grows linearly when task count increases. In the other approaches. The proposed PEFS reduces the task
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 10
7 80
PEFS PEFS* PEFS0 RFTR DFTS MBFS PEFS PEFS* PEFS0 RFTR DFTS MBFS
Fig. 13: Total energy consumption with task count-EDS. Fig. 16: Average resource utilization-IDS.
80
failure ratio by approximately 23.529%, 18.75%, 16.13% and 70
PEFS PEFS* PEFS0 RFTR DFTS MBFS
20
0.04
PEFS PEFS* PEFS0 RFTR DFTS MBFS 10
0.035
0
0.03 Task
Rejection Ratio (%)
0.025
0.01 80
PEFS PEFS* PEFS0 RFTR DFTS MBFS
0.005 70
Resource Utilization Ratio(%)
0 60
Task
50
40
Fig. 14: Failure ratio-IDS. 30
20
10
0.04
PEFS PEFS* PEFS0 RFTR DFTS MBFS 0
0.035 1 2 3 4
Task Count (103)
0.03
Rejection Ratio (%)
0.025
Fig. 18: Average resource utilization with task count-IDS.
0.02
0.015
100
0.01 PEFS PEFS* PEFS0 RFTR DFTS MBFS
90
0.005 80
Resource Utilization (%)
70
0
Task 60
50
40
Fig. 15: Failure ratio-EDS. 30
20
10
0
1 2 3 4 5
6.4 Resource Utilization Task Count(105)
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 11
tasks and non-failure-prone tasks based on model outputs. [11] C. Wang, L. Xing, H. Wang, Z. Zhang, and Y. Dai, “Processing time
Second, a scheduling algorithm based on vector bin packing analysis of cloud services with retrying fault-tolerance technique,”
IEEE International Conference on Communications in China, pp. 63–
is proposed to schedule two types of tasks efficiently. The 67, 2012.
main difference between these two scheduling processes is [12] G. Ramalingam and K. Vaswani, “Fault tolerance via idempo-
that, for the failure-prone tasks, super tasks are generated tence,” SIGPLAN Not., vol. 48, no. 1, pp. 249–262, 2013.
firstly based on an elegant vector reconstruction method [13] K. Plankensteiner, R. Prodan, and T. Fahringer, “A new fault
tolerance heuristic for scientific workflows in highly distributed
for the fault-tolerant. A replication strategy is applied to environments based on resubmission impact,” IEEE International
replicate only the failure-prone tasks, then arranged into Conference on e-Science, pp. 313–320, 2009.
super tasks in a way of vector reconstruction so that the [14] M. A. Mukwevho and T. Celik, “Toward a smart cloud: A review
of fault-tolerance methods in cloud systems,” IEEE Transactions on
execution of different copies of the tasks in different hosts Services Computing, 2018.
will not be overlapped so that redundant execution will be [15] J. Wu, P. Zhang, and C. Liu, “A novel multiagent reinforcement
avoided. Experiments on Internet Data Set and Eular Data learning approach for job scheduling in grid computing,” Future
Set are conducted, and the experimental results validate the Generation Computer Systems, vol. 27, no. 5, p. 430–439, 2011.
[16] M. A. Shafii, L. M. Shafie Abd, and B. M. Bakri, “On-demand grid
merits of the proposed scheme in comparison with existing provisioning using cloud infrastructures and related virtualization
techniques. tools : A survey and taxonomy,” International Journal of Advanced
Future work includes further augmentation of larger- Studies in Computer Science and Engineering IJASCSE, vol. 3, no. 1,
pp. 49–59, 2014.
scale simulations to scrutinize the performance of heuristics. [17] V. S. Kushwah, S. K. Goyal, and P. Narwariya, “A survey on
Other improvements will also be considered to address various fault tolerant approaches for cloud environment during
reliability and consistency requirements. load balancing,” IJCNWMC, vol. 4, no. 6, pp. 25–34, 2014.
[18] P. Kassian, P. Radu, F. Thomas, K. Attila, and P. Kacsuk, “Fault-
tolerant behavior in state-of-the-art gridworkflow management
ACKNOWLEDGMENTS systems,” Institute for Computer Science University of Innsbruck Attila
Kert CoreGRID Technical Report Number TR-0091, 2007.
This work is partially supported by the National Natural [19] P. Guo and Z. Xue, “Real-time fault-tolerant scheduling algorithm
Science Foundation of China (grant numbers 61520106005, with rearrangement in cloud systems,” 2017 IEEE 2nd Information
Technology, Networking , Electronic and Automation Control Confer-
61761136014), and the National Key Research and Devel- ence (ITNEC), pp. 399–402, 2017.
opment Program of China (grant number 2017YFB1010001). [20] J. Soniya, J. Angela, J. Sujana, and T. Revathi, “Dynamic fault-
The first author gratefully acknowledges CAS-TWAS Presi- tolerant scheduling mechanism for real time tasks in cloud com-
puting,” International Conference on Electrical, Electronics, and Opti-
dent’s Fellowship for funding his Ph.D. at Chinese Academy mization Techniques (ICEEOT), 2016.
of Sciences, Beijing, China. [21] R. K. Yadav and V. Kushwaha, “An energy preserving and fault
tolerant task scheduler in cloud computing,” IEEE ICAETR, 2014.
[22] M. Amoon, “A fault-tolerant scheduling system for computational
R EFERENCES grids,” Comput. Elect. Eng., vol. 38, no. 2, p. 399–412, 2012.
[23] P. Keerthika and S. P, “A multiconstrained grid scheduling algo-
[1] K. Bilal, S. U. Khan, L. Zhang, H. Li, K. Hayat, S. A. Madani, rithm with load balancing and fault tolerance,” Sci. World J., 2015.
N. Min-Allah, L. Wang, D. Chen, M. Iqbal, C.-Z. Xu, and A. Y. [24] H. Duan, C. Chen, G. Min, and Y. Wu, “Energy-aware scheduling
Zomaya, “Quantitative comparisons of the state-of-the-art data of virtual machines in heterogeneous cloud computing systems,”
center architectures,” Concurrency Computation: Practice and Experi- Future Generation Computer Systems, vol. 74, pp. 142–150, 2017.
ence, vol. 25, no. 12, 2012. [25] C. Ghribi, M. Hadji, and D. Zeghlache, “Energy efficient vm
[2] D. Kliazovich, P. Bouvry, and S. U. Khan, “Dens: data center scheduling for cloud data centers: Exact allocation and migration
energy-efficient network-aware scheduling,” Cluster Computing, algorithms,” 2013 13th IEEE/ACM International Symposium on Clus-
vol. 16, no. 1, p. 65–75, 2013. ter, Cloud, and Grid Computing, 2013.
[3] J. Shuja, S. A. Madani, K. Bilal, K. Hayat, S. U. Khan, and S. Sarwar, [26] B. Nazir, K. Qureshi, and P. Manuel, “Adaptive check pointing
“Energy-efficient data centers,” Computing, vol. 94, no. 12, p. strategy to tolerate faults in economy based grid,” Journal of
973–994, 2012. Supercomputing, vol. 50, no. 1, pp. 1–18, 2009.
[4] J. Liu, S. Wang, A. Zhou, S. Kumar, F. Yang, and R. Buyya, “Us- [27] A. Marahatta, C. Chi, F. Zhang, and Z. Liu, “Energy-aware fault-
ing proactive fault-tolerance approach to enhance cloud service tolerant scheduling scheme based on intelligent prediction model
reliability,” IEEE Transactions on Cloud Computing, no. 99, 2017. for cloud data center,” The 9th International Green and Sustainable
[5] O. Hannache and M. Batouche, “Probabilistic model for evaluating Computing Conference, Pittsburgh, USA, 2018.
a proactive fault tolerance approach in the cloud,” IEEE Interna- [28] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a
tional Conference on Service Operations And Logistics, And Informatics, warehouse-sized computer,” ISCA, pp. 13–23, 2007.
pp. 94–99, 2015. [29] V. V. Vazirani, “Approximation algorithms,” Springer-Verlag, New
[6] B. Mohammed, M. Kiran, M. Kabiru, and I.-U. Awan, “Failover York, Inc., 2001.
strategy for fault tolerance in cloud computing environment,” [30] R. Panigrahy, K. Talwar, L. Uyeda, and U. Wieder, “Heuristics for
Software: Practice and Experience, vol. 47, no. 9, p. 1243–1274, 2017. vector bin packing,” Microsoft Research, Tech. Rep., 2011.
[7] J. Zhao, Y. Xiang, T. Lan, H. H. Huang, and S. Subramaniam, “Elas- [31] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. De Rose, and
tic reliability optimization through peer-to-peer checkpointing in R. Buyya, “Cloudsim:a toolkit for modeling and simulation of
cloud computing,” IEEE Transactions on Parallel and Distributed cloud computing environments and evaluation of resource,” Soft-
Systems, vol. 28, no. 2, pp. 491–502, 2017. ware: Practice and Experience, vol. 41, no. 1, pp. 23–50, 2011.
[8] M. Randles, D. Lamb, and A. Taleb-Bendiab, “A comparative [32] “Eular data set.”
study into distributed load balancing algorithms for cloud com- [33] “Internet data set.” [Online]. Available: https://github.com/
puting,” IEEE International Conference on Advanced Information Net- somec001/InternetData
working and Applications Workshops, pp. 551–556, 2010. [34] “Google cluster data set.” [Online]. Available: https://github.
[9] A. Marahatta, Y.-S. Wang, F. Zhang, A. K. Sangaiah, S. K. com/google/cluster-data/blob/master/ClusterData2011 2.md
Sah Tyagi, and Z. Liu, “Energy-aware fault-tolerant dynamic task
scheduling scheme for for virtualized cloud data center,” Mobile
Networks and Applications, 2018.
[10] A. Zhou, S. Wang, C.-H. Hsu, M. H. Kim, and K. S. Wong,
“Network failure-aware redundant virtual machine placement in
a cloud data center,” Concurrency and Computation: Practice and
Experience, vol. 29, no. 24, 2017.
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 12
Avinab Marahatta received his Bachelor of Fa Zhang received his B.S. degree from Hebei
Computer Engineering from Purbanchal Univer- University of Science and Technology, Shiji-
sity, Biratnagar, Nepal, in 2010, and Master of azhuang, China, in 1997, his M.S. degree
Technology in Computer Science from Jawahar- from Yanshan University, Qinhuangdao, China
lal Nehru Technological University Hyderabad in 2000, and his Ph.D. degree from the Insti-
(JNTUH), Kukatpally, Hyderabad, India in 2014. tute of Computing Technology (ICT), Chinese
He worked as a Computer Engineer/IT In-charge Academy of Sciences, Beijing, China, in 2005,
in Higher Secondary Education Board (HSEB), all in Computer Science. From 2002 to 2005,
Ministry of Education, Government of Nepal. he worked as a Visiting Scholar at Ohio State
Moreover, he worked as an IT Consultant and University, Columbus, OH, USA. Following this,
IT Expert for the National Examination Board he worked as an Assistant Professor with the
(NEB), Ministry of Education, Government of Nepal. He has been National Research Center for Intelligent Computing Systems (NCIC),
awarded the CAS- TWAS President’s Fellowship for pursuing his Ph.D. ICT from 2005 to 2008; then, he has worked as an Associate Professor
degree at the Institute of Computing Technology, Chinese Academy of with the Advance Computing Research Laboratory, ICT till 2019. Now,
Sciences, Beijing, China. His research interests include energy-efficient he is a full Professor at High Performance Computer Research Center,
computing, data center networking, computer architecture, distributed ICT. In addition, he worked as a Visiting Scientist at the University Rey
and parallel processing, and network securities. Juan Carlos, Madrid, Spain, from 2009 to 2010. His research interests
include computer algorithms, high performance computing, biomedical
image processing, and bioinformatics.
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
Nama : Febryan Adi Pratama
NIM : A11.2020.12719