Sister 12juni 1271

This article has been accepted for publication in a future issue of this journal, but has not been
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
Transactions on Sustainable Computing
IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING, VOL., NO., 1
PEFS: AI-driven Prediction based Energy-aware

Fault-tolerant Scheduling Scheme for Cloud
Data Center
Avinab Marahattaa,b , Qin Xinc , Ce Chia,b , Fa Zhangb , Zhiyong Liub,∗
Abstract—Cloud data centers (CDCs) have become increasingly popular and widespread in recent years with the growing popularity
of cloud computing and high-performance computing. Due to the multi-step computation of data streams and heterogeneous task
dependencies, task failure frequently occurs, resulting in poor user experience and additional energy consumption. To reduce task
execution failure as well as energy consumption, we propose a novel AI-driven energy-aware proactive fault-tolerant scheduling
scheme for CDCs in this paper. Firstly, a prediction model based on the machine learning approach is trained to classify the arriving
tasks into “failure-prone tasks” and “non-failure-prone tasks” according to the predicted failure rate. Then, two efficient scheduling
mechanisms are proposed to allocate two types of tasks to the most appropriate hosts in a CDC. The vector reconstruction method is
developed to construct super tasks from failure-prone tasks and separately schedule these super tasks and non-failure-prone tasks to
the most suitable physical host. All the tasks are scheduled in an earliest-deadline-first manner. Our evaluation results show that the
proposed scheme can intelligently predict task failure and achieves better fault tolerance and reduces total energy consumption better
than the existing schemes.
Index Terms—Cloud computing, cloud data center, scheduling, fault-tolerance, energy-efficiency, task failure, prediction, deep neural
network.
F
1 I NTRODUCTION Proactive fault-tolerant techniques are widely adopted

Cloud data centers (CDCs) provide a wide range of facili- in CDCs [4], [5]. However, efficient use of proactive fault
ties to end-users in different domains such as health care, tolerance relies heavily on the prior knowledge of the failed
scientific computing, smart grid, e-commerce, and nuclear tasks. Generally, the task has failed due to over-utilization or
science [1]. The task and resource failures are inevitable due lack of availability of resources, hardware failures, required
to the growing number of CDC resources, which provide libraries that are not installed properly, execution cost or ex-
ICT infrastructures. However, reliable on-demand resources ecution time exceeding the threshold value, system running
are needed for service providers to fulfill the service level out of memory/disk space, and so on.
agreement (SLA) of customers [2]. Therefore, it is of great Previous studies have adopted several fault-tolerance
significance to ensure the reliability and availability of such mechanisms, such as load balancing [6], [7], migration [4],
systems [3]. load balancing [8], replication [9], [10], retry [11], [12], task
Modeling and examining dynamic fault-tolerant tech- resubmission [13], etc. Recently, data have very complex
niques for virtualized CDCs is challenging. First, cloud structures and parameters. Therefore, simple statistical ap-
applications are usually large-scale and consist of an enor- proaches may not be able to handle the patterns of complex
mous number of distributed computing nodes. The complex data. Some studies have used a machine learning approach
structure and fault-tolerant behavior of heterogeneous CDC to predict task failure [14], [15], and they have not leveraged
applications are hard to describe. Second, CDC applications the prediction to facilitate task scheduling, while the ad-
dynamically adjust the virtual machine (VM) configurations ditional resources used in proactive fault-tolerant schemes,
to meet user requests which require multi-dimensional re- will result in additional energy consumption, and very few
sources (processor, memory, disk storage). The concurrency fault-tolerant studies have taken optimization of energy
and uncertainty of requests will raise the complication of consumption into consideration.
model validation and verification in the scheduling. In this paper, our aim is to predict a task failure accord-
ing to the requested resources before the actual failure oc-
Authors Affiliations: curs, and leverage the prediction to design a task scheduling
a University of Chinese Academy of Sciences, China.
scheme, thus reducing the task execution failure and total
b High Performance Computer Research Center, Institute of Computing
energy consumption. To this end, a novel AI-driven Predic-
Technology, Chinese Academy of Sciences, China.
c University of the Faroe Islands, Faroe Islands. tion based Energy-aware Fault-tolerant Scheduling scheme
Authors’ E-mail: avinab.marahatta@ict.ac.cn, qinx@setur.fo, (PEFS) is proposed. The scheme involves two stages: 1)
chice18s@ict.ac.cn, zhangfa@ict.ac.cn, zyliu@ict.ac.cn prediction of task failure probability, and 2) task scheduling.
∗ Corresponding author.
In the first stage, task parameters (involving the requested
A preliminary version appeared in the proceedings of the 9th Interna-
tional Green and Sustainable Computing Conference, Pittsburgh, USA, 2018. resources, actually allocated resources, and whether failure
https://doi.org/10.1109/IGCC.2018.8752123 occurred) are gathered from historical data set (HDS). Then,
2377-3782 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: UNIVERSITY OF CONNECTICUT. Downloaded on August 15,2020 at 04:38:35 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2020.3015559, IEEE
all the task parameters are inserted into TensorFlow1 as network partitions, hardware malfunctions, power outages,
inputs. Using a deep neural network (DNN) approach, a abrupt software failures, etc. [16].
model is trained to predict the failure rate of each arriving The existing fault-tolerant techniques in CDCs include
task. In this way, all the arriving tasks can be classified into replication, check-point, job migration, retry, task resubmis-
failure-prone tasks and non-failure-prone tasks based on sion, etc. Some studies [17], [18] introduced methods based
model outputs. In the second stage, a scheduling algorithm on certain principles, such as retry, resubmission, replica-
is developed based on vector bin packing to schedule the tion, renovation of software, screening, and migration, to
two types of tasks efficiently. The main difference between harmonize the fault-tolerant mechanism with CDC task
these two scheduling schemes is that, for the failure-prone scheduling. However, for parallel and distributed comput-
tasks, super tasks are firstly generated based on an elegant ing systems, the most widely adopted and acknowledged
vector reconstruction method for fault-tolerant purpose. A method is to replicate data to multiple hosts [9].
replication strategy is applied to replicate only the fault- A rearrangement-based improved fault-tolerant schedul-
prone tasks, then arranged into super tasks. The elegant ing algorithm (RTFR) has been presented to deal with the
vector reconstruction method is designed to construct super dynamic scheduling issue for tasks in cloud systems [19]. A
tasks to execute different copies of the tasks in different primary-backup model is adopted to realize fault-tolerance
hosts. The super tasks are uniquely constructed so that the in this method. The corresponding backup copy will be
sub-task will not be overlapped, and redundant execution released after the primary replica is completed, in order
will be avoided. to release the resource it occupies. In addition, the waiting
The main contributions of this work include: tasks can be rearranged to utilize the released resources. In
contrast, after the task is sent to the waiting queue of the
• A DNN model is proposed for the prediction of the virtual machine, the execution sequence is fixed and cannot
possibility of failure of incoming tasks, so that further be changed.
scheduling strategy can be developed based on the The backup overlapping mechanism and virtual ma-
prediction. chine migration strategy are adopted in the cloud to im-
• Different resource allocation and scheduling strate- prove resource utilization [20]. It is a primary-backup ap-
gies are developed for failure-prone and non-failure- proach similar to the proposal in [19]. A dynamic integrated
prone tasks for both reduction of task failure and task scheduling algorithm is presented in [21] by modifying
energy consumption. breadth-first search algorithm to find the overall optimal
• A unique fault-tolerant mechanism is developed to virtual machine for each task. The above techniques do not
schedule failure-prone tasks by constructing super only minimize the makespan and response time for tasks
tasks based on the vector reconstruction method. but can also be easily integrated with other virtual machine
• Extensive experiments using the CloudSim2 toolkit management techniques to reduce energy consumption and
have been conducted to evaluate our scheme. The improve fault tolerance.
results validate that our scheme outperforms the By observing resource behavior during job execution,
state-of-the-art in terms of failure ratio, resource uti- various statistics and probability-based techniques can be
lization, and energy consumption. used to identify the failure rate of jobs. The history of
The remainder of the paper is organized as follows: resource failure can be leveraged for careful selection of
Related works are introduced in Section 2. The structure of resources and fault-tolerant scheduling [22]. This depends
the Prediction based Energy-aware Fault-tolerant Schedul- on a scheduling indicator, which is used to generate the
ing scheme (PEFS) system, as well as the models for re- scheduling decision whenever a job arrives at a grid sched-
sources, tasks, energy consumption, and fault-tolerance, are uler. The scheduling technique selects resources with the
described in Section 3. Resource allocation and task schedul- lowest failure rate.
ing schemes are developed in Section 4. The experimental A multi-constrained load-balancing fault-tolerant
setting is introduced, and quantitative analysis of the cor- scheduling is proposed to reduce the makespan, cost, and
relations of different parameters is presented, in Section 5. task failure rate while improving resource utilization [23].
The results of the experiments are presented and analyzed Resource selection is made on the basis of initial failure
in Section 6. Finally, some concluding remarks are presented rates, the number of jobs submitted, successfully executed
in Section 7. jobs, and the processing capabilities of the resources. A
new scheduling approach named PreAntPolicy is proposed
[24] that consists of a prediction model based on fractal
2 R ELATED W ORK mathematics and a scheduler on the basis of an improved
When multiple task instances from different applications ant colony algorithm. The prediction model determines
start to execute on numerous hosts, some of the hosts whether to trigger the execution of the scheduler by virtue
may fail accidentally, resulting in a fault in the system. of load trend prediction, and the scheduler is responsible for
This phenomenon is usually avoided by a fault-tolerance resource scheduling while minimizing energy consumption
mechanism [9]. Various factors can lead to a host failure. under the premise of guaranteeing the Quality-of-Service
In addition, a failure event usually stimulates another fault (QoS). The combination of energy-aware optimal allocation
event. These failures may include operating system crashes, and consolidation algorithm is developed as a bin packing
problem with a minimum power consumption objective
1. https://www.tensorflow.org/. [25].
2. http://www.cloudbus.org/cloudsim/. An adaptive fault-tolerant job scheduling strategy is pro-
posed by [26]. It is a checkpointing method. The proposed where Piidle represents the idle power consumption of the
strategy dynamically updates the failure index based on Hi . Similarly, Pimax represents the maximum power con-
the successful completion of the assigned tasks in order to sumption of the Hi .
maintain a grid failure index. The resource broker uses a Assuming that all the CPU cores are homogeneous. i.e.,
fault index from the scheduler to apply different scheduling ∀c = 1,2,· · ·, P Ei : M IP Si,c = M IP Si,1 , the CPU utilization
intensities for arriving tasks. The success and the fault index of the host Hi at time t, indicated as Ui (t), is defined as the
value of the resource is decreased if the task is completed average percentage of the total allocated computing powers
within the defined time. of Vi (t) which are assigned to the Hi . It can be expressed as
To the best of our knowledge, no work has been per- following.
formed so far using a DNN to predict task failure and lever- P Ei X
aging the prediction information to facilitate the energy- 1 X
Ui (t) = ( ) mipsj,c (2)
aware fault-tolerant scheduling in CDCs as we do in this P Ei × M IP Si,1 c=1
j⊆Vi (t)
paper.
The total energy consumption of the host in time period
3 M ODEL D ESIGN [t1 , t2 ] is formulated as:
Z t2
The Prediction based Energy-aware Fault-tolerant Schedul-
Ei = Pi (t)dt (3)
ing scheme (PEFS) proposed in this paper involves two t1
stages: 1) failure prediction and, 2) task scheduling, as
illustrated in Fig. 1. The predictor is designed based on a where,
DNN to train and test the task in the HDS. This predictor Ui (t): the utilization of the host Hi at time t; 0 ≤ Ui (t) ≤ 1.
is used to predict the probability of task failure, and based P Ei : the number of cores of the host Hi .
on this probability, and tasks are classified into failure-prone mipsj,c : the assigned MIPS of the cth core to the Vj on the
and non-failure-prone tasks. Failure-prone and non-failure- host Hi .
prone tasks are organized in a failure-prone task queue M IP Si,c : the max computing power capacity of the cth core
and non-failure task queue, respectively, then these types on the host Hi .
of tasks are using the scheduling method, separately. The
power model is adopted to address the energy-saving con- 3.3 Fault-tolerant Model
cern of the CDC. Similarly, the fault model is used to design The task failures may occur due to the unavailability of
a fault-tolerant mechanism for the failure-prone tasks. resources, hardware failures, execution cost and time ex-
ceeding a threshold value, system running out of memory
3.1 Task and Resource Model or disk space, over-utilization of resources, improper instal-
In a CDC environment, the service providers receive inde- lation of required libraries, and so on. These faults can be
pendent tasks submitted by the end-users. A set of inde- transient or permanent and are assumed to be independent.
pendent tasks is given as T ={T1 , T2 , T3 , · · ·, Tk }. Each So, developing a fault-tolerant scheduling scheme needs to
ϕ guarantee the deadline of all the tasks in the system that are
task Tk is associated with task requirements Tk having a
ϕ p p
set of parameters Tk ={Tk , Tk , Tk }, where Tk , Tkm and
m s met before fault occurs even under the worst-case scenario.
s As we know, the replication strategy is widely used for
Tk represents the CPU, RAM and disk storage required to
execute a given task Tk . In addition, Tk can be modeled as fault tolerance, which generally replicates tasks into two or
Tk = (tak , tdk , tlk ), where tak , tdk and tlk represent the arrival more copies, then schedules to different hosts. Therefore,
time, the deadline and the work volume, respectively. The there are more possibilities for wastage of resources and
tasks are categorized into failure-prone and non-failure- an increase in the unusual energy consumption. Thus, in
prone tasks in accordance with their proneness to failure. this paper, only failure-prone tasks are replicated. First,
A set of failure-prone tasks is designed as T ={T1 , T2 , T3 , · · three consecutive tasks are taken from the failure-prone task
·, Tl } for a failure-prone scheduling process based on fault- queue, and each task is replicated into three copies. Then,
tolerant mechanism. Similarly, a set of non-failure-prone the vector reconstruction method is designed to reconstruct
tasks T ={T1 , T2 , T3 , · · ·, Tm } is scheduled by using non- the super task from replicate copies, as shown in Section
failure-prone scheduling process. 4.2 and Fig. 3. Reconstructed super tasks are mapped to
A CDC consists of a set of hosts H = {H1 , H2 , · · ·, Hi }, the most suitable hosts, allocated with resources, and then
providing the physical infrastructures for creating virtual- scheduled in different hosts, separately. The sequence of
ϕ
ized resources to satisfy the end-user’s requirements. Vj replicate copies of tasks in super tasks is designed as shown
is the virtual machine requirement, which is modeled as in Fig. 3 so that the execution of different copies of the tasks
Vjϕ = {Vjp , Vjm , Vjs }, where Vjp , Vjm and Vjs represent the in different hosts will not be overlapped in order to avoid
parameters of CPU, RAM and disk storage, respectively. redundant execution.
3.2 Power Consumption Model 4 T HE S CHEME FOR R ESOURCE A LLOCATION

In this paper, the power consumption of the host Hi , indi- AND TASK S CHEDULING
cated as Pi [27], is expressed as depicted below (based on
The scheduling scheme is divided into three parts. First,
energy consumption model proposed in [28]).
tasks are classified into failure-prone and non-failure-prone
Pi (t) = Piidle + (Pimax − Piidle ) × Ui (t) (1) tasks based on a DNN. Second, super tasks are created from
Fig. 1: System diagram of PEFS.
failure-prone tasks based on a replication oriented elegant In this paper, the percentage split is used to divide the
vector reconstruction method, which is designed for fault- data set into the training set and testing set. The accuracy
tolerant mechanism. Third, vector bin packing is adopted to of the predicted task failures is measured using root mean
schedule super tasks and non-failure-prone tasks to suitable squared error (er ) and mean absolute percentage error (ea )
hosts. PEFS is described in Algorithm 1. given by equation (6) and equation (7), respectively.
v
u n
u1 X
4.1 Task Failure Prediction er = t (Yi − YiP F )2 (6)
The task failure prediction involves multiple steps, as shown
n i=1
in Fig. 2 [27], which include preprocessing, training the n
preprocessed data, prediction of task failure as well as
1 X |Yi − YiP F |
ea = (7)
checking the accuracy of predicted results. n i=1 Yi
Let the data set used for normalization be defined by Z ,
where YiP F and Yi are the predicted and actual task
and the normalized data set be Ž. The equation (4) is used
failure, respectively, and n is the total number of samples.
to normalize the data set Z in the range (0, 1) [27].
The prediction algorithm is described in Algorithm 2.
Zi − Zmin
Zi = (4) Algorithm 1 Prediction based Energy-aware Fault-tolerant
Zmax − Zmin
Scheduling
where Zmax and Zmin are the maximum and minimum
values respectively obtained from data set Z , the normal- 1: procedure PEFS()
ized data Ž is fed into the network as input which is 2: Task Tk , historical data set,V MList - a set of VMs, and
followed by training and evaluation of the network along HostList - a set of hosts
with task failure prediction. 3: Predict task failure: PREDICTION()
The architecture with p − q − r input parameters is 4: if (Tk prediction status==f ail) then
defined, where p, q , and r represent the number of neurons 5: Keep Tk on failure-prone task queue
in input, hidden, and output layers, respectively. Actual 6: FAILURE-PRONE SCHEDULING ()
failures are extracted and analyzed to predict the failure 7: Else
ratio of upcoming new task on the CDC. Input data set (Z) 8: Keep Tk on non-failure-prone task queue
and corresponding output (Y ) are constructed in equation 9: NON-FAILURE-PRONE SCHEDULING ()
(5), where each Zi denotes one data point and Zj represents 10: end if
the number of actual requests received. 11: end procedure
   
Z11 Z21 . . . Z1l Y1
Z21 Z22 . . . Z2l  Y2  4.2 Vector Reconstruction
Z= . ..  , Y =  ..  (5)
   
.. ..
 .. . . .   .  Failure-prone tasks are arranged in the failure-prone task
Zk1 Zk2 ... Zkl Yk queue in an earliest-deadline-first manner. First, three con-
Z1 Ž1
Z2 Ž2 N1 N1 N1
Z3 Ž3 𝑃𝐹
Normalization 𝑌𝑛+1
I1 N2 N2 N2 O1
N10 N20 N10

H1 H2 H3
Zn Žn Hidden Layers
Fig. 2: Deep neural network for task failure prediction.
Algorithm 2 Failure Prediction Algorithm 3 Failure-prone Scheduling

1: procedure P REDICTION () 1: procedure FAILURE - PRONE - SCHEDULING ()
2: Initialize weight vector wj 2: A set of failure-prone tasks { T1 , T2 , T3 , · · ·, Tl }
3: for each all Z data set (i = 1 to n) 3: Choose first three consecutive tasks from failure-prone
4: for each j = 1 to n task queue i.e., T1 , T2 , T3 from { T1 , T2 , T3 , · · ·, Tl }
n
5: Evaluate output YiP F =
P
Zj wj 4: Create SuperT ask1 , SuperT ask2 , SuperT ask3 using
j=1 vector reconstruction method
6: Compare output YiP F with actual output Yi 5: V M List= Sort VM list by order(enough resources)
7: Error Yi − YiP F 6: HostList= Sort host list by order(energy consumption)
8: Use Gradient Descent to minimize the error 7: a ←− size of HostList, b ←− size of V M List
9: Adopt wj in current layer 8: for i = 1 to a
10: Repeat until er and ea has been minimized. 9: for j = 1 to b
11: end for 10: if Vj resources can schedule SuperT ask then
12: end for 11: m ←− order of sub tasks of SuperT ask
13: end procedure 12: for n = 1 to m
13: If Tn of SuperT ask already not executed then
14: Schedule Tn of SuperT ask to Vj of Hi
secutive tasks are taken from failure-prone task queue, 15: else
and each task is replicated into three copies. Then, the 16: break
vector reconstruction method is designed to reconstruct 17: end if
super tasks from replicate copies as, shown in Fig. 3 [27], 18: end for
SuperT ask1 , SuperT ask2 , SuperT ask3 . 19: else
Each requested resource of each task from the con- 20: break
secutive three tasks is evaluated, then the highest re- 21: end if
quested resource values are chosen, i.e., resource vector 22: end for
< CP Uhigh .RAMhigh .Diskhigh > is computed for all of 23: end for
three super tasks. Then, these super tasks are mapped to 24: end procedure
the most suitable hosts having sufficient resources using
Algorithm 3, separately. Similarly, non-failure-prone tasks
are mapped to the most suitable hosts using Algorithm 4.
generalizing FFD for scaling and normalizing across multi-
dimension cases to decide how to assign a weight to a
4.3 Vector Bin Packing vector, i.e., FFD product weight (FFDProd) and FFD sum
weight (FFDSum).
Resource allocation on different hosts can be considered as a
In this paper, we present a norm-based greedy heuristic
vector bin packing problem, where VMs are treated as bins.
based on FFDSum that looks at the difference between
The greedy algorithm is one of the most natural heuristics
the vector Gl and residual capacity c(t) under a certain
for one-dimensional bin packing, generally mentioned as
norm [30]. The item vector Gl that minimizes the quantity
first fit decreasing (FFD) [29]. All the items are sorted by
wi (Gl − c(t)i )2 and the assignment does not violate the
P
size in decreasing order, and the three dimensions of the i
requirement of CPU, RAM, and storage are used in sorting, capacity constraints for the l2 norm, from all unassigned
and then placed sequentially in the first bin that has suf- items. FFDSum is more robust if the dimensions can be
ficient capacity. There are generally two natural options of assigned smaller coefficients wi and have a lower impact on
𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘1
T1 T2 T3
T1 T2 T3 Tl T1 T2 T3 T2 T3 T1
Failure-prone task queue Selecting three tasks

for super tasks
T3 T1 T2
Fig. 3: Failure-prone scheduling with preparing super task (fault-tolerant mechanism).
Algorithm 4 Non-failure-prone Scheduling chosen for the resource vector value of super tasks. Thus,
1: procedure N ON - FAILURE - PRONE - SCHEDULING () the final resource vector value < 6.7.5 > is allocated to each
2: A set of non-failure-prone tasks { T1 , T2 , T3 , · · ·, Tm } of the super tasks, i.e.,
3: Create Vector for Task Tk SuperT ask1 = 6.7.5,
4: V M List= Sort VM list by order(enough resources) SuperT ask2 = 6.7.5,
5: HostList= Sort host list by order(energy consumption) SuperT ask3 = 6.7.5.
6: a ←− size of HostList, b ←− size of V M List
7: for i = 1 to a Start
8: for j = 1 to b
9: if Vj resources can schedule Tk then
𝑇1 𝑇2 𝑇3
10: Schedule Tk to Hosti
11: else
12: break
13: end if
14: end for 𝑇2 𝑇3 𝑇1
15: end for

16: end procedure
the ordering. Average demand heuristic call FFDAvgSum, 𝑇3 𝑇1 𝑇2

n
AvgDemi = n1 Gli in i dimension is chosen wi .
P
l=1 End
4.4 Example Fig. 4: Scheduling sequence of sub tasks in super tasks.

When tasks arrive in the system, a task joins the queue
Fig. 4 indicates the execution time of the scheduling
of the entire system in earliest-deadline-first manner. Let
sequence of sub tasks in super tasks, and Fig. 5 shows the
T1 , T2 , T3 , T4 , T5 , T6 , T7 , T8 , ... be tasks in the system queue.
example of resource selection of a super task.
Then, the predictor classifies the task into failure-prone tasks
and non-failure-prone tasks in accordance with their pre-
dicted failure probability. Based on the classification, failure- 4.5 Additional Proposed Scheduling Approaches
prone task queue and non-failure-prone task queues are Two new different scheduling schemes named PEFS∗ and
generated. Let T1 , T2 , T3 , T6 , T8 be the failure-prone tasks. PEFS0 are presented. PEFS∗ is a scheduling scheme when all
First, three consecutive failure-prone tasks (first three pre- the tasks are marked as failure-prone tasks. Similarly, PEFS0
dicted failure tasks) T1 , T2 , T3 are chosen from the failure- is a scheduling scheme when all the tasks are marked as
prone task queue. Let us assume that T1 , T2 , T3 have differ- non-failure-prone tasks.
ent vectors such as,
T1 = < CP U.RAM.Disk > = < 4.2.5 >, 4.5.1 PEFS∗
T2 = < CP U.RAM.Disk > = < 6.4.3 >, The primary aim of developing the PEFS∗ is to obtain
T3 = < CP U.RAM.Disk >= < 5.7.4 >. performance variation of the scheme when all the tasks are
First, requested resources (CPU, RAM, Disk) are com- failure-prone. Upon a task’s arrival in the system, the task
pared among these three consecutive tasks. From the com- joins the task queue in an earliest-deadline-first manner. We
parison, we can see that T1 has the highest disk: 5, T2 has the assume all the tasks as failure-prone tasks. Then vectors
highest CPU: 6, and T3 has the highest RAM:7. Then, highest are constructed for failure-prone tasks before mapping to
resource vector value < CP Uhigh .RAMhigh .Diskhigh > is hosts. Then they are organized and scheduled, as shown in
CPU RAM DISK CPU RAM DISK CPU RAM DISK

4 2 5 6 4 3 5 7 4
𝑇1 𝑇2 𝑇3
Vector
Reconstruction
Method
𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘1 𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘2 𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘3
𝑇1 𝑇2 𝑇3 𝑇2 𝑇3 𝑇1 𝑇3 𝑇1 𝑇2
CPU
6
RAM
7
DISK
5
CPU
6
RAM
7
DISK
5
CPU
6
RAM
7
DISK
5
Fig. 7: System diagram of PEFS0 .
Fig. 5: Example of resource selection of super tasks.

Phase I (Task prediction based on DNN)
This phase is applied to the n tasks that are submitted
Algorithm 3 and Fig. 6. Only the failure-prone scheduling
into the system. In this phase, the task predictor predicts
mechanism is adopted. A replication strategy is applied to
tasks into the failure-prone task and non-failure-prone task.
replicate only the failure-prone tasks, then arranged into
The complexity of phase I is O(n2 )
super tasks in a way of vector reconstruction method that
the execution of different copies of the tasks in different
Phase II (Construction of super task)
hosts will not be overlapped so that redundant execution
The vector reconstruction method is only used for
will be avoided.
failure-prone tasks. Failure-prone tasks are replicated
into three copies and super tasks are created based on
their sequencing order and resource selection. So, in
the worst-case scenario, the complexity of phase II is
O(nlogn + nlogn + n3 ).
Phase III (Scheduling based on greedy norm-based heuristic)

In this phase, PEFS schedule super tasks and non-failure-
prone tasks based on a norm-based greedy heuristic. So, the
complexity of this phase is O(nlogn + nlogn + n3 ).
5 E XPERIMENTS
Experiments are conducted to validate the proposed
scheme. In the experiments, the predicted task failure is
Fig. 6: System diagram of PEFS∗ . compared with actual task failure to validate the failure
prediction. In addition, the performance of the proposed
scheduling scheme, i.e. PEFS, is compared with some
4.5.2 PEFS0 existing techniques, real-time fault-tolerant scheduling
The primary aim of developing this scheme is to obtain algorithm with rearrangement (RFTR) [19], dynamic fault
the performance variation of the scheme when all the tasks tolerant scheduling mechanism (DFTS) [20] and modified
are non-failure-prone. Upon a task’s arrival in the system, breadth first search (MBFS) [21] as all of them are designed
the task joins the task queue to be as earliest-deadline-first for fault-tolerant scheduling. The algorithms for comparison
manner. We assume all the tasks to be as non-failure-prone are briefly described as follows.
tasks. Then vectors are constructed for non-failure-prone
tasks before mapping to hosts. Then they are organized RFTR: Real-time fault-tolerant scheduling algorithm with
and scheduled, as shown in Algorithm 4 and Fig. 7. Only rearrangement.
a non-failure-prone scheduling mechanism is adopted for This algorithm adaptively optimizes the execution
this scheme. orders of real-time tasks and fully utilizes the computing
resources. Primary-backup (PB) approach is adopted to
realize fault tolerance. After a primary copy is completed,
4.6 The time complexity of PEFS the corresponding backup copy will be deallocated in order
The complexity of the PEFS scheme is measured based on to release the resource it occupies. Moreover, the waiting
its three phases. tasks can be rearranged to utilize the released resource,
TABLE 1: Types of VM 5.2 Parameter setting

VM Type Cores MIPS RAM (MB) Disk Storage (GB) All the hosts in CDC are identical, and each host has CPU
VM1 1 1000 1536 5 cores (2800 MIPS per core), 8192 MB of RAM, 1 TB of disk
VM2 1 1500 3840 5 storage. The metrics used for comparison include the failure
VM3 1 2500 871 5
ratio of tasks in scheduling, resource utilization and total
energy consumption.
whereas, in most existing algorithms, the executing Task failure ratio

sequence is settled after sending tasks to the waiting queue This ratio represents the tasks that failed because they
of a VM. could not be scheduled.
FT =(Number of Failure Tasks)/(Total no of Tasks)
DFTS: Dynamic fault tolerant scheduling mechanism.
The backup overlapping mechanism and efficient VM Resource utilization
migration strategy are incorporated for real-time tasks in The utilization is defined by the ratio between the time
cloud computing. This algorithm aims to achieve both fault taken to process tasks and total time.
tolerance and high resource utilization in the cloud. Utilization=(Time Processing Tasks)/(Total Time)%
MBFS: Modified breadth first search. Energy Consumption

This algorithm finds the optimal VM for each task. It can Total energy consumed by the CDC.
not only minimize the makespan and response time for tasks
but can also be easily integrated with other virtual machine 5.3 Quantitative Analysis
management techniques to reduce energy consumption and To demonstrate the relation between the failure ratio of
improve fault tolerance. tasks and resources, we quantify the impacts of different pa-
rameters and their correlation. We perform correlation and
covariance analysis based on the data set. The correlation
5.1 Environment Setting shows the relation between failure task ratio and resource
allocation. The analysis result shows that task failure occurs
The experiments are conducted to validate the energy-aware if the scheduler allocates minimum resources, the linear
fault-tolerant scheduling scheme based on an intelligent relation between failure task ratio, and resource allocation.
prediction model by deploying the core methods presented Table II shows the average allocated resources for failed and
in Section 3 and Section 4. The simulation is performed on a successful tasks with correlation and covariance analysis
machine equipped with CPU (Intel(R) Core(TM) i5-3230M, result of Internet data set [33]. Similarly, Table III shows the
2.60GHz), RAM (8.0 GB), disk storage (750GB), Windows 10 analysis of the Google cluster data set [34], where average
operating system, NetBeans IDE 8.2, JDK 8.0. TensorFlow is requested resources for failed and successful tasks are given
used to implement a DNN using the data set. The simulation with their correlation analysis. From the analysis, we can
is performed using CloudSim [31] to create a CDC system see that the tasks are failed if the allocated resources are low
that has identical hosts and heterogeneous VMs. Table I and requested resources are high. Fig. 8 represents the rela-
presents the types of VM in simulations. tion between tasks and computing resources. Red and blue
We consider submission time, waiting time, start time, dots represent the failed and successful task, respectively.
run time, end time, status of task (failed/ succeeded), re- The three dimensions represent the requested and allocated
sources used (CPU, RAM, disk storage), etc. Similarly, task resources (CPU, RAM, and Disk Storage) for the given task
failure information has been gathered for intelligent failure set. Fig. 8(a) indicates the task representation of the Internet
prediction. Based on the utilization threshold of CPU, RAM, data set with resources, and Fig. 8(b) indicates the task
and disk storage, the resource utilization parameters are representation of Google cluster data set with resources.
measured. If the resource utilization parameter has a value
more than the threshold value, then the status would be
classified as failed, and otherwise not failed. The perfor-
mance (accuracy of prediction model) is measured based
on various errors.
To simulate task heterogeneity and dynamic nature in
the CDC environment, two different data sets are taken.
The Eular Data Set [32] and Internet Data Set [33] contain
huge data of the often executed tasks that can have a good
coverage of the applications widely used in data centers,
and they have useful information including arrival time,
deadline, resource requirement, as well as the states of fail (a) Internet data set (b) Google data set
and succeed. One million tasks from the Eular Data Set and
50,000 tasks from the Internet Data Set are used for the Fig. 8: Relation of failure task with resources (CPU, RAM,
training and testing of the deep learning network for task Disk storage).
classification and scheduling.
TABLE 2: Allocated resources analysis of Internet data set

Resources Average resources for failed task Average resources for successful task Correlation coefficient Covariance
CPU 0.419 0.585 0.286 0.041
RAM 0.416 0.580 0.287 0.041
Disk storage 0.419 0.577 0.275 0.040
TABLE 3: Requested resources analysis of Google cluster data set

Resources Average resources for failed task Average resources for successful task Correlation coefficient Covariance
CPU 0.044 0.029 -0.135 -0.003
RAM 0.029 0.015 -0.180 -0.003
Disk storage 0.0003 0.0001 -0.1268 -0.00003
6 R ESULTS AND D ISCUSSION 3.5
Total Energy Consumption (KWh)

PEFS PEFS* PEFS0 RFTR DFTS MBFS
6.1 Accuracy of prediction model 3
2.5
Based on an experimental setting, we summarize the results
using a DNN based prediction model. Total 50,000 tasks are 2
taken from IDS, where 40,000 tasks are used as testing, and 1.5
10,000 tasks are used for training. Similarly, total 1 million 1

continuous tasks are taken from EDS, where 250,000 tasks 0.5
are used for training and remaining are used for testing. 0
Total 1 input layer, 3 hidden layers, and 1 output layer are Task
taken for a DNN. Similarly, 10, 20, and 10 nodes are taken
in the first, second, and third hidden layers, respectively. Fig. 10: Total energy consumption-IDS.
Errors er and ea based on equation (6) and equation (7)
are used to validate the prediction accuracy of failure task, 3
respectively. Fig. 9 compares the actual task failures and
2.5
predicted task failures of DNN, Naive, and support vector
machine (SVM) methods to different task counts. The pre- 2
diction accuracy always stays above 84%, and the DNN has
(104)
1.5
the highest prediction accuracy than Naive and SVM. 1
0.5
2.5
Actual Failure DNN Naïve SVM 0
Failure Task Count (104)
Task
2
1.5
Fig. 11: Total energy consumption-EDS.
1
0.5
every task count, the proposed scheme optimized energy
0
10 20 30 40
outstandingly.
Task Count (103)

Fig. 9: Actual and predicted failure with task count [IDS]. 0.9
0.75
0.6
6.2 Energy consumption 0.45
The total energy consumption is compared among the en- 0.3
tire schemes, as shown in Fig. 10, and Fig. 11. The pro- 0.15
posed PEFS reduces energy consumption by approximately 0
12.328%, 14.315%, 28.66%, 21.74%, and 26.10% relative to 10 20 30 40
Task Count (103)
those of PEFS∗ , PEFS0 , RFTR, DFTS, and MBFS, respectively
on IDS. Similarly, PEFS reduces energy consumption by ap- Fig. 12: Total energy consumption with task count-IDS.
proximately 2.053%, 3.345%, 26.366%, 17.068%, and 24.062%
relative to those of PEFS∗ , PEFS0 , RFTR, DFTS, and MBFS,
respectively on EDS.
The PEFS generates a schedule that uses lower energy 6.3 Task Failure ratio
consumption than the other approaches. Fig. 12 and Fig. Fig. 14 and Fig. 15 show the task failure ratio of algorithms.
13 exhibit that the total energy consumption of the six It can be seen that the failure ratio of the PEFS is lower than
algorithms grows linearly when task count increases. In the other approaches. The proposed PEFS reduces the task
7 80
PEFS PEFS* PEFS0 RFTR DFTS MBFS PEFS PEFS* PEFS0 RFTR DFTS MBFS

6 70
Resource Utilization Ratio (%)

60
5
50
(103) 4
40
3
30
2
20
1
10
0
0
1.5 3 4.5 6 7.5
Task
Task Count(105)
Fig. 13: Total energy consumption with task count-EDS. Fig. 16: Average resource utilization-IDS.
80
failure ratio by approximately 23.529%, 18.75%, 16.13% and 70
Resource Utilization (%)

29.73% relative to those of PEFS0 RFTR, DFTS and MBFS, 60
respectively on IDS, and 28.125%, 25.806%, 20.689% and 50
34.285% relative to those of PEFS0 RFTR, DFTS and MBFS, 40
respectively on EDS. 30
20
0.04
PEFS PEFS* PEFS0 RFTR DFTS MBFS 10
0.035
0
0.03 Task
Rejection Ratio (%)
0.025
0.02 Fig. 17: Average resource utilization-EDS.

0.015
0.01 80
0.005 70
Resource Utilization Ratio(%)
0 60
Task
50
40
Fig. 14: Failure ratio-IDS. 30
20
10
0.04
PEFS PEFS* PEFS0 RFTR DFTS MBFS 0
0.035 1 2 3 4
Task Count (103)
0.03
Rejection Ratio (%)
0.025
Fig. 18: Average resource utilization with task count-IDS.
0.02
0.015
100
0.01 PEFS PEFS* PEFS0 RFTR DFTS MBFS
90
0.005 80
Resource Utilization (%)
70
0
Task 60
50
40
Fig. 15: Failure ratio-EDS. 30
20
10
0
1 2 3 4 5
6.4 Resource Utilization Task Count(105)
We observe that the resource utilization of the six algorithms

ascend accordingly. PEFS has higher resource utilization in Fig. 19: Average resource utilization with task count-EDS.
both IDS and EDS. Fig. 16 and Fig. 17 show that PEFS has
higher resource utilization than the other five approaches.
Fig. 17 indicates that the PEFS has higher utilization on EDS. 7 C ONCLUSION AND F UTURE W ORK
This paper presents an AI-driven Prediction based Energy-
Similarly, Fig. 18 and Fig. 19 show the resource uti- aware Fault-tolerant Scheduling scheme (PEFS) for the
lization of six algorithms in different task counts on IDS cloud data center. First, task parameters (involving the re-
and EDS, respectively. This occurred due to the fact that quested resources, actually allocated resources, and whether
PEFS employs super task construction strategies to allocate failure occurred) are gathered from the historical data set.
failure predicted tasks to the most suitable physical hosts Then a DNN-based prediction model is trained to predict
and virtual machines, where other remaining schemes use the failure rate of each of the arriving tasks. In this way,
the replication mechanism for all tasks. all the arriving tasks can be classified into failure-prone
tasks and non-failure-prone tasks based on model outputs. [11] C. Wang, L. Xing, H. Wang, Z. Zhang, and Y. Dai, “Processing time
Second, a scheduling algorithm based on vector bin packing analysis of cloud services with retrying fault-tolerance technique,”
IEEE International Conference on Communications in China, pp. 63–
is proposed to schedule two types of tasks efficiently. The 67, 2012.
main difference between these two scheduling processes is [12] G. Ramalingam and K. Vaswani, “Fault tolerance via idempo-
that, for the failure-prone tasks, super tasks are generated tence,” SIGPLAN Not., vol. 48, no. 1, pp. 249–262, 2013.
firstly based on an elegant vector reconstruction method [13] K. Plankensteiner, R. Prodan, and T. Fahringer, “A new fault
tolerance heuristic for scientific workflows in highly distributed
for the fault-tolerant. A replication strategy is applied to environments based on resubmission impact,” IEEE International
replicate only the failure-prone tasks, then arranged into Conference on e-Science, pp. 313–320, 2009.
super tasks in a way of vector reconstruction so that the [14] M. A. Mukwevho and T. Celik, “Toward a smart cloud: A review
of fault-tolerance methods in cloud systems,” IEEE Transactions on
execution of different copies of the tasks in different hosts Services Computing, 2018.
will not be overlapped so that redundant execution will be [15] J. Wu, P. Zhang, and C. Liu, “A novel multiagent reinforcement
avoided. Experiments on Internet Data Set and Eular Data learning approach for job scheduling in grid computing,” Future
Set are conducted, and the experimental results validate the Generation Computer Systems, vol. 27, no. 5, p. 430–439, 2011.
[16] M. A. Shafii, L. M. Shafie Abd, and B. M. Bakri, “On-demand grid
merits of the proposed scheme in comparison with existing provisioning using cloud infrastructures and related virtualization
techniques. tools : A survey and taxonomy,” International Journal of Advanced
Future work includes further augmentation of larger- Studies in Computer Science and Engineering IJASCSE, vol. 3, no. 1,
pp. 49–59, 2014.
scale simulations to scrutinize the performance of heuristics. [17] V. S. Kushwah, S. K. Goyal, and P. Narwariya, “A survey on
Other improvements will also be considered to address various fault tolerant approaches for cloud environment during
reliability and consistency requirements. load balancing,” IJCNWMC, vol. 4, no. 6, pp. 25–34, 2014.
[18] P. Kassian, P. Radu, F. Thomas, K. Attila, and P. Kacsuk, “Fault-
tolerant behavior in state-of-the-art gridworkflow management
ACKNOWLEDGMENTS systems,” Institute for Computer Science University of Innsbruck Attila
Kert CoreGRID Technical Report Number TR-0091, 2007.
This work is partially supported by the National Natural [19] P. Guo and Z. Xue, “Real-time fault-tolerant scheduling algorithm
Science Foundation of China (grant numbers 61520106005, with rearrangement in cloud systems,” 2017 IEEE 2nd Information
Technology, Networking , Electronic and Automation Control Confer-
61761136014), and the National Key Research and Devel- ence (ITNEC), pp. 399–402, 2017.
opment Program of China (grant number 2017YFB1010001). [20] J. Soniya, J. Angela, J. Sujana, and T. Revathi, “Dynamic fault-
The first author gratefully acknowledges CAS-TWAS Presi- tolerant scheduling mechanism for real time tasks in cloud com-
puting,” International Conference on Electrical, Electronics, and Opti-
dent’s Fellowship for funding his Ph.D. at Chinese Academy mization Techniques (ICEEOT), 2016.
of Sciences, Beijing, China. [21] R. K. Yadav and V. Kushwaha, “An energy preserving and fault
tolerant task scheduler in cloud computing,” IEEE ICAETR, 2014.
[22] M. Amoon, “A fault-tolerant scheduling system for computational
R EFERENCES grids,” Comput. Elect. Eng., vol. 38, no. 2, p. 399–412, 2012.
[23] P. Keerthika and S. P, “A multiconstrained grid scheduling algo-
[1] K. Bilal, S. U. Khan, L. Zhang, H. Li, K. Hayat, S. A. Madani, rithm with load balancing and fault tolerance,” Sci. World J., 2015.
N. Min-Allah, L. Wang, D. Chen, M. Iqbal, C.-Z. Xu, and A. Y. [24] H. Duan, C. Chen, G. Min, and Y. Wu, “Energy-aware scheduling
Zomaya, “Quantitative comparisons of the state-of-the-art data of virtual machines in heterogeneous cloud computing systems,”
center architectures,” Concurrency Computation: Practice and Experi- Future Generation Computer Systems, vol. 74, pp. 142–150, 2017.
ence, vol. 25, no. 12, 2012. [25] C. Ghribi, M. Hadji, and D. Zeghlache, “Energy efficient vm
[2] D. Kliazovich, P. Bouvry, and S. U. Khan, “Dens: data center scheduling for cloud data centers: Exact allocation and migration
energy-efficient network-aware scheduling,” Cluster Computing, algorithms,” 2013 13th IEEE/ACM International Symposium on Clus-
vol. 16, no. 1, p. 65–75, 2013. ter, Cloud, and Grid Computing, 2013.
[3] J. Shuja, S. A. Madani, K. Bilal, K. Hayat, S. U. Khan, and S. Sarwar, [26] B. Nazir, K. Qureshi, and P. Manuel, “Adaptive check pointing
“Energy-efficient data centers,” Computing, vol. 94, no. 12, p. strategy to tolerate faults in economy based grid,” Journal of
973–994, 2012. Supercomputing, vol. 50, no. 1, pp. 1–18, 2009.
[4] J. Liu, S. Wang, A. Zhou, S. Kumar, F. Yang, and R. Buyya, “Us- [27] A. Marahatta, C. Chi, F. Zhang, and Z. Liu, “Energy-aware fault-
ing proactive fault-tolerance approach to enhance cloud service tolerant scheduling scheme based on intelligent prediction model
reliability,” IEEE Transactions on Cloud Computing, no. 99, 2017. for cloud data center,” The 9th International Green and Sustainable
[5] O. Hannache and M. Batouche, “Probabilistic model for evaluating Computing Conference, Pittsburgh, USA, 2018.
a proactive fault tolerance approach in the cloud,” IEEE Interna- [28] X. Fan, W.-D. Weber, and L. A. Barroso, “Power provisioning for a
tional Conference on Service Operations And Logistics, And Informatics, warehouse-sized computer,” ISCA, pp. 13–23, 2007.
pp. 94–99, 2015. [29] V. V. Vazirani, “Approximation algorithms,” Springer-Verlag, New
[6] B. Mohammed, M. Kiran, M. Kabiru, and I.-U. Awan, “Failover York, Inc., 2001.
strategy for fault tolerance in cloud computing environment,” [30] R. Panigrahy, K. Talwar, L. Uyeda, and U. Wieder, “Heuristics for
Software: Practice and Experience, vol. 47, no. 9, p. 1243–1274, 2017. vector bin packing,” Microsoft Research, Tech. Rep., 2011.
[7] J. Zhao, Y. Xiang, T. Lan, H. H. Huang, and S. Subramaniam, “Elas- [31] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. De Rose, and
tic reliability optimization through peer-to-peer checkpointing in R. Buyya, “Cloudsim:a toolkit for modeling and simulation of
cloud computing,” IEEE Transactions on Parallel and Distributed cloud computing environments and evaluation of resource,” Soft-
Systems, vol. 28, no. 2, pp. 491–502, 2017. ware: Practice and Experience, vol. 41, no. 1, pp. 23–50, 2011.
[8] M. Randles, D. Lamb, and A. Taleb-Bendiab, “A comparative [32] “Eular data set.”
study into distributed load balancing algorithms for cloud com- [33] “Internet data set.” [Online]. Available: https://github.com/
puting,” IEEE International Conference on Advanced Information Net- somec001/InternetData
working and Applications Workshops, pp. 551–556, 2010. [34] “Google cluster data set.” [Online]. Available: https://github.
[9] A. Marahatta, Y.-S. Wang, F. Zhang, A. K. Sangaiah, S. K. com/google/cluster-data/blob/master/ClusterData2011 2.md
Sah Tyagi, and Z. Liu, “Energy-aware fault-tolerant dynamic task
scheduling scheme for for virtualized cloud data center,” Mobile
Networks and Applications, 2018.
[10] A. Zhou, S. Wang, C.-H. Hsu, M. H. Kim, and K. S. Wong,
“Network failure-aware redundant virtual machine placement in
a cloud data center,” Concurrency and Computation: Practice and
Experience, vol. 29, no. 24, 2017.
Avinab Marahatta received his Bachelor of Fa Zhang received his B.S. degree from Hebei
Computer Engineering from Purbanchal Univer- University of Science and Technology, Shiji-
sity, Biratnagar, Nepal, in 2010, and Master of azhuang, China, in 1997, his M.S. degree
Technology in Computer Science from Jawahar- from Yanshan University, Qinhuangdao, China
lal Nehru Technological University Hyderabad in 2000, and his Ph.D. degree from the Insti-
(JNTUH), Kukatpally, Hyderabad, India in 2014. tute of Computing Technology (ICT), Chinese
He worked as a Computer Engineer/IT In-charge Academy of Sciences, Beijing, China, in 2005,
in Higher Secondary Education Board (HSEB), all in Computer Science. From 2002 to 2005,
Ministry of Education, Government of Nepal. he worked as a Visiting Scholar at Ohio State
Moreover, he worked as an IT Consultant and University, Columbus, OH, USA. Following this,
IT Expert for the National Examination Board he worked as an Assistant Professor with the
(NEB), Ministry of Education, Government of Nepal. He has been National Research Center for Intelligent Computing Systems (NCIC),
awarded the CAS- TWAS President’s Fellowship for pursuing his Ph.D. ICT from 2005 to 2008; then, he has worked as an Associate Professor
degree at the Institute of Computing Technology, Chinese Academy of with the Advance Computing Research Laboratory, ICT till 2019. Now,
Sciences, Beijing, China. His research interests include energy-efficient he is a full Professor at High Performance Computer Research Center,
computing, data center networking, computer architecture, distributed ICT. In addition, he worked as a Visiting Scientist at the University Rey
and parallel processing, and network securities. Juan Carlos, Madrid, Spain, from 2009 to 2010. His research interests
include computer algorithms, high performance computing, biomedical
image processing, and bioinformatics.
Qin Xin graduated with his Ph.D (Oct. 2002 –

Oct. 2004) from Department of Computer Sci-
ence at the University of Liverpool, UK in De-
cember 2004. Currently, he is working as a full
Professor of Computer Science in the Faculty
of Science and Technology at the University of
the Faroe Islands (UoFI), Faroe Islands (Den-
mark). Prior to joining UoFI, he had held various
research positions in world-leading universities
and research laboratories including Senior Re-
search Fellowship at Universite Catholique de
Louvain, Belgium, Research Scientist/Postdoctoral Research Fellowship
at Simula Research Laboratory, Norway and Postdoctoral Research
Fellowship at the University of Bergen, Norway. His main research focus Zhiyong Liu received his Ph.D. degree in Com-
is on the design and analysis of sequential, parallel and distributed puter Science from the Institute of Computing
algorithms for various communication and optimization problems in Technology, Chinese Academy of Sciences, Bei-
wireless communication networks, as well as cryptography and digital jing, China, in 1987. He worked as a Professor
currencies including quantum money. Moreover, he also investigates the with the Institute of Computing Technology start-
combinatorial optimization problems with applications in Bioinformatics, ing in 1987. He was Executive Director General
Data Mining and Space Research. Currently, Prof. Dr. Xin is serving of the Directorate for Information Sciences, Na-
on the Management Committee Board of Denmark for several EU ICT tional Natural Science Foundation of China, from
projects. Dr. Xin has produced more than 70 peer-reviewed scientific 1995 to 2006. Currently, he is a Chair Professor
papers. His works have been published in leading international confer- with the Advance Research Laboratory, Institute
ences and journals, such as ICALP, ACM PODC, SWAT, IEEE MASS, of Computing Technology, Chinese Academy of
ISAAC, SIROCCO, IEEE ICC, Algorithmica, Theoretical Computer Sci- Sciences. He has worked on networks, high performance architectures
ence, Distributed Computing, IEEE Transactions on Computers, Journal and algorithms, parallel and distributed processing, and bioinformat-
of Parallel and Distributed Computing, IEEE Transactions on Dielectrics ics. His research interests include computer architecture, algorithms,
and Electrical Insulation, and Advances in Space Research. networks, and parallel and distributed processing. He has served as
a Board Member, a Referee, an Editor, and an Invited Editor for aca-
demic journals, a Program/Organization Committee Member/Chairman,
an Advisory Committee Member/Chairman for national and international
conferences, a Steering Expert Committee Member for research ini-
tiatives, and an Investigator/Principal Investigator for research projects
on computer control systems, computer architectures, and algorithms.
He was the recipient of the China National Science Congress Prize, a
National Prize for Science and Technology in China.
Ce Chi received his B.S. degree of Software

Engineering from Jilin University, Changchun,
China, in 2018. He is currently studying for his
M.S. degree in the Institute of Computing Tech-
nology, Chinese Academy of Sciences, Beijing,
China. His research interests include energy-
efficient computing, distributed and parallel pro-
cessing, mechanism design and deep learning.
Nama : Febryan Adi Pratama
NIM : A11.2020.12719
Judul PEFS: AI-driven Prediction based Energy-aware

Fault-tolerant Scheduling Scheme for Cloud
Data Center
Journal IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING Vol 6 No 4
Tahun 2020
Penulis Avinab Marahatta,, Qin Xin , Ce Chia, Fa Zhang, Zhiyong Liu,
Reviewer Febryan Adi Pratama
Tanggal Review Senin, 12 Juni 2023
Tujuan Penelitian Tujuan dari paper ini adalah untuk mengoptimalkan penggunaan
energi, meningkatkan keandalan dan toleransi terhadap
kesalahan, meningkatkan kinerja sistem, meningkatkan
penggunaan sumber daya, dan mendukung keberlanjutan
lingkungan di pusat data cloud.
Metode Penelitian Metode penelitian yang akan digunakan dalam penelitian PEFS
meliputi meliputi: (1) analisis dan pemodelan data historis untuk
memprediksi pola beban kerja dan permintaan sumber daya di
pusat data, (2) pengembangan algoritma penjadwalan yang
mengoptimalkan penggunaan energi, penempatan beban kerja,
dan pengalokasian sumber daya berdasarkan prediksi AI-driven,
(3) integrasi mekanisme toleransi terhadap kesalahan untuk
mendeteksi dan merespons kegagalan perangkat keras atau
perangkat lunak secara adaptif, dan (4) eksperimen dan evaluasi
kinerja skema penjadwalan menggunakan simulasi atau
pengujian di lingkungan nyata untuk menguji efektivitasnya
dalam meningkatkan efisiensi energi, keandalan, dan kinerja
sistem di pusat data cloud.
Hasil Penelitian Hadil penelitian menunjukan bahwa skema PEFS efektif dalam
mengoptimalkan penggunaan energi, meningkatkan keandalan
sistem, dan memaksimalkan penggunaan sumber daya. Dengan
memanfaatkan prediksi AI-driven untuk mengatur penempatan
beban kerja dan pengalokasian sumber daya, skema ini berhasil
mengurangi konsumsi energi secara signifikan dan
mengoptimalkan penjadwalan untuk menghindari kelebihan
beban atau ketidakseimbangan. Selain itu, integrasi mekanisme
toleransi terhadap kesalahan memungkinkan respons yang cepat
terhadap kegagalan perangkat keras atau perangkat lunak,
menjaga ketersediaan layanan di pusat data. Dengan demikian,
skema penjadwalan PEFS memberikan manfaat yang signifikan
dalam meningkatkan efisiensi, keandalan, dan kinerja sistem di
pusat data cloud.
Kelebihan Penelitian Kelebihan penelitian skema penjadwalan PEFS (PEFS: AI-
driven Prediction based Energy-aware Fault-tolerant Scheduling
Scheme) untuk pusat data cloud adalah: (1) Mempertimbangkan
prediksi AI-driven untuk mengoptimalkan penggunaan energi
dan pengalokasian sumber daya, yang menghasilkan efisiensi
energi yang lebih tinggi dan mengurangi biaya operasional pusat
data, (2) Mengintegrasikan mekanisme toleransi terhadap
kesalahan yang meningkatkan keandalan sistem dengan
merespons dengan cepat terhadap kegagalan perangkat keras
atau perangkat lunak, mengurangi waktu pemulihan dan
mempertahankan ketersediaan layanan, (3) Meningkatkan
kinerja sistem dengan mengoptimalkan penempatan beban kerja
dan penggunaan sumber daya untuk memastikan throughput
yang lebih tinggi dan waktu respons yang lebih baik, (4)
Memperhatikan aspek keberlanjutan dengan mengurangi
konsumsi energi dan dampak lingkungan negatif pusat data
cloud. Dengan kombinasi fitur-fitur ini, skema PEFS
memberikan manfaat yang signifikan dalam meningkatkan
efisiensi, keandalan, kinerja, dan keberlanjutan lingkungan di
pusat data cloud.
Kekurangan Penelitian Kekurangan dari PEFS adalah kemungkinan terbatasnya akurasi
prediksi AI-driven dalam mengantisipasi pola beban kerja dan
permintaan sumber daya di lingkungan yang dinamis.
Variabilitas yang tinggi dalam beban kerja dan permintaan
sumber daya dapat menyebabkan prediksi yang tidak akurat,
yang pada gilirannya dapat mengakibatkan alokasi sumber daya
yang tidak optimal dan kinerja sistem yang menurun. Selain itu,
implementasi skema penjadwalan PEFS dapat memerlukan
sumber daya komputasi yang signifikan untuk menjalankan
algoritma prediksi dan mekanisme toleransi terhadap kesalahan,
yang dapat mempengaruhi overhead dan memerlukan investasi
tambahan dalam infrastruktur IT. Untuk mengatasi kekurangan
ini, diperlukan pemantauan dan evaluasi yang berkelanjutan,
serta peningkatan dalam akurasi prediksi dan efisiensi
implementasi skema penjadwalan PEFS.
Kesimpulan Penelitian memberikan kontribusi yang signifikan dalam
meningkatkan efisiensi penggunaan energi, keandalan sistem,
kinerja, penggunaan sumber daya, dan keberlanjutan lingkungan
di pusat data cloud. Dengan memanfaatkan prediksi AI-driven,
skema ini mampu mengoptimalkan penempatan beban kerja dan
pengalokasian sumber daya, mengurangi konsumsi energi, dan
meningkatkan respons terhadap kegagalan. Meskipun terdapat
beberapa kekurangan dalam akurasi prediksi dan potensial
overhead implementasi, penelitian ini memberikan dasar yang
kuat untuk pengembangan lebih lanjut dan penerapan praktis
skema penjadwalan PEFS dalam mengoptimalkan operasi pusat
data cloud.

Sister 12juni 1271

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sister 12juni 1271

Uploaded by

Copyright:

Available Formats

This article has been accepted for publication in a future issue of this journal, but has not been

PEFS: AI-driven Prediction based Energy-aware

1 I NTRODUCTION Proactive fault-tolerant techniques are widely adopted

3.2 Power Consumption Model 4 T HE S CHEME FOR R ESOURCE A LLOCATION

Fig. 1: System diagram of PEFS.

N10 N20 N10

Fig. 2: Deep neural network for task failure prediction.

Algorithm 2 Failure Prediction Algorithm 3 Failure-prone Scheduling

Failure-prone task queue Selecting three tasks

Fig. 3: Failure-prone scheduling with preparing super task (fault-tolerant mechanism).

15: end for

the ordering. Average demand heuristic call FFDAvgSum, 𝑇3 𝑇1 𝑇2

4.4 Example Fig. 4: Scheduling sequence of sub tasks in super tasks.

CPU RAM DISK CPU RAM DISK CPU RAM DISK

𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘1 𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘2 𝑆𝑢𝑝𝑒𝑟𝑇𝑎𝑠𝑘3

Fig. 5: Example of resource selection of super tasks.

Phase III (Scheduling based on greedy norm-based heuristic)

TABLE 1: Types of VM 5.2 Parameter setting

whereas, in most existing algorithms, the executing Task failure ratio

MBFS: Modified breadth first search. Energy Consumption

TABLE 2: Allocated resources analysis of Internet data set

TABLE 3: Requested resources analysis of Google cluster data set

6 R ESULTS AND D ISCUSSION 3.5

Total Energy Consumption (KWh)

10,000 tasks are used for training. Similarly, total 1 million 1

the highest prediction accuracy than Naive and SVM. 1

PEFS PEFS* PEFS0 RFTR DFTS MBFS

6.2 Energy consumption 0.45

The total energy consumption is compared among the en- 0.3

Total Energy Consumption (KWh)

Resource Utilization Ratio (%)

Resource Utilization (%)

0.02 Fig. 17: Average resource utilization-EDS.

We observe that the resource utilization of the six algorithms

Qin Xin graduated with his Ph.D (Oct. 2002 –

Ce Chi received his B.S. degree of Software

Judul PEFS: AI-driven Prediction based Energy-aware

You might also like