Raja Kaif Ali 1

Failure Prediction Based Fault Tolerance Resource
Management Techniques in Cloud
By
RAJA KAIF ALI
CIIT/FA22-BCS-146/ATD
BSC Thesis
In
Computer Science
COMSATS University Islamabad

Abbottabad Campus - Pakistan
FALL 2022
1

A Thesis Presented to
COMSATS University Islamabad, Abbottabad Campus
In partial fulfillment
of the requirement for the degree of
BS (Computer Science)
By
RAJA KAIF ALI

FALL 2022
2
A Post Graduate Thesis submitted to the Department of Computer Science as partial

fulfillment of the requirements for the award of Degree of BS in Computer Science.
Name Registration Number
RAJA KAIF ALI CIIT/FA22-BCS-146/ATD
Supervisor
MAAM FAIZA QAZI

Associate Professor
Department of Computer Sciences
Abbottabad Campus
Octuber,4,2022
3
Final Approval
This thesis titled

By
RAJA KAIF ALI

Has been approved
For the COMSATS University Islamabad, Abbottabad Campus
External Examiner: _______________________________________________________
Dr.
Supervisor: ______________________________________________________________
Dr. Babar Nazir, Associate Professor
Department of Computer Sciences, CUI, Abbottabad
HoD: ___________________________________________________________________
Dr. Imran Ali Khan
Department of Computer Sciences, CUI, Abbottabad
4
Declaration
I, RAJA KAIF ALI CIIT/FA22-BCS-146/ATD hereby declare that I have produced the work
presented in this thesis, during the scheduled period of study. I also declare that I have not taken
any material from any source except referred to wherever due that amount of plagiarism is within
acceptable range. If a violation of HEC rules on research has occurred in this thesis, I shall be
liable to punishable action under the plagiarism rules of the HEC.
Date: __OCT-04,2022___
Signature of the Student: ___________FAHAD HASSAN__________________
5
Certificate
It is certified that RAJA KAIF ALI-CIIT/FA22-BCS-146/ATD has carried out all the work
related to this thesis under my supervision at the Department of Computer Sciences, COMSATS
University Islamabad, Abbottabad Campus and the work fulfills the requirements for award of
BSC degree.
Date: ___________________
Supervisor:
__________________________________
Dr. Babar Nazir, Associate Professor
CUI, Abbottabad Campus
Head of Department:
______________________________
Dr. Imran Ali Khan
CUI, Abbottabad Campus
6
DEDICATION
This work is gratefully dedicated to

my loving parents for their moral
support and continuous
encouragement throughout my life.
7
ACKNOWLEDGMENTS
First of all, I am grateful to The Almighty Allah for showing me the right direction to complete
my thesis. I also pay humble tribute to our beloved Prophet Hazrat Muhammad (‫)ﷺ‬.
RAJA KAIF ALI

CIIT/FA22-BCS-146 /ATD
ABSTRACT
Cloud Computing is an easy access computation model, in which data is store remotely by the
owner of the data to enjoy huge aspect of applications on demand. Cloud system experience huge
amount of failures i.e., hardware failures, software failures, network failure etc., due to the large-
scale heterogeneity and distributed nature. Various fault tolerance techniques are used to resolve
and handle failures. Occasionally for tolerating the failures, more intensive FTT i.e.
CheckPointing is terrific, this technique handle the failure efficiently but cause in increase in the
8
overhead. Moreover, less intensive FTT i.e. retrying, is used for tolerating the failure but this
technique cannot handle the failure gracefully. Furthermore, when we want to execute a task to
achieve high throughput, availability and efficiency in task execution then we can launch the
replica on a certain resource. Replication technique generate overhead for keeping multiple
copies of replicas. Currently there is no prediction mechanism on the basis of which suitable
fault tolerance techniques can applied. In this paper, we have presented failure prediction based
fault tolerance techniques (FPBFTT) that first predict the resource failure, and on the basis of
failure percentage suitable fault tolerance techniques was applied. CheckPointing FTT was
applied on the resources when the failure percentage was greater than the upper threshold.
Retrying FTT was used when the failure percentage was less than lower threshold. Furthermore,
replication technique was used when the failure rate was not too high or too low but it was in
between the upper and lower threshold. Results are evaluated using the parameter of throughput,
response time, makespan, cost, SLA violation, budget and deadline. Simulation results obtain
through experiments and their comparison with existing technique concluded that our proposed
technique yield better results in all conditions than the conventional fault tolerance techniques.
Results reveal that in case of 25 jobs, the response time of FPBFTT is 9.24% better than RBFTT,
15.1% better than CPLCA and 22.5% better than job retry technique. In case of 50 jobs, the
throughput of FPBFTT is 17.4% better than RBFTT, 37.8% better than CPLCA and 64.2% better
than job retry technique. In case of 100 jobs, the makespan of FPBFTT is 16.08% better than
RBFTT, 26.8% better than CPLCA and 38.7% better than job retry technique.
Keywords: (Cloud Computing, Failure Prediction, Fault Tolerance, Retrying, Replication, Check
Pointing)
TABLE OF CONTENTS
Table of Contents
Chapter 1..........................................................................................................................................1
Introduction......................................................................................................................................1
1.1 Outline...................................................................................................................................2
Introduction..............................................................................................................................2
9
1.3 Motivation..............................................................................................................................4
1.4 Problem Statement.................................................................................................................5
1.5 Main Contribution of Thesis..................................................................................................6
1.6 Organization of Thesis...........................................................................................................7
2.1 Outline...................................................................................................................................9
2.2 Fault Tolerance – An Overview............................................................................................9
2.2.1 Proactive fault tolerance policies....................................................................................9
2.2.2 Reactive fault tolerance policies.....................................................................................9
2.3 Fault Tolerance Models:......................................................................................................11
2.4 Literature Review:...............................................................................................................13
Chapter 3........................................................................................................................................15
System Model................................................................................................................................15
3.1 Outline.................................................................................................................................16
3.2 Proposed System Architecture.............................................................................................16
3.2.1 Service Request Examiner and Admission Control (SRE & AC)................................17
3.2.2 Failure Prediction Manager..........................................................................................17
3.2.3 Failure Information Manager........................................................................................21
3.2.4 Resource Information Server........................................................................................21
3.2.5 Failure Advisor.............................................................................................................22
3.2.6 VM Placement Manager...............................................................................................30
Chapter 4........................................................................................................................................31
Simulation Setup and Results........................................................................................................31
4.1 Outline.................................................................................................................................32
4.2 Simulation Setup..................................................................................................................32
4.2.1 Resource Modelling......................................................................................................32
10
4.2.2 Application Modelling..................................................................................................33
4.3 Evaluation Parameters.........................................................................................................34
4.3.1 Response Time..............................................................................................................34
4.3.2 Throughput...................................................................................................................34
4.3.3 Makespan......................................................................................................................34
4.3.4 Cost...............................................................................................................................34
4.3.5 SLA-Violation..............................................................................................................35
4.3.6 Deadline........................................................................................................................35
4.3.7 Budget...........................................................................................................................35
4.4 Results and Discussion........................................................................................................35
4.4.1 Response Time..............................................................................................................35
4.4.2 Throughput...................................................................................................................38
4.4.3 Makespan......................................................................................................................41
4.4.4 Cost...............................................................................................................................43
4.4.5 SLA-Violation............................................................................................................45
4.4.6 Deadline........................................................................................................................46
4.4.7 Budget...........................................................................................................................47
5.1 Outline.................................................................................................................................49
5.2 Conclusion...........................................................................................................................49
5.3 Future Work.........................................................................................................................49
11
LIST OF ABBREVIATIONS
VM Virtual Machine
MIPS Millions of Instruction per second
DC Data Center
QoS Quality of Service
FTT Fault Tolerance Techniques
Max-min Maximum-Minimum
FR Failure Rate
FP Failure Prediction
ET Execution Time
CP Check Pointing
LH Load History
RIS Resource Information Server
IaaS Infrastructure as a Service
PaaS Platform as a Service
SaaS Software as a Service
SLA Service level Agreement
12
TABLE OF FIGURE
Figure 1 Layers of service model in cloud computing....................................................................4

Figure 2 Proposed System Architecture........................................................................................18
Figure 3 Retry Architecture...........................................................................................................25
Figure 4 Replication Architecture.................................................................................................26
Figure 5 CheckPointing Architecture............................................................................................30
Figure 6 Total jobs submitted=25..................................................................................................38
Figure 7 Total jobs submitted=50..................................................................................................38
Figure 8 Total jobs submitted=100...............................................................................................39
Figure 9 Total jobs submitted=1000..............................................................................................39
Figure 10 Total jobs submitted=25................................................................................................41
Figure 13 Total jobs submitted=1000............................................................................................42
Figure 14 . Total jobs submitted=25.............................................................................................44
Figure 16 Total jobs submitted=100..............................................................................................45
Figure 17 Total jobs submitted=1000............................................................................................45
Figure 18 Cost Comparison of Proposed and Existing FTT..........................................................46
Figure 19 SLA violation for Proposed and Existing FTT.............................................................48
Figure 20 Deadline Comparison of Proposed and Existing FTT..................................................49
Figure 21 Deadline Comparison of Proposed and Existing FTT..................................................50
1
Chapter 1
1.1 Outline
This chapter starts with brief introduction of cloud computing. Further it describes about the
motivation of fault prediction and fault tolerance in cloud presented in this thesis. Furthermore,
problem statement of my work is also explained in this chapter. Finally, the last section of this
chapter describes the classification of the thesis.
Introduction
2
According to National Institute of Standards “Cloud Computing is a model for enabling
convenient, on-demand network access to a shared pool of configurable computing resources
(e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and
released with minimal management effort or service provider interaction” [1]. Cloud computing
is an easy access computing model, where the data is store remotely by the owner of the data to
enjoy huge aspect of applications on demand. Cloud computing has very excellent prospect due
to its low investment cost, flexible deployment and ease of maintenance [2]. The aim of cloud
models is to ensure that how the services of the cloud are made available to end users. Cloud
computing is basically categorized into three models [3]. Software as a Service (SaaS), Platform
as a Service (PaaS) and Infrastructure as a Service (Iaas). In SaaS a highly scalable internet
based applications are hosted on the cloud and are provided to the customers on their demand.
Some common examples are GoogleDocs, Acrobat.com and SalesForce.com. In PaaS customers
build their own applications which can be executed on the provider’s infrastructure. Cloud
infrastructure provides the platform to the customers in order to develop, build, design and test
their applications. Some popular Paas examples are Azure Service Platform, Force.com and
Google App Engine. In IaaS Infrastructure components are provided to the customers by this
model. In this pay per use model services like storage, database management and compute
capabilities are offered on demand. Examples of Iaas include Amazon Web Services, Go Grid
and 3 Tera.
3
Various layers of service model are shown in Figure 1
Figure 1 Layers of service model in cloud computing
Cloud Computing support distributed service oriented architecture, multi user and multi domain
administrative infrastructure [3]. Due to the large scale heterogeneity and distributed nature,
cloud system experience huge amount of failures e.g., hardware failure (machine failure, disk
failure, memory failure, CPU failure, device failure) and software failure (OS failures,
application failure, user define exception, unhandled exception and device failures) [4]. The best
cloud provides have many considerable benefits but some serious concerns affect the reliability
and efficiency of this model [5]. Cloud computing face numerous challenging issues but the
most crucial challenge is related to the security. Furthermore, data availability, compatibility,
scalability and interoperability are serious challenges regarding cloud environment [6]. Failure
prediction and failure avoidance is the pivotal challenge of cloud computing.
There always exist greater chances of failures in cloud computing due to its distributed nature
[7]. The probability of the failure is always high and it is very essential to handle these failures.
4
If these failures are not address correctly they cause performance issues, minimum efficiency and
throughput, maximization in the response time and overhead. In order to tolerate the failures,
variety of techniques based on FT policies can be used. The aim of FTT is to minimize the
chance of failures. If any task failure occurs it can only be recovered either by retrying, migrating
the jobs, by replicating the jobs on different nodes or by using different check points [8]. If this
mechanism is not taken into account correctly, the probability of failures will increase and causes
overhead e.g., if any failure occur during the execution of the job, job can be retry many times
until its success, this mechanism cause overhead.
FTT generate a lot of overhead but it doesn’t consider the overhead. There is an intense need of
such failure prediction mechanism that can predict the failures in advance and overcome the
problem of overhead. Failure prediction is the approach to cope up with these problems. The
primary purpose of failure prediction is to predict the failures accurately by covering as many
failures as possible [9]. A perfect failure prediction would achieve a one to one matching
between predicted and actual failures with a low rate of overhead.
In our work, we have first predicted the failure percentage. The failure is predicted on the bases
of host. After predicting host level failure percentage we have applied a best suitable fault
tolerance technique like check pointing, retry, and replication that is suitable for the predicted
host based failure percentage scenario. After applying best tolerance technique for the required
scenario we have evaluated our result on the bases of different parameters.
1.3 Motivation
Cloud system encounter failures due to their large scale heterogeneity and distributed nature.
Fault is actually a bug or an error whereas failure is a divergence of services from their expected
objective. Various fault tolerance techniques are used to resolve and handle failures.
Occasionally for tolerating the failures, more intensive FTT i.e., CheckPointing can be used
which handle the failure efficiently but cause in increase in the overhead. Moreover, less
intensive FTT i.e., retrying can be used that cannot handle the failure efficiently. Furthermore,
when we want to execute a task to achieve high throughput, availability and efficiency in task
execution then we can launch the replica on a certain resource. Currently there is no prediction
mechanism on the basis of which suitable fault tolerance techniques can applied. We have
5
presented failure prediction based fault tolerance techniques (FPBFTT) that first predict the
resource failure, and on the basis of failure percentage suitable fault tolerance techniques was
applied. CheckPointing FTT was applied on the resources when the failure percentage was
greater than the upper threshold. Retrying FTT was used when the failure percentage was less
than lower threshold. Furthermore, replication technique was used when the failure rate was not
too high or too low but it was in between the upper and lower threshold. Failure prediction is the
way to predict or to observe the failures that occur during the processing of specific job. In cloud
computing if failures are to be predicted effectively and efficiently before its actual happening,
this phenomena assist to achieve the reliability, higher performance, robustness and availability
in cloud computing [4]. Moreover, this prediction mechanism overcomes the problem of
overhead that arise due to the fault tolerance technique. If the failure prediction mechanism is not
taken into account, it will results in low reliability and efficiency of the system will decline.
Furthermore, performance will be low this will extend the execution time of the jobs which
finally increase the chance of occurrences of failures. To get rid of these failures it is better to
use the failure prediction mechanism. In our work, we first predict about the failure percentage
based on the host and then apply different fault tolerance technique for tolerating the failures.
In literature review, three main techniques have been used to achieve the fault tolerances, which
are check pointing, replication and job migration. Check pointing is suitable for large-scale
application where work conserving is necessary otherwise latency will be increased and
throughput may be decreased. In some cases, we need high throughput and efficiency of task
execution to get the intended result for that purpose replication is used. Job migration is used
when there is some node /machine failure and we cannot execute the task, due to this failure
response time could be increased.
1.4 Problem Statement

Assuming that there is a DataCenter which consist of M hosts denoted as Host = {H 1, H2, H3 ……
Hn) and each Host has the following properties: Host = {Hosti (PE i, MIPSi, RAMi, BWi) | i=
1…….m} where, PEi to denote the processing element of Host i; MIPSi to show (million
instructions per second); RAMi to represent the size of RAM used in Host i; BWi to demonstrate
the bandwidth available to it. The set of virtual machines represented by VM = {VM 1, VM2,
VM3…….VMn) and each VM has the following properties: VM = {VM j (PEj , MIPSj, RAMj ,
6
BWj) | j= 1…….n} where, PEj to denote the processing element ; MIPS j to show (million
instructions per second); RAMj to represent the size of RAM used in Host i; BWj to demonstrate
the bandwidth available to it . Four virtual machine VM s are hosted by one host. In our case, we
have 12 VMs and 3 hosts. 12 virtual machines can be hosted by 3 hosts. First we get the failure
percentage on each host then apply suitable fault tolerance technique according to the failure
percentage on each host.
As the cloud environment is distributed and dynamic in nature, therefore the failure rate is
always high. By the occurrence of failure, performance of application executing in the cloud
degrades eventually due to which system does not perform robustly. It is very essential to
evaluate the failures efficiently. When failures occur in the cloud environment, for tolerating
these failures various fault tolerance techniques can be used. Existing fault tolerance techniques
solely able to mitigate the failures without considering efficient failure prediction mechanism.
Occasionally for tolerating the failures, more intensive FTT can be used which handle the failure
efficiently but cause in increase in the overhead. Moreover, less intensive FTT can be used that
cannot handle the failure efficiently. Furthermore, these tolerance techniques generate
unnecessary overhead. To address this problem, “Failure prediction based Failure Tolerance
technique (FPBFTT)” is used. In cloud computing if failures are to be predicted effectively and
efficiently before its actual happening, this phenomena assist to achieve the reliability, higher
performance, robustness, availability and overcome the problem of overhead .
In order to make cloud faults tolerant first it is essential to predict the percentage of the
host failures then apply different tolerance techniques according to condition. There are variety
of reactive fault tolerance techniques namely check pointing, replication, retrying and job
migration. We need to see that which technique is applicable under what situation, imagine if
application is very large and data intensive then check pointing strategy is terrific. The very
naïve and simple strategy is to retrying, where we can re-execute the failed task again and again.
On the other hand, in some cases when we want to execute a task to achieve some aspect of QoS
(i.e., high throughput, efficiency in task execution, high availability) then we can launch the
replica on a certain resource to get the intended results. In this case replication strategy is better
as compared to the both of above.
7
1.5 Main Contribution of Thesis
Following are the main contribution of this thesis:
 A failure prediction based fault tolerance technique (FPBFTT) was proposed.
 The proposed technique first predicts the failure on the bases of host.
 Suitable FTT e.g., retrying, replication and check pointing are applied on the hosts based
on the predicted failure scenario.
 To simulate the proposed technique, WORKFLOWSIM-1.0 was used as a simulating
tool.
 We evaluated the performance of our technique under different conditions using different
parameters such as Response Time, Throughput, Makespan, Cost, Budget, Deadline and
SLA violations. .
 We have compared the result of FPBFTT, with well-known existing FTT i.e.
“CheckPointing league Championship Algorithm (CPLCA)”, “Replication Based FT
Algorithm (RBFTA)” and with “Job Retry Technique (JRT)”.
 Simulation results obtain through experiments and their comparison with existing
technique leads us to the condition that our proposed failure prediction based fault
tolerance (FPBFTT) technique yield better results in all conditions.
 Proposed technique achieves better performance than existing fault tolerance technique.
1.6 Organization of Thesis
The rest of the thesis is organized as follows:

Chapter 2: This chapter presents the existing work of different FTT e.g. retrying, job migration,
and replication and failure handling through check pointing is described.
Chapter 3: This chapter describes the proposed system model and detail description about
implementation of retry, replication and check pointing fault tolerance algorithms.
Chapter 4: This chapter gives description about the Simulation Setup. Furthermore, results of
each simulation scenario of proposed work have also been discussed in this chapter.
Chapter 5: This Chapter gives conclusion of our work.
8
Chapter 2
Literature Review
9
2.1 Outline
This chapter takes start from the literature review of fault tolerance techniques implemented in
cloud. After that, a table of different FTT is presented. At the end, the summary of literature
review in the form of table is also listed.
2.2 Fault Tolerance – An Overview

Fault tolerance aim is to achieve robustness and dependability in any system. Fault tolerance
deals with quick repairing and replacement of flawed devices to retain the system. Fault
tolerance policies can be classified into two techniques named as proactive FT and reactive FT
[10].
2.2.1 Proactive fault tolerance policies
The proactive fault tolerance policy tolerates the failure before its actual happening.
Following techniques come under this method [11].
1. Self-Healing
When various instances of the same applications, running on different virtual machine.
Application instances failure can be automatically handle by the multiple virtual machines. In
this policy Improvement is achieved without the replication and redundancy.
2. Preemptive Migration
In this policy, an application is continuously monitored, if any VM found overloaded then some
of its resources are migrated to other VM.
3. Software Rejuvenation
10
In this policy, after a certain period system rebooted and every time the system is started with a
new state.
2.2.2 Reactive fault tolerance policies
When the actual failure completely occurs, reactive fault tolerance policies reduce or minimize
the failure effect on the application.
Following techniques come under this method [12]:
1. Check Pointing
For big application, this technique is very effective .If failure occur, than the task restarted from
the recent check point onwards rather than starting from initial point.
2. Replication
This technique creates a duplicate version of the data. Various tasks are replicated over different
resources, and if due to any failure data crashes, it can be recovered from the duplicated copy.
Replication can be implemented by using tools like AmazonEc2 and Hadoop.
3. Job migration
Sometime a machine get fail and unable to complete the job. At the time of failure the job is
migrated to working machine using HA-proxy.
4. Retry
Retry is the easiest task level technique. This technique is used to retries the failed task on the
same cloud resource.
5. Task Resubmission
The most widely used fault tolerance technique for scientific workflow system is the task
Resubmission. When any failure is detected, it can be resubmitted at a runtime to the same or the
different resources.
Table 1: Comparison of different Fault Tolerance Techniques
Failure Tolerance Policies Tool Implementation Fault Detected Application Type
11
Techniques Environment
Software Rejuvenation Proactive Assure Virtual Machine Host, Network Fault Tolerance
Failure
Self-Healing Proactive HA-Proxy/Assure Virtual Machine Process/Node Failure/ Fault Tolerance/

Host, Network Failure Load Balancing
Preemptive Migration Proactive HA-Proxy Virtual Machine Process/Node Failure Fault Tolerance/
Load Balancing
Check Pointing Reactive SHelp/Assure Virtual Machine Application Failure/ Fault Tolerance
Node Failure/Host,
Network Failure
Replication Reactive HA-proxy/AmazonEc2 Cloud Environment Application Failure/ Fault Tolerance/

Node Failure Load Balancing
Job migration Reactive HA-proxy/Hadoop Virtual Machine/ Application Failure/ Fault Tolerance /
Cloud Environment Node Failure/Process Load Balancing
Failure
Retry Reactive Assure Virtual Machine Host, Network Fault Tolerance

Failure
2.3 Fault Tolerance Models:
On the bases of above types of FTT several models are implemented. Table 2 summarized the
comparison of various fault tolerance models [12].
i. LLFT-Low Latency FT
Low latency FT model has been presented by Zhao Wenbing et al. [13]. Within a cloud
environment this model is installed. Using replication method this model tolerates the fault.
ii. FTM-Fault Tolerance Management

Jhawar Ravi in [14] has given a FT management approach (FTM) in light of virtualized
technology to explore FT which plainly expand the accessibility and reliability of
applications which are utilized as a part of distributed computing.
12
iii. Model for Intelligent Task Failure Detection
Intelligent task failure detection model [15] has proposed by Anju Bala. This model is
based on the proactive FTT. The aim of this model is to predict the task failure for
scientific workflow applications.
iv. FRAS-Fault Tolerance & Recovery Agent System

In [16], Lee Hwamin has given the concept of FT and recovery agent system. This model is
an agent based system and the recovery agent achieve on rollback technique when failures
occurs.
v. VDC-Virtual Data Centers

The concept of Virtual Data Centers has provided by Joshi Sagar which is based on
migration FT technique [17]. According to this model if a Vm is overloaded then it’s some
resources are migrated to other Vm to handle server failure.
vi. FT Techniques & Algorithms in Cloud Computing

FT techniques and algorithm in cloud computing has been proposed in [4]. This model uses
proactive FT technique to predict the faults in advance and take proper actions before or
after failure occurs.
vii. FT-Cloud
Zibin Zheng [18] has proposed the concept of FT-Cloud. In this model significant
component is determined based on the ranking. This Model is used for protection against
reliability, crash and value faults.
Table 2: Comparison of Different FT Models
Name of model Fault tolerance Procedure Policies Challenges
LLFT Using replication technique it runs for distributed Reactive Overhead and cost increase for
Applications maintain replicas.
13
FTM Virtualization technology based on availability, Reactive Lack of in depth analysis of
Reliability & on demand Service. FT services
Model for intelligent Predict the failed task for scientific WF application Proactive A standard technique can be
Task failure detection produced in cloud environment
for estimating the routines of
FT segment in evaluation with
parallel ones.
RAS Agent based system based on recovery of the system Reactive It does not apply on practical
Applications
VDC Based on migration technique Reactive More server utilization can be

there.
Fault tolerance To predict fault in advanced proactive techniques Proactive Need to have a compact model
Techniques & are used. Which will cover most extreme
Algorithm in CC FT aspect.
FT-Cloud Significant component is measured based on the Proactive It doesn’t contain latency
Ranking
2.4 Literature Review:
Vidhi Sutaria et al. described a fault prediction of hardware and software and mitigation
techniques in [19] .They presents, in brief about the failure prediction using fault tree analysis.
After identifying the reason of failure they describe fault tolerance techniques for the mitigation
of faults. Results demonstrated that the proposed technique can efficiently predict the faults and
also affectively mitigate these faults.
Nagios monitoring tool was used by Anu Wadhwa and Anju Bala in [20] to analyze the
faults. For the implementation of proactive FT the proposed monitoring tool gather information
from various remote nodes. Monitoring server is responsible for analyzing these certain metric
values. After analysis, results are generated which determine the status of the resources and
services running on them. Basic state labels for a host include ‘ok or up’, ‘warning’, ‘critical or
down’, ‘unknown’ and ‘unreachable’. The monitored content proactively prevents the system
from failures. Experimental results showed that the monitoring of various faults, improves the
reliability and availability of cloud services.
The author in [21] proposed a Recurrent Neural Network (RNN) and Ensemble method
to predict the application and jobs failure. Application failures are cause due to the increase in
14
the failure rate of hardware and software. Traditional technique e.g. Hidden Markov Model and
distribution based methods consider no dependencies that exists in the domain of time. In
contrast, RNN are revolving around the feed forward network, they can manage changeable
length in the domain of time. Results demonstrated that the proposed method can predict 84% of
detection rate and 20% of false alarm rate.
In [22], a technique named as virtualization and FTT was proposed. This technique was
used to maximize the availability of the system and to minimize the service time. Cloud manager
and decision making module was used for handling the faults and for managing the virtualization
and load balancing. Virtualization and load balancing are done in the first step whereas faults
tolerance achieve in the next step by redundancy and check pointing. Fault handler is a part of
virtualization. It finds and block the unrecoverable defective nodes and confine these virtual
nodes from use. From the recoverable nodes, temporary software faults are also removed by fault
handler
and make these nodes available for future request. Result demonstrated that the proposed
technique gives an excellent performance.
In [23], Jameela Al-Jaroodi et.al proposed an efficient algorithm named as “A Delay

tolerant FT algorithm”. The objective of this algorithm was to decrease the execution time and
the overhead of fault discovery and fault recovery is minimized by this algorithm. This algorithm
divides the whole work among the group of servers in pair in such a way that every pair is
responsible to handle its own work from opposite direction. The phenomena of dividing the work
helps the servers to finished their work at a same time. Result demonstrated that the working of
this algorithm is efficient.
A proactive FT approach [24] to HPC in to cloud was proposed by Egwutuoha et al. The
principle target of this approach was to decrease the wall clock execution time in the presence of
fault. To tolerate faults, an avoidance mechanism was used by the proactive fault tolerance. The
proposed algorithm doesn’t depend on a spare node precedent to prediction of a failure. Result
concluded that the proposed approach reduced the execution time of computational intensive
applications by 30% and frequency of these applications can be reduced by 50%. Furthermore,
reducing the wall execution time results in the reduce energy consumption.
15
Table 3: Comparison of proactive and reactive fault tolerance policies.
Paper No. Methodology Goals Advantages Disadvantages Implementation

Environment
[19] 1.Fault tree analysis Consider both hardware Provide system availability An unnecessary overhead WorkflowSim
2.FT techniques and software failures & robustness has to bear
[20] Nagios Monitoring Proactively determine the Improves the reliability & No preventive measure Cloud
Tool status of the resource availability of cloud services should have been taken
if the prediction is wrong.
[21] Recurrent Neural Detect Failure of TPR is high i.e., 84% Kill those jobs that are Cloud
Network & Ensemble application /jobs predicted as failed.
Methods
[22] Replication Replica will be created on Minimize the service time Generate a lot of overhead Cloud
Fault tolerance those Vm whose success & maximize the system for keeping replicas
Technique rate is higher availability
[23] A delay-tolerance Effectively reduce the Minimize the fault Generate overhead for Cloud
fault tolerance execution time of dist- discovery and recovery keeping multiple copies
algorithm ributed tasks by paralle- overhead. Of replicated servers
lizing the tasks.
[24] A proactive Fault Reduce wall clock Execution Energy Consumption - Cloud
tolerance Technique Time
16
Chapter 3
System Model
3.1 Outline
This chapter, starts with the proposed system architecture. It also defines the various components
of proposed model. Algorithms for the core components of model are also explained in this
chapter.
3.2 Proposed System Architecture

This section explains the proposed model that can predict the resource failure gracefully and
after prediction this model handles the failures by applying various fault tolerance technique. The
detail explanation regarding the key components of the system model of the proposed technique
is discussed in further sections.
17
Figure 2 Proposed System Architecture
3.2.1 Service Request Examiner and Admission Control (SRE & AC)
When a user(s) submit a job(s) it reaches to the SRE & AC. SRE & AC accept those service
requests that meet the SLA and reject those services that put overburden on the limited resource
available.
3.2.2 Failure Prediction Manager
Failure prediction manager is responsible for predicting the host level failure percentage we take
montage workflow [25]. Four sizes (25, 50, 100, and 1000) jobs are chosen for this experiment
[26]. We run different jobs of montage to predict the failure percentage on each host. For this
purpose, we use Waikato Environment for Knowledge Analysis tool (WEKA). This tool is
18
written in java and is a famous suite of machine learning software. WEKA contains collection of
algorithms for analysis of data and predictive modelling. In our case we use WEKA tool for
predicting the failure on each host.
1. Preprocessing 2. Classifier
Dataset C4.5 (J48) Decision
Load the dataset preparation before
Data Set Tree Algorithm
(CSV file in weka applying data
format)
[7]
mining techniques
3. Test Option 4. Training 5. Testing 6. Output
Percentage Split 66 34 Predicted Status of Jobs

[38] Percent Percent (Success/Failure)
On different Host
Figure 3.2: Flow of Host Level Failure Prediction
3.2.2.1 Failure Prediction Using WEKA
Dataset Preparation: We take Montage workflow as an application for the preparation of dataset.
Various attributes containing the failure information of the tasks have been gathered using
different classes of WorkflowSim. Each attribute along with its description is defined in the
below table.
Table 4: Attributes List

S.No Attribute Name Description
1 Cloudlet ID Id of cloudlet
2 Vm ID Id of Vm
3 Host ID Id of host
4 Data Center ID Id of data center
5 Time Total execution time of cloudlet
6 Start Time Start time of cloudlet
7 Finish Time Finish time of cloudlet
8 Depth Level of task/cloudlet
9 Status Success/Failure
19
Cloudlet ID indicate which cloudlet id fails on the resource. Vm id signifies Vm number on
which cloudlet has been failed. A host id shows about the detail of Vm (on which the cloudlet
was failed) is present in which host. DataCenter id indicates about the number datacenters that
are present. The attribute of time is used to know about the total execution time, start and finish
time of the cloudlet. As there are various levels in the workflows therefore, depth specifies the
level of task in workflow. Status indicates about the success and failure of the cloudlet.
Preprocessing: Preprocessing is the first step, in which Dataset is prepare before applying data
mining techniques. The selection of attribute is done in this step.
Classifier: We use J48 Decision tree algorithm because this algorithm achieved highest accuracy
than simple cart, Reptree and NBtree decision trees [27]. From existing literature it is concluded
that j48 outperform than other decision tree when different datasets are given to this algorithm.
Moreover, the execution time of j48 is less than other decision tree algorithms when different
datasets are provided. J48 is the advanced version of C4.5 algorithm. This algorithm uses a tree
like structure for building its model. This algorithm is used to classify an unknown sample.
Following nodes are used for building a tree.
1. Start Node: Start node is the top most node, also called as root node. The attribute having
highest gain ratio is selected as root node.
2. Internal Node: These nodes denote a test on an attribute.
3. Leaf Node: Leaf node is used to represent the labels of class.
4. Branch: Represent an outcome of the test.
In case of our dataset j48 categorized the variables into two categories. First is the dependent
variable that are actually the target variable. Second type of variable is independent variables
which are used to predict the dependent or target variable. C4.5 algorithm is actually named as
j48 in WEKA. This algorithm select the attribute having higher gain ratio. Gain ration can be
calculated by the given equation.
Gain(S ,T )
Gain Ratio (S, T) =
SplitInformation (S , T )
(1)
20
Where, S is the attribute of the training data.
Gain (S, T) is the information gain of attribute S. Gain (S, T) can be computed by the below
equation.
¿ Entropy (Sv )¿ ¿
Gain (S, T) = Entropy (S) – ∑ ¿ T s , v∨
¿ T s∨¿
v ∈value (T s )
(2)
Where, Entropy(S) is the entropy of attribute S and can be computed by the below equation.
c
Entropy(S) = - ∑ p ( S , j ) ×log 2 p (S , j)
j=1
(3)
Where c denotes the number of classes and p(S,j ) is the proportion of instances in S that are
assigned to j–th class.
SplitInformation (S, T) is split information of attribute S. It can be calculated by Equation 4.
¿ ¿¿
SplitInformation (S, T) = - ∑ ¿ T s , v∨
¿ T s∨¿ × log (
v ∈value (T s )
¿ T s , v∨ ¿ ¿
¿ T s∨¿¿ ) (4)
SplitInformation expresses entropy on potential information generated by portioning training

data T, into a number of different variables which are owned by the attribute S.
Below is the pseudo code of the j48 (C4.5) algorithm.
Algorithm: C4.5 (j48) Algorithm
21
Input: T: Training Dataset, S: Attributes
Output: Decision tree Tree
1. if (T = = φ ) then
2. return failure
3. end_if
4. if (S = = φ ) then
5. Return Tree as a single node with most frequent class label in T
6. end_if
7. Set Tree= { }
8. for a ∈ S do
9. Set Information(a,T)=0 and SplitInformation(a,T)=0
10. Compute Entropy(a) using Eq.[3]
11. for v∈values(a,T) do
12. Set Ta,v as the subset of T with attribute a = v
¿
13. Information (a,T )=¿ T a ,v ∨ ¿ T ∨¿ ¿ ¿Entropy(av)
a
¿ T a ,v ∨ ¿ ¿
14. SplitInformation(a,T) = - ¿ T a∨¿ log ¿ T a ,v ∨ ¿ ¿¿¿
¿ T a∨¿
15. end_for
16. Gain(a,T)= Entropy(a) – Information (a,T)
Gain(a , T )
17. GainRatio (a,T) =
SplitInformation (a , T )
18. end_for;
19. Set abest =max { GainRatio(a,T)}
20. attach abest into a Tree
21. for v ∈values (abest , T ) do
22. Call C4.5 (Ta , v)
23. end_for
24. return_Tree
Test Option: Instead of selecting 10 Fold cross validation we select percentage split method.
Percentage split method determines the highest accuracy of the model. In percentage split
22
method, we select 66% of the whole data as a training data [38] whereas remaining 34% data is
used for testing. In case of our data set, j48 algorithm using percentage split of 66% achieved
accuracy of 85.9%.
Output file: In the output file the predicted status of job (success/failure) on different host are
mention.
3.2.3 Failure Information Manager
Failure information manager keep the track of the percentage of the failed resources. Failure
advisor, first contact with the failure information manager about the failed percentage of the
resources then applies suitable fault tolerance technique.
3.2.4 Resource Information Server
Resource information server when receives task completion message from a certain task it
increase the number of success of the resource Ns when it notify by retry manager, replication
manager and checkpoint manager (CPM). Furthermore, in case of resource failure message, the
checkpoint manager (CPM) notifies RIS to increase Nf.
Algorithm: Resource Information Server
Input: Status of task return by Algo. Retry, Replication, CheckPointing
Output: Update FR of resource
1. if (RIS receives task_status → Success)

2. Get (resource on which task_status → Success)
3. Increment Ns
4. else (RIS receives task_status → Failed)
5. Get (resource on which task_status → Failed)
6. Increment Nf
7. end_if
3.2.5 Failure Advisor
23
Failure advisor is responsible for applying different fault tolerance technique which is applicable
for the situation. Failure advisor, applied retry fault tolerant technique on the host having lower
percentage of failures. Similarly, replication FTT is applied on the host having medium
percentage of failures whereas check pointing FTT is applied on the host having higher
percentage of failures. The lower and upper threshold are set to 45% and 81% [39].
Algorithm: Select Best Fault Tolerance Technique
Input: List (Tn , Rn): List of tasks and resources
Output: Best Fault Tolerance Strategy for given scenario
1. Procedure MAX-MIN(List(Tn, Rn)

2. TL ← Tn
3. RL ← Rn
4. for all tasks (TL) do
5. find largest task Ti
6. for all resources (RL ) do
7. find resource Ri with minimum execution time for task Ti
8. end_for
9. Get Ri for Ti
10. if (Ri have FP ≤ Lower_Threshold)
11. Strategy←Retry
12. else if (Ri have FP ¿ Lower_Threshold && FP ¿ Upper_Threshold)
13. Strategy←Replication
14. else (Ri have FP ≥ Upper_Threshold)
15. Strategy←Checkpointing
16. end_for
17. return_Strategy;
24
3.2.5.1 Retry Fault Tolerance Technique
Task retry is the easiest task level technique. This technique is used to retries the failed task from
the scratch on the same resource or different resource. The Architecture of the Retry FTT is
given below in Figure 3.3.
The below algorithm shows the procedure for retrying the failed task. The input for the algorithm
is comprise of the list of failed tasks and resources. This algorithm mapped the failed tasks to
required resources.
Figure 3 Retry Architecture
Algorithm: Task_Retry
Input: List (Ft , Rn): List of failed tasks and resources
Output: M (Ft, Rn): Mapped list of failed tasks to resources
1. Procedure TASK RETRY (List ( Rn , Ft))

2. TL ← F t
3. RL ← Rn
4. M(Ft , Rn )←0
5. for all failed tasks (TL) do
25
6. Call Max-Min() and find task Ti
7. for all resources (RL) do
8. Call Max-Min() and find resource Ri for task Ti
9. end_for
10. M(Ft,Rn) ← M(Ft,Rn) + M(Ti,Ri)
11. Compute FR for resource Ri using Eq. [1]
12. if (FR for resource Ri ≤ Lower_Threshold/3)
13. Set max_retries=3
14. else (FR for resource Ri ¿ Lower_Threshold/3
15. Set max_retries=5
16. end_if
17. if (Retry_Manager receives task_status→ Success)
18. Send message to RIS
19. else (Retry_Manager receives task_status→ Failed)
21. GoTo Step 5
22. end_if
23. end_for
24. end
3.2.5.2. Replication Fault Tolerance Technique

This technique creates a duplicate version of the data. Various tasks are replicated over different
resources, and if due to any failure data crashes, it can be recovered from the duplicated copy.
The Architecture of the Replication FTT is given below in Figure 3.4.
26
Figure 4 Replication Architecture
Replication Information Manager (RIM): RIM computes the number of replicas for the task on
the bases of failure rate fr of the resources assigned to task in the scheduling step. The main
functionality of scheduling monitor is to sort the most suitable and reliable resource for the user
task. Scheduling monitor gets the information regarding the suitable resource for the task, from
resource information server (RIS). Moreover, it decides about the backup resources that will
execute these copies according to the load history of the resources, assigned to the task. RIM gets
the load history from resource information server.
Replication Information Manager Responsibility:
A. Determine no. of job replica.
The below algorithm decides about the number of task replica. The no. of task replica is
not fixed but it is dynamic for all tasks. The failure rate of the resources FR n will increase
in direct proportion with the no. of task replicas. The most extreme no. of copies ought to
be equivalent to the no. of suitable resources. The minimun number of copies is equal to
1.
The below algorithm determine about the number of task replicas. There should be atleast
one replica for the task. The algorithm compares the value of failure rate FR j with the
27
failure rate FRn. Addditional replica is created when the given condition occurs true.
Algorithm terminates when FRj is less than FRn.
The failure rate (FR) of resource j is defined in Equation 5. Whereas, the average failure rate of
the resources is defined in Equation 6.
Nf
Failure Rate ( FRj )= (5)
N s+ Nf
Where, Nf is the number of times resource j has failed to complete the job and NS is the time the
resource j has completed the job successfully.
∑ FR j
FR n= j=1 ( 6)
n
Algorithm: Replication_Manager
Input: Ti :Task, Ri :Resource
Output: Return No. of replicas
1. for each task ‘Ti’ do

2. task_replica ←0;
3. Call Max-Min Algorithm and find resource for task Ti
4. j← first resource for task ‘Ti’
5. do
6. increment task_replica
7. Select next resource from unallocated resources
8. Call Max-Min Algorithm for the selection of next resource for task Ti
9. j← next resource;
10. Compute FRj and FRn using Eq. [5] and [6]
11. while (FRj >= FRn )
12. if (FRj < FRn ) then GoTo Step 14
28
13. end_for
14. Exit
B. Determine the backup resources.

Backup resources will be chosen after deciding about the no. of task replicas. Only those
resources are consider for backup resources whose failure rate are less than the average
failure rate of the unallocated resources. Then, the selected resources are sort out in
ascending order according to the resources load history. Load history of the resources is
stored by RIS (resource information server). Equation 7 is used to find out the load
history of the resource ‘j’.
WC
Load History= (3)
Sj
Where, Wc = Total work done by the resource j
Sj = The speed of the resource j that can measured in MIPS.
Algorithm: Selection of Backup Resources
Input: List (Rn): List of unallocated Resources

Output: Selection of Backup Resources
1. for each resource in the unallocated list
2. if (FRj < FRn)
3. Keep the resource in the unallocated list
4. else if (FRj > FRn)
5. Remove ‘j’ from the list
6. end_if
7. end_for
8. Compute LH using Eq.[7]
9. Sortresources_ascending( LH)
10. j←1
11. while (task_replica > 0)
29
12. select j as a backup resource
13. task_replica - -
14. j ++
15. end_while
3.2.5.3 Check Pointing Fault Tolerance Technique:

For big application, this technique is very effective .If failure occur, than the task restarted from
the recent check point onwards rather than starting from initial point. The Architecture of the
Check Pointing FTT is given below in Figure 3.5.
Figure 5 CheckPointing Architecture

Checkpoint Repository: Checkpoint repository stores the status of the checkpoints. When it
receives a status of a new checkpoint, old one overwrites in checkpoint repository. Checkpoint
repository removes the record of the job from which its gain “job completion message”.
Checkpoint Manager (CPM): The major functionality of the CPM is to determine about the no.
of checkpoints and checkpoints interval for the task. Check pointing intervals and check pointing
number are determined on the bases of resource failure rate instead of resource failure index. The
failure rate of the resources j can be calculated by equation 5 [28], Equation 8 calculates the
number of checkpoints and Equation 9 determine the checkpoint interval time.
30
. Checkpoint Number = Rt * FR (8)
Where, Rt is the response time and is defined as the time period from job submission to
completion.
Rt
Checkpoint Interval= (9)
Rt∗FR
Algorithm: CheckPointing_Manager (CPM)
Input: Ti : Task, Ri : Resource, RT (Ri, Ti)

Output: CP_Number, CP_Interval
1. for (each task Ti assigned to a resource Ri ) do
2. Compute FR for resource Ri using Eq. [1]
3. Compute CP_Number using Eq. [8]
4. Compute CP_Interval using Eq. [9]
5. if (CP_Manager receives task_status → Success)
7. else (CP_Manager receives task_status → Failed)
9. Get Last_CP status from CP_Repository
10. if (CP_Status = Exist)
11. CP_Manager execute the task from last CP
12. else
13. Submit this task from start to the scheduler for re-scheduling
14. end_if
15. end_for
Resource information server increase the number of success of the resource N s when it notify by
the checkpoint manager (CPM) who receives task completion message from a certain task.
Furthermore, in case of task failure message, the checkpoint manager (CPM) notifies RIS to
increase Nf. In this scenario, the checkpoint manager (CPM) connects to the checkpoint
31
repository to get the status of the last checkpoint. If checkpoint exists then the CPM dispatches
not completed part of the task with the last checkpoint. If checkpoint doesn’t exist CPM submit
this task from start to the scheduler for re-scheduling.
3.2.6 VM Placement Manager
This Module is responsible for finding a new placement for the VMs to be migrated. VM
placement manager directly transfer the job to the dispatcher. Starting the execution of accepted
service requests on allocated VMs is the responsibility of dispatcher.
32
Chapter 4
Simulation Setup and Results
33
4.1 Outline
This chapter provides details of simulation environment. In this chapter we have run different
jobs of Montage workflow to predict the failure on host then we run same jobs to check FTT.
This chapter starts with the simulation setup where we discussed resource and application
modeling. In the next section, performance evaluation parameters have been discussed. Finally,
the last section provided the results for each performance metric followed by discussion of each
obtained result.
4.2 Simulation Setup

We simulated and implemented our algorithms in java language based simulator named as
WorkflowSim [29]. WorkflowSim, which broadens the current CloudSim simulator by giving an
elevated layer of workflow management. Workflow provides a distinctive and productive
evaluation platform.
4.2.1 Resource Modelling
For the simulation of the proposed strategy, we consider 12 virtual machines (VM) [30]. The
configuration of the VM is presented in Table 5. Moreover, we use 3 Host and 1 DataCenter
[31]. The parameters that are used for Host and DataCenter are represented in Table 6 and Table
7 respectively [30].
Table 5: VM Configuration
Parameter Value
Architecture X86
OS LINUX
34
Vmm XEN
RAM 512MB
MIPS Rating 250,500,750,1000 MIPS
Table 6: Host Configuration
Parameter Value
RAM 2048MB
Storage 1000000 MB
Bandwidth 100000 Mbps
Table 7: DataCenter Configuration
Parameter Value
Architecture X68
OS LINUX
Vmm XEN
Cost 3.0
Storage Cost 0.001
4.2.2 Application Modelling
For our proposed technique, we take Montage workflow [25]. Four sizes (25, 50, 100, and 1000)
jobs are chosen for this experiment [26]. The Montage application has been created by
35
NASA/IPAC [32]. It was represented as a workflow and can be executed in the environment of
cloud and grid.
To evaluate performance of our proposed technique we run different jobs of Montage to predict
the failure on host then we run same jobs to check FTT. To validate the results we compared the
proposed technique with existing techniques under different circumstances.
4.3 Evaluation Parameters

For performance evaluation of our proposed technique we have consider the following
parameters.
4.3.1 Response Time
Response Time is “the time period from job submission to completion”. We have calculated it by
using following formula.
Response Time = Job completion time – Job submission time (10)
4.3.2 Throughput
Throughput is defined as “no. of task completed per unit time”. OR we can say that “n number of
tasks submitted divided by total amount of time need to compute n number of tasks
successfully”. We have calculated it by using following formula.
n
Throughput (n) = (11)
Tn
Where, n = total no. of cloudlet submitted and Tn = total time needed to complete n cloudlet.
4.3.3 Makespan
Makespan is defined as “total execution time of all N jobs”. We have calculated Makespan by
using following formula.
Makespan ETmax = max {ETi, i=1, 2, 3... N} (12)
Where, ETi represents the completion time of jobi.
4.3.4 Cost
36
Cost is the budget it takes to finish a batch of jobs it can also be considered as measure of
efficiency. For each job of scientific workflow execution can be denoted by C j and can be
calculated with the help of Eq. [13].
Cj = Cost_Processing + Cost_Memory + Cost_Storage + Cost_Bandwidth (13)
4.3.5 SLA-Violation
SLA violation occurs when the user doesn’t get their requested resources. In technical term,
we can say that SLA violations occur when VM cannot acquire the amount of MIPS that
are requested. For the SLA violations calculations we use Eq. [14]
SLA =
∑ ( Requested MIPS )−∑ ( Allocated MIPS )
∑ (requeste d MIPS )
(14)
4.3.6 Deadline
Deadline is predefine finishing time to execute a batch of jobs [40]. We can calculate deadline by
using Eq. [15]
DL= Computation_Time + Communication_Time + Overhead (15)
DL represents the deadline. Where, overhead is the extra time consumed on re execution of
failed job/tasks of scientific workflow.
4.3.7 Budget
Budget is the predefined cost required to execute the entire batch of jobs [40]. Budget can be
computed with the help of Eq. [16]. Budget can be represented by B.
B= Cost_Computation + Cost_Communication + Overhead (16)
Where, overhead is the extra time consumed on re execution of failed job/tasks of scientific
workflow.
37
4.4 Results and Discussion
In this section, results regarding different parameters which we have applied on different FTT
have been explained.
4.4.1 Response Time
Response Time is the time period from job submission to completion. Average response time is
an important parameter that helps in determining the performance of a FPBFTT. If FPBFTT
takes less time to respond then response time decreases due to which performance increase.
Average response time of three existing i.e., Job retry Technique (JRT), Replication based fault
tolerance technique (RBFTT) [34] and CheckPointing League Championship Algorithm CPLCA
[35] is compared with proposed technique. Figures 4.1- 4.4 depict the response time of the
existing fault tolerance with the proposed failure prediction based fault tolerance technique
(FPBFTT), with different percentage of fault injected from 5% to 30% [34] and with various
number of job submitted [25,50,100,1000]. On X-axis, percentage of injected failures are
mentioned. On Y-axis, average response time of existing and proposed fault tolerance technique
is mentioned in second.
Figure 6 Total jobs submitted=25
Figure 4.1 shows that in case of 25 jobs the response time of FPBFTT is 9.24% better than
38
39
Figures 4.1- 4.4 shows the response time of the proposed technique (FPBFTT) that is better than
the response time of the existing FTT. The response time of the check pointing technique is not
as good as compared to the replication fault tolerance [36] and failure prediction based fault
tolerance technique. Figure 6-9 demonstrates that the response time of the job retry fault
tolerance technique is greater than the CPLCA, RBFTA and the FPBFTT. This is because job
retry technique retries the failed task from the scratch. When the resource failure occurs in case
of check pointing, job is migrated to another resource this can increase the response time of the
technique. Moreover, check pointing technique generate a lot of overhead due to periodically
saving the states. Similarly, when resource failure occurs in case of replication, there is no need
to migrate the job between different resources so that is why its response time is downgraded
then check pointing. Furthermore, replication technique bears an extra overhead for keeping the
different copies of replica. Existing FTT doesn’t consider the prediction phenomena whereas our
technique consider the prediction scenario and takes decisions on the basis of these prediction
due to which our proposed technique shows good result than existing FTT. In general, the
response time of the existing and proposed fault tolerance increases with the increase in the
percentage of the fault injected and with the increase in the job submitted as shown in Figures
4.1- 4.4.
40
4.4.2 Throughput
Throughput is number of jobs completed per unit time. A robust and efficient technique would
be that provide high throughput. If FPBFTT takes more time to respond in the presence of failure
then throughput decreases. Throughput of three existing i.e., Job retry Technique (JRT),
Replication based fault tolerance technique (RBFTT) [34] and CheckPointing League
Championship Algorithm CPLCA [35] is compared with proposed technique Figures 4.5-4.8
depict the throughput of the existing fault tolerance techniques with the proposed failure
prediction based fault tolerance technique (FPBFTT), with different percentage of fault injected
from 5% to 30% [34] and with various number of job submitted [25, 50, 100, 1000]. On X-axis,
percentage of injected failures are mentioned. On Y-axis, average throughput of existing and
proposed fault tolerance technique is mention in second.
Figure 4.5 shows that in case of 25 jobs the throughput of FPBFTT is 23.44% better than
RBFTT, 40% better than CPLCA and 60% better than job retry technique.
41
Figure 4.6 shows that in case of 50 jobs the throughput of FPBFTT is 17.4% better than RBFTT,
37.8% better than CPLCA and 64.2% better than job retry technique.
Figure 4.7 shows that in case of 100 jobs the throughput of FPBFTT is 20% better than RBFTT,
39% better than CPLCA and 73.1% better than job retry technique.
42
Figure 4.8 shows that in case of 1000 jobs the throughput of FPBFTT is 22% better than RBFTT,
43% better than CPLCA and 72% better than job retry technique.
Figures 4.5-4.8 shows that the throughput of the proposed technique (FPBFTT) is superior to the
throughput of the existing FTT. Figure 10-13 demonstrates that the throughput of the job retry
technique is smaller than the CPLCA, RBFTA and the FPBFTT. This is because when failure
occur retry FTT retries the failed task from the beginning. Similarly, when the resource failure
occurs in case of check pointing, jobs are migrated to some other resources this can increase the
response time of the technique due to which throughput decreases. Furthermore, CheckPointing
algorithm periodically saving the states of the jobs in the form of checkpoints due to which there
is an increase in the overhead of check points. Similarly, when resource failure occurs in case of
replication, there is no need to migrate the job between different resources so that is why its
response time is smaller than the check pointing and throughput is greater than check pointing.
Existing FTT doesn’t consider the prediction phenomena whereas our technique consider the
prediction scenario and takes decisions on the basis of these prediction due to which our
proposed technique shows good result than existing FTT. In general, the throughput of the
existing and proposed fault tolerance decreases with the increase in the fault percentage injected
and increase in the submission of job as shown in Figures 4.5-4.8.
43
4.4.3 Makespan
Makespan is total execution time of all N jobs. Figures 4.9-4.12 depict the makespan of the
existing FTT i.e., Replication based fault tolerance algorithm (RBFTA) [34] and CheckPointing
League Championship Algorithm CPLCA [35] with the proposed failure prediction based fault
tolerance technique (FPBFTT), with different percentage of fault injected [34] and with various
no. of job submitted. On X-axis, percentage of injected failures are mentioned. On Y-axis,
makespan of existing and proposed fault tolerance technique is mentioned.
Figure 14 . Total jobs submitted=25
Figure 4.9 shows that in case of 25 jobs the makespan of FPBFTT is 17.5% better than RBFTT,
31.15% better than CPLCA and 49.64% better than job retry technique.
44
Figure 4.10 shows that in case of 50 jobs the makespan of FPBFTT is 14.44% better than
45
Figures 4.9-4.12 shows that, the makespan of the proposed technique (FPBFTT) is better than
the makespan of the existing FTT. Figure 14-17 demonstrates that when smaller no of jobs are
sent for execution all the algorithms returns makespan time with the FPBFTT shows only slight
improvement. Similarly, when we continue to increase the number of jobs from 25 to 50, 50 to
100 and at last from 100 to 1000, FPBFTT takes less time to execute the jobs than the remaining
algorithms because the propose algorithm take decision on the basis of failure prediction. Our
proposed technique takes decision on the basis of prediction phenomena due to which it
makespan is better whereas retry, checkpoint and replication technique generate a lot of overhead
and takes more time to complete the jobs. In general, the makespan of the existing and proposed
fault tolerance increases with the increase in the percentage of the fault injected and with the
increase in the job submitted [37] as shown in Figures 4.9-4.12. Moreover, when makespan is
lesser, better will be the efficiency of the algorithm.
4.4.4 Cost
Cost is the budget it takes to finish a batch of jobs it can also be considered as measure of
efficiency. Results in Figure 4.13 shows the execution cost of job retry technique (JRT),
Replication based fault tolerance technique (RBFTT) [34], CheckPointing League Championship
Algorithm CPLCA [35] with the proposed failure prediction based fault tolerance technique
46
(FPBFTT). Montage workflow traces [25, 50,100, and 1000] are represented on X-axis and Y-
axis illustrate the execution cost.
Figure 18 Cost Comparison of Proposed and Existing FTT
Figure 4.13 shows that the cost is minimum for proposed technique. The cost for replication
based FTT is high because job is continuously executing on the parallel resources and we have to
maintain various backup resources for the execution of job. The cost for CPLCA is less because
there is no extra resources that are executing the job parallel. The cost of retry technique is
higher than CPLCA because every time job failed it start from beginning that increase the
execution cost. The cost for proposed technique is less because the proposed technique tasks
decision for applying suitable FTT on the bases of failed scenario. CheckPointing FTT was
applied on the resources when the failure percentage was greater than the upper threshold.
Retrying FTT was used when the failure percentage was less than lower threshold. Furthermore,
replication technique was used when the failure rate was not too high or too low but it was in
between the upper and lower threshold. The result for proposed and existing FTT are plotted in
Figure 18 reveals that the average cost for FPBFTT is 1010.5 dollar for 25 jobs, 1899.90 dollar
for 50 jobs, 3142.21 dollar for 100 jobs and 40886.68 dollar for 1000 jobs of Montage workflow.
Similarly, the average cost of RBFTT is 2055.58 dollar for 25 jobs, 4029.51 dollar for 50 jobs,
6241.22 dollar for 100 jobs and 47171.91 for 1000 jobs of Montage workflow. The average cost
of CPLCA is 1094.47 dollar for 25 jobs, 1977.51 dollar for 50 jobs, 4109.81 dollar for 100 jobs
47
and 42718.54 for 1000 jobs of Montage workflow. And finally, the average cost for JRT is
1508.18 dollar for 25 jobs, 2214.51 dollar for 50 jobs, 4505.92 dollar for 100 jobs and 45210.12
for 1000 jobs of Montage workflow.
4.4.5 SLA-Violation
To compare the efficiency of our proposed technique, we evaluated performance using SLA
violations with other existing FTT. We define that SLA violation occurs when a given VM
cannot get the amount of resources that are requested or the services fails to finish within
deadline [41]. Figure 4.14 demonstrates the SLA violation percentage comparison of various
fault tolerance policies, in this figure y-axis shows the SLA violation percentage while x-axis
shows the number of jobs.
Figure 19 SLA violation for Proposed and Existing FTT
Experimental results show that FPBFTT significantly decreases SLA Violations compared to the
other fault tolerance techniques such as job retry technique (JRT), Replication based fault
tolerance technique (RBFTT) [34], CheckPointing League Championship Algorithm CPLCA.
The results indicate that FBPFTT reduced the percentage SLA violation more efficiently than the
other FTT. SLA violation of RBFTT is higher than other FTT because in case of replication
based FTT extra cost has to bear for the replicated server due to which SLA violates. Results
48
demonstrates that SLA-Violation for FPBFTT is 6.3% for 25 jobs, 12.90% for 50 jobs, 19% for
100 jobs and 25% for 1000 jobs of Montage workflow. Similarly, the SLA violations of RBFTT
is 10% for 25 jobs, 21.50% for 50 jobs, 42% for 100 jobs and 65% for 1000 jobs of Montage
workflow. The SLA violation of CPLCA is 7.20% for 25 jobs, 15.40% for 50 jobs, 25.40% for
100 jobs and 42% for 1000 jobs of Montage workflow. And finally, the SLA violation for JRT is
8.90% for 25 jobs, 18% for 50 jobs, 34.20% for 100 jobs and 56% for 1000 jobs of Montage
workflow. The result for proposed and existing FTT are plotted in Figure 4.14. Results reveals
that the SLA violation is minimum for the proposed technique.
4.4.6 Deadline
Deadline is predefine finishing time to execute a batch of jobs. Results in Figure 4.15 shows the
predefine finish time of existing FTT i.e., job retry technique (JRT), Replication based fault
tolerance technique (RBFTT) [34], CheckPointing League Championship Algorithm CPLCA
[35] with the proposed FPBFTT. Montage workflow traces [25, 50,100, and 1000] are
represented on X-axis and Y-axis illustrate the finishing time to execute a batch of jobs. Deadline
on Y-axis is represented in seconds.
Figure 20 Deadline Comparison of Proposed and Existing FTT
Figure 4.15 shows that the predefine finish time for executing the batch of jobs is minimum for
proposed technique. The deadline for job retry technique is higher because retry technique
49
include more computation time also the overhead increases for retrying the job from scratch. The
deadline for CPLCA is greater than RBFTT because when any failure occur last checkpoint is
taken from the repository that’s increase the time of communication as well as computation. The
predefine finish time for RBFTT is less than job retry and check Pointing FTT because if any
failure occur in case of RBFTT it is not essential for the job to migrated to the another resources
but it start its execution on the replicated server due to which the communication cost decreases.
4.4.7 Budget
Budget is the predefined cost required to execute the entire batch of jobs [40]. Results in Figure
4.16 shows the predefined cost that is requires to execute the traces of montage. Result are
evaluated on the basis of existing FTT i.e,. Job retry technique (JRT), Replication based fault
tolerance technique (RBFTT) [34], CheckPointing League Championship Algorithm CPLCA
[35] with the proposed failure prediction based fault tolerance technique (FPBFTT). Montage
workflow traces [25, 50,100, and 1000] are represented on X-axis and Y-axis illustrate the
execution cost.
Figure 21 Deadline Comparison of Proposed and Existing FTT
Figure 4.16 shows that the budget is minimum for proposed technique. The budget for proposed
technique is less because the proposed technique tasks decision for applying suitable FTT on the
bases of failed scenario. CheckPointing FTT was applied on the resources when the failure
percentage was greater than the upper threshold. Retrying FTT was used when the failure
50
percentage was less than lower threshold. Furthermore, replication technique was used when the
failure rate was not too high or too low but it was in between the upper and lower threshold. The
budget for replication based FTT is high because job is continuously executing on the parallel
resources and we have to maintain various backup resources for the execution of job. The budget
for JRT is also higher than CPLCA and FBPFTT this is because when any failure occur JRT
restart the execution from beginning this mechanism increases the computation cost as well as
the retry overhead. The cost for CPLCA is less because there is no extra resources that are
executing the job parallel.
Chapter 5
Conclusion and Future Work
51
5.1 Outline
This chapter gives conclusion of my research work.
5.2 Conclusion
In this thesis, “Failure Prediction Based FTT” has been proposed. The proposed technique
depends on the prediction based failure scenario of resource failure and takes decision for
applying appropriate fault tolerance technique based on the prediction phenomena. We evaluated
the performance of our technique under different conditions using different parameters such as
Response Time, Throughput, Makespan, Cost, Budget, Deadline and SLA Violation. We have
compared the result of FPBFTT, with well-known existing FTT i.e. “CheckPointing league
Championship Algorithm (CPLCA)”, “Replication Based FT Algorithm (RBFTA)” and with
“Job Retry Technique (JRT)”. Simulation results obtain through experiments and their
comparison with existing technique leads us to the condition that our proposed failure prediction
based fault tolerance (FPBFTT) technique yield better results in all conditions. Results reveal
that in case of 25 jobs, the response time of FPBFTT is 9.24% better than RBFTT, 15.1% better
than CPLCA and 22.5% better than job retry technique. In case of 50 jobs, the throughput of
FPBFTT is 17.4% better than RBFTT, 37.8% better than CPLCA and 64.2% better than job retry
technique. In case of 100 jobs, the makespan of FPBFTT is 16.08% better than RBFTT, 26.8%
better than CPLCA and 38.7% better than job retry technique . Proposed technique achieves
better performance than existing fault tolerance technique.
5.3 Future Work

 As a future work, we will consider the false positive rate.
52
 Reducing the FPR by including more features or attributes in learning, can assist the
proactive failure management, which depend on prediction so that results turn out to be more
powerful and spare more resources.
53

Raja Kaif Ali 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Raja Kaif Ali 1

Uploaded by

Copyright:

Available Formats

Failure Prediction Based Fault Tolerance Resource

Management Techniques in Cloud

COMSATS University Islamabad

Failure Prediction Based Fault Tolerance Resource

RAJA KAIF ALI

A Post Graduate Thesis submitted to the Department of Computer Science as partial

Name Registration Number

RAJA KAIF ALI CIIT/FA22-BCS-146/ATD

MAAM FAIZA QAZI

This thesis titled

Failure Prediction Based Fault Tolerance Resource

RAJA KAIF ALI

For the COMSATS University Islamabad, Abbottabad Campus

External Examiner: _______________________________________________________

Dr. Babar Nazir, Associate Professor

Department of Computer Sciences, CUI, Abbottabad

Dr. Imran Ali Khan

Department of Computer Sciences, CUI, Abbottabad

Signature of the Student: ___________FAHAD HASSAN__________________

This work is gratefully dedicated to

RAJA KAIF ALI

1.4 Problem Statement.................................................................................................................5

1.5 Main Contribution of Thesis..................................................................................................6

1.6 Organization of Thesis...........................................................................................................7

2.2 Fault Tolerance – An Overview............................................................................................9

2.2.1 Proactive fault tolerance policies....................................................................................9

2.2.2 Reactive fault tolerance policies.....................................................................................9

2.3 Fault Tolerance Models:......................................................................................................11

2.4 Literature Review:...............................................................................................................13

3.2 Proposed System Architecture.............................................................................................16

3.2.2 Failure Prediction Manager..........................................................................................17

3.2.3 Failure Information Manager........................................................................................21

3.2.4 Resource Information Server........................................................................................21

3.2.5 Failure Advisor.............................................................................................................22

3.2.6 VM Placement Manager...............................................................................................30

Simulation Setup and Results........................................................................................................31

4.2 Simulation Setup..................................................................................................................32

4.2.1 Resource Modelling......................................................................................................32

4.3 Evaluation Parameters.........................................................................................................34

4.3.1 Response Time..............................................................................................................34

4.4 Results and Discussion........................................................................................................35

4.4.1 Response Time..............................................................................................................35

5.3 Future Work.........................................................................................................................49

Figure 1 Layers of service model in cloud computing....................................................................4

Figure 1 Layers of service model in cloud computing

1.4 Problem Statement

1.6 Organization of Thesis

The rest of the thesis is organized as follows:

Chapter 5: This Chapter gives conclusion of our work.

2.2 Fault Tolerance – An Overview

2.2.1 Proactive fault tolerance policies

Following techniques come under this method [11].

2.2.2 Reactive fault tolerance policies

Following techniques come under this method [12]:

Table 1: Comparison of different Fault Tolerance Techniques

Failure Tolerance Policies Tool Implementation Fault Detected Application Type

Self-Healing Proactive HA-Proxy/Assure Virtual Machine Process/Node Failure/ Fault Tolerance/

Replication Reactive HA-proxy/AmazonEc2 Cloud Environment Application Failure/ Fault Tolerance/

Retry Reactive Assure Virtual Machine Host, Network Fault Tolerance

2.3 Fault Tolerance Models:

ii. FTM-Fault Tolerance Management

iv. FRAS-Fault Tolerance & Recovery Agent System

Signature of the Student: _FAHAD HASSAN________