You are on page 1of 30

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/336883030

Towards increasing reliability of Amazon EC2 spot instances with a fault-


tolerant multi-agent architecture

Article  in  Multiagent and Grid Systems · October 2019


DOI: 10.3233/MGS-190312

CITATION READS

1 559

3 authors:

José Pergentino Araujo Neto Donald M Pianto


University of Brasília University of Brasília
8 PUBLICATIONS   30 CITATIONS    28 PUBLICATIONS   185 CITATIONS   

SEE PROFILE SEE PROFILE

Célia Ghedini Ralha


University of Brasília
147 PUBLICATIONS   651 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MASE Project View project

Natural Computing View project

All content following this page was uploaded by Célia Ghedini Ralha on 28 November 2019.

The user has requested enhancement of the downloaded file.


Multiagent and Grid Systems – An International Journal 15 (2019) 259–287 259
DOI 10.3233/MGS-190312
IOS Press

Towards increasing reliability of Amazon


EC2 spot instances with a fault-tolerant
multi-agent architecture

Y
José Pergentino Araújo Netoa,∗ , Donald M. Piantob and Célia Ghedini Ralhaa

OP
a Department of Computer Science, University of Brasília, Brazil
b Department of Statistics, University of Brasília, Brazil

Received 17 May 2019


Accepted 1 August 2019
C
Abstract. Cloud providers have recently offered their unused resources as transient instances. Amazon sells idle cloud resources
as spot instances pricing by an auction-based market mechanism to reduce the cost without any availability guarantee. Thus, to
dynamically and autonomously manage cloud resources to execute user applications ensuring greater reliability with cheaper
OR

spot instances is an open problem. In this context, we propose a fault-tolerant multi-agent architecture as middleware of cloud
providers and users to mediate access to a wide range of heterogeneous resources providing a resilient application execution
environment with a dynamic flexible fault-tolerant mechanism based on adaptive checkpointing. Our architecture combines a
case-based reasoning model with a survival analysis model to predict failure events and refine fault-tolerant plans with adequate
parameters to increase reliability optimizing total execution time and costs. We evaluated the proposed architecture with real
historical data collected from Amazon EC2 price changes including, with approximately 21 million records and generating
TH

1,362,816 scenarios stored in our case knowledge database. The results considering the time to revocation achieved high levels
of accuracy (98%) with a gain up to 74.48% to total execution time, reducing total cost when compared to other approaches in
the literature.

Keywords: Cloud computing, spot instances, fault tolerance, adaptive checkpointing, case-based reasoning, survival analysis
AU

1. Introduction

Cloud computing has evolved to offer services to execute applications with high flexibility, scalability
and availability with affordable pricing for cloud users and vendors [7]. Cloud computing has emerged
as a new paradigm to provide various types of computing resources to serve diverse application scenarios
being one of most popular and important information technology trends [18].
Organizations are shifting their business to cloud providers that should furnish resources to attend
users in a transparent way over the Internet. The cloud architecture consists of distributed resources with
a set of services offered by a pay-as-you-go model, being an interesting alternative to explore resources
often appearing to be unlimited [22].


Corresponding author: José Pergentino Araújo Neto, Department of Computer Science, University of Brasília, Brazil.
E-mail: paraujo@aluno.unb.br.

ISSN 1574-1702/19/$35.00
c 2019 – IOS Press and the authors. All rights reserved
260 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

The cloud Infrastructure as a Service (IaaS) is a base class that offers a set of Virtual Machines (VMs)
with different configurations and capacity allowing users to choose according their needs. Since users’
resource demands change constantly, cloud providers are offering VMs services at inexpensive prices
than the dedicated machines (on-demand) to maximize infrastructure utilization. IaaS is a promising
instrument to implement and execute distributed bag-of-tasks (BoT) applications [14]. BoT is formed
by a set of consistent applications (independent tasks) that can be executed in parallel using distributed
hosts [8]. The effort to compute BoT with memory intensive applications has become common, being
widely used in several applications, such as simulations, bioinformatics, data security, and cloud task
scheduling [15,25,34,38].
Cloud providers realized that dedicated hardware aren’t being fully used, having considerable idle
resources to be offered as unreliable VMs as transient instances. Using these instances users can be re-

Y
voked irreversibly according to each cloud providers’ rules. But these VMs are offered at considerably
lower prices without any availability guarantee [30]. Amazon Elastic Compute Cloud (EC2) and Google
Compute Engine (GCE) are offering transient instances using different strategies. Amazon adopts an

OP
auction based market mechanism with a dynamic pricing scenario based on users’ bids. Google is ex-
ploring idle resources where VMs are provisioned at considerable static value without price changes and
usage time limits.
Despite cloud computing benefits, the use of transient instances indicates many relevant issues that
C
still pose critical challenges. Considering automated management issues we may cite: dynamic resource
allocation to compute applications ensuring reliability, efficient resource scheduling to achieve high-
performance, dynamic flexible infrastructures to guarantee Quality of Services (QoS), ensure the ac-
cessibility to meet the desired System Level Agreement (SLA), security aspects (e.g., integrity and
OR

confidentiality of data, authentication), and fault-tolerant strategies to ensure accessibility, availability,


reliability of services achieving cost optimization. In this challenging scenario, the main problem this re-
search work addresses is the trade-off between reliability to ensure the effectiveness of users applications
and the low cost of spot instances resources without availability guarantee.
In order to face the research problem to effectively use spot instances, intelligent techniques are de-
TH

sirable. More specifically, having intelligent agents with the ability to identify the environment in which
they are inserted and act autonomously with learning skills would be useful [27]. As pointed out by [29],
it is necessary to address three key factors to effectively use transient instances to fulfill user requests:
(i) defining the right resource configuration according to user needs; (ii) creating a fault-tolerant strategy
to avoid data loss if a failure occurs; and (iii) recovering from instance revocations, when they occur, to
AU

ensure application continuity.


Considering the use of autonomous agents to effectively manage spot instance resources, in this re-
search we investigate how to dynamically and autonomously manage these resources to execute user
applications ensuring greater reliability with optimized time and cost. Thus, we propose a fault-tolerant
multi-agent architecture as middleware of cloud providers and users to mediate access to a wide range of
heterogeneous resources providing a resilient application execution environment with a dynamic flexible
fault-tolerant mechanism based on adaptive checkpointing.
To predict spot instance failures we use survival analysis and create a fault-tolerant execution plan
considering the current availability scenario according the application needs. We build upon our previous
work [4], where we demonstrated the viability of the prediction model achieving 92% success rate for
survival prediction. The prediction results demonstrated the model potential as a solution to support
decisions and efficiently use spot instances in cloud computing.
We evaluated the multi-agent architecture with real historical data collected from spot instances price
changes achieving good results. Thus, we argue that the proposed architecture provides a novel approach
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 261

that encompass a reasoning and a statistical model to predict Time Until Revocation, defining suitable
fault-tolerant parameters to avoid extra time in application executions. Agents’ reasoning model uses
Case-Based Reasoning (CBR) to decide based on previously observed cases, allowing continuous learn-
ing and problem-solving features similar to human being [1].
In summary, the key contributions of this paper are: (i) propose a platform independent multi-agent
architecture for integrating BoT enabled systems and provide a fault-tolerant spot instance environment;
and (ii) to increase reliability of users’ applications execution optimizing total time and costs. In this
paper we propose the combination of features to predict spot instances failures, as proposed in [4], and
a multi-agent architecture to dynamically create a fault-tolerant execution plan considering the current
availability scenario according to users’ application needs.
The rest of this paper is organized as follows: the literature review is presented in Section 2; in Sec-

Y
tion 3 we provide details about the proposed multi-agent architecture; in Section 4 we present experi-
mental results; a comparative study is summarized in Section 5; and in Section 6 we present conclusions

OP
and future work.

2. Literature review

As previously explained, the cloud computing unused resources are being exploited by cloud providers
C
and significant work has been done to study the consumption of resources for the cheapest price in an
attempt to provide a secure environment and avoid unexpected events. The assignation of unused cloud
computational resources in the form of unreliable VMs is commonly known as transient resource. In
OR

the following, we briefly review related works focused on cloud resources usage in an attempt to create
a fault-tolerant environment to run applications in an agent-based, spot instances and fault-tolerance
perspectives.
Considering the use of fault-tolerant approaches using spot resource management, in [36] the authors
propose a resource allocation strategy to address the problem of compute intensive application execution.
TH

To maximize the chance of finishing tasks on deadlines, the first approach uses a bidding mechanism
that estimates future spot prices and support bidding decisions to cloud users. Using price history data
authors present five bidding strategies: (i) a minimum value; (ii) a mean of all values; (iii) a current
on-demand price; (iv) the high value observed; and (v) the current spot instance price. Efficient bidding
strategies for spot instances are crucial in lowering execution costs, being ineffective for long executions.
AU

Even avoiding out-of-bid failure, if it occurs, all process data will be lost. As a second approach, this
work evaluates an existing checkpointing-based approach using the same definitions used in [37]. The
technique considered in this work is an hourly based execution state checkpointing, following the full-
time charge rule applied by Amazon. As long applications imply increasing checkpoints, this technique
can be improved by using a heuristic to define an appropriate checkpointing interval without compro-
mising the total execution time. In our proposed architecture, an adaptive checkpointing is adopted. This
flexible fault-tolerant approach adjusts checkpointing intervals according to the application execution
time, e.g., superior to 60 minutes when long executions are present.
In [39,40], authors simulated an auction with several checkpointing strategies to evaluate out-of-bid
situations in spot instances: hourly based and at each price rising edges (when a spot price increasing
is observed). Results show that the use of checkpointing technique tolerates instance failures and re-
duce costs when compared to on-demand instances. In [39] a time window evaluation was used in an
adaptive checkpointing based on the spot history data compared to the current price. Authors state that
using different bid value reflects on spot instance availability. Similar to [36], the total of checkpointing
262 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

occurrence will be increased. Small checkpointing interval implies delayed executions and increases to-
tal user costs. Our approach is adapted for different bid strategies and each strategy imply on time until
revocation, being an important element to define checkpointing interval according to current availability.
From a checkpoint perspective, an important element to be considered is the interval between sav-
ing states since it affects total execution time. Consider a task process TP that executes in a specific
time TPt = 2880 minutes (≈ 48 hours), with a checkpoint interval Tchk = 60 minutes, with an over-
head Tover = 10 minutes. The resulting additional execution time T + = ((TPt /Tchk ) − 1) ∗ Tover is
470 minutes (≈ 8 hours), having a total execution time TPT = TPt + T + of 3350 minutes (≈ 56 hours).
Thus, a well-defined checkpointing approach is compulsory to decreased execution time. Larger Tchk
implies faster executions reducing monetary costs. The following works will be presented considering
decreasing checkpointing intervals.

Y
In [33,41], authors use a set of smaller checkpointing intervals between 10 to 60 minutes. A defined
interval of 10 minutes is the worst strategy when a large application is running. This is a scenario where

OP
the checkpointing time can exceed the interval time and increase user costs. In [33], the authors adopt job
migration as a fault-tolerant technique that implicitly uses checkpointing over execution to be prepared
for migration. Similar to other related works, the fault-tolerant parameters are static during execution
time, ignoring the fact that changes occur over time considering an auction environment. A better check-
pointing strategy is to predict new failures in run-time and refine parameters according to price changes.
C
In [41], authors ignore the fact that CPU and memory intensive applications need considerable time to
save the execution state, reflecting in increased total execution time. As observed in Algorithm 4, our
proposal consider long executions as a trigger to evaluate and define appropriate checkpointing intervals.
OR

Alternatively, in [20], authors propose a strategy in which a new checkpoint interval occurs every
time price changes, i.e. checkpointing is defined by monitoring price changes. This approach is not
interesting considering high demand instance types, e.g. r3.8xlarge, a memory-optimized instance type
with 1.292.181 prices changes in the last two years (2017 and 2018) according to our records. In an
auction scenario, high demand means increase price changes records. Our approach considers price
TH

changes when calculating time until revocation after each checkpointing and refine the interval parameter
according to current availability.
Considering the multi-agent cloud management perspective, research focuses on security and resource
provisioning. Using a model that focuses on private cloud providers, [12] proposes a framework that uses
three agents to monitor execution, memory and CPU usage in the SaaS and IaaS layers. An important
AU

point to consider in this work is the use of agents with CBR model to predict resource usage and provide
vertical elasticity of available resources. Also, historical execution data is needed to predict resource
usage; whilst considered useful, it is not always possible. Transient resources are not being used in this
work, the authors preferred using reliable servers with availability guarantees.
Considering that the research focus of this paper is on the use of multi-agent approach to effectively
manage transient instance resources, the systematic literature review presented some works using agent
approach with dedicated machines (on-demand) [2,6,12,31]. The core agent behavior in these works use
elastic resource allocation associated to prediction methods for effective resource provisioning in cloud
computing. More recent works encompass dynamic resource provisioning approaches [19,26].
Regarding the approaches with fixed checkpointing strategies, no work uses autonomous agents with
spot instances. An important point is to consider checkpointing intervals to reduce total execution time
and costs. Even works with adaptive checkpointing strategy, such as [23], authors focus the reliability of
services and the availability of checkpointing storage in cloud environment using on-demand instances.
While our approach focus on users’ applications execution reliability using spot instances to optimize
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 263

Table 1
Literature review summary
References Multi-agent Checkpoint Recent spot User Adaptive CBR Survival
and restore instance transparent checkpointing analysis
generations
[40]
[37]
[36]
[33]
[41]
[39]
[20]
[∗]

Y
total execution time and costs. In our prior work [4], we use a heuristic model with checkpoint/restore
technique and a statistical model to predict time until revocation by analyzing price changes. Results

OP
presented high level of prediction accuracy (92%), but autonomous agents were not used to manage
aspects with fault-tolerant technique according to execution scenario in real-time. In this work, an ex-
tension of [5] is proposed with increasing agent types, including recency effects to refine time until
revocation, and extending the experimental scenario considering the amount of data and instance types
which improved results as presented in Sections 3 and 4.
C
Taking into account this research investigates the efficient use of spot instance resources, our literature
review focuses on using checkpoint/restore as the fault-tolerant technique to execution guarantee. Even
with additional overhead, this technique is the most used one in cloud computing [13]. What is common
OR

among these fault-tolerant related works is the fact that the defined parameters are static and controlled
by non-autonomous systems. A compilation of the cited works is presented in Table 1 compared to our
proposal (marked with [ ∗ ]).
To the best of our knowledge, there is no published research using the elements of this proposal,
where the main contribution is a fault-tolerant multi-agent architecture to execute distributed BoT appli-
TH

cations in a transparent way using spot instances. The proposed architecture combines CBR reasoning
model (retrieve similar execution cases) associated to a survival analysis statistical model (predict failure
events and ignore atypical known cases) to support agents’ reasoning to define adequate fault-tolerant
plans with an adaptive checkpointing according current availability scenario to increase spot instances
reliability optimizing total execution time and costs.
AU

3. Proposed work

In this section, we present the fault-tolerant multi-agent architecture, that analyzes the current and his-
torical scenario of spot instances (availability, prices and probabilistic indexes) to define an appropriate
fault-tolerant execution plan and their respective definitions.
A negotiation layer between users and cloud providers enables a new breed of trust services. The
management layer needs to intermediate external cloud resources in an autonomous way, choosing a
convenient subset of resources to execute user applications, increasing resource usage and decreasing
costs.
A resilient environment to run BoT is one of the most important and complex issues in cloud scenar-
ios with non-guaranteed transient instances, which can be as complex as a multi-objective optimization
problem since it attempts to reduce resource consumption, minimize user costs and ensure task execu-
tion.
264 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Y
C OP
OR

Fig. 1. Flowchart of the overall working process of this proposal.


TH

3.1. Proposal overview

Transient instances’ revocation affects application performance compared to on-demand servers, since
revocation affects the applications overhead because users have to deal with unexpected revocations in a
non-transparent way. For example, consider a compute-intensive application that runs on a spot instance
AU

and periodically saves its execution state to a remote disk. After a revocation, the application can re-
execute from its last consistent saved state. But saving many intervals incurs an overhead that increases
running time and decreases the performance of a spot instance relative to an on-demand safe and reliable
server.
Our proposal combines the CBR to retrieve similar execution cases, being supported by the survival
analysis statistical model, a non-parametric technique used to ignore atypical known cases and helps
agents to reason and define appropriate fault-tolerant parameters, e.g. the checkpointing intervals, to
reduce cloud resource usage time and decrease user monetary costs. The overall working process of this
proposal is presented in Fig. 1.
Figure 1 represents a flowchart that starts when a new application is submitted by the user, having a
validation about required files and parameters. Considering a previous knowledge database generation
(Algorith 1), after validation, a new spot instance will be requested and the time until revocation will
be calculated by retrieving similar cases (Algorithm 2) and processing the survival curve (Algorithm 3),
populating the survival time matrix. To define the appropriate fault-tolerant parameters (Algorithm 4),
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 265

a survival rate is calculated after verifying an experimental scenario to observe if defined time until
revocation is achieved by using historical price change data. In the end, a new remote execution will be
created with a failure detector monitor, activating the behavior when the following events occur (Fig. 6: if
it is the time to checkpointing, a new execution state will be saved in the repository; if an error occurred,
recover from the last consistent saved state, retain the case as a failure case and a new execution will be
created; and if the execution finished, the case will be retained and the flowchart ends.

3.2. Using Amazon EC2 spot instances

Considering the significant amount of unused resources, cloud providers have been adopting a model
that offers the unused capacity of VMs resources in the form of transient instances. Spot instances are

Y
VMs available at Amazon offered in an auction environment.
As opposed to on-demand instances, spot instances can be revoked without user intervention. Users
request spot instances VMs using a bid value (Sbid ) (a value that is willing to pay) and receive the VMs

OP
if the current spot price (Sp ) is below their bid price. As long as the Sp is below their Sbid , spot instances
provided to the user still available and only Sp is charged.
The value of Sp is constantly updated according to users demand. High demand by the cloud users
increases Sp , decreasing otherwise. The spot instance availability attribute (Son ) indicates the state of
the VM. When Son is set to true, the VM is up and running, false otherwise. When Son is set to false,
C
the VM will be revoked in two minutes. The state of Son changes according to Eq. (1).
(
True If Sp 6 Sbid ∧ Sava (. . .)
Son = (1)
OR

False If Sp > Sbid ∧ ¬Sava (. . .)


The VMs are distributed among distinct locations in a set of regions. Each region is divided into a
subset of zones, which offer a set of independent resources and services with different failure probabili-
ties. To launch a transient instance using Amazon EC2, a user submits a spot instance request command
composed of 6-tuple, as presented in Eq. (2):
TH

SpotReq = {Region, Zone, InstanceType, QtdOfInstances, AttributeSet, BidValue} (2)


The Region and Zone are the primary attributes in a new spot request, being their availability defined
according Sava , a black box cloud function, that returns the availability of VMs according their respective
parameters: Region, Zone and InstanceType.
AU

The InstanceType represents the capabilities of it resources, like CPU, memory, and local storage ca-
pacity, e.g., the p3.8xlarge instance type represents the 3rd generation of GPU accelerated computing
optimized instances with high frequency 4 GPUs, 32 vCPU, 64 gigabytes of GPU memory and 244 giga-
bytes of memory, commonly used to process machine or deep learning applications, speech recognition
or high-performance computing.
Amazon EC2 currently supports the generation of 57 different instance types in the spot tier, although
not all types are available in all Regions and Zones. The number of instances (QtdOfInstances) represents
how many instances will be delivered to the user. Moreover, a set of attributes (AttributeSet) are needed
to instance availability status be fulfilled, e.g. the attached storage, security chains, monitoring tasks and
fleet strength, being a collection, or fleet, of unsecured spot instances, and optionally secured dedicated
instances. When all the specifications for the user request are met, the spot instance VM is fulfilled,
which can take a few minutes to be ready to use.
Finally, in order to use spot instances, a customer request is needed and must include a bid value
(BidValue). Users define the maximum hourly value they are willing to pay (a bid), obtaining a VM to
266 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Table 2
Definitions of principal elements used in this proposal related to spot instance
Property Details
Ccpu Indicate the processing capacity, including virtual vCPU numbers and their respective clock speed.
Cmem Attribute indicating the resource associated to a volatile RAM memory size, including speed access and
extended capacity, when applied.
Cvd Represents a virtual disk storage capacity available to the user, being used to store required files and data.
Cnet Indicate the data transfer speed and capacity over an specific network adapter.
Csts Attribute indicating the executing state of VM, that can be initializing, running, revoked or stopped.
TIsec Represents security attributes that allows external access to VM, as opened ports and private security key.
Vcur Represents the current instance price (at hourly intervals), acquired from a cloud provider.

use while their bid exceeds (or is equal to) the current price of the instance. If the current price of a spot

Y
instance is equal or less than the user bid, the VM still available to the user. VMs are revoked when the
current price of spot instance type exceeds the user’s bid. A bid fault occurs when the current price of

OP
the VM is above the user’s bid [28].
The price of a spot instance depends on the type of instance as well as VM demand within each data
center. As an auction proposal, the spot instances are offered at lower prices compared to regular and
safe on-demand servers, increasing their values when resource availability and the number of users send
bids and receive VMs are considerable, decreasing otherwise.
C
According to the definition and classification of an environment, presented in [27], in this proposal
scenario, the composition of spot instances is considered as a distributed, accessible, non deterministic,
dynamic and discrete. A set of definitions about spot instances is presented in Table 2.
OR

In order to use spot instances, the applications need to be intentionally fault tolerant, i.e. be prepared
for unexpected failure events and recover from failures which are not always dominated by software
designers.

3.3. Achieving spot instance availability with fault-tolerant approaches


TH

Using spot instances allows reduced costs when running compute or memory intensive applications,
as long as these applications are prepared for scenarios with performance degradation and server revo-
cations.
Considering the use of cloud computing resources, resiliency means the capacity of a resource to be
AU

active, reliable, failure tolerant, recoverable, dependable, and secure, in case of unexpected failures that
result in a temporary or permanent service disruption [3,9,35]. In the scope of our proposal, resilience
is the ability to ensure application execution and is achieved through the fault-tolerant approach. Exam-
ples of six fully consistent fault-tolerant techniques: (i) The retry is the simplest fault-tolerant technique,
since the recovery process consists of restarting the process from scratch, ignoring all processed data;
(ii) the task re-submission fault-tolerant technique ignores processed data, resubmitting the failed task
to another resource; (iii) the checkpoint and restore technique saves the application’s running state at
specified time intervals. An overhead for each saved execution state is added because all process data
and volatile memory needs to be saved in persistent memory; (iv) a set of parallel executions is the strat-
egy used in replication, which increases the chances of execution success by increasing the number of
parallel executions; (v) the software rejuvenation is used when an application needs a clean resource, re-
booting the operating system according to schedule rules; (vi) and finally, job migration is a preventative
approach that predict a future failure and migrate the running process to an environment with reliable
resources by using a well-defined plan with safe instances.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 267

Y
OP
Fig. 2. Observed patterns in price changes in M4Large instance type.

Using fault-tolerant techniques minimizes the impact of VM unavailability. Results show that un-
der more volatile scenarios, the use of fault-tolerant approaches become considerably more useful and
C
provide significant benefits [39]. Intelligent approaches combined with fault-tolerant techniques allow
increased resilience in scenarios that use spot instances to execute applications, avoiding data loss.

3.4. Analysis of reasoning model


OR

The performance and availability of VMs in a cloud environment varies according to region, zone,
instance types and different times of day [4], where the authors observe the existence of a pattern of
price changes according to each day of the week and during each hour in the day.
Patterns can be observed in Fig. 2, that shows, for the price changes recovered from April/17 to
TH

December/18, how many price changes occurred on each day of the week and during each hour of the
day, grouped by zones in the US-WEST region for M4Large instance type. The probability of bid faults
increases when there are more price changes, consequently, few price alterations imply less failures.
A pattern can be observed in the m4.large instance type. The number of changes peaks during week-
days (a) (as opposed to weekends) and after 12am (c). The r4.xlarge has considerable price changes on
AU

weekends (b) and the number of changes by hour-in-day increases after 7pm (19 h) (d).
Our heuristic uses a model that analyzes historical and current prices of spot instances and their
changes to define appropriate fault-tolerant parameters. Compared to existing similar studies, our pro-
posal uses CBR to classify cases considering observed price change patterns of spot instances in terms
of hour-in-day and day-of-week, as proposed in [4,16,17].
Our knowledge database is composed by cases attributes (Table 3) and is organized as a set of tuples
(Eq. (3)) to be explored by our schematic CBR circle of the problem-solving process (illustrated in
Fig. 3), composed by an initial knowledge generation (detailed in Algorithm 1), following by a set of
eight sequential steps, as follows:
(i) Current Problem – a step that represents a new submission problem containing the requested re-
gion, zone, instance, bid-strategy, day-of-week and hour-in-day attributes; (ii) Retrieve – as presented
in Algorithm 2, this step uses similarity functions using Eq. (5) to fetch a set of similar cases according
requested attributes. An amalgamation function is a weighted sum of all local similarities (attribute sim-
ilarities) of a concept that constitutes the overall global similarity measure of the concept [32]. At the
268 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Y
C OP
Fig. 3. Schematic CBR circle overview.
OR

Algorithm 1 Generating case knowledge database


1: procedure C ASES G ENERATOR(reg, zone, instType, initDate, finalDate)
2: System Initialization
3: functionParams ← [reg, zone, instType, initDate, finalDate]
4: priceHistoryList[] ← getPricesFromDB(functionParams)
5: addictions[] ← new Double(1, 1.1, 1.2, 1.3, currentOnDemandPrice)
TH

6: caseBasedList[] ← new List()


7: for addiction in addictions do . Processing each addiction to increase database
8: for priceRow in priceHistoryList index i do . Each price change record
9: baseTime ← priceRow.time
10: basePrice ← priceRow.price · addiction
11: censured ← True
AU

12: for futurePriceRow in priceHistoryList do . Future price changes


13: comparablePrice ← priceHistoryList[i + 1].price futurePriceRow
14: if comparablePrice > basePrice then . A higher price was found
15: censured ← False
16: min = calcMinutes(baseTime, priceHistoryList[i + 1].time)
17: caseBasedList.add(new Case(set-of-attrs))
18: if censured then . A higher price record was not found
19: censured ← False
20: min = calcMinutes(baseTime, priceHistoryList[i + 1].time)
21: caseBasedList.add(new Case(setOfAttrs))
22: persistOnDatabase(caseBasedList)
23: return caseBasedList

retrieving step, the similarity function process case attributes (Eq. (4)) and uses different weights (Ta-
ble 3) to get closer or max calculated weights cases; (iii) Similar Cases – represents a structure with a set
of similar cases retrieved to be reused as reference; (iv) Reuse – the adaptation and reuse of the retrieved
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 269

Table 3
Case (Θ) attributes with their respective similarity function strategies and weights
Attribute Detail Function strategy Weight
Instance A subset of attributes indicating the used Constant 1 if attr is equal, else 0.
instance, including the instance name and
resources capacities.
Region Indicates the region location. Constant 1 if attr is equal, else 0.
Zone Attribute indicating the region zone. Constant 1 if attr is equal, else 0.
Day-of-week Represents the day-of-week. Euclidean A value between 0 and 1
according to the proximity
of values 0 and 7.
Hour-in-day Represents the hour-in-day. Polynomial A value between 0 and 1

Y
according to the proximity
of values 0 and 23.
Bid-strategy Represents the bid strategy used to define user Constant 1 if attr is equal, else 0.

OP
bid (median of last n-days, actual price or
percentage over current price).
Bid-value Represents the bid value (spot instance request) N/A N/A.
used to acquire the VM.
Time-init Attribute indicating the time (timestamp) of N/A N/A.
C
VM acquisition.
Time-end Attribute indicating the time (timestamp) of N/A N/A.
VM revocation instant.
Time-until-revocation Represent the time until revocation, N/A N/A.
OR

considering time-init and time-end.


Censored-data Attribute indicating if the Θ represents a N/A N/A.
censored event, being an important element to
calculate the time until revocation.
TH

Algorithm 2 Retrieving similar cases


1: procedure R ETRIEVE S IMILAR DATA(reg, zone, instType, dayOfWeek, hourInDay)
2: System Initialization
3: paramSet ← [reg, zone, instType, dayOfWeek, hourInDay]
4: model ← loadModule() . Load the CBR model, including the modeled similarity functions
5: model.cases ← loadCasesDatabase() . Load the CBR database, including filtered data according params
AU

6: caseAttributes[] ← model.loadAttributes()
7: similarCasesMap[] ← new List()
8: for case in model.cases do . Iterating on each case data
9: case.weight ← 0
10: for attribute in caseAttributes do . Iterating on each mapped attribute
11: functionType ← attribute.getFunctionType()
12: if functionType eq model.HEAVISIDE then
13: case.weight += (case[”attribute”] equals attribute in paramSet ? attribute.weight : 0)
14: if functionType eq model.POLYNOMIAL then
15: case.weight += calculateWeight(case[”attribute”], attribute, attribute.weight)
16: if functionType is null || functionType eq model.UNKNOWN then
17: RaiseAnError → “Unknown Similarity Function (functionType)”
18: similarCasesMap.put(case)
19: filterListByMaxWeight(similarCasesMap) as Set
20: return similarCasesMap
270 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

cases are needed in this step, being necessary to construct a proposed solution to the new problem ac-
cording to previously observed cases; (v) Solved Case – after adaptation, a set of cases is used as input to
an algorithm that evaluates each case data, ignoring atypical cases and calculate time until revocation, as
presented in Algorithm 3; (vi) Proposed Solution – a result of solved case step, a proposed solution data
is composed by the time until revocation considering current availability scenario and bid strategy; (vii)
Revise – the proposed solution is evaluated in this step to confirm its efficiency in a scenario with real
test cases. In this scenario, the time until revocation are observed; (viii) Retain Case – after evaluation,
it is necessary to explore the solved case and transform it into a learned case, adding its solution to the
local knowledge database. If the proposed time until revocation was achieved, a success case will be
added to the knowledge database, followed by the defined fault parameters. If the time until revocation
was not achieved, the properties of the case is the same, but the current execution time will be present on

Y
new similar cases obtained in the retrieving step and changing new proposed solutions that used recency
factor, as presented in Section 4.3.

Algorithm 3 Processing TS

3:
4:
OP
1: procedure P ROCESS S URVIVALT IME(reg, zone, instType, confLevel, dow, hid)
2: System Initialization
cases[] ← RetrieveSimilarData(reg, zone, instType, dow, hid)
C
if processRealCasesLength(cases) 6 confidenceSize then
5: raise Exception with too small times length
6: timeCases[] ← Integer(sizeof(cases[]))
7: censored[] ← Boolean(sizeof(cases[])) . Necessary on Kaplan Meier Estimator
8: for case in cases index i do
OR

9: timeCases[i] ← case
10: censored[i] ← [case.isCensored() == True]
11: sortedIntervals ← KaplanMeierEstimator.compute(cases, censored)
12: firstInterval ← sortedIntervals[0]
13: for interval in sortedIntervals do . To find the closer interval
14: if interval.cumulativeSurvival > confLevel then
TH

15: firstInterval ← interval


return firstInterval.value

Considering that the composition of the attributes (region, zone, instance type, day-of-the-week and
AU

hour-in-day) can influence the revocation time, the similarity functions implemented on a CBR model
offer an efficient approach to solve the problem posed in this paper. Once with a case database generated,
it is necessary to retrieve similar cases as presented in Algorithm 2.
To increase the reasoning process, this model uses the equivalence and adaptation approaches, which
involves the understanding of previous cases to assist with similar decisions and, in the case of failure,
allows their adaptation.
Our CBR formalization model includes a case (Θ) element, which corresponds to the context of a real
problem situation, represented as a finite set of n key/value pairs composed by attributes (ϑ) and values
(ν ), as presented in Eq. (3).
Θ = {< ϑ0 , ν0 >, < ϑ1 , ν1 >, < ϑ2 , ν2 >, . . . , < ϑn , νn >} (3)
Each similarity function is used not only in case retrieval but also to adapt cases to augment the
knowledge database, being determined as the proximity of attributes’ values and their respective weights
are defined according to the context of the problem.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 271

Context can be considered as an interpretation of a case, where only a selected subset of attributes are
considered relevant compared to others, having their specifications and weights defined according to the
detailed information in Table 3.
Attributes with N/A function strategy indicates that they are not used in a similarity function, being an
important data used by the agent-based process, detailed on Section 3. A context Γ is defined as a finite
set of j subset attributes (Ω) with associated constraints (Cα) on their values:
Γ = {< Ω1 , Cα1 >, < Ω2 , Cα2 >, < Ω3 , Cα3 >, . . . , < Ωj , Cαj >} (4)
Using the history of price traces (Pt ), a set of cases (∆) can be created by using algorithms to simulate
an auction scenario with user bids. These cases, generated from the real data (price traces), can be used
to calculate the expected survival time. As a specific example, the cases can be generated by using the

Y
result of an estimated function ESTbid (Pt , n), which calculates a median instance price over the last n
days, as a bid value and simulating the survival times from the price traces, revoking the instance when

OP
the price exceeds the bid.
A set of real cases can be defined as ∆ = {Θ 1 ,Θ 2 , . . . ,Θ n } with Θ, composed by a set of attributes
presented in Table 3, being fully included into the knowledge database.
When matching cases, some attributes are forced to be equal. In a similarity process, we partition θ into
θE and θD , representing attributes that must be equal and attributes that can be different, respectively.
C
Using the above partition of the attributes, we then calculate the similarity function between cases as
presented in Eq. (5):
    Y  
(1) (2) (1) (2)
Y
SCases δ (1) , δ (2) = AI γ θ = γ θ Kθ |γθ − γθ | (5)
OR

θ∈θE θ∈θD
The adjustment indicator function, AI , assumes the value 1 if its argument is true and zero otherwise.
(1) (2)
The kernel functions, Kθ , for each of the attributes in θD all assume the value 1 when γθ = γθ , and
decrease polynomially to zero as the values become more distant. The rate of decay varies according to
TH

the respective attribute.

3.5. A time-to-revocation prediction approach using survival analysis

A convenient statistical model is used to support intelligent agents’ decisions and offer a reliable en-
AU

vironment to execute applications when spot instances are used in a cloud computing business model,
where price and availability vary according to supply and demand. Statistical models provide mech-
anisms for extracting information, sometimes without complete data, providing support to organize,
analyze and present processed data that may be useful in decision making processes.
Our model uses a Survival Analysis statistical model [24], that is a class of statistical methods which
is used when the time to a specific event is one of the principal factors which the data represents. This
model, much used in medical research, involves variables associated with time, following observations
until the occurrence of an event of interest, frequently failure [10].
The event under study is a bid fault and the time to this event, time until revocation, is treated as the
main element. The results were not enough using pure CBR, being necessary to integrate a statistical
model to support agent decisions. A non-parametric technique is used, which incorporates incomplete
(censored) data to avoid estimation bias. This method requires three elements, as follows:
i. Time indexed case data.
ˆ (time) to be applied to the data.
ii. An estimator function SF
272 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Y
C OP
OR

Fig. 4. Survival curves with respective survival times of some spot instances at sunday 6am.

iii. The confidence level of prediction approach (c ∈ (0, 1)).


We would like to estimate the largest Survival Time (TS ) with a low probability of revocation. Algo-
TH

rithm 3 presents the method steps to obtain the TS of a defined parameters.


As presented in Algorithm 3, the TS time is extracted from estimated Survival Curves through a
survival function using the Kaplan-Meier estimator [21], in which ti represents the upper limit of a small
time interval, Di the number of deaths (failures) within that interval, and Si the number of survivors at
the beginning of the interval, as follows:
AU

Y
ˆ (time) =
SF (1 − Di /Si ) (6)
i:ti 6t

If no deaths occur in a given interval then the survival curve does not decrease. A survival curve with
respective times and confidence levels can be observed in Fig. 4.
According to Fig. 4, given the 98th confidence level (considering a 2% failure probability), the time
until revocation for a general memory instance type (m3.large) is around two hours, while the estimated
time until revocation for an accelerated computing instance type (p2.xlarge) is much smaller. As illus-
trated in Fig. 2, the time until revocation is considerably lower compared with a weekend day.
We use recovered similar cases from the case knowledge database to produce a survival curve, which
is used by the executor agent to predict the time until revocation according to a confidence level which
provides sufficient security. The agent estimates the largest TS for which we have a high confidence
(98%) that our running instance will not be revoked, considering the relationship of day-of-week and
hour-in-day.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 273

This time is defined by TS = arg maxt {t ∈ R | P (TUR > t) > 0.98} and can be calculated as the
98th percentile of time until revocation from the estimated survival curve. The survival curve for δi is
calculated from the other cases with the following weights:
X
wij = SCases(δi , δj ) SCases(δi , δk ) for j 6= i (7)
k6=i

The working process of using survival analysis is presented in Algorithms 3 and 4, being explored
as an important element to process time to revocation using 98th percentile of confidence level in the
process survival time step.

3.6. Multi-agent architecture

Y
In this proposal, the agents extends the architecture using different approaches according to the current

OP
scenario, where application execution, VM monitoring and fault-tolerant execution plan can be managed
dynamically and autonomously for a set of cloud providers and their respective services and resources.
According to the definition and classification presented in [27], in this proposed environment, the
composition of spot instances is considered as a distributed, accessible, non-deterministic, dynamic and
discrete.
C
Our proposal contains a set of autonomous agents composed by a quadruple BRAAG = <A, B, P, G>,
where:
– A stands for the set of agents;
OR

– B describes the set of available behaviors of agents. While a behavior b0 is defined by a subset of
conditions, having cond(b0 ) ⊆ PR;
– P is a finite set of propositions to be used by agents asserts;
– G depicts the set of goals to be achieved, having G ⊆ PR.
An abstract view of agents and their autonomy can be formalized. First, let us assume E = {e01 , e02 ,
TH

e2 , . . . , e0n } as a set of finite discrete states, e.g. auction environment and price changes, and VMs avail-
0

ability with their execution states. Each agent has a set of possible behaviours B = {β10 , β20 , β30 , . . . , βj0 },
which transform the state of the environment. The decision of which β 0 will be triggered depends on the
e0 state, i.e. E responds with a set of states and the agent defines a set of behaviours to be executed BE.
AU

β0
The BE corresponds to a sequence of interleaved environment states and agent behaviours BE : e00 −→
β1 β2 β0 −1
e01 −→ e02 −→ . . . −−−→ e0u . As observed, each e0 change triggers actions and this continues until a final
eu is reached without consequent behaviors.
Representing the history H as a set of all possible finite sequences (over E and B ), H BE as the subset
of a tuple T =< e0x , βx0 > representing the e0x triggering a βx0 behaviour, and H E as the subset of a tuple
T =< βx0 , e0x+1 > as a new e0x+1 state as the consequence of βx0 .
To represent the effect that an agent behaviour implies on an environment, a state transformer function
is defined as τ : H BE → ϕ(E). So, an environment Σ is formalized as a triple Σ =< E, e00 , ϕ >, where
E represents a subset of states, e00 ∈ E as the initial state, and ϕ as a state transformer function, triggered
after agent reasoning.
To understand the elements of the proposal regarding the agents and their reasoning modeling, the
multi-agent architecture is presented in Fig. 5 and some definitions can be formalized in the following
way: sen-avail is composed by a sensor that observes the availability of spot instances in a cloud en-
vironment; sen-price is a set of finite sensors that monitor spot instances price changes. According to
274 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Y
OP
Fig. 5. Proposed multi-agent architecture.

the nature of an auction, the price varies according to supply and demand and the system needs to be
updated; sen-fail represents a set of sensors that observe a cloud environment with available and running
spot instances, listening for revocation events; sen-exec is composed by sensors that obtain a set of n
C
running spot instances, composed by a BoT T with a subset of tasks T = {t1 , t2 , . . . , tn }. Each spot
instance receives T and executes the same BoT independently; B is a bag of pre-defined implemented
behaviors, individually created to achieve their respective objectives according a fault-tolerant execution
plan, with a set defined as B = {β1 , β2 , β3 , . . . , βn }; database represents a local knowledge database
OR

that contains a set of all price changes and processed data, including the cases used by the CBR model,
if applicable; ξ depicts a core function to evaluate an observed event to decide which behavior will be
executed by a defined set of actions; location is a flag to indicate if it is a local or remote agent. Agents
are inserted on a centralized container that allows remote agents and this attribute indicates if the agent
is a local or connected remote agent.
TH

The proposed architecture is composed of a finite set of agents that shares the entire environment,
which communicate with each other using a multi-agent protocol communication layer to enable interop-
erability with different technologies. The integration with external cloud infrastructures occurs through
pluggable modules to support other cloud providers. Moreover, an API that offers external access to the
AU

facade agent (Verifier) is provided to enable integration with external systems.


From the layer definitions presented in Fig. 5 and agent’s PEAS detailed in Table 4, a brief descrip-
tion follows: the Verifier agent checks user input and chooses which resources and instances types will
be used based on user requirements, which includes application files, estimated execution time and the
amount of CPU and memory required; the Price Monitor agent monitors a set of available price changes
in the cloud environment. Each spot instances price change triggers a process on a Core agent to generate
a case to be added the CBR knowledge database with the actual time until revocation based on the new
record; the Core agent defines the most appropriate fault-tolerant technique and its respective parame-
ters. To achieve high levels of accuracy, it uses an approach to predict time until revocation [4], having
a reasoning process detailed in next sections. It produces elements to calculate a survival rate, which is
mandatory information to support agent decisions; the Executor Manager agent is responsible to manage
the cloud spot instances, creating and allocate n spot instances to attend application requirements, i.e. if
the user needs 50 parallel executions, 50 VMs (spot instances) will be created and the execution will be
managed. These instances will be used to run tasks and monitors it execution, respecting previous def-
initions, i.e., cloud region, zone, instance types, fault-tolerant technique, and parameters; the Executor
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 275

Table 4
Agents’ PEAS definitions
Agent Performance Environment Actuators Sensors
Verifier Receive user Partially Validate input data; define cloud Analyze provided data and
application parameters, observable, provider and instances that match files; evaluate instance
data and files. deterministic, the requirements of the user tasks. prices and availability;
sequential, define instance types.
static and
discrete.
Price Acquires a set of Observable, Keeps the instance price database Analyze current instance
monitor instance types to deterministic, updated. price; evaluate instance
monitor and their sequential, prices changes; informs
respective regions and dynamic and price change events.

Y
zones; Identifies each continuous.
instance type’s price
changes.

OP
Core Receive, from Verifier Partially Define execution plan to Request case database;
agent, the cloud observable, distribute tasks and run an apply the similarity model;
provider and respective deterministic, application with defined recover similar cases to
instance definitions; episodic, static fault-tolerant technique and calculate the failure based
retrieve case database. and discrete. respective parameters. on time until revocation per
instance.
C
Executor Receive execution plan. Observable, Keep instance and task execution Request spot instances;
manager stochastic, state; evaluate proposed case copy required files to VM;
manager sequential, solution; retrieve case state. Run applications; evaluate
dynamic and application execution; ends
OR

continuous. the instance when


execution finishes.
Executor Obtain a set of running Observable, Provide status data from running Monitors remote task
Monitor instances to monitor. deterministic, tasks on VMs; Report failure execution; Send
sequential, events for the Recover agent. information about a VM
dynamic and failure; re-validate events
TH

continuous. to avoid false-positives


failures.
Recover Task execution failure Observable, Keep instance execution state; Recover when a bid fault
when a bid fault deterministic, Ensure the fault-tolerant occurs according to the
occurs. sequential, technique and parameters fault-tolerant plan.
dynamic and application; evaluate proposed
AU

continuous. case solution; retrieve case state.

Monitor is a lightweight agent that monitors instance execution and is used to quickly inform failures to
be recovered by a Recover agent. Even though the Executor Manager agent monitors spot instances exe-
cution, the Executor Monitor agent is more efficient since their behavior occurs as an external observer;
the Recover agent is instantiated when a failure occurs, being informed by a message from Executor
Monitor agent and applying the recovery processes to guarantee application execution. As well as Ex-
ecutor Manager agent, the Recover agent respects the previous definitions about fault-tolerant technique
and parameters.
In our scenario, each agent has a defined environment, e.g. the Price Monitor agent is limited to ob-
serve only the auction environment, whereas Executor and Recover agents share the same environment.
A brief sequence diagram, which summarizes the interaction between the agents since a new BoT sub-
mission until a failure detection event is presented in Fig. 6 as a set of eight steps, as follows: Step 01:
the user provides the application and parameters to the Verifier agent; Step 02: the Verifier agent validate
276 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Y
C OP
OR

Fig. 6. Sequence diagram of BoT and behavior in the presence of a failure.


TH

user input and asks the PriceMonitor agent to inform an appropriate instance type according to user
parameters. At this time, the Verifier agent is responsible to reject or accept user submission according
to architecture requirements, e.g. application files, bid-value, custom checkpointing interval and appli-
AU

cation time; Step 03: After PriceMonitor response, the VerifierAgent inform the Core agent to create an
execution plan. At a result, the Core agent will receive the chosen instance and user params; Step 04: the
Core agent retrieve similar cases according current day-of-week, hour-in-day, region, zone and instance
type, calculate time until revocation using similarity functions (Eq. (5)) and define a fault-tolerant exe-
cution plan, according to Algorithm 4. As a result, the agent informs the most appropriate fault-tolerant
execution plan to be used in the current application execution; Step 05: the Core agent asks the Executor
Manager agent to start execution on n instances, according to user needs; Step 06: the ExecutorManager
is the agent that is responsible to manager cloud spot instances. At this time, the agent requests new spot
instance and wait for availability. Step 07: if a failure is detected, the Executor Monitor agent asks the
Recover agent to inform the failure to the Core agent and start the recovery process. As a result, a new
execution plan will be processed (as stated in Step 04) while the Recover agent initializes the recovery
process to start a new execution using the latest consistent state; Step 08: considering that the recovery
instant is different, a new execution plan needs to be calculated by the Core agent until a successful
execution.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 277

Considering a user execution requirement with n instances, there should be one Verifier agent, one
Core agent, n Executor Manager agents, n Executor Monitors and n Recover agents, when necessary.
In the presence of a failure, the Executor Monitors agent signals the Recover agent, having r temporary
Recover agents, where r is a positive integer r 6 n, when necessary, i.e. if an error occurs, which will
recover according the defined fault-tolerant execution plan.
As an agent-based project, the characterization of agents is necessary for the understanding of their
objectives and interactions. As proposed by [27], the agents’ characterization is presented through the
Performance, Environment, Actuators and Sensors (PEAS) aspects as presented in Table 4.
Given an application’s execution time (appTime) and TS , a production rule [11] can be used to support
agent decisions, as detailed in Algorithm 4.
Algorithm 4 Defining the appropriate fault-tolerant execution plan

Y
1: procedure D EFINE E XECUTION P LAN(appTime, region, zone, instType, bidValue)
2: System Initialization

OP
3: validData ← paramsValidation(appTime, region, zone, instType, bidValue)
4: if validData then . App is able to execut
5: if appTime > 60 then . Time in minutes
6: dayOfWeek, hourInDay ← extractCurrentTime()
7: params ← [region, zone, instanceType, 0.98, dayOfWeek, hourInDay]
8: survivalTime ← ProcessSurvivalTime(params) . (Algorithm 3)
priceHistorySet ← recoverPriceHistoryFrom(90) . Time in days
C
9:
10: successRate ← 0
11: for priceRow in priceHistoryList index i do . Each price change record
12: if extractTime(priceRow) == dayOfWeek, hourInDay then
13: currentPrice ← priceRow.currentPrice
OR

14: errorOcurred ← False


15: instant ← priceRow.time
16: for instant increasing 60 do . Time in minutes
17: if priceRow.price > priceHistorySet[instant].price then
18: errorOcurred ← True
19: if not errorOcurred then
TH

20: successRate + +
21: chkInterval ← calculateCheckpointInterval(successRate)
22: else
23: chkInterval ← 0
24: return ExecutionPlan(appTime, region, zone, instType, bidValue, chkInterval)
AU

A required parameter (appTime) is a mandatory data and it is used in the first step: a checkpoint
interval will be adopted in a new execution or will be ignored. In [36,40], the authors consider longer
executions greater than one hour as the simplest and most intuitive, yet effective, way for dealing with
the cost/reliability trade-off when running applications on spot instances.
The reasoning detailed in Algorithm 4 considers an execution with no checkpoint as the primary
fault-tolerant plan to be applied when appTime is considered short (line 23), i.e. less than 60 minutes
not requiring any other parameters. When appTime is considered long (line 5), the survival rate and its
success are considered mandatory parameters, having a survival rate calculated (between lines 11 and
20), considering a successful when instant time overtake the processed survival time (line 20).
If success is high, a convenient checkpointing interval is calculated from TS as follows. Given an
application’s required execution time (TA ), a checkpoint overhead (CO ), a max interval parameter, in
minutes, defined by the user (TUchk ) and TS , estimated by STM, a convenient checkpoint interval (CI)
can be obtained as follows:
CI = min(TS − CO , TA , TUchk ) (8)
278 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Fig. 7. Successful rate according to amount of used data (in months).

When the survival rate offers low confidence, a checkpoint fault-tolerant plan which does not use TS
is considered as a last alternative. In a dynamic execution scenario, job migration can occur if a consid-

Y
erable price change pattern is observed by agents, that request a new instance considering a different bid
strategy according to convenient recent values.
The main goal of our proposal is to observe the actual environment and find an optimal fault-tolerant

OP
plan and parameters that minimize the total runtime and reduce costs. To reach these goals, our agents
use the approach presented in [4], which uses CBR, survival analysis, and quantiles of the time until
revocation to find a probable survival time and use it as part of a strategy to avoid occurrences of failures.
In addition to [4], elements of this approach can extend the fault-tolerant definitions by using other
C
techniques, like recency factor to obtain better results, no checkpointing or even user defined exception
handling.
OR

4. Experiments and results

4.1. Experimental scenario

A set of Amazon’s spot instances and it respective price change history was used to evaluate our
TH

proposal and predict the TS , used as input to define a fault-tolerant execution plan with respective pa-
rameters. We extend our experiment and adding the reasoning process to define fault-tolerant parameters
for 78 instances, including all zones (us-west-1b, us-west-1c, us-west-2a, us-west-2b, us-west-2c) of the
US-WEST-1 and US-WEST-2 regions.
Using 19 months of real price changes, provided and collected from Amazon AWS from April of 2017
AU

to October of 2018 (approximately 21 million records), our experiments simulate 1.362.816 scenarios
using 78 instances, 4 bid strategies (actual, mean of last week, median of last week, mean last day),
two Regions (US-WEST-1 and US-WEST-2) in a 13 weeks scenario (from August to October) with all
relations of day-of-week and hour-in-day.
In order to define the optimal range of data to use for case generation and survival curve estimation,
we calculated the success rate as a function of the number of months of previous data used. The results
can be found in Fig. 7.
According Fig. 7, two important features require comment. First, there is a general trend of increasing
success as the number of months of data used increases. This is to be expected, as long as the data gener-
ating process has not changed significantly over the period analyzed, since more data allows us to more
accurately estimate the non-parametric survival curves, leading to better estimates of the revocation time.
Second, there is a significant dip in the success rate at 9 months. This occurs because of the introduction
of new instance types which have few observations, leading to poor estimation and increased failures for
these instance types, reducing the overall success rate.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 279

Table 5
Generated STM of r4.xlarge instance using values from the 98th confidence level and extracted from their respective survival
curve
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1 202 170 125 148 120 97 120 101 82 116 139 130 291 380 321 234 187 207 170 134 104 100 90 206
Day of the week

2 172 120 128 83 109 77 84 98 68 115 110 106 99 112 130 105 159 155 136 110 87 106 259 179
3 135 169 167 136 99 105 127 95 70 79 96 98 79 101 106 101 78 101 103 80 88 120 153 112
4 142 103 118 104 136 100 147 94 69 85 164 133 119 92 91 99 99 84 136 84 83 136 139 196
5 143 117 149 160 128 92 103 87 68 84 130 111 90 81 95 105 91 106 124 95 84 97 158 186
6 139 153 136 108 90 72 102 91 70 100 172 127 93 82 115 110 97 153 108 126 82 149 121 226
7 176 183 277 221 197 151 130 156 98 81 198 135 109 116 125 84 103 156 120 100 80 153 117 181
Hour in Day

Y
C OP
Fig. 8. Heat map of survival success compared of time until revocation times in r3.large.
OR

Given the increasing success rate as more observations are used, we prefer the use of all the data col-
lected over the entire period. From the collected data, approximately 110 million cases were generated,
considering all zones and instances from US-WEST-1 and US-WEST-2 regions.

4.2. Evaluating time until revocation


TH

Each day-of-week and hour-in-day relationship has its own survival curve, being represented as a
Survival Time Matrix (STM), as illustrated in Table 5, which shows time until revocation, in minutes,
obtained by using the 98th confidence level and extracted from their respective survival curves. The
values in the STM are the results from the similar cases recovered by the similarity function in a case
database, calculated from the best experiment results for the confidence level between 90th and 98th .
AU

Better results was achieved using 98th as the the confidence level.
To evaluate STM times, a set of experiments that compare STM values with real scenarios was cre-
ated. Each value was compared with an auction simulation. Each time until revocation extracted from a
relationship between day-of-week and hour-in-day in STM time that is achieved by a simulation will in-
crease by 1, 0 otherwise, as detailed in Algorithm 4. The results are shown on heat map graphic presented
in Fig. 8.
Using the same experimental scenario, 1.362,816 simulations were executed. An example of the mem-
ory optimized instance type (r3.large) is illustrated in Fig. 8, achieving success rates around 80% with
standard deviation of 1.56 percentage points. Considering all 78 instances’ simulations, the success rate
goes up to around 85%. This occurs because some instances with elevated monetary costs have stable
variations, allowing long time until revocation times.
An unsecured gap can be observed on Wednesday and Thursday between 12pm and 6pm (18 h), with
7 failures in 13 attempts, showing that another bid strategy is needed to increase the observed time until
revocation and achieve better success rates.
280 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Table 6
ˆ
A new STM’ created after recency strategy in SF(time) function
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1 180 120 152 128 110 67 101 101 71 96 99 112 211 240 288 221 117 170 120 111 93 70 80 189
Day of the week

2 100 100 112 80 109 60 65 89 80 120 101 160 120 112 150 150 139 112 130 140 97 112 233 119
3 115 122 117 168 120 98 97 96 93 90 91 91 71 111 93 100 70 151 111 86 89 130 143 101
4 132 131 108 97 106 80 124 90 69 85 124 111 112 111 101 80 89 74 126 88 88 122 132 156
5 144 111 143 120 113 112 113 90 60 60 111 111 93 88 95 105 91 116 112 99 80 78 118 177
6 139 168 120 98 97 96 93 90 91 91 71 127 93 82 115 110 97 153 108 126 82 149 121 226
7 176 91 71 111 93 100 70 151 111 198 93 90 91 91 71 127 93 82 115 110 90 113 116 112
Hour in Day

Y
C OP
ˆ .
Fig. 9. Heat map of survival success after recency factor in SF
OR

4.3. Updating time until revocations using a recency factor

To achieve better results, a new strategy can be incorporated into the SF ˆ (time) function, where more
recent generated cases and results receive a greater weight. This is one way to deal with changes in the
data generating process over the period. In this strategy, the interval used in our experiment was reduced
TH

to 15 months, from April/17 to June/18 and a new validation period was added to increase the accuracy
of our approach.
To refine the STM values and include a recency factor to time until revocation, the last 4 weeks
(Jul/18) were used to simulate real time until revocation, creating a new subset of cases and calculating
the median times between them to reproduce a new recent scenario. With this change, a new STM
AU

was generated (STM’), representing new values in the day-of-week and hour-in-day relationship with
survival times that consider recent data. A generated STM’ is presented in Table 6 with new values
compared to Table 5.
Better results were achieved with this recency change and can be observed in Fig. 9. Giving greater
weight to consider more recent results allowed a gain of 12%, reaching a 92% success rate under the
same conditions of our experiment. A comparative achieved gain can be observed in Table 7 regarding
instance types illustrated in Fig. 10.
The new heat map with new experimental results for the day-of-week and hour-in-day relationship
can be observed in Fig. 9, increasing success compared to Fig. 8. Green areas means that time prediction
was achieved in a real auction scenario. Considering that Amazon AWS provides a wide selection of
instance types optimized to fit different use cases, giving the user flexibility to choose the appropriate
mix of resources for her applications, a set of optimized instance types are used in our experiments and
a subset is presented in Fig. 10.
A compute optimized instance type was used in (a) and (b), a kind of instance used on compute-
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 281

Table 7
Successful rates (in percentage) obtained by experiments in Fig. 10
(a & b) (c & d) (e & f) (g & h)
Original strategy 76% 94% 80% 32%
Recency strategy 91% 96% 89% 39%

Y
C OP
OR
TH
AU

Fig. 10. Set of heat map results considering distinct instance types.

intensive workloads that delivers very cost-effective high performance at a low price per compute ratio.
The results of an accelerated computing instance with GPU resources optimized for graphics-intensive
applications can be found in (c) and (d). The results shown in (e) and (f) represent a storage optimized
instance with low latency and very high random I/O performance. Lastly, (g) and (h) represent a general
purpose instance that provides a balance of compute, memory, and network resources. A compilation of
the results obtained by the experiments presented in Fig. 10 can be observed in Table 7.
Observing the results for (g & h), even using the recency strategy only a small gain was obtained,
from 32% to 39%. This kind of instance has considerable price changes over time, requiring a different
strategy which considers more recent price change patterns.
Compared to Fig. 10(h), better results were achieved using the last 7 days as a validation period in our
recency factor strategy, as shown in Fig. 11.
282 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Fig. 11. New heat map of M4.xlarge Instance considering recency of 7 days.

Y
C OP
Fig. 12. Reducing checkpointing and total cost compared to [20,33,36,37,39–41].
OR
TH

Fig. 13. Reducing total time execution compared to [20,33,36,37,39–41].


AU

5. Comparative study

Given the prediction accuracy presented in [4] (92%), a reliable time until revocation extracted from
STM can be used to define an appropriate fault-tolerant execution plan and parameters, e.g. checkpoint
fault-tolerant mechanism and their intervals, as detailed in Algorithm 4.
Considering an application that requires time TA = 1440 minutes, having the checkpointing overhead
time Ct minutes and a user defined Tchk = TA − Ct , in case of unexpected running delay, we can
compare the different approaches in the related works with respect number of checkpoints with total
costs (Fig. 12) and total time execution (Fig. 13).
Regarding the use of checkpoint/restore fault-tolerant technique, commonly used when TA is consid-
ered long, we can compare different approaches in the literature with respect to total execution time
(Fig. 13) and number of checkpoints with total cost (Fig. 12), using Eq. (8), as follows: (1) and (2) with
fixed Tchk = 60 min [37,40,41]; (3) using a fixed Tchk = 60 − Ct , but considering overhead times [36];
(4a ) and (4b ) that considers each price changes as a trigger to a new checkpoint [39], using Tchk = 96 min
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 283

Table 8
Results of total time from different scenarios
Strategy TA = 300 TA = 400 TA = 900 TA = 1200 TA = 1440
Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30
1 325 350 450 430 460 580 975 1050 1350 1365 1470 1890 1560 1680 2160
2 325 350 450 430 460 580 975 1050 1350 1365 1470 1890 1560 1680 2160
3 325 350 450 435 470 610 980 1060 1380 1370 1480 1920 1570 1700 2220
4a 600 600 600 1280 800 790 1380 1860 1800 1740 2220 2520 1920 2400 2880
4b 600 600 600 1600 800 790 1800 2350 1800 1985 2710 2520 2165 2890 2880
5a 375 450 750 500 600 1000 1125 1350 2250 1575 1890 3150 1800 2160 3600
5b 400 500 900 530 660 1180 1200 1500 2700 1680 2100 3780 1920 2400 4320
5c 450 600 1200 600 800 1600 1350 1800 3600 1890 2520 5040 2160 2880 5760
6 350 400 600 465 530 790 1050 1200 1800 1470 1680 2520 1680 1920 2880
7a 305 310 330 405 410 430 915 930 990 1280 1300 1380 1460 1480 1560

Y
7b 310 320 360 410 420 460 930 960 1080 1300 1340 1500 1485 1530 1710
7c 305 310 330 405 410 430 905 910 930 1265 1270 1290 1445 1450 1470
7d 305 310 330 405 410 430 915 930 990 1280 1300 1380 1460 1480 1560
7e

OP
305 310 330 405 410 430 910 920 960 1280 1300 1380 1460 1480 1560
7f 305 310 330 405 410 430 905 910 930 1270 1280 1320 1450 1460 1500

and Tchk = 145 min respectively, recovered by real records from our price change database, and [20]
treated as a failure on each price change. In case of exceeding the limit of checkpoint, the max value
is used; (5a ), (5b ) and (5c ) with a fixed interval of Tchk = 10 min, Tchk = 15 min and Tchk = 20 min,
C
respectively [33]; (6) uses Tchk = 30 min as a fixed interval [41].
In (7) we present our approach’s results with respect to different TS scenarios: (7a ) TS = TA ; (7b ) with
a half total time (TS = TA /2); (7c ) evaluating a new time after checkpoint using the same (TS = TA /2);
(7d ) a close negative value compared to the total time (TS = TA − overhead time); (7e ) a close positive
OR

value compared to the total time (TS = TA + overhead time); and (7f ) with a considerable double TA as
TS .
Since some related works consider only compute-intensive applications, the cost of the checkpoint is
low and short intervals do not significantly affect the execution time. Experimental results for different
TA and Ct are presented in Tables 8–10.
TH

Table 8 illustrates the total time results. Although different behaviors are observed, in general, the
agents perform well using TS and calculated time until revocation to define the checkpoint interval,
this being the best choice compared to other strategies, whether Ct represents shorter or longer CPU or
memory intensive applications. The application total execution time reflects on the total cost. A better
AU

strategy to decrease this time helps to reduce monetary costs for cloud users.
The number of checkpoints, CA, and the length of time for each checkpoint, Ct , impact application
execution time. Each checkpoint overhead time Ct increases the total used time UT as follows:
XCA
UT = Ct,i + TA (9)
i=1
Table 8 demonstrates that a gain of 49.17% with respect to total execution time can be observed
compared to (4b ) when short TA and Ct are used and 72.5% when increase Ct . Moreover, a 74.48% time
reduction was achieved for long Ct compared to (5c ). Furthermore, the worst-case previously observed
(Tchk = 10) is a better strategy than a price change strategy when no compute or memory intensive
applications are used and TA is short. Similarly, Table 10 presents results for the number of checkpoints
used by different strategies. These values affect the total execution time, presented in Table 8 and total
cost per instance, as observed in Table 9.
All collected data and source code are available at the project website and can be used to reproduce
our experiments and test other scenarios not contemplated in this paper.
284 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

Table 9
Results of total cost (in dollar US) charged from different scenarios
Strategy TA = 300 TA = 400 TA = 900 TA = 1200 TA = 1440
Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30
1 30.44 32.78 42.15 40.28 43.09 54.33 91.33 98.35 126.45 127.86 137.69 177.03 146.12 157.36 202.32
2 30.44 32.78 42.15 40.28 43.09 54.33 91.33 98.35 126.45 127.86 137.69 177.03 146.12 157.36 202.32
3 30.44 32.78 42.15 40.74 44.02 57.14 91.79 99.29 129.26 128.32 138.63 179.84 147.06 159.23 207.94
4a 56.20 56.20 56.20 119.89 74.93 74.00 129.26 174.22 168.60 162.98 207.94 263.04 179.84 224.80 269.76
4b 56.20 56.20 56.20 149.87 74.93 74.00 168.60 220.12 168.60 185.93 253.84 263.04 202.79 270.70 269.76
5a 35.12 42.15 70.25 46.83 56.20 93.67 105.38 126.45 210.75 147.53 177.03 295.05 168.60 202.32 337.20
5b 37.47 46.83 84.30 49.64 61.82 110.53 112.40 140.50 252.90 157.36 196.70 354.06 179.84 224.80 404.64
5c 42.15 56.20 112.40 56.20 74.93 149.87 126.45 168.60 337.20 177.03 236.04 472.08 202.32 269.76 539.52
6 32.78 37.47 56.20 43.55 49.64 74.00 98.35 112.40 168.60 137.69 157.36 236.04 157.36 179.84 269.76
7a 28.57 29.04 30.91 37.94 38.40 40.28 85.70 87.11 92.73 119.89 121.77 129.26 136.75 138.63 146.12

Y
7b 29.04 29.97 33.72 38.40 39.34 43.09 87.11 89.92 101.16 121.77 125.51 140.50 139.09 143.31 160.17
7c 28.57 29.04 30.91 37.94 38.40 40.28 84.77 85.24 87.11 118.49 118.96 120.83 135.35 135.82 137.69
7d 28.57 29.04 30.91 37.94 38.40 40.28 85.70 87.11 92.73 119.89 121.77 129.26 136.75 138.63 146.12

OP
7e 28.57 29.04 30.91 37.94 38.40 40.28 85.24 86.17 89.92 119.89 121.77 129.26 136.75 138.63 146.12
7f 28.57 29.04 30.91 37.94 38.40 40.28 84.77 85.24 87.11 118.96 119.89 123.64 135.82 136.75 140.50

Table 10
Results of checkpoint calculated from different scenarios
Strategy TA = 300 TA = 400 TA = 900 TA = 1200 TA = 1440
C
Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30 Ct =5 Ct =10 Ct =30
1 5 5 5 6 6 6 15 15 15 21 21 21 24 24 24
2 5 5 5 6 6 6 15 15 15 21 21 21 24 24 24
OR

3 5 5 5 7 7 7 16 16 16 22 22 22 26 26 26
4a 60 30 10 96 40 13 96 96 30 96 96 42 96 96 48
4b 60 30 10 160 40 13 180 145 30 145 145 42 145 145 48
5a 15 15 15 20 20 20 45 45 45 63 63 63 72 72 72
5b 20 20 20 26 26 26 60 60 60 84 84 84 96 96 96
5c 30 30 30 40 40 40 90 90 90 126 126 126 144 144 144
6 10 10 10 13 13 13 30 30 30 42 42 42 48 48 48
TH

7a 1 1 1 1 1 1 3 3 3 4 4 4 4 4 4
7b 2 2 2 2 2 2 6 6 6 8 8 8 9 9 9
7c 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
7d 1 1 1 1 1 1 3 3 3 4 4 4 4 4 4
7e 1 1 1 1 1 1 2 2 2 4 4 4 4 4 4
7f 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
AU

6. Conclusion

A resilient multi-agent architecture was presented in this paper to provide an efficient way to use
spot instances to run users’ applications achieving better levels of reliability and reduce total execution
time and costs, focusing on an important element in a resilient environment: fault tolerant behaviors.
The architecture uses a prediction approach to support agent decisions in a dynamic scenario. To the
best of our knowledge, there is no published work that combines the use of intelligent agents with CBR,
survival analysis to predict failure events and FT mechanism based on adaptive checkpointing to increase
reliability of user applications in spot instances.
We believe the proposed approach can be used to define fault-tolerant techniques and their respective
parameters. Moreover, to increase longer survival times, new bid strategies can be introduced, e.g. using
bid values defined by a percent increase over actual instance prices. Using real price traces of spot
instances, our experiments have confirmed the accuracy of the proposed model.
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 285

We identified instances for which our methodology yielded poor success rates. For these instances, the
data generating process appears to be changing more quickly than for other instances. To capture these
changes we implemented a recency factor which gives greater weight to more recent observations data
and increases success rates.
As future work, the proposed architecture can be extended to deal with multiple cloud regions by the
same provider considering data transfer costs between different regions. In addiction, a multiple provider
approach can be used, extending the number of pricing models according to each cloud provider. Also,
other strategies can be explored to exploit more recent data, such as comparing recent instance price
changes to update fault-tolerant parameters. To increase the reliability, a combination of fault-tolerant
plans can be used, e.g. a multi-strategic fault-tolerant framework.

Y
Acknowledgments

OP
This work was supported in part by the Brazilian Coordination for the Improvement of Higher Educa-
tion Personnel – (CAPES) through grant number 1441250. Prof. Celia Ghedini Ralha thanks the support
received from the Brazilian National Council for Scientific and Technological Development (CNPq) for
the research productivity in Computer Science area through grant number 311301/2018-5.
C
References
[1] A. Aamodt and E. Plaza, Case-based reasoning: Foundational issues, methodological variations, and system approaches,
OR

AI Communications 7(1) (March 1994), 39–59.


[2] M. Al-Ayyoub, Y. Jararweh, M. Daraghmeh and Q. Althebyan, Multi-agent based dynamic resource provisioning and
monitoring for cloud computing systems infrastructure, Cluster Computing 18(2) (June 2015), 919–932.
[3] M. Al-Kuwaiti, N. Kyriakopoulos and S. Hussein, A comparative analysis of network dependability, fault-tolerance,
reliability, security, and survivability, IEEE Communications Surveys Tutorials 11(2) (June 2009), 106–124.
[4] J.P. Araujo Neto, D.M. Pianto and C.G. Ralha, A Prediction Approach to Define Checkpoint Intervals in Spot Instances,
in: Proceedings of the 11th International Conference on Cloud Computing, CLOUD 2018, SCF 2018, Volume 10967,
TH

Springer, Seattle, WA, USA, 2018, pp. 84–93.


[5] J.P. Araujo Neto, D.M. Pianto and C.G. Ralha, A resilient agent-based architecture for efficient usage of transient servers
in cloud computing, in: 2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom),
2018, pp. 218–225.
[6] J. Bajo, F. De la Prieta, J.M. Corchado and S. Rodríguez, A low-level resource allocation in an agent-based cloud
AU

computing platform, Applied Soft Computing 48 (November 2016), 716–728.


[7] R. Buyya, C.S. Yeo, S. Venugopal, J. Broberg and I. Brandic, Cloud computing and emerging IT platforms: Vision, hype,
and reality for delivering computing as the 5th utility, Future Generation Computer Systems 25(6) (June 2009), 599–616.
[8] W. Cirne, F. Brasileiro, J. Sauvé, N. Andrade, D. Paranhos, E. Santos-neto and R. Medeiros, Grid computing for bag of
tasks applications, in: 3rd IFIP Conference on E-Commerce, E-Business and EGovernment, 2003.
[9] C. Colman, C. Develder and M. Tornatore, A survey on resiliency techniques in cloud computing infrastructures and
applications, IEEE Communications Surveys & Tutorials 18(3) (2016).
[10] D.R. Cox, Analysis of survival data, 1st edition edition, Routledge, 1984.
[11] R. Davis, B. Buchanan and E. Shortliffe, Production rules as a representation for a knowledge-based consultation pro-
gram, Artificial Intelligence 8(1) (1977), 15–45.
[12] F. De la Prieta, S. Rodríguez, J. Bajo and J.M. Corchado, +cloud: A virtual organization of multiagent system for
resource allocation into a cloud computing environment, in: Transactions on Computational Collective Intelligence XV,
N.T. Nguyen, R. Kowalczyk, J.M. Corchado and J. Bajo, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2014, pp.
164–181.
[13] E.N.M. Elnozahy, L. Alvisi, Y.-M. Wang and D.B. Johnson, A survey of rollback-recovery protocols in message-passing
systems, ACM Computing Surveys 34(3) (September 2002), 375–408.
[14] A. Iosup, S. Ostermann, M.N. Yigitbasi, R. Prodan, T. Fahringer and D.H.J. Epema, Performance analysis of cloud
computing services for many-tasks scientific computing, IEEE Transactions on Parallel and Distributed Systems 22(6)
(June 2011), 931–945.
286 J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances

[15] A. Iosup, O. Sonmez, S. Anoep and D. Epema, The performance of bags-of-tasks in large-scale distributed systems,
in: Proceedings of the 17th International Symposium on High Performance Distributed Computing – HPDC ’08, ACM
Press, 2008, pp. 97–108.
[16] B. Javadi, R.K. Thulasiram and R. Buyya, Characterizing spot price dynamics in public cloud environments, Future
Generation Computer Systems 29(4) (June 2013), 988–999.
[17] B. Javadi, R.K. Thulasiramy and R. Buyya, Statistical modeling of spot instance prices in public cloud environments, in:
2011 Fourth IEEE International Conference on Utility and Cloud Computing, IEEE, 2011, pp. 219–228.
[18] A. Jula, E. Sundararajan and Z. Othman, Cloud computing service composition: A systematic literature review, Expert
Systems with Applications 41(8) (June 2014), 3809–3824.
[19] K.D. Kumar and E. Umamaheswari, Prediction methods for effective resource provisioning in cloud computing: A
survey, Multiagent and Grid Systems 14(3) (September 2018), 283–305.
[20] K. Lee and M. Son, DeepSpotCloud: Leveraging cross-region GPU spot instances for deep learning, in: 2017 IEEE 10th
International Conference on Cloud Computing (CLOUD), IEEE, 2017, pp. 98–105.
[21] W.Q. Meeker and L.A. Escobar, Statistical methods for reliability data, Wiley, New York, 1998.

Y
[22] P. Mell and T. Grance, The nist definition of cloud computing, National Institute of Standards and Technology 53(6)
(2009).
[23] B. Meroufel and G. Belalem, Adaptive checkpointing with reliable storage in cloud environment, Multiagent and Grid

OP
Systems 13(3) (September 2017), 253–268.
[24] R.G. Miller, Jr, Survival analysis, volume 2nd Edition, John Wiley & Sons, 2011.
[25] A.-M. Oprescu and T. Kielmann, Bag-of-Tasks Scheduling under Budget Constraints, in: 2010 IEEE Second Interna-
tional Conference on Cloud Computing Technology and Science, IEEE, 2010, pp. 351–359.
[26] C.G. Ralha, A.H.D. Mendes, L.A. Laranjeira, A.P.F. Araújo and A.C.M.A. Melo, Multiagent system for dynamic re-
source provisioning in cloud computing platforms, Future Generation Comp Syst 94 (May 2019), 80–96.
C
[27] S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 3rd edition, Prentice Hall Press, Upper Saddle
River, NJ, USA, 2010.
[28] A.W. Services, Amazon ec2 spot instances, https://aws.amazon.com/ec2/spot, accessed in January 2018.
[29] P. Sharma, D. Irwin and P. Shenoy, Portfolio-driven Resource Management for Transient Cloud Servers, Proceedings of
the ACM on Measurement and Analysis of Computing Systems 1(1) (June 2017), 5:1–5:23.
OR

[30] S. Shastri, A. Rizk and D. Irwin, Transient guarantees: Maximizing the value of idle cloud capacity, in: SC’16: Proceed-
ings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016, pp.
992–1002.
[31] U. Siddiqui, G.A. Tahir, A.U. Rehman, Z. Ali, R.U. Rasool and P. Bloodsworth, Elastic jade: Dynamically scalable
multi agents using cloud resources, in: 2012 Second International Conference on Cloud and Green Computing, 2012,
pp. 167–172.
TH

[32] A. Stahl, Defining similarity measures: Top-down vs. bottom-up, in: Advances in Case-Based Reasoning, S. Craw and
A. Preece, eds, Springer Berlin Heidelberg, Berlin, Heidelberg, 2002, pp. 406–420.
[33] S. Subramanya, T. Guo, P. Sharma, D. Irwin and P. Shenoy, Spoton: A batch computing service for the spot market, in:
Proceedings of the Sixth ACM Symposium on Cloud Computing, ACM, 2015, pp. 329–341.
[34] X. Tang, X. Yang, G. Liao and X. Zhu, A shared cache-aware task scheduling strategy for multi-core systems, Journal
of Intelligent & Fuzzy Systems 31(2) (July 2016), 1079–1088.
AU

[35] P.T. Vlacheas, V. Stavroulaki, P. Demestichas, S. Cadzow, S. Gorniak and D. Ikonomou, Ontology and taxonomies of
resilience, in: Tech Rep, European Network and Information Security Agency, 2011.
[36] W. Voorsluys and R. Buyya, Reliable Provisioning of Spot Instances for Compute-intensive Applications, in: IEEE 26th
International Conference on Adv Information Networking and Applications, 2012, pp. 542–549.
[37] W. Voorsluys, S. Garg and R. Buyya, Provisioning spot market cloud resources to create cost-effective virtual clusters,
in: International Conference on Algorithms and Architectures for Parallel Processing, Springer, 2011, pp. 395–408.
[38] Y. Yang, X. Peng and X. Wan, Security-aware data replica selection strategy for Bag-of-Tasks application in cloud
computing, Journal of High Speed Networks 21(4) (November 2015), 299–311.
[39] S. Yi, A. Andrzejak and D. Kondo, Monetary cost-aware checkpointing and migration on amazon cloud spot instances,
IEEE Transactions on Services Computing 5(4) (2012), 512–524.
[40] S. Yi, D. Kondo and A. Andrzejak, Reducing costs of spot instances via checkpointing in the amazon elastic compute
cloud, in: 2010 IEEE 3rd International Conference on Cloud Computing, 2010, pp. 236–243.
[41] J. Zhou, Y. Zhang and W. Wong, Fault tolerant stencil computation on cloud-based gpu spot instances, IEEE Transactions
on Cloud Computing (2018).
J.P. Araújo Neto et al. / Towards increasing reliability of Amazon EC2 spot instances 287

Authors’ Bios

José Pergentino de Araújo Neto holds a bachelor’s degree in Information Systems from the Faculdades
Integradas de Patos (2005) and a Master’s degree in Software Engineering from the Center for Advanced
Studies and Systems in Recife, Brazil (2009). He is currently a PhD student in the Informatics Grad-
uate Program at the Department of Computer Science, University of Brasília, Brazil. He is a lecturer
at the Institute of Higher Education of Brasília. He has experience in computer science with emphasis
on information systems, software architecture, distributed systems, artificial intelligence and multiagent
systems.

Y
Donald Matthew Pianto (dpianto@gmail.com) holds a Ph.D. in Physics from the University of Illinois
at Urbana Champaign, USA (1995), a PhD in Computational Mathematics from the Federal University of

OP
Pernambuco, Brazil (2008), and a Bachelor’s degree in Physics from SUNY at Stony Brook University,
USA (1994). He is currently an Adjunct Professor at Statistics Department, University of Brasília, Brazil.
He is a member of the Graduate Program in Applied Computing in the Department of Computer Science
and Economics of the Public Sector in the Department of Economics, both programs in the University
of Brasília. He has experience in the area of probability and statistics with emphasis in econometrics,
statistical analysis of neural data, statistical analysis of images, and estimation of causal effects.
C
OR

Célia Ghedini Ralha (ghedini@unb.br) holds a Ph.D. in Computer Science from Leeds University, Eng-
land (1996) and a M.Sc. in Informatics from Aeronautics Institute of Technology, Brazil (1990). She is
an Associate Professor at the Department of Computer Science, University of Brasília, Brazil. She has
participated in many research projects and published several papers in international journals and confer-
ences. She is a member of the Informatics Graduate Program at the Department of Computer Science,
University of Brasília and a senior member of the Brazilian Computer Society. She receives a research
productivity grant in the Computer Science area from the Brazilian National Council for Scientific and
TH

Technological Development (CNPq). Her current research interests include design of intelligent and
knowledge-based systems, multiagent systems, agent-based simulation and multiagent planning.
AU

View publication stats

You might also like