1 - Availability and Reliability Modeling of VM Migration As Rejuvenation On A System Under Varying Workload

Software Quality Journal (2020) 28:59–83
https://doi.org/10.1007/s11219-019-09474-1
Availability and reliability modeling of VM migration

as rejuvenation on a system under varying workload
Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1
Published online: 4 March 2020

© Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract
Cloud computing serves as a platform for diverse types of applications, from low-priority
to critical. Some of these applications require high levels of system availability and reli-
ability. Developing methods for cloud computing availability and reliability evaluation is
of utmost importance. In this paper, we propose a set of models for availability and reli-
ability evaluation of a virtualized system with VMM software rejuvenation enabled by
VM migration scheduling. To improve models fidelity with a real environment, we added
a specific sub-model to represent the aspects of workload variation. Our main goal is to
find the proper VM migration schedule to maximize system availability and to analyze the
impact of such a schedule on the system reliability. Our results include the following: (1)
the appropriate rejuvenation schedule to maximize availability in each proposed scenario;
(2) downtime reduction when comparing the system with and without rejuvenation; and
(3) reliability analysis of different scenarios of workload variation considering the proper
rejuvenation schedules. The evaluation results comprise from systems without high work-
load demand (peakDuration = 0 h per day) to systems with only high workload demand
(peakDuration = 24 h per day). Our results show a significant improvement in availabil-
ity and reliability due to VM migration scheduling. In scenarios with a heavy workload, the
downtime avoidance caused by software rejuvenation surpasses 3.39 days, and the reliability
gain passes 86%.
Keywords Software aging and rejuvenation · VM migration · Availability · Reliability ·

Cloud computing
Acronyms
MTTF - Mean Time To Failure
MTTR - Mean Time To Repair
SPN - Stochastic Petri Net
SRN - Stochastic Reward Net
VM - Virtual Machine
VMM - Virtual Machine Monitor
Matheus Torquato
mdmelo@dei.uc.pt
Extended author information available on the last page of the article.

60 Software Quality Journal (2020) 28:59–83
1 Introduction
Cloud computing serves as a platform for several types of applications, ranging from low-
priority to critical. Thus, research on the improvement of cloud availability and reliability is
of utmost importance. One of the issues in this context is software aging. Software aging is a
cumulative process that affects long-running execution. Software aging can lead the system
to hangs or failures due to bugs effects accumulation. Several papers have reported software
aging evidence in the cloud-virtualized components (Araujo et al. 2011a, b; Langner and
Andrzejak 2013; de Melo et al. 2017). For instance, the paper (Matos et al. 2012b) reported
software aging issues in the Virtual Machine Monitor (VMM), one of the core components
of the cloud virtualized environment.
In the software aging context, some previous papers reported the relation of workload
intensity and software aging accumulation (Vaidyanathan and Trivedi 1999; Bovenzi et al.
2011). These papers’ results show that the higher the workload, the faster is the software
aging accumulation. So it is essential that the software aging management techniques also
consider workload intensity.
Software rejuvenation is used to counteract software aging (Huang et al. 1995a). Soft-
ware rejuvenation actions usually comprise application restart or operating system reboot.
The goal of software rejuvenation is to clear accumulated software aging status, bringing
the software to a stable state (i.e., without software aging accumulation effects). Specifically
about VMM software aging, the papers (Torquato et al. 2017, 2018a) present experimental
results to show the effectiveness of using Virtual Machine (VM) migration as support for
VMM software rejuvenation. The technique consists of moving the VM from a machine
with VMM software aging status to a machine without VMM aging accumulation. After
VM migration completion, the system performs a software rejuvenation action in the VMM
with accumulated software aging. More details about the considered rejuvenation technique
are in Section 3.2.
This paper proposes a set of models for availability and reliability evaluation of a vir-
tualized system with VMM software rejuvenation enabled by VM migration scheduling.
The models also consider the workload variation. Our main goal is to find the VM migra-
tion schedules which maximize the system availability and evaluate their impact on system
reliability, taking account of the workload variation aspect.
The considered system architecture has three components: Main Node—which runs the
VM, Standby Node—target for VM migration and a VM, which runs the desired application
(a detailed description of system architecture and operational behavior is in Section 3).
We adopted the Stochastic Reward Nets (SRN) for models’ design (Ciardo et al. 1993;
Muppala et al. 1994). The use of SRNs admits the computation of reward rates in Stochastic
Petri Nets (SPN). We developed two SRN models. The first model is an availability model.
We use it to compute the optimal rejuvenation schedule to maximize system steady-state
availability. We also present an annual downtime comparison between the system with and
without rejuvenation. Our second model is a reliability model. We use it to quantify the
probability of service continuity, observing the optimal rejuvenation schedules obtained
from availability evaluation. Therefore, we can notice the impact of applying VM migration
as rejuvenation on system reliability.
In our evaluations, we assume that the workload variation obeys a 1-day cycle. So
we use a variable peakDuration to represent the duration of high workload demand in
hours. Moreover, we also assume that the period of low workload demand is equal to
24 − peakDuration. In the evaluations, we vary the peakDuration from 0 to 24 h with
Software Quality Journal (2020) 28:59–83 61
a 1-h step. For availability evaluation, we change the rejuvenation schedule from 1 to 168 h
(a week) with 1-h step. In the reliability evaluation, we vary the time from 0 to 720 h using
80-h step. A brief discussion about the possible limitations of this evaluation strategy is at
the Section 3.4.
Our obtained results show that software rejuvenation brings significant improvements
for system availability and reliability. From the availability perspective, the use of rejuvena-
tion policies in scenarios with heavier workload (i.e., peakDuration = 24) decreases the
annual downtime in more than 3 days when compared with the system without rejuvena-
tion. We also observed that the system availability suffers more reduction in systems with
a higher workload. From reliability point-of-view, our results show that the improvement
due to rejuvenation is more significant in systems with a high workload. In the scenario
with 24 h of peak workload demand, in a 160-h interval, the system reliability improve-
ment caused by rejuvenation is of 86%. That means the system has 86% more probability
of remaining failure-free running from the start of its operation to 160 h when applying
software rejuvenation. Our detailed results are in Section 5.
The highlights of this paper are as follows:
– Comprehensive dependability modeling comprising availability and reliability of a
system with rejuvenation based on VM migration scheduling;
– The models cover the impact of workload variation on system availability and reliabil-
ity;
– A considerable set of results, including the following:
– Proper VM migration schedules to maximize system availability;
– Downtime comparison between the system with and without rejuvenation;
– Analysis of the VM migration scheduling impact on system reliability.
This paper is an extension of our previous work (Torquato and Vieira 2018). We expand
our contributions as follows: (i) We reduced the availability model complexity, and now we
can compute the steady-state and transient metrics in a monolithic model; (ii) We divided
the aging accumulation into more phases to improve the representation of the workload
variation behavior; (iii) We added new sections for reliability modeling and evaluation; and
(iv) We modified the proposed models to remove the guard functions, improving models’
readability.
Up to our knowledge, this is the first paper to tackle both perspectives, availability,
and reliability, on a system with VM migration scheduling as VMM software rejuvena-
tion subject to varying workload. This work is a step towards more realistic scenarios with
different levels of workload demand. We believe the approach presented in this paper can
be extended to other situations and also serve as support for the virtualized environments
decision-making process.
The rest of this paper is organized as follows. Section 2 presents basic concepts of
dependability modeling, and software aging and rejuvenation. Section 3 has the assump-
tions of our work. These assumptions include system architecture, considered rejuvenation
technique, failure and repair behavior, and workload variation behavior. Section 4 has our
proposed models. Section 5 has the obtained results and discussion. Section 6 presents the
comparison of our paper with related works. And Section 7 has our conclusions and future
research endeavor. Appendix A presents the approach used in our deterministic transitions
for enabling memory in the SRN analysis through a technique based on the age memory
policy. Appendix B presents a brief description of TimeNet tool.
2 Background
This section presents some relevant concepts used in our research. Section 2.1 presents
concepts of dependability modeling. Section 2.2 introduces the fundamentals of software
aging and rejuvenation.
2.1 Dependability modeling
Dependability is an umbrella term which comprises several system attributes like availabil-
ity, reliability, safety, maintainability, and integrity (Avizienis et al. 2004). In the context of
this paper, we are dealing only with availability and reliability. Availability is the readiness
of correct service. Moreover, reliability is the continuity of correct service (Avizienis et al.
2001).
Events that affect system dependability (e.g., outages, crashes and failures) are, usually,
unexpected. Therefore, it is hard to set up an approach to system dependability evaluation
through measurements. Thus, many researchers use models for dependability evaluation.
There are several types of models which can be used for dependability evaluation. A com-
prehensive guide for dependability modeling is in the book (Trivedi and Bobbio 2017). In
this paper, we use SRN models for our evaluations.
SRN models allow reward measure computation. SRN models can represent system
interactions and dependencies, which make them appropriate and powerful for depend-
ability evaluation (Malhotra and Trivedi 1994). SRN models are being extensively used
for dependability evaluation (Mural et al. 1999; Constantinescu 2005; Trivedi et al. 1993).
The paper (Malhotra and Trivedi 1995) presents more details about the use of SRN for
dependability evaluation.
2.2 Software aging and rejuvenation
Software testing is, usually, a necessary task before software release to the public. Soft-
ware testing involves the verification of software status under different conditions of system
configuration and workloads (Royce 1970). However, even with extensive and diligent soft-
ware testing procedures, some errors and bugs may remain undetected (Grottke and Trivedi
2007). Some bugs activation only occurs after a long time of software execution. There-
fore, defining bug’s roots causes is a complicated task. These types of bugs are known as
Heisenbugs (Grottke and Trivedi 2005).
Aging-related bugs are similar to Heisenbugs. However, aging-related bugs raise with
specific conditions (e.g., lack of computational resources) that are hard to forecast or
reproduce (Vaidyanathan and Trivedi 2001). Software aging is the consequence of the aging-
related bugs effects accumulation (Parnas 1994). Software aging leads the system to a
degraded state in which the system is more likely to suffer hangs or failures. Software aging
accumulation is directly affected by the workload submitted to the system (Cotroneo et al.
2014; Vaidyanathan and Trivedi 1999; Bovenzi et al. 2011).
The paper (Huang et al. 1995b) presents the first definitions of software rejuvenation.
The authors define software rejuvenation as a proactive technique to avoid aging effects
from reaching critical levels. The rejuvenation actions rely on restarting an application to
conduct it to a clean state, without software aging effects accumulation.
Previous studies show analytical modeling approach to evaluate software aging and reju-
venation impacts (Trivedi et al. 2000; Vaidyanathan and Trivedi 2005). Some papers propose
scheduling of software rejuvenation actions (Melo et al. 2013a, b; Torquato et al. 2018b) to
determine when to perform rejuvenation actions to maximize system availability.
3 Assumptions
This section has our assumptions for the proposed modeling and evaluation.
3.1 System architecture
The system architecture has three components: Main Node, VM, and Standby Node. The
Main Node runs the VM. The VM runs the desired application. The Standby Node is a
machine used for VM arrival after migration. The Standby Node turns into the Main Node
after VM migration completion. There is also a remote storage volume which enables VM
live migration. A network switch connects all the components. Figure 1 depicts the system
architecture.
In this paper, we only consider the software aging effects in the VMM component. In the
virtualized environments, the Main Node runs the VMM. So, in our models, only the Main
Node and its hosted VM may suffer problems due to software aging. To overcome software
aging problems, we propose a software rejuvenation strategy based on VM migration. The
idea behind the rejuvenation technique (explained in details in the next section) is to move
the VM from the physical machine (i.e., Main Node) with software aging problems to a
physical machine without aging accumulation (i.e., Standby Node). So we apply the Virtual
Machine migration to perform this VM movement. We consider the remote storage volume
to enable VM live migration (Clark et al. 2005). VM live migration allows the VM move-
ment with reduced downtime. The remote storage volume persists the VM’s virtual disk.
So, while live migrating, the system only has to transfer the memory pages and processor
state instead of all the VM components (processor state, memory, and virtual disk).
Fig. 1 System architecture

3.2 VMM rejuvenation technique
The VMM rejuvenation technique considered in this paper has four stages. In the first stage,
the Main Node and VM run without software aging accumulation. In this stage, the Standby
Node is running. As Standby Node is not running Virtual Machines, the Virtual Machine
Monitor (VMM) does not suffer from software aging accumulation. As time passes, the
VMM starts to experience software aging accumulation. This event leads to the second
stage. In this stage, the Main Node presents symptoms of software aging accumulation.
These symptoms of software aging are usually detected as performance degradation, and
system hangs. When the rejuvenation interval is reached, the VM migration is triggered.
It is essential to highlight that, in this strategy, also known as time-based rejuvenation, the
mechanism of VM migration submission is unaware of the system status. Therefore, the
system can perform unnecessary VM migrations.
The next stage is the VM migration. Assuming that Standby Node is running, the VM
is moved from Main Node to Standby Node. In the last stage, the Standby Node takes the
role of Main Node. In this stage, we also perform software rejuvenation on previous Main
Node to clear the software aging accumulation. After software rejuvenation, the previous
Main Node machine can receive VM migrations in the environment. Figure 2 summarizes
the approach.
3.3 Failure and repair behaviors
The system becomes non-operational after a VM internal failure (application or guest OS

failure). After a VM failure, the system becomes operational again after the repair of VM.
As VM depends on Main Node to perform its operations, failures on Main Node affects VM
availability. We consider both types of Main Node failure: aging and non-aging failures. As
Fig. 2 VMM rejuvenation technique

a non-aging failure, we mean hardware or host OS failures. The software aging accumula-
tion behavior in the Main Node is related to the Virtual Machine Monitor (VMM). After a
Main Node failure, the system turns operational again with a two-step recovery: repair Main
Node and reboot (or instantiate) VM. Failures on Standby Node do not represent a system
failure, but they prevent VM migration.
The models do not comprise of the following failures: (i) Network failures; (ii) VM
migration failures; (iii) Failures in the Remote Storage Volume.
3.4 Workload variation behavior
In this paper, we are considering the workload variation with two states: (i) peak, when the
workload demand is high, and (ii) off-peak, when the workload demand is low. We also
assume the 1-day cycle of workload variation. So we define the variable peakDuration
to define the number of hours of high workload demand in a day. Figure 3 summarizes the
considered workload variation cycle.
We highlight that the two-level workload and the peakDuration measured in hours may
not suffice for complex systems. For example, there are systems with more than two-level
workload. In such cases, our modeling framework has to extend to cover the intermediate
levels of workload intensity. Besides that, there are systems with bursty workloads, which
usually lasts less than hours. Alternatively, some systems have multiple bursty workloads
in a day. For such cases, it is possible to modify the variables of the duration of peak and
off-peak periods to approximate the actual system workload profile. However, the idea of
two-level workload may be enough to represent systems with more predictable workload
intensities. For example, some commercial systems are expected to receive a high workload
demand during some hours of the day. The same happens to systems inside a university,
which usually faces high workload demand during the working hours of the day.
Software aging accumulation is faster when the workload is high and slower when the
workload is low. Software aging accumulation is usually modeled using hypo-exponential
distributions (Machida et al. 2013; Melo et al. 2013b). Therefore, we used a four-phase
Erlang model, as presented in Torquato and Vieira (2018) and Torquato et al. (2018b), to
describe the aging accumulation.
However, four phases maybe not enough to capture the dynamics of peak and off-peak
workload variation. So we divided each phase into ten sub-phases. The decision of the soft-
ware aging rate depends on the workload state. The aging rate is faster when the workload
is on peak state, and slower when the workload is on the off-peak state. An exponential
transition represents each sub-phase.
peakDuration
Peak Off-peak
24-peakDuration
System with System with
high workload low workload
Fig. 3 Workload variation cycle

4 Models
We used the tool TimeNet 4.4 (Zimmermann 2017) to design and evaluate the models.
We divided this section into two subsections. Appendix B presents more details about the
TimeNet tool. Section 4.1 has the details about the availability model, and Section 4.2
describes the reliability model.
4.1 Availability model
Figure 4 has our proposed availability model. Despite being a monolithic model, we high-
light three sections of the model as sub-models: (a) Clock Model; (b) Workload Model; and
(c) Main Model.
Clock Model represents the VM migration scheduling process. The deterministic tran-
sition Trigger represents the time interval for VM migration submission. Instead of
considering a random variable, a deterministic transition considers a deterministic value as
the transition delay. Transition Trigger firing removes the token from place Clock and
puts it on the place Schedule. In this state, when place Schedule has one token, it rep-
resents that the VM migration is ready. So, at this point, the model has one of the conditions
satisfied for transition StartLM firing. But, VM migration also depends on two more con-
ditions: (i) VM up and running (represented by a token in place UP) and (ii) Standby Node,
which is the target of VM migration, also up and running (represented by a token in the
place SN UP). Transition StartLM firing removes the tokens from places UP and SN UP
and deposits a token in the place LM. Place LM with a token represents the start of VM
migration process. Besides that, place LM with a token also enables the ResetClock tran-
sition firing. So we are assuming that the clock reset right after VM migration start. After
transition ResetClock firing, a token goes to the place Clock, starting the time counting
for the next VM migration.
Fig. 4 Availability SRN model

Workload Model represents the workload variation. The workload variation has two lev-
els peak and off-peak. The token in the place Peak represents the situation when the system
is receiving a high workload demand (peak). When the peak period ceases, the system goes
to an off-peak period. This event is represented by PeakPeriod firing, when the token is
removed from Peak and deposited in the place OffPeak. After the off-peak period, the
system returns to a peak period. OffPeakPeriod firing represents this event, removing
the token from place OffPeak and depositing in the place Peak, and the cycle restarts.
The workload model defines the rate for the software aging accumulation process. We rep-
resent this behavior using arcs from place Peak to the transition PhaseHigh, and from
place OffPeak to the transition PhaseLow. As presented in Wang et al. (2007), we also
used deterministic behavior to represent the workload variation dynamics. We used 10-stage
Erlang distribution to approximate all the deterministic behavior of transitions Trigger,
PeakPeriod, and OffPeakPeriod (Malhotra and Reibman 1993). Therefore, by using
the Erlang sub-net, we can keep the time counting for the peak and off-peak period, even
with token re-entering into places Peak or OffPeak. For example, when the transition
PhaseHigh fires, it produces a token re-entrance into place Peak. The token re-entrance
will cause a reset in the time counting of the transition PeakPeriod. However, as we are
using an Erlang sub-net to represent the deterministic transition, the time counting is pre-
served by another segment of the net. This mechanism is based on the age memory policy
of Stochastic Petri Nets (Ajmone Marsan et al. 1995). Appendix A presents more details on
this strategy. We decided not to include the Erlang sub-net in the representation to improve
the readability of the model.
The Main model represents the behavior of VM, Main Node, and Standby Node compo-
nents. The token in the place UP is the core of the model, which represents that VM and
Main Node are running. Also, as mentioned earlier, the token in the place SN UP repre-
sents that the Standby Node is running. Standby Node failures are represented by the firing
of transition SN f. This transition firing moves the token from the place SN UP to place
SN DW. Standby Node becomes operational again after a repair. Standby Node repair pro-
cess is denoted by the SN r firing, moving the token from place SN DW to place SN UP.
Standby Node unavailability does not affect overall system availability directly. Because
the system availability is measured from the service hosted in the VM. However, Standby
Node unavailability prevents VM migration, which can lead the system to software aging
failures.
Also, in the Main model (upper part), we have the VM migration process. The software
rejuvenation begins after submission of VM migration (transition StartLM firing), which
is enabled observing the schedule of Clock model and the presence of tokens in places UP
and SN UP. The StartLM firing removes the tokens from places UP and SN UP and puts a
token in the place LM. This behavior represents that the system experiences downtime during
VM migration (Clark et al. 2005). After VM Migration downtime (represented by transition
LM dwt), the system returns to activity. As presented in Section 3.2, after VM migration, we
perform a rejuvenation in the Node with software aging status. This event is represented by
the place SN W (symbolizes that the Standby Node is waiting for rejuvenation) and transition
Rej (representing the rejuvenation process). The transition Rej moves a token back to the
place SN UP, representing that the Standby Node is operational again, and can receive new
VM migrations.
Above the Workload model, we have an Erlang sub-net which represents the software
aging accumulation. We decided to use Erlang because this distribution can represent the
gradual increasing of the software aging process (Wajid and Shuaib Khan 2006).1 The
immediate transition Aging fires when the place UP has tokens. As mentioned in the
Section 3.4, the Erlang phase delay depends on the state of workload. Software aging pro-
cess is closely related to workload. The higher the workload, the faster is the depletion of
resources. To represent these characteristics, we use two different delays for our Erlang sub-
net. The place AgingStart immediately receives forty tokens from Aging transition
when place UP has tokens. Those tokens represent the remaining aging sub-phases before a
software aging failure. Also, we consider four phases with ten sub-phases each. The deci-
sion between the concurrent transitions PhaseHigh and PhaseLow depends on the state
of the Workload model. If place Peak has tokens, then transition PhaseHigh will be
enabled. Otherwise, the transition PhaseLow will be enabled. The transition PhaseHigh
has a shorter delay than transition PhaseLow, representing the aging process is faster in the
presence of high workload. Either, PhaseHigh or PhaseLow firing puts a token in the
place Accumulation. To avoid problems in the tokens aging, we adopted the Exclusive
Server semantics in transitions PhaseHigh and PhaseLow.
Software aging accumulation is indeed represented by the places Accumulation and
Critical. The place Critical represents the phases of software aging accumulation,
and the place Accumulation represents sub-phases of software aging accumulation.
We decided to use this approach as an improvement for the model proposed in our pre-
vious paper (Torquato and Vieira 2018). So we continue to consider four phases, but we
divided each phase into ten sub-phases. Therefore, the model can capture the dynamic
of workload variation with more fidelity. The transition AgingAcc fires when the place
Accumulation has ten tokens. The place Critical receives tokens from transition
AgingAcc. When place Critical has four tokens, it represents that the system reaches
a software aging failure. The event of software aging failure is represented by transition
AgingFailure. This transition removes the token from the place UP and puts a token in
place DW2. The system returns to activity after repair (transition Repair).
The transitions ClearAging1, ClearAging2, and ClearAging3 represent the
removal of software aging accumulation. We consider that the software aging effects are
removed in three situations: (i) VM migration, (ii) VM failure or (iii) Main Node failure.
VM migration is intended to support software rejuvenation. Thus, after a VM migration,
the software aging status is cleared. We also assume that the repair process of the VM or
Main Node comprise rejuvenation actions (i.e., common rejuvenation actions). We represent
those behaviors by using an inhibitor arc2 from the place UP to each one of the transitions
ClearAging.
Main model also comprise the behavior of non-aging failures. The transition VM f repre-
sents a non-aging VM failure. After firing, this transition removes the token from the place
UP and inserts it on the place VM DW. The system returns to activity after VM repair. The
transition VM r represents the VM repair after a non-aging failure. However, the system
can experience a Main Node failure before VM repair. This circumstance is represented by
MN f2 firing.
Place DW represents that both VM and Main Node are down. A Main Node failure can
occur after a VM failure (transition MN f2) or before a VM failure (transition MN f). The
1 Just a clarification note: Erlang with fewer phases can represent a gradual increase. So we use a four-phase
Erlang sub-net to represent software aging. However, Erlang with more phases tends to deterministic results.
Therefore, we used a ten-phase Erlang sub-net to represent the Clock behavior.
2 Arc terminating in a small circle instead of an arrow. In our case, the inhibitor arc enables the transition in
the absence of tokens in its input places.
firing of transitions MN f or MN f2 deposits a token in the place DW. The system recovery
after a Main Node failure has two steps. First, the repair of Main Node, which is represented
by transition MN r firing. This transition firing removes a token from the place DW and
inserts it on the place VM S. The second step of system recovery is the VM reboot. The
transition VM rb denotes the VM reboot process, removing the token from VM S and putting
it on the place UP. In some cases, the system manager may decide to instantiate a new VM
after Main Node failure. In such cases, as the delays for creation and rebooting VMs are
similar, we consider the VM rb as the transition for the VM instantiation.
The system is operational when the VM and Main Node are running. Place UP with a
token represents this system state. Therefore, to obtain system availability, we computed the
probability that the place UP has tokens (Availability = P {#U P > 0}).
To help in the model’s reading, Table 1 has the immediate transitions and their meaning.
4.2 Reliability model
Our reliability model is quite similar to the availability model presented in the earlier
section. However, as we are dealing with reliability, we have to adapt the model to remove
any repair behavior (Trivedi and Bobbio 2017). So we removed all the behavior after the
transitions VM f and MN f, and also the transition Repair. The rest of the model remains
the same from the availability model. Figure 5 presents the proposed reliability model.
We use the reliability model to compute the transient probability that the system will
remain failure-free from the start of operation (i.e., time = 0) to a specific time t. We do
not consider the VM migration downtime as a system failure. So the metric is obtained from
the probability of places UP or LM to have tokens at time t. The metric can be summarized
as follows: Reliability(t) = P{(#U P >0)OR(#LM>0)} (t).
5 Results and analysis
We present two sets of results in this section. The first consists of the availability evaluation
results, and the second have the reliability evaluation results. For both evaluations, we used
the same set of parameters presented in Table 2. These parameters are based on the previous
experimental studies and consolidated papers (Machida et al. 2013; Torquato et al. 2018b;
Kim et al. 2009).
Table 1 Immediate transitions and their meaning
Transition Meaning
ResetClock Restart time counting for VM migration

StartLM Start of VM migration process
Aging Start of software aging accumulation
AgingAcc Software aging accumulation phase completion
AgingFailure System failure occurence due to software aging
ClearAging(1→3) Software aging accumulation clearance
Fig. 5 Reliability SRN model
5.1 Availability evaluation results
The goal of the availability evaluation is to define the proper rejuvenation policies to max-
imize system availability. Because, as each VM migration has an associated downtime, too
often migration may impair system availability. On the other hand, less frequent migrations
may allow failure due to software aging. Thus, there is an optimal rejuvenation schedule
which maximizes system availability.
To reach the proposed goal, we conduct a sensitivity analysis on the delay time of the
transition Trigger to find possible effects on system availability. We vary the transition
Trigger delay from 1 to 168 h (a week) with 1-h step. Obeying the workload variation
behavior presented in Section 3.4, we vary peakDuration from 0 to 24 h with 1-h step.
Figure 6 presents a surface plot with the availability results of all scenarios.
The first conclusion is that the software rejuvenation impacts are more significant in
systems with heavier workloads. Nevertheless, it is also important to highlight the benefits
obtained from software rejuvenation. So we compare the availability results of the system
with and without rejuvenation (i.e., baseline availability). To compute the baseline availabil-
ity, we removed the tokens related to the VM Migration in the proposed model. We find the
optimal rejuvenation schedule for all scenarios and compare the obtained availability with
the baseline availability. Our obtained results are in Table 3.
The table results give the same conclusion that the software rejuvenation is more effective
in scenarios with a heavier workload. Besides that, we can notice that scenarios with higher
peakDuration require more frequent migrations to achieve higher availability levels. The
possible side-effects caused by more frequent migration are one of our future works.
To improve the comparison, and also to help the visualization of the software rejuvena-
tion effects, we also made a comparison between the annual downtime of the availability of
the system with rejuvenation and the baseline availability. We present the downtime reduc-
tion metric, which is the difference between the downtime of the system with and without
Table 2 Parameters used in the models
Parameters Values
Transition name Description Mean time
OffPeakPeriod Period of off-peak incoming Workload (per day) 24 - peakDuration

PeakPeriod Period of Peak incoming Workload (per day) peakDuration = (0 → 24h)
Repair Repair time after a software aging failure 1h

PhaseHigh Time to Aging in presence of high workload 2.5 h
PhaseLow Time to Aging in presence of low workload 12.5 h
MN f, MN f2 Main Node Failure Delay 1236.70588 h
MN r Main Node Repair Delay 1.09433 h
SN f Standby Node Failure Delay 1236.70588 h
SN r Standby Node Repair Delay 1.09433 h
VM f Virtual Machine Failure Delay 2880 h
VM r Virtual Machine Repair Delay 30 min
VM rb Virtual Machine Reboot Delay 5 min
LM dwt VM Live Migration Downtime 4s
Rej Rejuvenation Node Delay 2 min
Trigger (availability evaluation) Interval to VM Live Migration 1 → 168 h
Trigger (reliability evaluation) Interval to VM Live Migration 1 → 720 h
71
Fig. 6 Steady-state availability results
Table 3 Appropriate rejuvenation policies for considered scenarios
Peak duration (h) Optimal reju- Availability Availability with-

venation trigger out rejuvenation
(h)
0 158 0.99886769 0.99776706

1 133 0.99886637 0.99706615
2 116 0.99886502 0.99674177
3 103 0.99886365 0.99641653
4 92 0.99886225 0.99609065
5 83 0.99886085 0.99576427
6 76 0.99885945 0.99543753
7 70 0.99885805 0.99511050
8 65 0.99885666 0.99478324
9 60 0.99885529 0.99445581
10 56 0.99885393 0.99412823
11 53 0.99885259 0.99380052
12 50 0.99885128 0.99347272
13 48 0.99884997 0.99314483
14 45 0.9988487 0.99281687
15 43 0.99884745 0.99248884
16 41 0.9988462 0.99216077
17 40 0.99884499 0.99183265
18 38 0.99884378 0.99150449
19 37 0.99884257 0.99117629
20 35 0.99884135 0.99084809
21 34 0.99884016 0.99051990
22 33 0.99883894 0.99019180
23 32 0.9988377 0.98986399
24 31 0.99883638 0.98953703
rejuvenation. Figure 7 presents the downtime reduction results. We also include a trend line
connecting the points.
In the scenario with the heavier workload (peakDuration = 24), the downtime reduc-
tion is of 3.39 days. Even in scenarios with a lighter workload, the downtime reduction is
substantial. For example in the scenario with peakDuration = 0, the downtime reduc-
tion is about 9.64 h. The trend line in the figure suggests that the downtime reduction
obeys a behavior similar to a linear curve. Therefore, we may expect to achieve even more
substantial rejuvenation effects in scenarios with peakDuration > 24.
5.2 Reliability evaluation results
We conduct the reliability evaluation using the suggested rejuvenation policies obtained
from the availability evaluation (see Table 3). The evaluation consists of a sensitivity
analysis of reliability metric, in which we vary the time (t) from 0 to 720 h using 80-h step.
We highlight that the analysis is not intended to find the best rejuvenation schedule to
maximize system reliability. In our scenarios, the best rejuvenation schedule for reliability
4887.74
4716.58
Downtime reduction (min/yr)* 4544.94
4373.13
4201.26
4029.40
3857.53
3685.69
3513.86
3342.09
3170.33
2998.62
2826.97
2655.37
2483.83
2312.37
2140.99
1969.71
1798.56
1627.56
1456.75
1286.21
1115.98
946.20
578.49
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Peak duration (h)
* downtime without rejuvenation downtime with rejuvenation
Fig. 7 Downtime reduction (comparison)

maximization is the policy with the shorter delay between migrations. The often migrations
will prevent software aging failures. However, often migrations impair the system avail-
ability (see Fig. 6). So we conducted the reliability evaluation considering the policies for
availability maximization. Therefore, we can verify the impacts of the same rejuvenation
policies in system reliability.
Our obtained results are in Fig. 8.3 The black dots are related to the system with rejuve-
nation, and the gray dots are related to the system without rejuvenation. We also include a
trend line (continuous line) in the plot. The plot also presents a 95% confidence interval for
the evaluation, which is highlighted using dashed lines.
From the results, we can draw the following conclusions: (i) Reliability decay is steeper
in scenarios with heavier workload (i.e., system with more exposure to peak workload); (ii)
Reliability improvement due to software rejuvenation becomes noticeable with the increase
of t; and (iii) In all scenarios, the system without rejuvenation tends to suffer failure earlier
than the system with rejuvenation.
To improve the reliability comparison of the system with and without rejuvenation, we
plot the reliability results for t = 160. So we aim to highlight the probability of the system
remains failure-free in its first week4 of usage. The results are in Fig. 9. The error bars in
the plot represent the 95% confidence interval for the results. The trendlines (dotted lines)
in the plot adopts the moving average technique.
Until peakDuration = 7 h, the reliability results of the system with and without reju-
venation are nearly the same. So it is possible to conclude that the suggested rejuvenation
policies for maximum system availability only cause significant reliability in systems with
peakDuration > 7 h. For example, in the scenario with 24 h of peak workload, the proba-
bility of the system without rejuvenation remains failure-free until 160 h equal to 0%, while
in the system with rejuvenation, the same probability equals to 86% ± 3%.
The trendline of the system with rejuvenation reveals that the reliability levels tend to be
the same for all considered scenarios. That suggests a baseline value for system reliability
when applying availability-oriented rejuvenation policies. To improve the baseline values,
the system manager may apply more frequent migrations in the system, which will produce
availability reductions. In the future works, we intend to conduct a Multi-Criteria Decision
Analysis to study the tradeoffs between system availability and reliability when applying
software rejuvenation policies.
6 Related works
It is important to highlight some papers that provide guidelines for our cloud computing
behavior modeling and evaluation as Kim et al. (2009), Dantas et al. (2012), and Matos et al.
(2012a). Despite presenting a relevant context for cloud computing availability evaluation,
these papers do not focus on software aging and rejuvenation or VM migration scheduling.
Their contributions are more general than ours. Our specific contributions are related to
software aging and rejuvenation on the cloud with VM migration as rejuvenation.
Li et al. (2014) present a comprehensive dependability evaluation of a Hadoop Clus-
ter hosted in clouds. The results comprise availability, reliability, and maintainability. The
3 PD means peak duration and RT means rejuvenation trigger

4A week has 168 h.
Fig. 8 Reliability results

With Rejuvenation Without Rejuvenation Mov. Avg. With Rej. Mov. Avg. Without Rej.
1.00
0.84
0.90 0.85 0.83 0.84 0.85 0.85 0.86
0.82 0.84 0.83 0.82 0.84 0.82 0.84 0.82 0.84 0.84 0.84 0.83 0.82 0.82 0.82 0.83 0.82 0.81
0.80 0.83
0.82 0.83 0.83 0.83 0.83 0.81
Reliability (160h)
0.81
0.70 0.77
0.73
0.60
0.62
0.50 0.56 0.44
0.40 0.30
0.30 0.25
0.17
0.20
0.09 0.08
0.10 0.03 0.02
0.01 0.01 0.00 0.00 0.00
0.00
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Peak Duration (h)
Fig. 9 Reliability results (160 h)
authors use Markov Chains for the evaluation. The paper does not cover software aging and
rejuvenation aspects.
Machida et al. (2010, 2013) propose availability models for a system with rejuvenation
enabled by VM migration. Our previous research (Melo et al. 2013a, b) deals with the
same subject. Thein et al. also present papers in this topic (Thein and Park 2009; Thein
et al. 2008). However, different from these papers, we also cover workload variation and
reliability aspects.
Okamura et al. (2014) propose a model for transient analysis of a system with reju-
venation based on VM migration. The authors use phase-type expansion to evaluate two
migration approaches: cold-VM and warm-VM. The presented results are focused on point-
wise availability. The paper (Torquato et al. 2018b) also propose a comparison with two
migration policies: Warm-Standby and Cold-Standby migrations. The set of results include
availability and power consumption analysis. In our paper, we also present a reliability
evaluation. Our models also cover reliability evaluation.
The papers (Xie et al. 2004) and (Wang et al. 2007) provide important insights for our
paper. Our idea for the workload variation model is based on these papers. These papers
include performance evaluation, which is one of our future works. However, these papers
lack reliability evaluation. Besides that, we include a specific rejuvenation technique (VM
migration) in our models. So we also take account of the details of VM migration, such as
VM migration downtime and Standby Node availability. Moreover, different from Xie et al.
(2004), we also consider failures from other sources (e.g., hardware failures).
Our previous paper (Torquato and Vieira 2018) was the first step of our research. Expand-
ing our contributions, we are now presenting a new availability model with more phases
to capture the dynamics of workload variation. We also improved the monolithic model
presented in Torquato and Vieira (2018) changing some deterministic transitions to Erlang
sub-nets and turning the guard functions into transitions. Using this approach, we can com-
pute the steady-state and transient metrics. Finally, we include new sections with reliability
modeling and evaluation.
Table 4 summarizes the related work comparison.
Table 4 Related work comparison
Papers Availability eval- Reliability evalu- Software aging Workload varia-

uation ation and rejuvenation tion
Kim et al. (2009), Dantas et al. Yes No No No

(2012), and Matos et al. (2012a)
Li et al. (2014) Yes Yes No No
Machida et al. (2010, 2013), Melo Yes No Yes No
et al. (2013a, b), Thein et al. (2008,
2009), Okamura et al. (2014),
Torquato et al. (2018b)
Xie et al. (2004) and Wang et al. (2007) Yes. Also include No Yes Yes
performability
metrics
Torquato and Vieira (2018) Yes No Yes Yes
This paper Yes Yes Yes Yes
77
7 Conclusions and future works
This paper presented models for availability and reliability evaluation of a system with VM
migration as rejuvenation under a varying workload. Our evaluation is based on Stochas-
tic Reward Nets. The results include the appropriate VM migration schedule to maximize
availability in each considered scenario. As the workload intensity affects software aging
directly, the scenarios with lower workload have best availability results. Nevertheless, sce-
narios with higher workload have substantial improvement in availability (when compared
with the system without software rejuvenation). In some scenarios, the downtime avoidance
is more than 3 days.
The results also show a noticeable improvement in system reliability when applying
VM migration as rejuvenation. We compute that, in the scenario with 24 h per day of high
workload demand, the reliability improvement is of 86% ± 3%. Our results are useful to
show the VM migration scheduling impacts on system availability and reliability.
In future works, we intend to include performance aspects in the models. As presented
in Wang et al. (2007) and Xie et al. (2004), a feasible way is to add a sub-model based on
queues to compute job blocking probability and throughput metrics.
We also intend to include a sensitivity analysis in the VM migration downtime parameter
to observe possible side-effects caused by different VM migration approaches. Another
goal is to expand the model to represent scenarios with more VMs and physical machines.
Besides that, it is also important to investigate other rejuvenation policies as threshold-based
and hybrid (threshold and time-based) approaches.
Another relevant future work is the proposal of a Multi-Criteria Decision Making process
to analyze the tradeoffs of availability and reliability when using VM migration as support
for VMM software rejuvenation.
As secondary targets, we aim to add relevant transient metrics as reliability, interval
reliability, and pointwise availability (Suzuki et al. 2003; Dohi et al. 2018). To achieve this
goal, we should redesign our models and reward functions to represent these metrics.
Acknowledgments This paper is an extended version of the paper (Torquato and Vieira 2018). We want to
thank the anonymous reviewers for the valuable comments to improve the research presented in this paper.
Funding information This work has been partially supported by Portuguese funding institution FCT -
Foundation for Science and Technology, Ph.D. grant SFRH/BD/146181/2019 and project ATMOSPHERE,
funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement no
777154.
Appendix A: Approach used in the deterministic transitions
In the design of our models, we faced two significant challenges: (1) the use of deterministic
transitions that substantially increases the time for model evaluation, and 2) the use of guard
functions that usually affect the model readability as they hide specific system behaviors in
some transitions.
To tackle these challenges, we adopted the following approach. For challenge (1), we
replaced the deterministic transitions with 10-phase Erlang subnets that are capable of
reproducing the deterministic behavior. For challenge (2), we added the specific arcs and
transitions that represent the intended guard function behavior.
In the representation of the guard function behavior, we used token swaps in some places
of the SRN model. Figure 10 presents a token swap with a deterministic transition. The
Fig. 10 Token swap with

deterministic transition
problem with tokenswaps is that each time that the transition TokenSwap fires the time
counting for the deterministic transition T0 is reset due to the token re-entrance in the place
P0.
To overcome the token swap problem, we adopted a technique based on the age memory
policy (Ajmone Marsan et al. 1995). In the age memory policy, the SRN is able to persist
the cumulative enabling time of a transition since the last time it was fired. The age mem-
ory policy is useful to model multitask systems, where the cumulative work in one task is
preserved when the system switches to others. In our case, we used the 10-phase Erlang
sub-net as the age memory policy mechanism. Figure 11 presents the adopted approach.
Fig. 11 Token swap with age memory policy
As soon as IT0 fires, the place PEr0 receives 10 tokens. Those tokens represent the time
counting for the transition EPh firing. It is possible to notice that, regardless of TokenSwap
firing, the time counting is preserved.
We used this approach in all the deterministic transitions presented in this paper. We
decided to represent the approach using a deterministic transition to improve the readability
of the model.
Appendix B: A brief description of TimeNET Tool
This description is based on paper (Zimmermann 2017).

TimeNET is a tool for modeling and evaluating several variants of SPNs. It allows the
user to calculate reward measures from the models. It also supports the evaluation of models
with exponential, deterministic, and non-exponential transitions. The tool provides numeri-
cal analysis and simulation methods to compute transient and steady-state solutions for the
models. TimeNET also supports Colored Petri Nets.
The software is built using a Java graphical interface, shell scripts, and C++ algorithms.
The tool is available on 32 and 64-bit versions for Windows and Linux.
TimeNET tool is developed and maintained by the Systems and Software Engineering
Group, TU Ilmenau, Germany. It is available free of charge for non-commercial use from
http://timenet.tu-ilmenau.de/.
References
Araujo, J., Matos, R., Maciel, P., Matias, R. (2011a). Software aging issues on the eucalyptus cloud com-
puting infrastructure. In 2011 IEEE international conference on systems, man, and cybernetics (SMC)
(pp. 1411–1416). IEEE.
Araujo, J., Matos, R., Maciel, P., Matias, R., Beicker, I. (2011b). Experimental evaluation of software
aging effects on the eucalyptus cloud computing infrastructure. In Proceedings of the middleware 2011
industry track workshop (p. 4). ACM.
Avizienis, A., Laprie, J.-C., Randell, B., et al. (2001). Fundamental concepts of dependability. University of
Newcastle upon Tyne, Computing Science.
Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C. (2004). Basic concepts and taxonomy of dependable
and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11–33.
Bovenzi, A., Cotroneo, D., Pietrantuono, R., Russo, S. (2011). Workload characterization for software aging
analysis. In 2011 IEEE 22nd international symposium on Software reliability engineering (ISSRE) (pp.
240–249): IEEE.
Ciardo, G., Blakemore, A., Chimento, P.F., Muppala, J.K., Trivedi, K.S. (1993). Automated generation and
analysis of markov reward models using stochastic reward nets. In Linear algebra, markov chains, and
queueing models (pp. 145–191). Springer.
Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Pratt, I., Warfield, A. (2005). Live migra-
tion of virtual machines. In Proceedings of the 2nd conference on symposium on networked systems
design & implementation-Volume 2 (pp. 273–286). USENIX Association.
Constantinescu, C. (2005). Dependability evaluation of a fault-tolerant processor by gspn modeling. IEEE
Transactions on Reliability, 54(3), 468–474.
Cotroneo, D., Natella, R., Pietrantuono, R., Russo, S. (2014). A survey of software aging and rejuvenation
studies. ACM Journal on Emerging Technologies in Computing Systems (JETC), 10(1), 8.
Dantas, J., Matos, R., Araujo, J., Maciel, P. (2012). An availability model for eucalyptus platform An analysis
of warm-standy replication mechanism. In 2012 IEEE international conference on Systems, man, and
cybernetics (SMC) (pp. 1664–1669). IEEE.
de Melo, M.D.T., Araujo, J., Umesh, IM., Maciel, P.R.M. (2017). Sware: An approach to support software
aging and rejuvenation experiments. Journal on Advances in Theoretical and Applied Informatics, 3(1),
31–38.
Dohi, T., Zheng, J., Okamura, H., Trivedi, K.S. (2018). Optimal periodic software rejuvenation policies based
on interval reliability criteria. Reliability Engineering & System Safety, 180, 463–475.
Grottke, M., & Trivedi, K. (2005). A classification of software faults. Journal of Reliability Engineering
Association of Japan, 27(7), 425–438.
Grottke, M., & Trivedi, K.S. (2007). Fighting bugs: Remove, retry, replicate, and rejuvenate. Computer,
40(2), 107–109.
Huang, Y., Kintala, C., Kolettis, N., Fulton, N.D. (1995a). Software rejuvenation Analysis, module and
applications. In Ftcs (p. 0381). IEEE.
Huang, Y., Kintala, C., Kolettis, N., Fulton, N.D. (1995b). Software rejuvenation Analysis, module and appli-
cations. In 1995. FTCS-25. Digest of papers., twenty-fifth international symposium on Fault-tolerant
computing (pp. 381–390). IEEE.
Kim, D.S., Machida, F., Trivedi, K.S. (2009). Availability modeling and analysis of a virtualized system. In
2009. PRDC’09. 15th IEEE pacific rim international symposium on Dependable computing (pp. 365–
371). IEEE.
Langner, F., & Andrzejak, A. (2013). Detecting software aging in a cloud computing framework by
comparing development versions. In 2013 IFIP/IEEE international symposium on integrated network
management (IM 2013) (pp. 896–899). IEEE.
Li, H., Zhao, Z., He, L., Zheng, X. (2014). Model and analysis of cloud storage service reliability based on
stochastic petri nets. Journal of Information & Computational Science, 11(7), 2341–2354.
Machida, F., Kim, D.ong.S., Trivedi, K.S. (2010). Modeling and analysis of software rejuvenation in a server
virtualized system. In 2010 IEEE second international workshop on Software aging and rejuvenation
(woSAR) (pp. 1–6). IEEE.
Machida, F., Kim, D.S., Trivedi, K.S. (2013). Modeling and analysis of software rejuvenation in a server
virtualized system with live vm migration. Performance Evaluation, 70(3), 212–230.
Malhotra, M., & Reibman, A. (1993). Selecting and implementing phase approximations for semi-markov
models. Stochastic Models, 9(4), 473–506.
Malhotra, M., & Trivedi, K.S. (1994). Power-hierarchy of dependability-model types. IEEE Transactions on
Reliability, 43(3), 493–502.
Malhotra, M., & Trivedi, K.S. (1995). Dependability modeling using petri-nets. IEEE Transactions on
reliability, 44(3), 428–440.
Ajmone Marsan, M., Balbo, G., Conte, G., Donatelli, S., Franceschinis, G. (1995). Modelling with
generalized stochastic Petri nets Vol. 292. New York: Wiley.
Matos, R.D.S., Maciel, P.R.M., Machida, F., Kim, D.S., Trivedi, K.S. (2012a). Sensitivity analysis of server
virtualized system availability. IEEE Trans. Reliab., 61(4), 994–1006.
Matos, R., Araujo, J., Alves, V., Maciel, P. (2012b). Characterization of software aging effects in elastic stor-
age mechanisms for private clouds. In 2012 IEEE 23rd international symposium on software reliability
engineering workshops (ISSREW) (pp. 293–298). IEEE.
Melo, M., Araujo, J., Matos, R., Menezes, J., Maciel, P. (2013a). Comparative analysis of migration-based
rejuvenation schedules on cloud availability. In 2013 IEEE international conference on systems, man,
and cybernetics (SMC) (pp. 4110–4115). IEEE.
Melo, M., Maciel, P., Araujo, J., Matos, R., Araujo, C. (2013b). Availability study on cloud computing
environments Live migration as a rejuvenation mechanism. In 2013 43rd annual IEEE/IFIP international
conference on Dependable systems and networks (DSN) (pp. 1–6). IEEE.
Muppala, J., Ciardo, G., Trivedi, K.S. (1994). Stochastic reward nets for reliability prediction. Communica-
tions in reliability, maintainability and serviceability, 1(2), 9–20.
Mural, I., Bondavalli, A., Zang, X., Trivedi, K. S. (1999). Dependability modeling and evaluation of phased
mission systems: a dspn approach. In Dependable computing for critical applications 7, 1999 (pp. 319–
337). IEEE.
Okamura, H., Yamamoto, K., Dohi, T. (2014). Transient analysis of software rejuvenation policies in virtu-
alized system Phase-type expansion approach. Quality Technology & Quantitative Management, 11(3),
335–351.
Parnas, D.L. (1994). Software aging. In Proceedings of the 16th international conference on Software
engineering (pp. 279–287). IEEE Computer Society Press.
Royce, W.W. (1970). Managing the development of large software systems. In Proceedings of IEEE
WESCON (vol. 26). Los Angeles.
Suzuki, H., Dohi, T., Kaio, N., Trivedi, K.S. (2003). Maximizing interval reliability in operational software
system with rejuvenation. In 2003. ISSRE 2003. 14th international symposium on Software reliability
engineering (pp. 479–490). IEEE.
Thein, T., Chi, S.-D., Park, J.S. (2008). Availability modeling and analysis on virtualized clustering with
rejuvenation. International Journal of Computer Science and Network Security.
Thein, T., & Park, J.S. (2009). Availability analysis of application servers using software rejuvenation and
virtualization. Journal of computer science and technology, 24(2), 339–346.
Torquato, M., Maciel, P., Araujo, J., Umesh, I.M. (2017). An approach to investigate aging symptoms and
rejuvenation effectiveness on software systems. In 2017 12th iberian conference on Information systems
and technologies (CISTI) (pp. 1–6): IEEE.
Torquato, M., Araujo, J., Umesh, I. M., Maciel, P. (2018a). Sware A methodology for software aging and
rejuvenation experiments. Journal of Information Systems Engineering & Management, 3(2), 15.
Torquato, M., Umesh, I.M., Maciel, P. (2018b). Models for availability and power consumption evaluation of
a private cloud with vmm rejuvenation enabled by vm live migration. The Journal of Supercomputing,
1–25.
Torquato, M., & Vieira, M. (2018). Interacting srn models for availability evaluation of vm migration as
rejuvenation on a system under varying workload. In 2018 IEEE International symposium on software
reliability engineering workshops (ISSREW) (pp. 300–307). IEEE.
Trivedi, K.S., Ciardo, G., Malhotra, M., Sahner, R.A. (1993). Dependability and performability analysis. In
Performance evaluation of computer and communication systems (pp. 587–612). Springer.
Trivedi, K.S., Vaidyanathan, K., Goseva-Popstojanova, K. (2000). Modeling and analysis of software aging
and rejuvenation. In 2000.(SS 2000) proceedings. 33rd annual Simulation symposium (pp. 270–279).
IEEE.
Trivedi, K.S., & Bobbio, A. (2017). Reliability and availability engineering: modeling, analysis, and
applications. Cambridge: Cambridge University Press.
Vaidyanathan, K., & Trivedi, K.S. (1999). A measurement-based model for estimation of resource exhaustion
in operational software systems. In Issre (pp. 84). IEEE.
Vaidyanathan, K., & Trivedi, K.S. (2001). Extended classification of software faults based on aging. In Fast
abstract, int. Symp. Software Reliability Eng. Hong Kong.
Vaidyanathan, K., & Trivedi, K.S. (2005). A comprehensive model for software rejuvenation. IEEE
Transactions on Dependable and Secure Computing, 2(2), 124–137.
Wajid, R.A., & Shuaib Khan, M. (2006). Comparative distributions of hazard modeling analysis. Pakistan
Journal of Statistics and Operation Research, 2(2), 127–134.
Wang, D., Xie, W., Trivedi, K.S. (2007). Performability analysis of clustered systems with rejuvenation under
varying workload. Performance Evaluation, 64(3), 247–265.
Xie, W., Hong, Y., Trivedi, K.S. (2004). Software rejuvenation policies for cluster systems under vary-
ing workload. In 2004. Proceedings 10th IEEE pacific rim international symposium on dependable
computing (pp. 122–129). IEEE.
Zimmermann, A. (2017). Modelling and performance evaluation with timenet 4.4. In International confer-
ence on quantitative evaluation of systems (pp. 300–303). Springer.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Matheus Torquato received his Master’s Degree in Computer Sci-

ence from the Federal University of Pernambuco. He received his
Bachelor’s Degree in Computer Science from the Federal University
of Alagoas. He also has a certificate in Computer Networks, received
from Federal Institute of Alagoas. He already worked on building and
managing of Cloud Computing Private Environments. His research
interests comprise subjects like Cloud Computing, Performance and
Dependability Evaluation, Computer Networks and Security. He is
currently on leave from his teaching activities at the Federal Institute
of Alagoas, Campus Arapiraca to pursue Ph.D. at the University of
Coimbra. His website is http://www.matheustorquato.com.
Paulo Maciel received the degree in electronic engineering in 1987

and the M.Sc. and Ph.D. degrees in electronic engineering and com-
puter science from the Federal University of Pernambuco, Recife,
Brazil, respectively. He was a faculty member with the Department of
Electrical Engineering, Pernambuco University, Recife, Brazil, from
1989 to 2003. Since 2001, he has been a member of the Informat-
ics Center, Federal University of Pernambuco, where he is currently
an Associate Professor. In 2011, during his sabbatical from the Fed-
eral University of Pernambuco, he stayed with the Department of
Electrical and Computer Engineering, Edmund T. Pratt School of
Engineering, Duke University, Durham, NC, USA, as a Visiting
Professor. His current research interests include performance and
dependability evaluation, Petri nets and formal models, encompass-
ing manufacturing, embedded, computational, and communication
systems as well as power consumption analysis. Dr. Maciel is a
Research Member of the Brazilian Research Council.
Marco Vieira received the Ph.D. degree from UC, Portugal, in 2005.
He currently is a Full Professor with the University of Coimbra,
Coimbra, Portugal. His research interests include dependability and
security assessment and benchmarking, fault injection, software pro-
cesses, and software quality assurance, subjects in which he has
authored or coauthored more than 200 papers in refereed conferences
and journals. He has participated and coordinated several research
projects, both at the national and European level. He has served on
program committees of the major conferences of the dependabil-
ity area and acted as referee for many international conferences and
journals in the dependability and security areas.
Aﬃliations
Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1
Paulo Maciel
prmm@cin.ufpe.br
Marco Vieira
mvieira@dei.uc.pt
1 Centre for Informatics and Systems, Department of Informatics Engineering, University
of Coimbra (CISUC-DEI, UC), Coimbra, Portugal
2 Federal Institute of Alagoas (IFAL), Campus Arapiraca, Arapiraca, Brazil
3 Centro de Informática, Universidade Federal de Pernambuco (CIn-UFPE), Recife, Brazil

1 - Availability and Reliability Modeling of VM Migration As Rejuvenation On A System Under Varying Workload

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - Availability and Reliability Modeling of VM Migration As Rejuvenation On A System Under Varying Workload

Uploaded by

Copyright:

Available Formats

Software Quality Journal (2020) 28:59–83

Availability and reliability modeling of VM migration

Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1

Published online: 4 March 2020

Keywords Software aging and rejuvenation · VM migration · Availability · Reliability ·

Extended author information available on the last page of the article.

2.1 Dependability modeling

2.2 Software aging and rejuvenation

3.1 System architecture

Fig. 1 System architecture

3.2 VMM rejuvenation technique

3.3 Failure and repair behaviors

The system becomes non-operational after a VM internal failure (application or guest OS

Fig. 2 VMM rejuvenation technique

3.4 Workload variation behavior

Fig. 3 Workload variation cycle

4.1 Availability model

Fig. 4 Availability SRN model

4.2 Reliability model

5 Results and analysis

Table 1 Immediate transitions and their meaning

ResetClock Restart time counting for VM migration

Fig. 5 Reliability SRN model

5.1 Availability evaluation results

OffPeakPeriod Period of off-peak incoming Workload (per day) 24 - peakDuration

Repair Repair time after a software aging failure 1h

Fig. 6 Steady-state availability results

Table 3 Appropriate rejuvenation policies for considered scenarios

Peak duration (h) Optimal reju- Availability Availability with-

0 158 0.99886769 0.99776706

5.2 Reliability evaluation results

Fig. 7 Downtime reduction (comparison)

3 PD means peak duration and RT means rejuvenation trigger

Fig. 8 Reliability results

Fig. 9 Reliability results (160 h)

Papers Availability eval- Reliability evalu- Software aging Workload varia-

Kim et al. (2009), Dantas et al. Yes No No No

7 Conclusions and future works

Appendix A: Approach used in the deterministic transitions

Fig. 10 Token swap with

Fig. 11 Token swap with age memory policy

Appendix B: A brief description of TimeNET Tool

This description is based on paper (Zimmermann 2017).

Matheus Torquato received his Master’s Degree in Computer Sci-

Paulo Maciel received the degree in electronic engineering in 1987

Matheus Torquato1,2 · Paulo Maciel3 · Marco Vieira1

You might also like