You are on page 1of 14

124 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO.

2, APRIL-JUNE 2005

A Comprehensive Model
for Software Rejuvenation
Kalyanaraman Vaidyanathan, Member, IEEE, and Kishor S. Trivedi, Fellow, IEEE

Abstract—Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been
reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of
exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a
technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a
system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important
research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to
software aging in the framework of Gray’s software fault classification (deterministic and transient), and study the treatment and
recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource
usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate
transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined
for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion
for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for
failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal
rejuvenation schedules that maximize availability or minimize downtime cost.

Index Terms—Availability, measurement-based dependability evaluation, semi-Markov reward models, software aging, software
rejuvenation, workload characterization.

1 INTRODUCTION

T HEphenomenon of “software aging,” one in which the


state of the software or its environment degrades with
time, has been reported and investigated by several studies
and reinitializing internal data structures are some actions
by which the internal state or the environment of the
software can be cleaned.
[17], [25]. The primary causes of this degradation are the Although we use the by-now-established phrase “soft-
exhaustion of operating system resources, data corruption, ware aging,” it should be clear that no deterioration of the
and numerical error accumulation. Eventually, this may software system per se is implied but rather, the software
lead to performance degradation of the software or crash/ appears to age due to the degradation of the operating
hang failure or both. Some common causes of “software environment (for example, gradual depletion of resources)
aging” are memory bloating and leaking, unreleased file- [5]. Likewise, “software rejuvenation” actually refers to
locks, data corruption, storage space fragmentation, and rejuvenation of the environment in which the software is
accumulation of round-off errors [17]. Since aging leads to executing.
transient failures in software systems, environment diver- Software rejuvenation has been implemented in several
sity [28], a software fault tolerance technique that is based different systems. The AT&T billing applications [25] and
on changing the operating environment, can be employed telecommunications switching software [2] both have some
proactively to prevent degradation or crashes. This involves form of rejuvenation implemented. Proactive fault manage-
occasionally stopping the running software, “cleaning” its ment of this type is not only used in high-availability
internal state or its environment, and restarting it. Such a software, but also in safety-critical systems. Preventive
technique known as “software rejuvenation” was proposed maintenance, to maximize the probability of successful
by Huang et al. [25]. This counteracts the aging phenom- mission completion, has been proposed for spacecraft
enon in a proactive manner by removing the accumulated systems [40]. Proactive fault management was also recom-
error conditions and freeing up operating system resources. mended for the Patriot missiles’ software system [33],
Garbage collection, flushing operating system kernel tables where the recommendation was for the computer system to
be switched off and on every eight hours.
More recently, two kinds of rejuvenation policies have
been implemented in cluster systems to improve perfor-
. K. Vaidyanathan is with the Scalable Systems Group, Sun Microsystems,
San Diego, CA 92121. E-mail: kalyan.vaidyanathan@sun.com. mance and availability, by taking advantage of the failover
. K.S. Trivedi is with the Department of Electrical and Computer feature [9], [27], [45]. In the periodic policy, rejuvenation of
Engineering, Duke University, Durham, NC 27708-0294. the cluster nodes is done in a rolling fashion after every
E-mail: kst@ee.duke.edu.
deterministic interval. In the prediction-based policy, the
Manuscript received 27 Jan. 2004; revised 17 Nov. 2004; accepted 31 Mar. time to rejuvenate is estimated based on the collection and
2005; published online 3 June 2005.
For information on obtaining reprints of this article, please send e-mail to: statistical analysis of system data. Microsoft’s IIS 5.0 Web
tdsc@computer.org, and reference IEEECS Log Number TDSC-0022-0104. server features a software rejuvenation policy known as
1545-5971/05/$20.00 ß 2005 IEEE Published by the IEEE Computer Society
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 125

resources. The result from the semi-Markov reward model


are then fed into a higher-level availability model that
accounts for failure followed by reactive recovery, as well as
proactive recovery. This comprehensive model is then used
to derive optimal rejuvenation schedules that maximize
availability or minimize downtime cost.
The main contributions of this paper are 1) a measure-
ment-based model for capturing the effect of system
workload on operating system resources, 2) investigating
the effect of workload on system resources, particularly
with respect to exhaustion and 3) development of a
comprehensive model used to compute optimal rejuvena-
tion schedules that maximize availability or minimize
downtime cost.
The rest of this paper is organized as follows: In Section 2,
we show how to include faults attributed to software aging
into the framework of Gray’s classification. Section 3 gives
an overview of related work and brings out some
differences between this work and earlier work. The
experimental setup and data collection are briefly explained
in Section 4. Section 5 discusses the clustering approach and
state transition model for system workload characterization.
Modeling of resource depletion is explained in Section 6.
The workload model behavior and results are discussed in
Fig. 1. Our approach for model construction and solution. Section 7. Section 8 combines the measurement-based
workload model with an availability model to obtain an
process recycling. The popular Web server software, optimal rejuvenation schedule. Section 9 summarizes the
Apache, implements a form of rejuvenation by killing and contributions of the paper.
recreating processes after a certain numbers of requests
have been served [31]. Software rejuvenation is implemen- 2 CLASSIFICATION AND TREATMENT OF
ted in specialized transaction processing servers [8]. SOFTWARE FAULTS
Rejuvenation has also been proposed for cable and DSL
modem gateways [13] in Motorola’s Cable Modem Termi- In this section, we describe how we can include software
nation System [32] and in middleware applications [7] for faults attributed to software aging into Jim Gray’s fault
failure detection and prevention. Automated rejuvenation classification [20] and discuss the various fault tolerance
strategies have been proposed in the context of self-healing techniques to deal with these faults in the operational phase
and autonomic computing systems [23]. of the software. Particular attention is given to environment
In this paper, we first describe how to include faults diversity, explaining its need, various approaches, and
attributed to software aging in the framework of Gray’s methods in practice.
software fault classification (deterministic and transient) 2.1 Classification of Software Faults
and study the treatment and recovery strategies for each of Faults, in both hardware and software, can be classified
the fault classes. This helps us understand the nature of according to their phase of creation or occurrence, system
software faults and their impact on system availability and boundaries (internal or external), domain (hardware or
performance and aid in choosing the best possible recovery software), phenomenological cause, intent, and persistence
strategy when a fault is triggered. We then construct a semi- [4]. In this section, we restrict ourselves to the classification
Markov reward model based on workload and resource of software faults based on their phase of creation.
usage data collected from the UNIX operating system Some studies have suggested that since software is not a
(shown in Fig. 1). For this model, we first identify different physical entity and, hence, not subject to transient physical
workload states using statistical cluster analysis, then we phenomena (as opposed to hardware), software faults are
estimate the transition probabilities and the state sojourn permanent in nature [24]. Some other studies classify
time distributions from the measured data. The measured software faults as both permanent and transient. Gray [20]
sojourn time distributions fit either hypoexponential or classifies software faults into Bohrbugs and Heisenbugs.
hyperexponential theoretical distributions leading to a Bohrbugs are essentially permanent design faults and,
semi-Markov model rather than a Markov model. Corre- hence, almost deterministic in nature. They can be
sponding to each resource, a reward function is then identified easily and weeded out during the testing and
defined for the model based on the rate of resource debugging phase (or early deployment phase) of the
depletion in different states. The workload-based model software life cycle. A software system with Bohrbugs is
results are then compared with the results obtained by analogous to a faulty deterministic finite state machine.
using the purely time-based approach [17]. The model is Heisenbugs, on the other hand, are design faults that
then solved to obtain estimated times to exhaustion for the behave in a way similar to hardware transient or
126 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

DELTA-4 [10], [35] systems. Since there are multiple


versions of software operating, it is not likely that all of
them will experience the same transient failure. One of the
disadvantages of design diversity is the high cost involved
Fig. 2. Venn diagram of software fault types. in developing multiple variants of software.
Data diversity [1] can work well with Bohrbugs and is
intermittent faults. Their conditions of activation occur less expensive to implement than design diversity. To some
rarely or are not easily reproducible. These faults are extent, data diversity can also deal with Heisenbugs since
extremely dependent on the operating environment (other different input data is presented and, by definition, these
programs, OS, and hardware resources). Hence, these bugs are nondeterministic and nonrepeatable.
faults result in transient failures, i.e., failures that may Environment diversity is a recent approach to software
not recur if the software is restarted. Some typical fault tolerance. The components that affect the behavior of a
situations in which Heisenbugs might surface are bound- software are the volatile state (program stack and data
aries between various software components, improper, or segments), the persistent state (user files), and the operating
insufficient exception handling and interdependent timing system environment (all resources like swap space, file
of various events. It is for this reason that Heisenbugs are systems, communications channels, etc.) [46]. Transient
extremely difficult to identify through testing. In fact, any faults typically occur in computer systems due to design
attempt to detect such a bug may alter the operating faults in software which result in unacceptable and
environment enough to change the symptoms. A software erroneous states in the operating system environment.
system with Heisenbugs is analogous to a faulty non- Environment diversity therefore advocates reexecuting the
deterministic finite state machine. A mature piece of software in a different operating environment [24], [28].
software in the operational phase, released after its Although this technique has been used for long in an ad
development and testing stage, is more likely to experience hoc manner, only recently has it gained recognition and
failures caused by Heisenbugs than due to Bohrbugs. Most importance. Usually, restart or reexecution is done at the
recent studies on failure data have reported that a large instance of a failure in the software. Examples of environ-
proportion of software failures are transient in nature [20], ment diversity techniques include retry operation, restart
[21], caused by phenomena such as overloads or timing application and rebooting the node. The retry and restart
and exception errors [11], [39]. The study of failure data operations can be done on the same node or on another
from Tandem’s fault tolerant computer system indicated spare (cold/warm/hot) node. Tandem’s fault tolerant
that 70 percent of the failures were transient failures, computer system [30] is based on the process pair approach.
caused by race conditions and timing problems [30]. It was noted that many application failures did not recur
We designate faults attributed to software aging as aging- once the application was restarted on the second processor.
related faults. Aging-related faults can fall under Bohrbugs This was due to the fact that the second processor provided
or Heisenbugs depending on whether the failure is a different environment which did not trigger the same
deterministic (repeatable) or transient. Fig. 2 illustrates this error conditions which led to the failure of the application
classification. Following are examples of software faults in on the first processor. Hence, in this case (as well as in
each of these categories. A software fault that is environ- Avaya’s SwiFT [18]), hardware redundancy coupled with
ment independent and, hence, deterministic, falls under the software replication1 was used to tolerate most of the
category of nonaging related Bohrbug (for example, a set of software faults.
inputs resulting in the same failure every time). If the For aging-related bugs, environment diversity can be
software bug, for example, is related to the arrival order of particularly effective if utilized proactively in the form of
messages to a process, it is classified as a nonaging related software rejuvenation. Rejuvenation can be triggered either
Heisenbug. Reorder of messages and replay might result in by time (deterministic intervals) or by measurement and
the system working correctly. A bug causing a gradual analysis of data of the system condition.
resource exhaustion deterministically every time is classi-
fied as an aging-related Bohrbug. A bug causing an
unknown resource leak during rare instances, which are
3 RELATED WORK
difficult to reproduce, could be classified as an aging- The study of software aging and rejuvenation can be
related Heisenbug. broadly classified into two approaches—measurement-
based approach and analytic modeling approach. The first
2.2 Software Fault Tolerance Techniques approach applies statistical analysis to data collected from
In this section, we discuss the various software fault systems and applies trend analysis or other techniques to
tolerance techniques that can be used for the specific fault determine a window of time over which to perform
classes discussed above. Design diversity has been advo- rejuvenation in order to prevent unplanned outages. The
cated as a technique for software fault tolerance mainly to analytic modeling approach assumes failure and repair time
deal with Bohrbugs [3]. It relies on the assumption of distributions of a system and obtains optimal rejuvenation
independence between multiple variants of software. schedule to maximize availability, or minimize loss prob-
However, as some studies have shown, this assumption ability or downtime cost.
may not always be valid [29]. Design diversity has also been
used to treat Heisenbugs, as in the GUARDS [34] and 1. Identical copies.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 127

Garg et al. [17] present a measurement-based approach In this work, we bridge the gap between the measure-
for detecting and estimating trends and times to exhaustion ment-based and the analytic modeling approaches by first
of operating system resources due to software aging. The building a measurement-based semi-Markov workload
data collection technique used in present paper, including model. Both the structure (states and state transitions) and
the workload and resource usage variables monitored, are the parameters (transition probabilities and sojourn time
the same as in their work. While the work by Garg et al. [17] distributions) are determined using statistical estimation on
considers only time based trend detection and estimation of measured data. Measured data on the rate of resource
resource exhaustion, Vaidyanathan and Trivedi [44] take depletion in each state is used as reward rate leading to the
the system workload into account for building a model to computation of the estimated time to exhaustion of each
estimate resource exhaustion times. In this paper, we resource. These results are then used to build a higher level
discuss a comprehensive model, which first extends the semi-Markov availability model. This comprehensive mod-
workload-based approach by performing transient analysis el takes into account both reactive recovery following a
and formulating the estimated time to resource exhaustion failure due to resource exhaustion and rejuvenation, and is
as the mean time to accumulated reward in a semi-Markov used to derive optimal rejuvenation schedules.
reward model. We then develop an upper-level availability Other related work in measurement-based dependability
model that accounts for failure and rejuvenation, and helps evaluation is based on either measurements made at failure
us derive optimal rejuvenation schedules. Cassidy et al. [8] times [11] or at error observation times [42]. In our case, we
have developed an approach to rejuvenation for large monitor the system performance variables continuously
online transaction processing servers. Using pattern recog- since we are interested in trend estimation and, hence, in
nition methods, they find that 13 out of several monitored predicting the time to next failure and not in observing
parameters deviate from “normal” behavior just prior to a interfailure times or identifying error patterns. Hsueh et al.
crash, providing sufficient warning to initiate rejuvenation. use [26] a semi-Markov reward model to estimate the cost of
Li et al. [31] present an approach based on time series different types of errors. The clustering method they use is
analysis to predict resource usage trends in a Web server very similar to ours but, in their work, reward rates are
while subjecting it to an artificial workload. assigned based on error rates while, in our case, reward
Several papers have dealt with determining the optimal rates are attached to states reflecting the rate of resource
times to perform preventive maintenance of operational depletion in that state.
software systems, through analytical models. The accuracy
of this approach is determined by the assumptions made in 4 EXPERIMENTAL SETUP AND DATA COLLECTION
the model for capturing aging. In many papers [14], [15],
4.1 SNMP-Based Distributed Resource
[25], [40], only the failures causing unavailability of the Monitoring Tool
software are considered, while Pfening et al. [36] assume a
The SNMP (Simple Network Management Protocol)-based
gradually decreasing service rate of a software that serves
distributed resource monitoring tool developed by Garg
transactions. Garg et al. [16], however, consider both these
et al. [17] was used for our data collection. SNMP is an
effects of aging together in a single model. Models proposed
application protocol which offers network management
in some of the papers [14], [25] are restricted to hypo-
services in the Internet protocol suite. The manager, the
exponentially distributed time to failure, while models
agent, and the MIB (Management Information Base) form
proposed in others [15], [36], [40] can accommodate general
the main constituents of an SNMP-based management tool.
distributions, but only for the specific aging effect they
The SNMP protocol defines a client-server relationship
capture. Generally distributed time to failure, as well as the
between a manager and an agent, and the MIB describes the
service rate being an arbitrary function of time are allowed
information that can be obtained and/or modified through
in the work by Garg et al. [16]. Only their model captures
interactions between the manager and the agent.
the effect of load on aging. In the work by Dohi et al. [12],
software rejuvenation models are formulated using semi- 4.2 Data Collection
Markov processes and optimal rejuvenation schedules that The resource monitoring tool referred to previously was
maximize availability and minimize cost are derived used to collect operating system resource usage data
analytically. Nonparametric statistical algorithms to esti- (physical/virtual memory usage, file/process table usage,
mate the optimal schedules are developed given a sample etc.) and system activity data (paging activity, CPU
data of failure times. The use of rejuvenation has been utilization, etc.) from nine heterogeneous UNIX work-
extended to cluster systems [9], [45], and analytical models stations, which were connected by an Ethernet LAN at the
of the implementation show that employing software Duke Department of Electrical and Computer Engineering.
rejuvenation in cluster systems results in a significant In our setup, shown in Fig. 3, a central monitoring station
increase in system availability and decrease in downtime runs the manager program which sends get requests
cost. In [47], models are presented for software rejuvenation periodically to each of the agent programs running on the
in cluster systems under varying workloads. Bobbio et al. monitored workstations. The agent programs in turn obtain
[6] present fine grained degradation models where one can data for the manager from their respective machines by
identify the current degradation level based on the executing various standard UNIX utility programs like
observation of a system parameter. Optimal rejuvenation pstat, iostat, and vmstat. Also shown in the figure along with
policies based on a risk criterion and an alert threshold are the monitored workstations are their respective operating
then presented. systems and primary functions in the department.
128 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

a complete state-transition model to describe the system


workload dynamics. The details of the model construction
are explained in the rest of this section.
In this study, we use the following variables to
characterize the workload:

. cpuContextSwitch: The number of process context


switches performed during the measurement inter-
val (10 min in our case).
. sysCall: The number of system calls made during the
interval.
. pageIn: The number of page-in operations (pages
fetched in from file system or swap device) during
the interval.
. pageOut: The number of page-out operations (pages
pushed out to file system or swap device) during the
interval.
Thus, a point in a four-dimensional space, (cpuContextS-
Fig. 3. Experimental setup for data collection. witch, sysCall, pageIn, pageOut), represents the measured
workload for a given interval of time. These variables are
The objects or parameters monitored on the work- thus used to define and characterize the system workload.
stations include those that describe the state of the We next partition the data points into clusters that contain
operating system resources, state of the processes run- similar points based on some predefined criteria. A
ning, information on the /tmp file system, availability statistical clustering algorithm is used for achieving this.
and usage of network related resources, and information
on terminal and disk I/O activity. More than 100 such 5.1 Cluster Analysis
parameters were monitored at regular intervals (10 min) The goal of cluster analysis is to determine a partition of a
for more than three months. given set of points into groups or clusters such that the
In this paper, we only discuss the results for the data points in a single cluster are similar, according to a certain
collected from the machine Rossby. The two resources criterion, to each other than to points in a different cluster.
selected for study are usedSwapSpace and realMemoryFree In our case, we use an iterative nonhierarchical clustering
since they are considered to be leading indicators of
algorithm called the Hartigan’s k-means clustering algorithm
aging [17], [25]. Other resources like fileTablesize and
[22]. The objective of this algorithm is to divide a given set
processTablesize can also be considered, but are not
of points into k clusters so that the within-cluster sum of
addressed in this paper. Also, only the longest stretch of
sample points in which no reboots or failures occurred squares is minimized.
were used for building the model, which is described in If the variables for clustering are not expressed in
the following section. homogeneous units, a scale change or normalization must
be performed. In our analysis, we used the normalization
method based on the following transformation:
5 WORKLOAD CHARACTERIZATION AND MODELING
xi  mini fxi g
We have used the two-stage method [37] for constructing x0 i ¼ ;
our semi-Markov workload model, where we define a maxi fxi g  mini fxi g
matrix P and a vector H(t). The transitions of the semi- where x0i is the normalized value of xi . This transformation
Markov process can be thought of taking place in two restricts the values of all the variables to the range 0 to 1. We
stages. Consider a transition from state i to state j. In the need to eliminate the outliers in the data before applying
first stage, the chain stays in state i for some amount of time the scaling technique since they tend to distort the
described by the sojourn time distribution Hi ðtÞ. transformation. Outliers in the data were identified and
In the second stage, the chain moves to state j
eliminated by an analysis of the cumulative distribution of
determined by the probability pij . The sojourn time
each parameter, although they were assigned to clusters in
distributions Hi ðtÞ are allowed to be general by the semi-
the final stage. The statistics of the measured workload
Markov assumption whereas a (homogeneous continuous-
time) Markov chain would have allowed only exponential variables are shown in Table 1.
distributions. The k-means clustering algorithm was applied to the
The system workload was characterized by obtaining a workload data and this resulted in eight clusters. The
number of variables pertaining to CPU activity and file number of final clusters is highly data dependent. We
system I/O. We then identify a small number of represen- compute the ratio  ¼ WWiþ1i  ðn  i  1Þ, where Wi is the
tative workload states through statistical clustering techni- within-cluster sum of squares for i clusters and n is the total
ques. Once the states are identified, the transition number of data points. The clustering algorithm starts with
probabilities from one state to another and the distribution one cluster and keeps adding one at a time until the desired
of sojourn time in each state are estimated. Thus, we obtain value of  is reached.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 129

TABLE 1
Statistics for the Workload Variables Measured

TABLE 2
Statistics for the Workload Clusters

The statistics for the eight workload clusters are shown 5.3 Sojourn Time Distribution
in Table 2. Also shown in the table are the percentage of the To completely specify the semi-Markov process, we need to
sample data points in each cluster. We observe that more determine the distribution of sojourn time in each state. This
than 75 percent of the points belong to clusters 4, 5, and 7 distribution for all the workload states was fitted to either
that are relatively light workload states. Clusters 1 and 8 are 2-stage hyperexponential or 2-stage hypoexponential dis-
heavy workload states and contain significantly less data tribution functions. The sojourn times in workload states
W1 , W5 , and W7 were fitted to hypoexponential distribu-
points.
tions, while the sojourn times in all other states were fitted
5.2 The State-Transition Model to hyperexponential distributions [43]. The fitted distribu-
The next step, after the clusters and centroids are identified, tions were tested using the Kolmogorov-Smirnov test [43] at
is to build a state-transition model for the system workload. a significance level of 0.01.
Fig. 4 shows the Kolmogorov-Smirnov test for sojourn
This is done by determining the transition probabilities
time distribution in workload state 4 and the fitted
from one state to another. The transition probability pij from
distributions for all the workload states are listed in Table 3.
a state i to a state j can be estimated from the sample data Even though our model was constructed from real
using the following formula [26], [43]: measurements, the clustering method results only in an
observed no: of transitions from state i to state j approximate model for the workload. We made a quick
pij ¼ : check of the model accuracy by comparing the steady state
total observed no: of transitions from state i
probability of occupying a particular workload state
State transition probabilities were estimated from the computed from the model to the estimated probability
observed data for the eight workload states and the from the observed data, i.e., the fraction of the length of
resulting (8  8) transition probability matrix, P is shown time the system was in that workload state to the total
below: length of the period of observation. These values, shown in
Table 4, match closely.
P ¼
2 3
0:000 0:155 0:224 0:129 0:259 0:034 0:165 0:034 6 MODELING RESOURCE DEPLETION
6 0:071 0:000 0:136 0:140 0:316 0:026 0:307 0:004 7
6 7 The previous section discussed the development of a semi-
6 7
6 0:122 0:226 0:000 0:096 0:426 0:000 0:113 0:017 7 Markov process to describe the system workload. Since our
6 7
6 0:147 0:363 0:059 0:000 0:098 0:216 0:088 0:029 7 original aim was to estimate the exhaustion of system
6 7
6 7: resources as a function of workload, we need to incorporate
6 0:033 0:068 0:037 0:011 0:000 0:004 0:847 0:000 7
6 7 the effect of workload on the resources in this model.
6 0:070 0:163 0:023 0:535 0:116 0:000 0:023 0:070 7
6 7 Therefore, a reward function corresponding to each system
6 7
4 0:022 0:049 0:003 0:003 0:920 0:003 0:000 0:000 5 resource considered for analysis is assigned to the model.
0:307 0:077 0:154 0:231 0:077 0:154 0:000 0:000 This will enable us to study the evolution of the system
130 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

Fig. 4. Kolmogorov-Smirnov test for sojourn time distribution in W4 .


TABLE 3
Sojourn Time Distributions in the Eight Workload States

resource in the different states of the workload model. The are no reboots during this period. The time interval for
reward rate, rij , for each workload state, i, and for each which the plots are shown for the resources, corresponds to
resource, j, is assigned the estimated slope (described in the the time interval for which the workload data used for
next subsection) of the depletion of resource j in the
workload state i.
We consider two resources here for our analysis—
usedSwapSpace and realMemoryFree. The time plots of these
two resources as measured on machine Rossby between
two successive reboots/failures are shown in Fig. 5. There

TABLE 4
Comparison of State Occupancy Probabilities

Fig. 5. Time plots of resources usedSwapSpace and realMemoryFree in


machine Rossby.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 131

TABLE 5
Slope Estimates (in KB/10 min) for usedSwapSpace and realMemoryFree at Different Workload States

building the state-transition model (in the previous section) assumption that resource depletion does depend on the
was sampled from. system workload and the rates of exhaustion vary with
workload changes. We observe that the slopes for usedSwap-
6.1 Computing the Slope
Space in all the workload states are nonnegative, and the
If a linear trend is present in the data, linear regression slopes for realMemoryFree are nonpositive in all the work-
methods can be used to estimate the true slope by load states except in one. This shows that usedSwapSpace
computing the least squares estimate of the slope. The
increases whereas realMemoryFree decreases over time. This
slope obtained by this method can deviate greatly from the
validates the “software aging” phenomenon [17], [25]. It is
true slope if there are gross errors or outliers in the data
also generally observed that the higher the system activity,
[19]. Therefore, we estimate the true slope of a resource
higher is the resource depletion. The 95 percent confidence
depletion at every workload state by using a nonparametric
intervals for the slope are relatively wide for usedSwapSpace
procedure developed by Sen [38]. Sen’s method is not
in workload state W1 and for realMemoryFree in workload
greatly affected by gross data errors or outliers, and it can
be computed even with missing data. state W8 . This reflects the relatively small number of data
The procedure for estimating the slope by this method is points in these states and accounts for a high degree of
as follows: First, n0 slope estimates are calculated for all pairs variability.
of points at i and j for which i > j; as mij ¼ ðxi  xj Þ=ði  jÞ.
We thus obtain n0 ¼ nðn  1Þ=2 slope estimates, where n is 7 RESULTS FOR THE WORKLOAD MODEL
the total number of data observations. The median of these
The irreducible semi-Markov reward model (SMP) de-
n0 values is the required slope estimate. It is also possible
scribed above was converted into an irreducible Markov
to obtain a two-sided confidence interval about the true
reward model (MRM) before solving since all the sojourn
slope [19].
Table 5 shows the slope obtained by Sen’s method for times have phase-type distributions. For examples and
these two resources, for all the eight workload states. The
slopes shown in this table are the slopes of the respective
resource depletion during the longest interval for which the
system was at that workload state. For a particular work-
load state, it is possible to obtain the slope of resource
depletion during different visits to the state. We expect the
different slopes thus obtained to be relatively close. Figs. 6
and 7 show the observed values of usedSwapSpace and
realMemoryFree, respectively, at workload state 4 during two
different visits (with different sojourn times for each visit).
The estimated slopes (in KB/10 min) and 95 percent
confidence intervals are also given in parentheses beside
the slope estimates. We observe that the slope in both the
instances are within the same confidence limits. Thus, in
our model, these slopes correspond to the reward rates for
each workload state for usedSwapSpace and realMemoryFree.
The observation that slopes in a given workload state are
within the same confidence intervals (refer to Figs. 6 and 7),
together with the observation that slopes across different
workload states are different (refer to Table 5) validates our Fig. 6. usedSwapSpace in W4 during two different visits.
132 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

Fig. 8. Transient slope estimate of usedSwapSpace.

Fig. 7. realMemoryFree in W4 during two different visits.


where ð0Þ is the initial probability vector,
Z r
details of such a Markovizing transformation, see [26], [43]. 1

First, we compute the expected steady state reward rate, CðrÞ ¼ euR1 B du R1
1 ;
0
which can be considered the rate of depletion of the
resource averaged over all workload states. This can then be B ¼ Q1  Q2 Q1
4 Q3 , and e is a column vector of all ones.
compared with the slope obtained using the (workload- 7.1 Model Solution
independent) pure time-based approach of [17]. The
The Markov reward model describing the system workload
expected reward rate at time t is then computed in order
and resource depletion was solved partly using the
to determine how quickly the steady state is reached.
Finally, the expected time to accumulate reward r is SHARPE (Symbolic Hierarchical Automated Reliability
computed in order to estimate the mean time to exhaustion and Performance Evaluator) [37] software package devel-
of the resource. oped by researchers at Duke University.
The following procedure is used to compute the Figs. 8 and 9 show the instantaneous reward rate (or the
expected time to accumulate reward r. The states of the transient slope estimate) of usedSwapSpace and realMemory-
MRM can be partitioned into two sets based on the assigned Free respectively. The increase in the slope in the beginning
reward rates (S), which contains all states with positive can be attributed to the initial conditions and the dynamics
reward rates, and S z ¼   S, which contains all the states of the stochastic model. The values settle down to a stable
with zero reward rates. Based on this partition, the value fairly quickly, i.e., the steady state is reached within a
generator matrix Q of the MRM can be rearranged and short time.
partitioned into four submatrices such that Table 6 gives the estimates for slope and time to
  exhaustion for usedSwapSpace and realMemoryFree computed
Q1 Q2
Q¼ ;
Q3 Q4
where Q1 contains all the transitions within S, Q2 contains
the transitions from S to S z , Q3 contains the transitions
from S z to S, and Q4 contains all the transitions within S z .
Similarly, the reward rate matrix R can be partitioned
such that
 
R1 0
R¼ ;
0 0
where R1 ¼ diagi2S ½ri  and each ri corresponds to the
ordering of states in Q1 .
Let T ðrÞ represent the time to accumulate reward r. The
expected time to accumulate reward r, E½T ðrÞ, is given
by [41]
 
CðrÞ CðrÞQ2 Q1
4
E½T ðrÞ ¼  ð0Þ 1 e;
Q1 1 1
4 Q3 CðrÞ Q4 þ Q4 Q3 CðrÞQ2 Q4
Fig. 9. Transient slope estimate of realMemoryFree.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 133

TABLE 6
Estimates for Slope (in KB/10 min) and Time to Exhaustion (in Days) for usedSwapSpace and realMemoryFree

using both the time based (workload independent) and the trend estimate coincides with the final peak values of
workload-based approaches. The slope for the time-based realMemoryFree.
approach is obtained through Sen’s slope estimate [17]. The
slope for the workload based estimation is obtained as the
8 COMPREHENSIVE MODEL
expected reward rate at steady state from the model. The
95 percent confidence interval for the workload-based slope In the measurement-based model described in the previous
estimations is larger since it allows for a larger degree of section, explicit failure and rejuvenation are not captured,
variability that possibly takes into account peaks in the i.e., there are no failure states or rejuvenation states. All the
data. The workload based estimation of time to resource workload states are “up” states or available states where the
exhaustion is computed as the mean time to accumulate software is doing useful work.
reward r, where r is the maximum amount of resource The macrostates of a software system employing
which can be consumed. In other words, E½T ðrmax Þ is rejuvenation can be represented by the semi-Markov
process in Fig. 10 [16]. The software is available for service
computed for each resource of interest. For a decreasing
in state A. It can either fail in this state and enter state B
resource usage (negative reward rates), the problem is
(failure state) or a rejuvenation trigger can take it to state C.
transformed into an increasing resource usage problem by
The software is considered unavailable in both states B and
changing all negative reward rates to positive rates and by
C. It is brought back to state A from B after (reactive)
assuming the resource usage increases from zero to the
recovery and from state C after the completion of
initial value (assumed to be the maximum). To estimate the
(proactive) rejuvenation.
time to resource exhaustion using the time-based approach,
The subordinated process in state A for the transaction-
we use the initial intercept c, the slope estimate m and the
based system described in [16] was a queuing system with
simple linear equation y ¼ mx þ c. The initial points for states representing the number of current transactions in the
both the estimations is taken to be the average value over system. The measurement-based workload model devel-
the first few days. The maximum value for the usedSwap- oped in Section 5 can be substituted for this subordinated
Space in machine Rossby was 312,724 Kilobytes. process as shown in Fig. 10.
We find that workload based estimations give a lower The distribution of time to failure, F ðtÞ, of the software
time to resource exhaustion than the corresponding slopes starting from the available state, A, is the distribution of
computed using time based estimations [17]. It was time to exhaustion of the resource in the workload model.
observed that machines failed due to resource exhaustion Since the computation of this distribution, in general, is
much before the times to resource exhaustion estimated by very complicated [41] (more so with some states having
the time based method. The upper confidence level for the zero reward rates), we approximate the distribution of time
workload based estimation can be used to estimate the to failure with an increasing failure-rate (IFR) distribution
peaks in resource usage. The lower confidence level
estimate gives the general overall trend without being
affected by the peaks. The reason that peaks in the data are
more or less covered by the workload-based estimation is
that workload model takes into account those workload
states that are likely to generate peaks in the data. Hence,
the slope estimated by the workload model is larger. The
time-based estimation removes these peaks and gives us
only a general overall trend.
A comparison between the estimated times to exhaustion
can help us ascertain which resources are important for us
to monitor and manage; in this case, realMemoryFree being
the more important or critical resource. We get a slope
estimated using system workload being better than that
estimated using the workload independent method. The
workload based trend estimation coincides with the final
values of realMemoryFree and the upper confidence level Fig. 10. Macrostates of the software behavior.
134 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

which has the same mean as that of the job completion time. where  is the optimal rejuvenation trigger period and  ¼
We split the time to failure into two exponential stages with 2= as indicated earlier (i.e.,  is mean time to resource
each exponential stage having a mean of =2, where  is exhaustion from the workload model). The above equation
obtained as the mean time to resource exhaustion in can be written as a fixed point equation of the form
Section 7. This results in the time to failure distribution  
F ðtÞ being a 2-stage Erlang given by 1 f  r
 ¼ ln ¼ ga ð Þ: ð2Þ
 f  ðf  2r Þ
F ðtÞ ¼ 1  et  tet
It is easy to see that  lies in the open interval
with parameter  ¼ 2=.  
The rejuvenation trigger period (from state A to state C), f
0; :
, is assumed to be deterministic. The distributions of time ðf  2r Þ
to recover (from state B to state A) and time to perform The solution of  can now be obtained by the bisection
rejuvenation (from state C to state A) are considered general method, where two initial values 1 and 2 are picked such
with mean times f and r , respectively. that 1 < ga ð1 Þ and 2 > ga ð2 Þ. This computation is
8.1 Model Solution repeated several times by narrowing the distance between
As previously explained, the software can be in one of three 1 and 2 each time till the desired accuracy of  is reached.
states at time t: available for service (state A), failure state Over a time interval ½0; T , the expected uptime is given
(state B), and rejuvenation state (state C). by, U ¼ T  Avail. The maximum (optimal) expected up-
Solving for the embedded discrete time Markov chain time is then given by U  ¼ T  Avail , where Avail
steady-state probabilities [43], we get [16] A ¼ 1=2, corresponds to the optimal availability computed using  .
B ¼ pAB =2, and C ¼ pAC =2, where pAC is given by
8.1.2 Expected Downtime Cost
pAC ¼ 1  F ðÞ ¼ e ð1 þ Þ Another objective of rejuvenation could be to minimize the
and total downtime cost over a given interval. The total cost per
unit time due to downtime is given by
pAB ¼ 1  pAC :
Cost ¼ B  cfailure þ C  crejuv ;
Let HA ðtÞ be distribution of holding time in state A of the
semi-Markov process. The mean holding time in state A, where cfailure and crejuv are costs per unit time due to failure
E½HA ðtÞ, is given by and rejuvenation respectively. Steady-state probabilities B
and C can be computed in a way similar to the one for
Z 
1 computing A . Therefore,
E½HA ðtÞ ¼ ð1  F ðtÞÞdt ¼ ½2  e ð2 þ Þ:
0 
B f  cfailure þ C r  crejuv
Cost ¼ CðÞ ¼
As discussed in the previous section, the distributions of A E½HA ðtÞ þ B f þ C r
sojourn time in states B and C are considered general, with f  cfailure þ e ð1 þ Þðr  crejuv  f  cfailure Þ
means f and r , respectively. ¼ 1  ð2 þ Þ þ   e ð1 þ Þð   Þ
:
 ½2  e f f r

8.1.1 Steady-State Availability ð3Þ


The steady-state availability of the software, i.e., the steady- As before, to obtain the value of optimal rejuvenation
state probability of the semi-Markov process being in state 0
trigger period,  , which minimizes downtime cost, CðÞ, we
A, can be obtained as a function of  from the formula [43] differentiate CðÞ with respect to  and set this to 0.
A E½HA ðtÞ Therefore,
Avail ¼ A ðÞ ¼
A E½HA ðtÞ þ B f þ C r dCðÞ
 e ð2 þ Þ
1 ¼ 0:
 ½2 d
¼1 
:
 ½2  e ð2 þ Þ þ f  e ð1 þ Þðf  r Þ Substituting for CðÞ from (3), differentiating and simplify-
ð1Þ ing the equation, we get
To obtain the value of optimal rejuvenation trigger period, 0
ðf  cfailure  r  crejuv Þe þ ½f  cfailure  2r  crejuv
 , which maximizes availability, A ðÞ, we differentiate 0

A ðÞ with respect to  and set this value to 0. Therefore, þ f r ðcfailure  crejuv Þ  f  cfailure ¼ 0;

dA ðÞ where 0 is the optimal rejuvenation trigger period. The


¼ 0: above equation can be written as a fixed point equation of
d
the form
Substituting for A ðÞ from (1), differentiating and simplify-
ing the equation, we get 1 h i
0 ¼ ln f cfailure ½f cfailure ¼ gc ð0 Þ: ð4Þ
f cfailure r crejuv
2r crejuv þf r ðcfailure crejuv Þ0


ðf  r Þe þ ðf  2r Þ  f ¼ 0;
Now, 0 lies in the open interval
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 135

Fig. 11. Expected uptime versus rejuvenation trigger period.


  Fig. 12. Expected cost versus rejuvenation trigger period.
f  cfailure
0; :
½f  cfailure  2r  crejuv þ f r ðcfailure  crejuv Þ of our work is a measurement-based model that incorpo-
0
The solution of  can now be obtained by the bisection rates the effect of system workload on operating system
method as explained earlier. resources and an approach to investigate its effect on
Over a time interval ½0; T , the expected cost due to software aging. Since many studies have suggested strong
downtime is given by, C ¼ T  Cost. The minimum correlations between workload and system reliability/
expected downtime cost is then given by C 0 ¼ T  Cost0 , availability, this model is an improvement over the purely
where Cost0 corresponds to the optimal cost per unit of time-based model [17]. We were able to clearly demonstrate
downtime computed using 0 . the relation between the system workload and resource
exhaustion and, thus, validate our assumption that work-
8.2 Results for the Comprehensive Model load does affect the way various system resources are
To illustrate the applicability of the model, we assume that depleted. Not only was the system workload dynamics
the mean time to recover from a failure, f is 4 hours and captured in the model, but also the effect of workload on
the mean time to rejuvenate the system, r is 1 hour. The resource depletion was quantified by means of reward rates
mean time to failure is 41.38 days (from Section 5). We also or slopes in the model. In doing this, we have also validated
assume a cost of $5,000/hour for failure and $500/hour for the “software aging” phenomenon with respect to resource
rejuvenation. We compute the expected uptime and the exhaustion. The quantification of resource exhaustion for
expected cost over an interval of 1,000 days. different workload states helps us in obtaining a better
From (2) and (4), we get estimation for exhaustion rates and time to exhaustion than
  just time based estimation. The analysis of relative
3 importance among different resources and estimated time
 ¼ 496:56ln
4  0:00403 to resource exhaustion [17] can be done here with greater
  accuracy. Finally, we also integrated the analytical model
0 19; 500
 ¼ 496:56ln : and the measurement-based model to obtain a comprehen-
20; 000  38:336250
sive model. The model was solved to obtain the optimal
Solving the above equations, we get  ¼ 36:10 days and rejuvenation trigger period for maximizing expected up-
0 ¼ 5:60 days. Fig. 11 shows the plot of the rejuvenation time and expected cost due to downtime. Although the
trigger period versus the expected uptime and Fig. 12 estimations obtained from these methodologies cannot be
shows the plot of rejuvenation interval versus the expected taken to be estimations of actual machine failure times since
cost due to downtime. As we can see, the results of the they might depend on various other factors, all these results
optimal rejuvenation trigger periods obtained graphically take us a step further toward predicting actual failure times
agree with the results obtained by solving the equations for and may help us in proposing new or better preventive
 and 0 . maintenance policies like “software rejuvenation.” These
models are very generic and can be applied to any other
data set. The models described in this paper are offline
9 CONCLUSION models, as opposed to online models [9], [45]. Once a model
In this paper, we included aging-related faults intro Gray’s is built, it can be used until the configuration or the
classification of software faults and discussed various workload pattern of the machine is changed.
methods to deal with these faults. We also dealt with A possible extension of this work could be to consider
software rejuvenation, a specific form of environment the system workload in a more fine-grained manner and to
diversity that is gaining importance as an effective determine the relation between each individual process or a
preventive maintenance technique. The main contribution group of similar processes and resource consumption.
136 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005

Future work along these lines could be to model the [17] S. Garg, A. van Moorsel, K. Vaidyanathan, K. Trivedi, “A
Methodology for Detection and Estimation of Software Aging,”
distribution of time to failure more accurately and take Proc. Ninth Int’l Symp. Software Reliability Eng., pp. 282-292, Nov.
multiple resources into consideration while developing the 1998.
[18] S. Garg, Y. Huang, C.M.R. Kintala, K.S. Trivedi, and S. Yagnik,
model. “Performance and Reliability Evaluation of Passive Replication
Schemes in Application Level Fault Tolerance,” Proc. Fault Tolerant
Computing Symp. (FTCS 1999), pp. 322-329, June 1999.
ACKNOWLEDGMENTS [19] R.O. Gilbert, Statistical Methods for Environmental Pollution Mon-
itoring. New York: Van Nostrand Reinhold, 1987.
This work was done while the first author was at Duke [20] J. Gray, “Why Do Computers Stop and What Can Be Done About
University. The project was supported in part by Telcordia It?” Proc. Fifth Symp. Reliability in Distributed Software and Database
Systems, pp. 3-12, Jan. 1986.
Technologies as a core project of the CACC, by the JPL REE [21] J. Gray, “A Census of Tandem System Availability between 1985
project under Contract 1216658, by the AFOSR under MURI and 1990,” IEEE Trans. Reliability, vol. 39, pp. 409-418, Oct. 1990.
[22] J.A. Hartigan, Clustering Algorithms. New York: Wiley, 1975.
grant F49620-1-0327, by the ARO under a MURI grant [23] Y. Hong, D. Chen, L. Li, and K.S. Trivedi, “Closed Loop Design for
C-DAAD19 01-1-0646, “Mathematics of Failures in Complex Software Rejuvenation,” Proc. Workshop Self-Healing, Adaptive and
Self-Managed Systems (SHAMAN 2002), June 2002.
Systems” and by the SITAR project of the DARPA OASIS [24] Y. Huang, P. Jalote, and C. Kintala, “Two Techniques for Transient
Grant N66001-00-C. Software Error Recovery,” Lecture Notes in Computer Science,
vol. 774, pp. 159-170, 1994.
[25] Y. Huang, C. Kintala, N. Kolettis, and N.D. Fulton, “Software
REFERENCES Rejuvenation: Analysis, Module and Applications,” Proc. 25th
Symp. Fault Tolerant Computing (FTCS-25), pp. 381–390, June 1995.
[1] P.E. Amman and J.C. Knight, “Data Diversity: An Approach to [26] M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling
Software Fault Tolerance,” Proc. 17th Int’l Symp. Fault Tolerant Based on Real Data: A Case Study,” IEEE Trans. Computers, vol. 37,
Computing, pp. 122-126, June 1987. no. 4, pp. 478-484, Apr. 1988.
[2] A. Avritzer and E.J. Weyuker, “Monitoring Smoothly Degrading
[27] “IBM Netfinity Director Software Rejuvenation,”White Paper,
Systems for Increased Dependability,” Empirical Software Eng. J.,
IBM Corp., Research Triangle Park, N.C., Jan. 2001.
vol. 2, no. 1, pp. 59-77, 1997.
[28] P. Jalote, Y. Huang, and C. Kintala, “A Framework for Under-
[3] A. Avizienis and L. Chen, “On the Implementation of N-Version
standing and Handling Transient Software Failures,” Proc. Second
Programming for Software Fault Tolerance During Execution,”
ISSAT Int’l Conf. Reliability and Quality in Design, 1995.
Proc. IEEE COMPSAC 77 Conf., pp. 149-155, Nov. 1977.
[29] J.C. Knight and N.G. Leveson, “An Experimental Evaluation of
[4] A. Avizienis, J.-C. Laprie, and B. Randell, “Fundamental Concepts
the Assumption of Independence in Multiversion Programming,”
of Dependability,” LAAS Technical Report No. 01-145, LAAS,
Software Eng. J., pp. 96-109, vol. 12, no. 1, 1986.
France, Apr. 2001.
[5] Y. Bao, X. Sun, and K. Trivedi, “Adaptive Software Rejuvenation: [30] I. Lee and R.K. Iyer, “Software Dependability in the Tandem
Degradation Models and Rejuvenation Schemes,” Proc. Int’l. Conf. GUARDIAN System,” IEEE Trans. Software Eng., vol. 21, no. 5,
Dependable Systems and Networks (DSN-2003), June 2003. pp. 455-467, May 1995.
[6] A. Bobbio, A. Sereno, and C. Anglano, “Fine Grained Software [31] L. Li, K. Vaidyanathan, and K.S. Trivedi, “An Approach to
Degradation Models for Optimal Rejuvenation Policies,” Perfor- Estimation of Software Aging in a Web Server,” Proc. Int’l Symp.
mance Evaluation, vol. 46, pp. 45-62, 2001. Empirical Software Eng. (ISESE 2002), Oct. 2002.
[7] T. Boyd and P. Dasgupta, “Preemptive Module Replacement [32] Y. Liu, Y. Ma, J.J. Han, H. Levendel, and K.S. Trivedi, “Modeling
Using the Virtualizing Operating System,” Proc. Workshop Self- and Analysis of Software Rejuvenation in Cable Modem Termina-
Healing, Adaptive and Self-Managed Systems (SHAMAN 2002), June tion System,” Proc. Int’l Symp. Software Reliability Eng. (ISSRE
2002. 2002), Nov. 2002.
[8] K. Cassidy, K. Gross, and A. Malekpour, “Advanced Pattern [33] E. Marshall, “Fatal Error: How Patriot Overlooked a Scud,”
Recognition for Detection of Complex Software Aging in Online Science, p. 1347, Mar. 1992.
Transaction Processing Servers,” Proc. Int’l Conf. Dependable [34] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A.
Systems and Networks (DSN 2002), June 2002. Fantechi, E. Jenn, C. Rabejac, and A. Wellings, “GUARDS: A
[9] V. Castelli, R.E. Harper, P. Heidelberger, S.W. Hunter, K.S. Generic Upgradable Architecture for Real-Time Dependable
Trivedi, K. Vaidyanathan, and W. Zeggert, “Proactive Manage- Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10,
ment of Software Aging,” IBM J. Research & Development, vol. 45, no. 6, June 1999.
no. 2, Mar. 2001. [35] D. Powell, “Distributed Fault Tolerance: Lessons from Delta-4,”
[10] M. Chereque, D. Powell, P. Reynier, J.-L. Richier, and J. Voiron, IEEE Micro, vol. 14, no. 1, Feb. 1994.
“Active Replication in Delta-4,” Proc. 22nd IEEE Int’l. Symp. Fault [36] A. Pfening, S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi,
Tolerant Computing (FTCS-22), pp. 28-37, July 1992. “Optimal Rejuvenation for Tolerating Soft Failures,” Performance
[11] R. Chillarege, S. Biyani, and J. Rosenthal, “Measurement of Failure Evaluation, vols. 27-28, pp. 491-506, Oct. 1996.
Rate in Widely Distributed Software,” Proc. 25th IEEE Int’l Symp. [37] R.A. Sahner, K.S. Trivedi, A. Puliafito, Performance and Reliability
Fault Tolerant Computing, pp. 424-433, July 1995. Analysis of Computer Systems—An Example-Based Approach Using
[12] T. Dohi, K. Goseva-Popstojanova, and K.S. Trivedi, “Statistical the SHARPE Software Package. Norwell, Mass.: Academic Publish-
Non-Parametric Algorithms to Estimate the Optimal Software ers, 1996.
Rejuvenation Schedule,” Proc. 2000 Pacific Rim Int’l Symp. [38] P.K. Sen, “Estimates of the Regression Coefficient Based on
Dependable Computing (PRDC 2000), Dec. 2000. Kendall’s Tau,” J. the Am. Statistical Assoc., vol. 63, pp. 1379-1389,
[13] C. Fetzer and K. Hostedt, “Rejuvenation and Failure Detection in 1968.
Partitionable Systems,” Proc. Pacific Rim Int’l Symp. Dependable [39] M. Sullivan and R. Chillarege, “Software Defects and Their Impact
Computing (PRDC 2001), Dec. 2001. on System Availability—A Study of Field Failures in Operating
[14] S. Garg, A. Puliafito, and K.S. Trivedi, “Analysis of Software Systems,” Proc. 21st IEEE Int’l Symp. Fault-Tolerant Computing,
Rejuvenation Using Markov Regenerative Stochastic Petri Net,” pp. 2-9, 1991.
Proc. Sixth Int’l Symp. Software Reliability Eng., pp. 180-187, Oct. [40] A.T. Tai, S.N. Chau, L. Alkalaj, and H. Hecht, “On-Board
1995. Preventive Maintenance: Analysis of Effectiveness and Optimal
[15] S. Garg, Y. Huang, and C. Kintala, K.S. Trivedi, “Minimizing Duty Period,” Proc. Third Int’l Workshop Object Oriented Real-Time
Completion Time of a Program by Checkpointing and Rejuvena- Dependable Systems, Feb. 1997.
tion,” Proc. 1996 ACM SIGMETRICS Conf., pp. 252-261, May 1996. [41] M. Telek, A. Pfening, and G. Fodor, “An Effective Numerical
[16] S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi, “Analysis of Method to Compute the Moments of the Completion Time of
Preventive Maintenance in Transactions Based Software Systems,” Markov Reward Models,” Computer Math. Applications, vol. 36,
IEEE Trans. Computers, pp. 96-107, vol. 47, no. 1, Jan. 1998. no. 8, pp. 59-65, 1998.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 137

[42] A. Thakur and R.K. Iyer, “Analyze-NOW—An Environment for Kalyanaraman Vaidyanathan received the BE
Collection and Analysis of Failures in a Network of Work- degree in computer science from the University
stations,” Proc. Int’l Symp. Software Reliability Eng., pp. 14-23, Apr. of Madras, India, and the MS and PhD degrees
1996. in electrical and computer engineering from
[43] K.S. Trivedi, Probability and Statistics, with Reliability, Queuing, and Duke University (2002). He was a recipient of
Computer Science Applications, second ed. John Wiley, 2001. the IBM Graduate Fellowship Award in 2000. His
[44] K. Vaidyanathan and K.S. Trivedi, “A Measurement-Based Model research interests include software reliability
for Estimation of Resource Exhaustion in Operational Software and performance and dependability evaluation
Systems,” Proc. 10th IEEE Int’l Symp. Software Reliability Eng., of computer systems. He is currently a research
pp. 84-93, Nov. 1999. engineer in the Scalable Systems Group, Sun
[45] K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Microsystems, San Diego, California, exploring proactive fault monitor-
“Analysis and Implementation of Software Rejuvenation in ing techniques through telemetry and pattern recognition. He is a
Cluster Systems,” Proc. Joint Int’l Conf. Measurement and Modeling member of the IEEE.
of Computer Systems, ACM SIGMETRICS 2001/Performance 2001,
June 2001. Kishor S. Trivedi holds the Hudson Chair in the
[46] Y.-M. Wang, Y. Huang, P.-Y. Chung, P. Vo, and C. Kintala, Department of Electrical and Computer Engi-
“Checkpointing and Its Applications,” Proc. Symp. Fault Tolerant neering at Duke University, Durham, North
Computer Systems, pp. 22-31, June 1995. Carolina. He has been on the Duke faculty since
[47] W. Xie, Y. Hong, and K.S. Trivedi, “Software Rejuvenation Policies 1975. He is the author of a well-known text
for Cluster Systems under Varying Workload,” Proc. 10th Int’l entitled Probability and Statistics with Reliability,
Pacific Rim Dependable Computing Symp. (PRDC 2004), Mar. 2004. Queuing and Computer Science Applications
with a thoroughly revised second edition being
published by John Wiley. He has also published
two other books entitled Performance and
Reliability Analysis of Computer Systems (Kluwer Academic Publishers)
and Queueing Networks and Markov Chains (John Wiley). His research
interests are in reliability and performance assessment of computer and
communication systems. He has published more than 390 articles and
lectured extensively on these topics. He has supervised 39 PhD
dissertations. He has made seminal contributions in software rejuvena-
tion, solution techniques for Markov chains, fault trees, stochastic Petri
nets, and performability models. He has actively contributed to the
quantification of security and survivability. He is a fellow of the IEEE and
a Golden Core Member of IEEE Computer Society. He is a codesigner
of HARP, SAVE, SHARPE, and SPNP software packages that have
been well circulated.

. For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like