Professional Documents
Culture Documents
2, APRIL-JUNE 2005
A Comprehensive Model
for Software Rejuvenation
Kalyanaraman Vaidyanathan, Member, IEEE, and Kishor S. Trivedi, Fellow, IEEE
Abstract—Recently, the phenomenon of software aging, one in which the state of the software system degrades with time, has been
reported. This phenomenon, which may eventually lead to system performance degradation and/or crash/hang failure, is the result of
exhaustion of operating system resources, data corruption, and numerical error accumulation. To counteract software aging, a
technique called software rejuvenation has been proposed, which essentially involves occasionally terminating an application or a
system, cleaning its internal state and/or its environment, and restarting it. Since rejuvenation incurs an overhead, an important
research issue is to determine optimal times to initiate this action. In this paper, we first describe how to include faults attributed to
software aging in the framework of Gray’s software fault classification (deterministic and transient), and study the treatment and
recovery strategies for each of the fault classes. We then construct a semi-Markov reward model based on workload and resource
usage data collected from the UNIX operating system. We identify different workload states using statistical cluster analysis, estimate
transition probabilities, and sojourn time distributions from the data. Corresponding to each resource, a reward function is then defined
for the model based on the rate of resource depletion in each state. The model is then solved to obtain estimated times to exhaustion
for each resource. The result from the semi-Markov reward model are then fed into a higher-level availability model that accounts for
failure followed by reactive recovery, as well as proactive recovery. This comprehensive model is then used to derive optimal
rejuvenation schedules that maximize availability or minimize downtime cost.
Index Terms—Availability, measurement-based dependability evaluation, semi-Markov reward models, software aging, software
rejuvenation, workload characterization.
1 INTRODUCTION
Garg et al. [17] present a measurement-based approach In this work, we bridge the gap between the measure-
for detecting and estimating trends and times to exhaustion ment-based and the analytic modeling approaches by first
of operating system resources due to software aging. The building a measurement-based semi-Markov workload
data collection technique used in present paper, including model. Both the structure (states and state transitions) and
the workload and resource usage variables monitored, are the parameters (transition probabilities and sojourn time
the same as in their work. While the work by Garg et al. [17] distributions) are determined using statistical estimation on
considers only time based trend detection and estimation of measured data. Measured data on the rate of resource
resource exhaustion, Vaidyanathan and Trivedi [44] take depletion in each state is used as reward rate leading to the
the system workload into account for building a model to computation of the estimated time to exhaustion of each
estimate resource exhaustion times. In this paper, we resource. These results are then used to build a higher level
discuss a comprehensive model, which first extends the semi-Markov availability model. This comprehensive mod-
workload-based approach by performing transient analysis el takes into account both reactive recovery following a
and formulating the estimated time to resource exhaustion failure due to resource exhaustion and rejuvenation, and is
as the mean time to accumulated reward in a semi-Markov used to derive optimal rejuvenation schedules.
reward model. We then develop an upper-level availability Other related work in measurement-based dependability
model that accounts for failure and rejuvenation, and helps evaluation is based on either measurements made at failure
us derive optimal rejuvenation schedules. Cassidy et al. [8] times [11] or at error observation times [42]. In our case, we
have developed an approach to rejuvenation for large monitor the system performance variables continuously
online transaction processing servers. Using pattern recog- since we are interested in trend estimation and, hence, in
nition methods, they find that 13 out of several monitored predicting the time to next failure and not in observing
parameters deviate from “normal” behavior just prior to a interfailure times or identifying error patterns. Hsueh et al.
crash, providing sufficient warning to initiate rejuvenation. use [26] a semi-Markov reward model to estimate the cost of
Li et al. [31] present an approach based on time series different types of errors. The clustering method they use is
analysis to predict resource usage trends in a Web server very similar to ours but, in their work, reward rates are
while subjecting it to an artificial workload. assigned based on error rates while, in our case, reward
Several papers have dealt with determining the optimal rates are attached to states reflecting the rate of resource
times to perform preventive maintenance of operational depletion in that state.
software systems, through analytical models. The accuracy
of this approach is determined by the assumptions made in 4 EXPERIMENTAL SETUP AND DATA COLLECTION
the model for capturing aging. In many papers [14], [15],
4.1 SNMP-Based Distributed Resource
[25], [40], only the failures causing unavailability of the Monitoring Tool
software are considered, while Pfening et al. [36] assume a
The SNMP (Simple Network Management Protocol)-based
gradually decreasing service rate of a software that serves
distributed resource monitoring tool developed by Garg
transactions. Garg et al. [16], however, consider both these
et al. [17] was used for our data collection. SNMP is an
effects of aging together in a single model. Models proposed
application protocol which offers network management
in some of the papers [14], [25] are restricted to hypo-
services in the Internet protocol suite. The manager, the
exponentially distributed time to failure, while models
agent, and the MIB (Management Information Base) form
proposed in others [15], [36], [40] can accommodate general
the main constituents of an SNMP-based management tool.
distributions, but only for the specific aging effect they
The SNMP protocol defines a client-server relationship
capture. Generally distributed time to failure, as well as the
between a manager and an agent, and the MIB describes the
service rate being an arbitrary function of time are allowed
information that can be obtained and/or modified through
in the work by Garg et al. [16]. Only their model captures
interactions between the manager and the agent.
the effect of load on aging. In the work by Dohi et al. [12],
software rejuvenation models are formulated using semi- 4.2 Data Collection
Markov processes and optimal rejuvenation schedules that The resource monitoring tool referred to previously was
maximize availability and minimize cost are derived used to collect operating system resource usage data
analytically. Nonparametric statistical algorithms to esti- (physical/virtual memory usage, file/process table usage,
mate the optimal schedules are developed given a sample etc.) and system activity data (paging activity, CPU
data of failure times. The use of rejuvenation has been utilization, etc.) from nine heterogeneous UNIX work-
extended to cluster systems [9], [45], and analytical models stations, which were connected by an Ethernet LAN at the
of the implementation show that employing software Duke Department of Electrical and Computer Engineering.
rejuvenation in cluster systems results in a significant In our setup, shown in Fig. 3, a central monitoring station
increase in system availability and decrease in downtime runs the manager program which sends get requests
cost. In [47], models are presented for software rejuvenation periodically to each of the agent programs running on the
in cluster systems under varying workloads. Bobbio et al. monitored workstations. The agent programs in turn obtain
[6] present fine grained degradation models where one can data for the manager from their respective machines by
identify the current degradation level based on the executing various standard UNIX utility programs like
observation of a system parameter. Optimal rejuvenation pstat, iostat, and vmstat. Also shown in the figure along with
policies based on a risk criterion and an alert threshold are the monitored workstations are their respective operating
then presented. systems and primary functions in the department.
128 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005
TABLE 1
Statistics for the Workload Variables Measured
TABLE 2
Statistics for the Workload Clusters
The statistics for the eight workload clusters are shown 5.3 Sojourn Time Distribution
in Table 2. Also shown in the table are the percentage of the To completely specify the semi-Markov process, we need to
sample data points in each cluster. We observe that more determine the distribution of sojourn time in each state. This
than 75 percent of the points belong to clusters 4, 5, and 7 distribution for all the workload states was fitted to either
that are relatively light workload states. Clusters 1 and 8 are 2-stage hyperexponential or 2-stage hypoexponential dis-
heavy workload states and contain significantly less data tribution functions. The sojourn times in workload states
W1 , W5 , and W7 were fitted to hypoexponential distribu-
points.
tions, while the sojourn times in all other states were fitted
5.2 The State-Transition Model to hyperexponential distributions [43]. The fitted distribu-
The next step, after the clusters and centroids are identified, tions were tested using the Kolmogorov-Smirnov test [43] at
is to build a state-transition model for the system workload. a significance level of 0.01.
Fig. 4 shows the Kolmogorov-Smirnov test for sojourn
This is done by determining the transition probabilities
time distribution in workload state 4 and the fitted
from one state to another. The transition probability pij from
distributions for all the workload states are listed in Table 3.
a state i to a state j can be estimated from the sample data Even though our model was constructed from real
using the following formula [26], [43]: measurements, the clustering method results only in an
observed no: of transitions from state i to state j approximate model for the workload. We made a quick
pij ¼ : check of the model accuracy by comparing the steady state
total observed no: of transitions from state i
probability of occupying a particular workload state
State transition probabilities were estimated from the computed from the model to the estimated probability
observed data for the eight workload states and the from the observed data, i.e., the fraction of the length of
resulting (8 8) transition probability matrix, P is shown time the system was in that workload state to the total
below: length of the period of observation. These values, shown in
Table 4, match closely.
P ¼
2 3
0:000 0:155 0:224 0:129 0:259 0:034 0:165 0:034 6 MODELING RESOURCE DEPLETION
6 0:071 0:000 0:136 0:140 0:316 0:026 0:307 0:004 7
6 7 The previous section discussed the development of a semi-
6 7
6 0:122 0:226 0:000 0:096 0:426 0:000 0:113 0:017 7 Markov process to describe the system workload. Since our
6 7
6 0:147 0:363 0:059 0:000 0:098 0:216 0:088 0:029 7 original aim was to estimate the exhaustion of system
6 7
6 7: resources as a function of workload, we need to incorporate
6 0:033 0:068 0:037 0:011 0:000 0:004 0:847 0:000 7
6 7 the effect of workload on the resources in this model.
6 0:070 0:163 0:023 0:535 0:116 0:000 0:023 0:070 7
6 7 Therefore, a reward function corresponding to each system
6 7
4 0:022 0:049 0:003 0:003 0:920 0:003 0:000 0:000 5 resource considered for analysis is assigned to the model.
0:307 0:077 0:154 0:231 0:077 0:154 0:000 0:000 This will enable us to study the evolution of the system
130 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005
resource in the different states of the workload model. The are no reboots during this period. The time interval for
reward rate, rij , for each workload state, i, and for each which the plots are shown for the resources, corresponds to
resource, j, is assigned the estimated slope (described in the the time interval for which the workload data used for
next subsection) of the depletion of resource j in the
workload state i.
We consider two resources here for our analysis—
usedSwapSpace and realMemoryFree. The time plots of these
two resources as measured on machine Rossby between
two successive reboots/failures are shown in Fig. 5. There
TABLE 4
Comparison of State Occupancy Probabilities
TABLE 5
Slope Estimates (in KB/10 min) for usedSwapSpace and realMemoryFree at Different Workload States
building the state-transition model (in the previous section) assumption that resource depletion does depend on the
was sampled from. system workload and the rates of exhaustion vary with
workload changes. We observe that the slopes for usedSwap-
6.1 Computing the Slope
Space in all the workload states are nonnegative, and the
If a linear trend is present in the data, linear regression slopes for realMemoryFree are nonpositive in all the work-
methods can be used to estimate the true slope by load states except in one. This shows that usedSwapSpace
computing the least squares estimate of the slope. The
increases whereas realMemoryFree decreases over time. This
slope obtained by this method can deviate greatly from the
validates the “software aging” phenomenon [17], [25]. It is
true slope if there are gross errors or outliers in the data
also generally observed that the higher the system activity,
[19]. Therefore, we estimate the true slope of a resource
higher is the resource depletion. The 95 percent confidence
depletion at every workload state by using a nonparametric
intervals for the slope are relatively wide for usedSwapSpace
procedure developed by Sen [38]. Sen’s method is not
in workload state W1 and for realMemoryFree in workload
greatly affected by gross data errors or outliers, and it can
be computed even with missing data. state W8 . This reflects the relatively small number of data
The procedure for estimating the slope by this method is points in these states and accounts for a high degree of
as follows: First, n0 slope estimates are calculated for all pairs variability.
of points at i and j for which i > j; as mij ¼ ðxi xj Þ=ði jÞ.
We thus obtain n0 ¼ nðn 1Þ=2 slope estimates, where n is 7 RESULTS FOR THE WORKLOAD MODEL
the total number of data observations. The median of these
The irreducible semi-Markov reward model (SMP) de-
n0 values is the required slope estimate. It is also possible
scribed above was converted into an irreducible Markov
to obtain a two-sided confidence interval about the true
reward model (MRM) before solving since all the sojourn
slope [19].
Table 5 shows the slope obtained by Sen’s method for times have phase-type distributions. For examples and
these two resources, for all the eight workload states. The
slopes shown in this table are the slopes of the respective
resource depletion during the longest interval for which the
system was at that workload state. For a particular work-
load state, it is possible to obtain the slope of resource
depletion during different visits to the state. We expect the
different slopes thus obtained to be relatively close. Figs. 6
and 7 show the observed values of usedSwapSpace and
realMemoryFree, respectively, at workload state 4 during two
different visits (with different sojourn times for each visit).
The estimated slopes (in KB/10 min) and 95 percent
confidence intervals are also given in parentheses beside
the slope estimates. We observe that the slope in both the
instances are within the same confidence limits. Thus, in
our model, these slopes correspond to the reward rates for
each workload state for usedSwapSpace and realMemoryFree.
The observation that slopes in a given workload state are
within the same confidence intervals (refer to Figs. 6 and 7),
together with the observation that slopes across different
workload states are different (refer to Table 5) validates our Fig. 6. usedSwapSpace in W4 during two different visits.
132 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005
First, we compute the expected steady state reward rate, CðrÞ ¼ euR1 B du R1
1 ;
0
which can be considered the rate of depletion of the
resource averaged over all workload states. This can then be B ¼ Q1 Q2 Q1
4 Q3 , and e is a column vector of all ones.
compared with the slope obtained using the (workload- 7.1 Model Solution
independent) pure time-based approach of [17]. The
The Markov reward model describing the system workload
expected reward rate at time t is then computed in order
and resource depletion was solved partly using the
to determine how quickly the steady state is reached.
Finally, the expected time to accumulate reward r is SHARPE (Symbolic Hierarchical Automated Reliability
computed in order to estimate the mean time to exhaustion and Performance Evaluator) [37] software package devel-
of the resource. oped by researchers at Duke University.
The following procedure is used to compute the Figs. 8 and 9 show the instantaneous reward rate (or the
expected time to accumulate reward r. The states of the transient slope estimate) of usedSwapSpace and realMemory-
MRM can be partitioned into two sets based on the assigned Free respectively. The increase in the slope in the beginning
reward rates (S), which contains all states with positive can be attributed to the initial conditions and the dynamics
reward rates, and S z ¼ S, which contains all the states of the stochastic model. The values settle down to a stable
with zero reward rates. Based on this partition, the value fairly quickly, i.e., the steady state is reached within a
generator matrix Q of the MRM can be rearranged and short time.
partitioned into four submatrices such that Table 6 gives the estimates for slope and time to
exhaustion for usedSwapSpace and realMemoryFree computed
Q1 Q2
Q¼ ;
Q3 Q4
where Q1 contains all the transitions within S, Q2 contains
the transitions from S to S z , Q3 contains the transitions
from S z to S, and Q4 contains all the transitions within S z .
Similarly, the reward rate matrix R can be partitioned
such that
R1 0
R¼ ;
0 0
where R1 ¼ diagi2S ½ri and each ri corresponds to the
ordering of states in Q1 .
Let T ðrÞ represent the time to accumulate reward r. The
expected time to accumulate reward r, E½T ðrÞ, is given
by [41]
CðrÞ CðrÞQ2 Q1
4
E½T ðrÞ ¼ ð0Þ 1 e;
Q1 1 1
4 Q3 CðrÞ Q4 þ Q4 Q3 CðrÞQ2 Q4
Fig. 9. Transient slope estimate of realMemoryFree.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 133
TABLE 6
Estimates for Slope (in KB/10 min) and Time to Exhaustion (in Days) for usedSwapSpace and realMemoryFree
using both the time based (workload independent) and the trend estimate coincides with the final peak values of
workload-based approaches. The slope for the time-based realMemoryFree.
approach is obtained through Sen’s slope estimate [17]. The
slope for the workload based estimation is obtained as the
8 COMPREHENSIVE MODEL
expected reward rate at steady state from the model. The
95 percent confidence interval for the workload-based slope In the measurement-based model described in the previous
estimations is larger since it allows for a larger degree of section, explicit failure and rejuvenation are not captured,
variability that possibly takes into account peaks in the i.e., there are no failure states or rejuvenation states. All the
data. The workload based estimation of time to resource workload states are “up” states or available states where the
exhaustion is computed as the mean time to accumulate software is doing useful work.
reward r, where r is the maximum amount of resource The macrostates of a software system employing
which can be consumed. In other words, E½T ðrmax Þ is rejuvenation can be represented by the semi-Markov
process in Fig. 10 [16]. The software is available for service
computed for each resource of interest. For a decreasing
in state A. It can either fail in this state and enter state B
resource usage (negative reward rates), the problem is
(failure state) or a rejuvenation trigger can take it to state C.
transformed into an increasing resource usage problem by
The software is considered unavailable in both states B and
changing all negative reward rates to positive rates and by
C. It is brought back to state A from B after (reactive)
assuming the resource usage increases from zero to the
recovery and from state C after the completion of
initial value (assumed to be the maximum). To estimate the
(proactive) rejuvenation.
time to resource exhaustion using the time-based approach,
The subordinated process in state A for the transaction-
we use the initial intercept c, the slope estimate m and the
based system described in [16] was a queuing system with
simple linear equation y ¼ mx þ c. The initial points for states representing the number of current transactions in the
both the estimations is taken to be the average value over system. The measurement-based workload model devel-
the first few days. The maximum value for the usedSwap- oped in Section 5 can be substituted for this subordinated
Space in machine Rossby was 312,724 Kilobytes. process as shown in Fig. 10.
We find that workload based estimations give a lower The distribution of time to failure, F ðtÞ, of the software
time to resource exhaustion than the corresponding slopes starting from the available state, A, is the distribution of
computed using time based estimations [17]. It was time to exhaustion of the resource in the workload model.
observed that machines failed due to resource exhaustion Since the computation of this distribution, in general, is
much before the times to resource exhaustion estimated by very complicated [41] (more so with some states having
the time based method. The upper confidence level for the zero reward rates), we approximate the distribution of time
workload based estimation can be used to estimate the to failure with an increasing failure-rate (IFR) distribution
peaks in resource usage. The lower confidence level
estimate gives the general overall trend without being
affected by the peaks. The reason that peaks in the data are
more or less covered by the workload-based estimation is
that workload model takes into account those workload
states that are likely to generate peaks in the data. Hence,
the slope estimated by the workload model is larger. The
time-based estimation removes these peaks and gives us
only a general overall trend.
A comparison between the estimated times to exhaustion
can help us ascertain which resources are important for us
to monitor and manage; in this case, realMemoryFree being
the more important or critical resource. We get a slope
estimated using system workload being better than that
estimated using the workload independent method. The
workload based trend estimation coincides with the final
values of realMemoryFree and the upper confidence level Fig. 10. Macrostates of the software behavior.
134 IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, VOL. 2, NO. 2, APRIL-JUNE 2005
which has the same mean as that of the job completion time. where is the optimal rejuvenation trigger period and ¼
We split the time to failure into two exponential stages with 2= as indicated earlier (i.e., is mean time to resource
each exponential stage having a mean of =2, where is exhaustion from the workload model). The above equation
obtained as the mean time to resource exhaustion in can be written as a fixed point equation of the form
Section 7. This results in the time to failure distribution
F ðtÞ being a 2-stage Erlang given by 1 f r
¼ ln ¼ ga ð Þ: ð2Þ
f ðf 2r Þ
F ðtÞ ¼ 1 et tet
It is easy to see that lies in the open interval
with parameter ¼ 2=.
The rejuvenation trigger period (from state A to state C), f
0; :
, is assumed to be deterministic. The distributions of time ðf 2r Þ
to recover (from state B to state A) and time to perform The solution of can now be obtained by the bisection
rejuvenation (from state C to state A) are considered general method, where two initial values 1 and 2 are picked such
with mean times f and r , respectively. that 1 < ga ð1 Þ and 2 > ga ð2 Þ. This computation is
8.1 Model Solution repeated several times by narrowing the distance between
As previously explained, the software can be in one of three 1 and 2 each time till the desired accuracy of is reached.
states at time t: available for service (state A), failure state Over a time interval ½0; T , the expected uptime is given
(state B), and rejuvenation state (state C). by, U ¼ T Avail. The maximum (optimal) expected up-
Solving for the embedded discrete time Markov chain time is then given by U ¼ T Avail , where Avail
steady-state probabilities [43], we get [16] A ¼ 1=2, corresponds to the optimal availability computed using .
B ¼ pAB =2, and C ¼ pAC =2, where pAC is given by
8.1.2 Expected Downtime Cost
pAC ¼ 1 F ðÞ ¼ e ð1 þ Þ Another objective of rejuvenation could be to minimize the
and total downtime cost over a given interval. The total cost per
unit time due to downtime is given by
pAB ¼ 1 pAC :
Cost ¼ B cfailure þ C crejuv ;
Let HA ðtÞ be distribution of holding time in state A of the
semi-Markov process. The mean holding time in state A, where cfailure and crejuv are costs per unit time due to failure
E½HA ðtÞ, is given by and rejuvenation respectively. Steady-state probabilities B
and C can be computed in a way similar to the one for
Z
1 computing A . Therefore,
E½HA ðtÞ ¼ ð1 F ðtÞÞdt ¼ ½2 e ð2 þ Þ:
0
B f cfailure þ C r crejuv
Cost ¼ CðÞ ¼
As discussed in the previous section, the distributions of A E½HA ðtÞ þ B f þ C r
sojourn time in states B and C are considered general, with f cfailure þ e ð1 þ Þðr crejuv f cfailure Þ
means f and r , respectively. ¼ 1 ð2 þ Þ þ e ð1 þ Þð Þ
:
½2 e f f r
A ðÞ with respect to and set this value to 0. Therefore, þ f r ðcfailure crejuv Þ f cfailure ¼ 0;
Future work along these lines could be to model the [17] S. Garg, A. van Moorsel, K. Vaidyanathan, K. Trivedi, “A
Methodology for Detection and Estimation of Software Aging,”
distribution of time to failure more accurately and take Proc. Ninth Int’l Symp. Software Reliability Eng., pp. 282-292, Nov.
multiple resources into consideration while developing the 1998.
[18] S. Garg, Y. Huang, C.M.R. Kintala, K.S. Trivedi, and S. Yagnik,
model. “Performance and Reliability Evaluation of Passive Replication
Schemes in Application Level Fault Tolerance,” Proc. Fault Tolerant
Computing Symp. (FTCS 1999), pp. 322-329, June 1999.
ACKNOWLEDGMENTS [19] R.O. Gilbert, Statistical Methods for Environmental Pollution Mon-
itoring. New York: Van Nostrand Reinhold, 1987.
This work was done while the first author was at Duke [20] J. Gray, “Why Do Computers Stop and What Can Be Done About
University. The project was supported in part by Telcordia It?” Proc. Fifth Symp. Reliability in Distributed Software and Database
Systems, pp. 3-12, Jan. 1986.
Technologies as a core project of the CACC, by the JPL REE [21] J. Gray, “A Census of Tandem System Availability between 1985
project under Contract 1216658, by the AFOSR under MURI and 1990,” IEEE Trans. Reliability, vol. 39, pp. 409-418, Oct. 1990.
[22] J.A. Hartigan, Clustering Algorithms. New York: Wiley, 1975.
grant F49620-1-0327, by the ARO under a MURI grant [23] Y. Hong, D. Chen, L. Li, and K.S. Trivedi, “Closed Loop Design for
C-DAAD19 01-1-0646, “Mathematics of Failures in Complex Software Rejuvenation,” Proc. Workshop Self-Healing, Adaptive and
Self-Managed Systems (SHAMAN 2002), June 2002.
Systems” and by the SITAR project of the DARPA OASIS [24] Y. Huang, P. Jalote, and C. Kintala, “Two Techniques for Transient
Grant N66001-00-C. Software Error Recovery,” Lecture Notes in Computer Science,
vol. 774, pp. 159-170, 1994.
[25] Y. Huang, C. Kintala, N. Kolettis, and N.D. Fulton, “Software
REFERENCES Rejuvenation: Analysis, Module and Applications,” Proc. 25th
Symp. Fault Tolerant Computing (FTCS-25), pp. 381–390, June 1995.
[1] P.E. Amman and J.C. Knight, “Data Diversity: An Approach to [26] M.C. Hsueh, R.K. Iyer, and K.S. Trivedi, “Performability Modeling
Software Fault Tolerance,” Proc. 17th Int’l Symp. Fault Tolerant Based on Real Data: A Case Study,” IEEE Trans. Computers, vol. 37,
Computing, pp. 122-126, June 1987. no. 4, pp. 478-484, Apr. 1988.
[2] A. Avritzer and E.J. Weyuker, “Monitoring Smoothly Degrading
[27] “IBM Netfinity Director Software Rejuvenation,”White Paper,
Systems for Increased Dependability,” Empirical Software Eng. J.,
IBM Corp., Research Triangle Park, N.C., Jan. 2001.
vol. 2, no. 1, pp. 59-77, 1997.
[28] P. Jalote, Y. Huang, and C. Kintala, “A Framework for Under-
[3] A. Avizienis and L. Chen, “On the Implementation of N-Version
standing and Handling Transient Software Failures,” Proc. Second
Programming for Software Fault Tolerance During Execution,”
ISSAT Int’l Conf. Reliability and Quality in Design, 1995.
Proc. IEEE COMPSAC 77 Conf., pp. 149-155, Nov. 1977.
[29] J.C. Knight and N.G. Leveson, “An Experimental Evaluation of
[4] A. Avizienis, J.-C. Laprie, and B. Randell, “Fundamental Concepts
the Assumption of Independence in Multiversion Programming,”
of Dependability,” LAAS Technical Report No. 01-145, LAAS,
Software Eng. J., pp. 96-109, vol. 12, no. 1, 1986.
France, Apr. 2001.
[5] Y. Bao, X. Sun, and K. Trivedi, “Adaptive Software Rejuvenation: [30] I. Lee and R.K. Iyer, “Software Dependability in the Tandem
Degradation Models and Rejuvenation Schemes,” Proc. Int’l. Conf. GUARDIAN System,” IEEE Trans. Software Eng., vol. 21, no. 5,
Dependable Systems and Networks (DSN-2003), June 2003. pp. 455-467, May 1995.
[6] A. Bobbio, A. Sereno, and C. Anglano, “Fine Grained Software [31] L. Li, K. Vaidyanathan, and K.S. Trivedi, “An Approach to
Degradation Models for Optimal Rejuvenation Policies,” Perfor- Estimation of Software Aging in a Web Server,” Proc. Int’l Symp.
mance Evaluation, vol. 46, pp. 45-62, 2001. Empirical Software Eng. (ISESE 2002), Oct. 2002.
[7] T. Boyd and P. Dasgupta, “Preemptive Module Replacement [32] Y. Liu, Y. Ma, J.J. Han, H. Levendel, and K.S. Trivedi, “Modeling
Using the Virtualizing Operating System,” Proc. Workshop Self- and Analysis of Software Rejuvenation in Cable Modem Termina-
Healing, Adaptive and Self-Managed Systems (SHAMAN 2002), June tion System,” Proc. Int’l Symp. Software Reliability Eng. (ISSRE
2002. 2002), Nov. 2002.
[8] K. Cassidy, K. Gross, and A. Malekpour, “Advanced Pattern [33] E. Marshall, “Fatal Error: How Patriot Overlooked a Scud,”
Recognition for Detection of Complex Software Aging in Online Science, p. 1347, Mar. 1992.
Transaction Processing Servers,” Proc. Int’l Conf. Dependable [34] D. Powell, J. Arlat, L. Beus-Dukic, A. Bondavalli, P. Coppola, A.
Systems and Networks (DSN 2002), June 2002. Fantechi, E. Jenn, C. Rabejac, and A. Wellings, “GUARDS: A
[9] V. Castelli, R.E. Harper, P. Heidelberger, S.W. Hunter, K.S. Generic Upgradable Architecture for Real-Time Dependable
Trivedi, K. Vaidyanathan, and W. Zeggert, “Proactive Manage- Systems,” IEEE Trans. Parallel and Distributed Systems, vol. 10,
ment of Software Aging,” IBM J. Research & Development, vol. 45, no. 6, June 1999.
no. 2, Mar. 2001. [35] D. Powell, “Distributed Fault Tolerance: Lessons from Delta-4,”
[10] M. Chereque, D. Powell, P. Reynier, J.-L. Richier, and J. Voiron, IEEE Micro, vol. 14, no. 1, Feb. 1994.
“Active Replication in Delta-4,” Proc. 22nd IEEE Int’l. Symp. Fault [36] A. Pfening, S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi,
Tolerant Computing (FTCS-22), pp. 28-37, July 1992. “Optimal Rejuvenation for Tolerating Soft Failures,” Performance
[11] R. Chillarege, S. Biyani, and J. Rosenthal, “Measurement of Failure Evaluation, vols. 27-28, pp. 491-506, Oct. 1996.
Rate in Widely Distributed Software,” Proc. 25th IEEE Int’l Symp. [37] R.A. Sahner, K.S. Trivedi, A. Puliafito, Performance and Reliability
Fault Tolerant Computing, pp. 424-433, July 1995. Analysis of Computer Systems—An Example-Based Approach Using
[12] T. Dohi, K. Goseva-Popstojanova, and K.S. Trivedi, “Statistical the SHARPE Software Package. Norwell, Mass.: Academic Publish-
Non-Parametric Algorithms to Estimate the Optimal Software ers, 1996.
Rejuvenation Schedule,” Proc. 2000 Pacific Rim Int’l Symp. [38] P.K. Sen, “Estimates of the Regression Coefficient Based on
Dependable Computing (PRDC 2000), Dec. 2000. Kendall’s Tau,” J. the Am. Statistical Assoc., vol. 63, pp. 1379-1389,
[13] C. Fetzer and K. Hostedt, “Rejuvenation and Failure Detection in 1968.
Partitionable Systems,” Proc. Pacific Rim Int’l Symp. Dependable [39] M. Sullivan and R. Chillarege, “Software Defects and Their Impact
Computing (PRDC 2001), Dec. 2001. on System Availability—A Study of Field Failures in Operating
[14] S. Garg, A. Puliafito, and K.S. Trivedi, “Analysis of Software Systems,” Proc. 21st IEEE Int’l Symp. Fault-Tolerant Computing,
Rejuvenation Using Markov Regenerative Stochastic Petri Net,” pp. 2-9, 1991.
Proc. Sixth Int’l Symp. Software Reliability Eng., pp. 180-187, Oct. [40] A.T. Tai, S.N. Chau, L. Alkalaj, and H. Hecht, “On-Board
1995. Preventive Maintenance: Analysis of Effectiveness and Optimal
[15] S. Garg, Y. Huang, and C. Kintala, K.S. Trivedi, “Minimizing Duty Period,” Proc. Third Int’l Workshop Object Oriented Real-Time
Completion Time of a Program by Checkpointing and Rejuvena- Dependable Systems, Feb. 1997.
tion,” Proc. 1996 ACM SIGMETRICS Conf., pp. 252-261, May 1996. [41] M. Telek, A. Pfening, and G. Fodor, “An Effective Numerical
[16] S. Garg, A. Puliafito, M. Telek, and K.S. Trivedi, “Analysis of Method to Compute the Moments of the Completion Time of
Preventive Maintenance in Transactions Based Software Systems,” Markov Reward Models,” Computer Math. Applications, vol. 36,
IEEE Trans. Computers, pp. 96-107, vol. 47, no. 1, Jan. 1998. no. 8, pp. 59-65, 1998.
VAIDYANATHAN AND TRIVEDI: A COMPREHENSIVE MODEL FOR SOFTWARE REJUVENATION 137
[42] A. Thakur and R.K. Iyer, “Analyze-NOW—An Environment for Kalyanaraman Vaidyanathan received the BE
Collection and Analysis of Failures in a Network of Work- degree in computer science from the University
stations,” Proc. Int’l Symp. Software Reliability Eng., pp. 14-23, Apr. of Madras, India, and the MS and PhD degrees
1996. in electrical and computer engineering from
[43] K.S. Trivedi, Probability and Statistics, with Reliability, Queuing, and Duke University (2002). He was a recipient of
Computer Science Applications, second ed. John Wiley, 2001. the IBM Graduate Fellowship Award in 2000. His
[44] K. Vaidyanathan and K.S. Trivedi, “A Measurement-Based Model research interests include software reliability
for Estimation of Resource Exhaustion in Operational Software and performance and dependability evaluation
Systems,” Proc. 10th IEEE Int’l Symp. Software Reliability Eng., of computer systems. He is currently a research
pp. 84-93, Nov. 1999. engineer in the Scalable Systems Group, Sun
[45] K. Vaidyanathan, R.E. Harper, S.W. Hunter, K.S. Trivedi, Microsystems, San Diego, California, exploring proactive fault monitor-
“Analysis and Implementation of Software Rejuvenation in ing techniques through telemetry and pattern recognition. He is a
Cluster Systems,” Proc. Joint Int’l Conf. Measurement and Modeling member of the IEEE.
of Computer Systems, ACM SIGMETRICS 2001/Performance 2001,
June 2001. Kishor S. Trivedi holds the Hudson Chair in the
[46] Y.-M. Wang, Y. Huang, P.-Y. Chung, P. Vo, and C. Kintala, Department of Electrical and Computer Engi-
“Checkpointing and Its Applications,” Proc. Symp. Fault Tolerant neering at Duke University, Durham, North
Computer Systems, pp. 22-31, June 1995. Carolina. He has been on the Duke faculty since
[47] W. Xie, Y. Hong, and K.S. Trivedi, “Software Rejuvenation Policies 1975. He is the author of a well-known text
for Cluster Systems under Varying Workload,” Proc. 10th Int’l entitled Probability and Statistics with Reliability,
Pacific Rim Dependable Computing Symp. (PRDC 2004), Mar. 2004. Queuing and Computer Science Applications
with a thoroughly revised second edition being
published by John Wiley. He has also published
two other books entitled Performance and
Reliability Analysis of Computer Systems (Kluwer Academic Publishers)
and Queueing Networks and Markov Chains (John Wiley). His research
interests are in reliability and performance assessment of computer and
communication systems. He has published more than 390 articles and
lectured extensively on these topics. He has supervised 39 PhD
dissertations. He has made seminal contributions in software rejuvena-
tion, solution techniques for Markov chains, fault trees, stochastic Petri
nets, and performability models. He has actively contributed to the
quantification of security and survivability. He is a fellow of the IEEE and
a Golden Core Member of IEEE Computer Society. He is a codesigner
of HARP, SAVE, SHARPE, and SPNP software packages that have
been well circulated.