5 views

Uploaded by Kenshin Himura

modelos de disponibilida

- Storytelling tesis
- Jamisola Dissertation 06
- Simulation Effectiveness
- FTA Metodology Cam0500244 En
- Artificial Cognition in Production Systems
- SES - Device Reliability
- Research Plan Dr. Babar Post Doc
- Towards a Seamless Development Process for Automotive Engine Control System
- Review of Immersive Intel
- Cost and financing: How much will it cost and who will pay? Safety, resilience and social cohesion: a guide for education sector planners
- 25-Simulation Equipment Techniques and Applications[510-534]
- Simulation Exercise Manual Handbook
- 20151226A_007102
- A comparison of noise simulation models
- SBI COMPLETE Simulation is the Imitation of the Operation of a Real
- D.A. Humphreys et al- ITER Plasma Control and US Involvement
- Development of Operating Systems
- Si March Brochure
- Simulasi sistem industri
- Life is a dream

You are on page 1of 28

1 Dept. of Mathematics and Computer Science

San Jose State University

San Jose, CA 95192-0103

sathaye@mathcs.sjsu.edu

Phone : (408) 924-5124

Fax : (408) 924-5080

2 Center

Dept. of Electrical and Computer Engineering

Duke University, Durham, NC 27708-0291

fkst, sramanig@ee.duke.edu

Phone : (919) 660-5269

Fax : (919) 660-5293

Abstract

As computer systems continue to be applied to mission-critical environments, techniques to evaluate

their dependability become more and more important. Of the dependability measures used to

characterize a system, availability is one of the most important. Techniques to evaluate a system's

availability can be broadly categorized as measurement-based and model-based. Measurementbased evaluation is expensive as it requires building a real system and taking measurements and

then analyzing the data statistically. Model-based evaluation on the other hand is inexpensive and

relatively easier to perform. In this paper, we rst look at some availability modeling techniques

and take up a case study from an industrial setting to illustrate the application of the techniques

to a real problem. Although easier to perform, model-based availability analysis poses problems

like largeness and complexity of the models developed which makes the models dicult to solve.

This paper also illustrates several techniques to deal with largeness and complexity issues.

1 Introduction

Complex computer systems are widely used in dierent applications ranging from
ight control,

command and control systems to commercial systems like information and nancial services. These

applications demand high performance and high availability. Availability evaluation addresses failure and recovery aspects of a system, while performance evaluation addresses processing aspects

and assumes that, the system components do not fail. For gracefully degrading systems, a measure that combines system performance and availability aspects is more meaningful than separate

measures of performance and availability. These composite measures are called performability measures. The two basic approaches to evaluate the availability/performance measures of a system are:

measurement-based and model-based. In measurement-based evaluation, the required measures

are estimated from measured data using statistical inference techniques. The data is measured

1

from a real system or its prototype. In case of availability evaluation, measurements are not always feasible. The reason being either the system has not been built yet, or it is too expensive

to conduct experiments. That is, in a high availability system one would need to measure data

from several systems to gather good sample data. On the other hand, injecting faults in a system

can be an expensive procedure. Model based evaluation is the cost-eective solution as it allows

system evaluation without having to build and measure a system. In this paper we discuss availability modeling techniques and their usage in practice. To emphasize the practicality of these

techniques, we discuss their pros and cons with respect to a case study, VAXcluster systems1 of

Digital Equipment Corporation (DEC2 ).

In this paper, we rst discuss dierent availability modeling approaches in Section 2. In Section

3, we discuss the benets of utilizing a composite availability and performance model in practice

instead of a pure availability model. Our discussion emphasizes this point using a model developed

for multiprocessors to determine the optimal number of processors in the system. In Section 4, we

present a case study to demonstrate the utility of availability modeling in a corporate environment.

2 Modeling Approaches

Model-based evaluation can be through discrete-event simulation, or analytic models, or hybrid

models combining simulation and analytic parts. A discrete-event simulation model can depict

detailed system behavior, as it is essentially a program whose execution simulates the dynamic

behavior of the system and evaluates the required measures. An analytic model consists of a set

of equations describing the system behavior. The evaluation measures are obtained by solving

these equations. In simple cases closed-form solutions are obtained, but more frequently numerical

solutions of the equations are necessary.

The main benet of discrete-event simulation is the ability to depict detailed system behavior

in the models. Also, the
exibility of discrete-event simulation allows its use in performance, availability and performability modeling. The main drawback of discrete-event simulation is the long

execution time, particularly when tight condence bounds are required in the solutions obtained.

Also, carrying out a \what if" analysis requires rerunning the model for dierent input parameters.

Advances in simulation speed-up such as regenerative simulation, importance sampling, importance

splitting, parallel and distributed simulation also need to be considered.

Analytic models are more of an abstraction of the real system than a discrete-event simulation

model. In general, analytic models tend to be easier to develop and faster to solve than a simulation

model. The main drawback is the set of assumptions that are often necessary to make analytic

models tractable. Recent advances in model generation and solution techniques as well as computing

power make analytic models more attractive. In this paper, we discuss model-based evaluation using

analytic techniques and how one can achieve results that are useful in practice.

A system modeler can either choose state space or non-state space analytical modeling techniques.

The choice of an appropriate modeling technique to represent the system behavior is dictated by

factors such as the measures of interest, level of detailed system behavior to be represented and

the capability of the model to represent it, ease of construction, and availability of software tools

1

2

Now known as COMPAQ

to specify and solve the model. In this section, we discuss several non-state space and state space

modeling techniques.

Non-state space models can be solved without generating the underlying state space. Practically

speaking, these models can be easily used for solving systems with hundreds of components because

there are many relatively good algorithms available for solving such models [15]. The non-state

space models can be evaluated to compute measures like system availability, reliability and system

mean time to failure (MTTF). The two main assumptions used by the models are statistically

independent failures and independent repair units for components. The non-state space modeling

techniques used to evaluate system availability are reliability block diagrams and fault trees. In

gracefully degradable systems, a knowledge of the performance of the system is also essential. Nonstate space modeling techniques used to evaluate system performance are product-form queuing

models and task-precedence graphs [9].

In a reliability block diagram (RBD) each component of the system is represented as a block

[9, 12]. The blocks are then connected in series and/or parallel based on the operational dependency

between the components. If for the system to be up all the components need to be operational,

blocks in a RBD are connected in series. On the other hand, if the system can survive with at least

one component then blocks are connected in parallel. An RBD can be used to model availability

if the repair times (and failure times) are all independent. Figure 1(a) shows a multiprocessor

availability model with n processors where at least one processor is required for the system to be

up. From this we conclude that the RBD represents a simple parallel system. Given a failure rate

and repair rate , the availability of processor Proci is given by,

Ai = + :

A=1,

n

Y

i=1

(1)

(1 , Ai ) = 1 , +

n

(2)

A fault tree [9], like a reliability block diagram is useful for availability analysis. It is a pictorial

representation of the sequence of events/conditions to be satised for a failure to occur. A fault

tree uses and, or and k of n gates to represent this combination of events in a tree-like structure. To

represent situations where one failure event propagates failure along multiple paths in the fault tree,

fault trees can have repeated nodes. Several ecient algorithms for solving fault trees exist. These

include algorithms for series-parallel systems (for fault trees without repeated components) [17],

a multiple inversion (MVI) algorithm called the LT algorithm to obtain sum of disjoint products

(SDP) from mincut set [18] and the factoring/conditioning algorithm that works by factoring a

fault tree with repeated nodes into a set of fault trees without repeated nodes [20]. Binary decision

diagram (BDD)-based algorithms can be used to solve very large fault trees [21, 22]. Figure 1(b)

shows the fault tree model for our multiprocessor system. UAi represents the unavailability of

3

Proc 1

A1

FAILURE

Proc 2

A2

.

.

.

...

Proc n

An

UA UA

1

2

(a)

UA

(b)

(a) Reliability block diagram, (b) Fault tree

processor i. The and gate indicates that the mutiprocessor system fails when all the n processors

becomes unavailable. The output of the top gate of the fault tree represents failure of the parallel

multiprocessor system.

Reliability block diagrams and fault trees cannot easily handle more complex situations such as

failure/repair dependencies and shared repair facilities. In such cases, more detailed models such

as the state space models are required. Here we discuss some Markovian state space models.

i. Markovian Models

Markov Chains

In this section we will consider homogeneous Markov chains. A homogeneous continuous

time Markov chain (CTMC)[12] is a state space model, in which each state represents various

conditions of the system. In homogeneous CTMCs, transitions from one state to another

occur after a time that is exponentially distributed. The arcs representing a transition from

one state to another are labeled by the constant rate corresponding to the exponentially

distributed time of the transition. If a state in the CTMC has no transitions leaving it, then

that state is called an absorbing state, and a CTMC with one or more such states is said to be

an absorbing CTMC. For the multiprocessor example, we now illustrate how a Markov chain

can be developed to capture shared repair and multiple failure modes.

The parameters associated with a system availability model that we will now develop for our

multiprocessor system are, the failure rate
of each processor and the processor repair rate

. The processor fault is covered with probability c and not covered with probability 1 , c.

After a covered fault, the system is up in a degraded mode after a reconguration delay.

On the other hand, an uncovered fault is followed by a longer delay imposed by a reboot

action. The reconguration and reboot delays are assumed to be exponentially distributed

with means 1= and 1= respectively. In practice the reconguration and reboot times are

extremely small compared to the times between failures and repairs, hence we assume that

n c

X

n

Yn

n-1

n-1

n

n (1-c)

(n-1) c

n-2

(n-1) (1-c)

Y

...

n-1

failures and repairs do not occur during these actions. System availability can be modeled

using the Markov chain shown in Figure 2. For the system to be up, at least one processor

out of the n processors needs to be operational. The state i, 1 i n in the Markov

model represents that i processors are operational and n , i processors are waiting for on-line

repair. The states Xn,i and Yn,i, for i = 0; : : : ; n , 2 represent that a system is undergoing a

reconguration and is being rebooted respectively. We compute the steady state probability

i for each state

P i [2]. Then the system unavailability dened as a function of n is given by,

UA(n) = 1 , ni=1 i .

By appropriately choosing reward rates (or weights) for each state the appropriate measure

can be obtained for the model on hand. For example, for the multiprocessor example, if the

reward rates are dened as ri = 1 for states i = 1 : : : n and ri = 0 otherwise, then the expected

steady state reward rate gives the steady state availability.

Stochastic Petri Nets and Reward Nets [9]

A Petri net [9] is a more concise and intuitive way of representing a situation to be modeled.

It is also useful to automate the generation of large state spaces. A Petri net consists of

places, transitions, arcs and tokens. Tokens move from one place to another along arcs

through transitions. The number of tokens in the places represents the marking of a Petri

net. If the transition ring times are stochastically timed, the Petri net is called a stochastic

Petri net (SPN). If the transition ring times are exponentially distributed, the underlying

reachability graph, representing transitions from one marking to another gives the underlying

homogeneous CTMC for the situation being modeled.

For the multiprocessor system, let us say we are interested in nding the probability that

an incoming task is turned away because all n processors are tied up by other tasks being

processed. The parameters associated with this pure performance model are, arrival rate of

tasks, service rate of tasks, the number of buers and a deadline on task response times. The

performance model assumes that the arriving task forms a Poisson process of rate and the

service requirements of tasks are independent, identically distributed with the exponential

distribution of mean 1=. A deadline d is associated with each task. Let us also take the

number of buers available for storing incoming tasks as b. We could use an M=M=n=b queue

represented by the generalised stochastic Petri net (GSPN, that allows immediate transitions

also) shown in Figure 3 for our performance model. Timed transitions are represented by

thick rectangles and immediate transitions by a thin line. Place proc contains the number

of processors available. Initially there are n tokens here representing n processors. When

transition arr res (this happens only if a buer is available), a token is removed from proc

and put in place serving representing one less free processor. Transition service has a ring

rate that depends on the number of tokens in place serving (indicated by the notation #).

Transition arr is disabled (indicated by the inhibitor arc from place buer) when there are

5

arr

buffer

request

serving

service

proc

n

b-n

b , n tokens in place buer, since there can only be b tasks in the system : n in place serving

and b , n in place buer. Therefore the probability that an incoming task is rejected is the

In practical system design, a pure availability model may not be enough for systems such as

gracefully degradable ones. In conjunction with availability, the performance of the system as

it degrades needs to be considered. This requires a \performability" model that includes both

performance and availability measures. In the next section, we present an example of a system for

which a performability measure is needed.

Consider the multiprocessor system again but with varying number of processors, n, each with the

same capacity. A key question asked in practice is regarding the number of processors needed.

As we discuss below, the optimal conguration in terms of the number of processors is a function

of the chosen measure of system eectiveness [2]. We begin our sizing based on measures from a

\pure" availability model. Next we consider sizing based on system performance measures. Last

we consider a composite of performance and availability measure to capture the eect of system

degradation.

A CTMC model for the failure/repair characteristics of the multiprocessor system is shown in Figure

2. The details of the model were already discussed when introducing Markov chains in the previous

section. The downtime during an observation interval of duration T is given by UA(n) T . The

results shown in Figure 4 assume T is 1 year, i.e., 8760 hours. In Figure 4(a) we plot the downtime

D(n) against n for varying values of the mean reconguration delay using c = 0:9,
= 1=6000 per

hour, and = 12 per hour. In Figure 4(b), we plot D(n) against n for dierent coverage values with

a mean reconguration delay of 10 seconds. We conclude from these results that the availability

benets of multiprocessing (i.e., increase in availability with increase in the number of operational

processors) is possible only if the coverage is near-perfect and the reconguration delay is very small

or most of the other processors are able to carry out useful work while a fault is being handled.

We further observe that for most practical parameter values, the optimal number of processors is 2

or 3. In the next subsection we consider a performance-based model for the multiprocessor sizing

problem.

(a)

(b)

(a) dierent mean reconguration delays, (b) dierent coverage values

On the lines of the GSPN example discussed in the previous section, we used an M=M=n=b queuing

model with nite buer capacity (see Figure 3) to compute the probability that a task is rejected

because the buer is full. In Figure 5 we plot the loss probability as a function of n for dierent

values of arrival rates. We observe that the loss probability reduces as the number of processors

increases. The conclusion from the performance model of the fault-free system is that the system

improves as the number of processors is increased. The details of this model and results are

presented in [2].

The above models point out the deciency of simply considering a pure availability or performance measure. The pure availability measure ignores dierent levels of performance at various

system states, while the pure performance measure ignores the failure/repair behavior of the system.

The next section considers combined measures of performance and availability.

Dierent levels of performance can be taken into account by attaching a reward rate ri corresponding

to some measure of performance to each state i of the failure/repair Markov model in Figure 2. The

resulting Markov reward model can then be analyzed for various combined measures of performance

and availability. The simplest reward rate assignment is to let ri = i for states with i operational

processors and ri = 0 for down states. With the reward assignment shown in Table 1, we can

compute the capacity-oriented availability, COA(n) as the expected reward rate in the steadystate. COA(n) is an upper bound on system performance that equates performance with system

capacity. When i processors are operational we used an M=M=i=b queuing model (such as the

7

Reward rate, r

0

State

0

1in

X , and Y ,

i = 0; : : : ; n , 2

n

State

0

1in

X , and Y ,

i = 0; : : : ; n , 2

n

Reward rate, r

0

T (i), throughput of system

with i processors and b buers

0

GSPN in Figure 3) to describe the performance of the multiprocessor system. We then assigned a

reward rate of ri = 0 for each down state i and a reward rate of ri = Tb (i), which is the throughput

for a system with i processors and b buers, for all other states (see Table 2). With this assignment,

the expected reward rate at steady state computes the throughput-oriented availability, TOA(n).

In Figure 6, we plot COA(n) and TOA(n) for dierent values of the arrival rate .

These two measures show that in order to process a heavier workload more than two processors

are needed. The measures COA and TOA are not adequate measures of system eectiveness as

they obliterate the eects of failure/repair and merely show the eects of system capacity and the

load. In [2], a measure of system eectiveness, total loss probability, is proposed that \equally"

re ects fault-free behavior and behavior in presence of faults. The total loss probability is dened as

the sum of rejection probability due to system being down or full and the probability of a response

time deadline being violated. The total loss probability is computed by using the following reward

rate assignments (Table 3) : ri = 1 if i is a down state and ri = qb (i) + (1 , qb (i))(P (Ri (b) > d)) if

i is an operational state, where qb(i) is the probability of task rejection when the buer is full for

9

State

0

1in

X , and Y ,

i = 0; : : : ; n , 2

n

Reward rate, r

1

q (i) + (1 , q (i))(P (R (b) > d))

1

Figure 7: Total loss probability Vs. number of processors for dierent task arrival rates

a system with i operational processors, Ri (b) is the response time for a system with i operational

processors and b buers, and d is the deadline on task response time. In Figure 7, we plot the total

loss probability as a function of n for dierent values of the task arrival rate. We observe that

the optimal number of processors increases with the task arrival rate, tighter deadlines and smaller

buer spaces.

In this section we discuss a case study to demonstrate that in practice, the choice of an appropriate

model type is dictated by the availability measures of interest, the level of detailed system behavior

to be represented, ease of model specication and solution, representation power of the model type

selected, and access to suitable tools or toolkits that can automate model specication and solution.

In particular we describe the availability models for Digital Equipment Corporation's (DEC)

VAXcluster system. VAXclusters are used in dierent application environments, and hence several

availability, reliability and performability measures need to be computed. VAXclusters used as com10

VAX

HSC

Disk

VAX

.

.

.

Star

Coupler

Disk

HSC

VAX

mercial computer systems in a general data processing environment require us to evaluate system

availability and performability measures. To consider VAXclusters in highly critical applications

like life support systems and nancial data processing systems, we evaluate many system reliability

measures. These two measures were not adequate for some nancial institution customers of VAXclusters. We therefore evaluated task completion measures to compute probability of application

interruption during its execution period.

VAXclusters are closely coupled systems that consist of two or more VAX computers, one or

more hierarchical storage controllers (HSCs), a set of disk volumes and a star coupler [5]. The

processor (VAX) subsystem and the storage (HSC and disk volume) subsystem are connected

through the star coupler by a CI3 bus. The star coupler is omitted from the availability models as

it is assumed to be a passive connector, and hence extremely reliable. Figure 8, shows the hardware

topology of a VAXcluster.

Our availability modeling approach considers a VAXcluster as two independent subsystems

namely, the processing subsystem and the storage subsystem. Therefore, the availability of the

VAXcluster is the product of the availability of each subsystem. In the following sections, we

develop a sequence of increasingly powerful availability models, where the level of modeling power

is directly proportional to the level of complex behavior and characteristics of VAXclusters included

in the model.

Our discussion of the availability models developed for the VAXcluster system is organized as

follows. In Section 4.1 and 4.2 we develop models using non-state space techniques like reliability

block diagrams and fault-trees, respectively. In Section 4.3.1, we develop a CTMC or rather a

Markov reward model. The utility of this model in practice is limited as the size of the model grows

exponentially with the number of processors in the VAXcluster. A model that avoids largeness is

discussed in Sections 4.3.2 and 4.4. The model in Section 4.3.2 uses a two-level decomposition for

the processor subsystem of the VAXcluster [16]. On the other hand, for the model in Section 4.4

an iterative scheme is developed [10]. The approximate nature of the results prompted a largeness

tolerance approach. In Section 4.5, a concise stochastic Petri net is developed for VAXclusters

consisting of uniprocessors [1]. The next approach in Section 4.6 consists of realistic heterogeneous

congurations, where the VAXclusters consist of uniprocessors and/or multiprocessors [3, 4] using

stochastic reward nets [14].

3

Computer Interconnect

11

The rst model of VAXclusters uses a non-state space method, namely, the reliability block diagram. This approach was seen in the availability model of VAXclusters by Balkovich et. al [6]. We

use this approach to partition the VAXcluster along functional lines, and this allows us to model

each component type separately. In Figure 9, the block diagram represents a VAXcluster conguration with n processors, n HSCs and n disks. We assume that the VAXcluster is down if all the

Processing

Subsystem

Storage

Subsystem

VAX

HSC

Disk

.

.

.

.

.

.

HSC

Disk

VAX

.

.

.

VAX

components of any of the three subsystems are down. We assume that the times to failure of all

components are mutually independent, and exponentially distributed random variables. We also

assume each component to have an independent repair facility. The repair time here is a 2-stage

hypoexponentially distributed random variable with the rst phase being the travel time for the

repairman to get to the eld and the second phase being the actual repair time. On evaluating this

model as a pure series-parallel availability model, the expression for the VAXcluster availability is

given by:

A = 1,

P

P + ( 1=

!n!

1

+1=

1,

H

H + ( 1=

!n!

1

+1=

1,

D

P + ( 1=

!n!

1

+1=

(3)

Here,

1=P is the mean time between VAX processor failures.

1=H is the mean time between HSC failures.

1=D is the mean time between disk failures.

1=F is the mean eld service travel time.

1=P , 1=H and 1=D are the mean time to repair a VAX processor, HSC and disk respectively.

The assumption that a VAXcluster is down when all the components of any of the three subsystems are down is not in tune with reality. For a VAXcluster to be operational the system should

meet quorum, where quorum is the minimum number of VAXes required for the VAXcluster to

function.

12

Cluster failure

OR

(n-k+1) of n

(n-k+1) of n

...

U U

P P

1 2

(n-k+1) of n

...

U

U U

P

n

H H

1 2

...

U

U U

n

D D

1 2

In this section, we present a fault tree model for the VAXcluster conguration discussed in Section

4.1. Figure 10, is a model for the VAXcluster with n processors, n HSCs, and n disks. Observe

that, in a block diagram model, the structure tells us when the system is functioning, while in a

fault tree model, the structure tells us when the system has failed. In addition, we have extended

the model to include a quorum required for operation. The cluster is operational as long as k out

of n processors, HSCs and disks are up. The negation of this operational information is depicted

in the fault tree as follows. The topmost node denotes \Cluster Failure" and the associated \OR"

gate species that, a cluster fails if (n , k + 1) processors, (n , k + 1) HSCs, or (n , k + 1) disks

are down. The steady state unavailability of the cluster, Ucluster is given by,

(4)

P

Q

Q

where, Ui = jJ j(n,k+1) ( j 2J Ui )( j 2= J (1 , Ui )) for i = P, H or D (processors, HSCs or disks),

and J is the set of indices of all functioning components.

j

The RBD and fault tree VAXcluster availability models are very limited in their depiction of

the failure/recovery behavior of the VAXcluster. For example, they assume that each component

has its own repair facility, and that there is only one failure/recovery type. In fact, combinatorial

models like RBDs, fault trees and reliability graphs require system components to behave in a

stochastically independent manner. Dependencies of many dierent kinds exist in VAXclusters and

hence, combinatorial models are not entirely satisfactory for such systems. State space modeling

techniques like Markovian models can include dierent kinds of dependencies. In the following

sections, we develop state space models for the processing subsystem and the storage subsystem

separately, and use a hierarchical technique to combine the models of the two subsystems to obtain

an overall system availability model.

13

20p,1

11s,1

PB

P

10p,0

CB

10c,1

00t,1

10t,1

02r,1

2 (1-c)

P

2 c

P

2 (1-k)

I

01c,1

000,0

2 k

I

PB

2

01b,0

01t,1

PB

CB

4.3.1 Continuous Time Markov Chain

We now develop a more detailed continuous time Markov chain (CTMC) model for the VAXcluster,

showing two types of failure and a coverage factor for each failure type. By using a state space

model, we are also able to incorporate shared repair for the processors in a cluster. We assumed that

the times between failures, the repair times and other recovery times are exponentially distributed

and developed an availability model for an n,processor (n 2) VAXcluster using a homogeneous

continuous-time Markov chain. A CTMC for a 2-processor VAXcluster developed in [1] is shown in

Figure 11. The following behavior is characterized in this CTMC. A processor is either up or down.

There are two types of failure: permanent and intermittent. A processor recovers from a permanent

failure by a physical repair and from an intermittent failure by a processor reboot. These failures

are further classied into covered or uncovered. A covered processor failure causes a brief (in the

order of seconds) cluster outage to recongure the failed processor out of the cluster and back

into the cluster after it is xed. Thus if a quorum is still formed by the operational processors, a

covered failure causes a small loss in system time. An uncovered failure causes the entire cluster to

go down until it is rebooted even if the remaining operational processors still form quorum. The

permanent and intermittent mean failure times are 1=p and 1=I hours respectively. The mean

repair, mean processor reboot, and mean cluster reboot times are given by 1=p , 1=PB , 1=CB

respectively. Let c and k denote the coverage factors for permanent and intermittent failures

respectively. Realistically, the mean reconguration time (1=IN ) to map a processor into the

cluster and the time (1=OUT ) to map it out of the cluster are dierent. In the CTMC of Figure

11, we have assumed that IN = OUT = T . The states of the Markov chain are represented by

(abc; d), where,

14

b = number

of processors down with intermittent failure

8

>

0 if both processors are up

>

>

>

>

>

p if one processor is being repaired

>

>

>

>

>

<b if one processor is being rebooted

c = >c if cluster is undergoing a reboot

>

>

t if cluster is undergoing a reconguration

>

>

>

>

>

r if two processors are being rebooted

>

>

>

:s if one is being rebooted and other repaired

(

d = 0 cluster up state

1 cluster down state

(5)

where, Pabc;d denotes the steady-state probability that the process is in state (abc; d). We computed

the availability of the VAXcluster system by solving the above CTMC using SHARPE [9], which is

a software package for availability/performability analysis. The main problem with this approach

was that the size of the CTMC grew exponentially with the number of processors in the VAXcluster

system. The largeness posed the following challenges: (1) the capability of the software to solve the

model with thousands of states for VAXclusters with n > 5 processors. (2) the problem of actually

generating the state space. In the next section we address these two drawbacks.

In this section, we discuss a VAXcluster availability model that avoids the largeness associated with

a Markov model. To reduce the complexity of a large system, Ibe et. al [16] developed a two-level

hierarchical model. The bottom level is a homogeneous CTMC and the top level is a combinatorial

model. This top-level model was represented as a network of diodes (or three-state devices).

The approximate availability model developed for the analysis made the following assumptions

[16].

1. The behavior of each processor was modeled by a homogeneous CTMC and assumed that

this processor did not break the quorum rule. This assumption is justied by the fact that

the probability of VAXcluster failure due to loss of quorum is relatively low.

2. Each processor has an independent repairman. This assumption is justied as the authors

saw that the MTBF was large compared to the MTTR.

These assumptions allowed the authors to decompose the n-processor VAXcluster into n independent subsystems. Further, the states of the CTMC for the individual processors were classied

into the following three states:

15

Y = the set of states in which the cluster is down due to a processor failure.

Z = the set of states in which the processor is down but the cluster is up.

The authors compared the superstates to the three states of a diode. The three states X, Y

and Z represent the following states of the diode { up state, the short circuit state and the open

circuit state respectively. Then the availability, A, of the VAXcluster was dened as follows [16]:

Let nX , nY and nZ denote the number of processors in superstates X,Y and Z, respectively.

Let PX , PY and PZ denote the probability that a processor is in superstate X, T and Z. Then Ibe

et al. [16] dened the availability An of an n-processor VAXcluster as follows:

n

X

n

An =

PXn PY0 PZn,n

n

0

n

,

n

X

X

n =1

n

X

n

=

PXn PZn,n , 0!(nn,! 0)! PX0 PZn

n

X

n =0

= (PX + PZ )n , PZn

X

(6)

The authors could analyze dierent VAXcluster congurations by simply varying the number

of processors n in the above equation. The main drawbacks of this approach are the approximate

nature of the solution versus an exact solution, and the need to make simplifying assumptions, one

of the assumptions being an independent repairman for each processor. In the next section, we

illustrate another approximation technique to deal with large subsystems.

In this section, we discuss a novel availability model for VAXclusters with large storage subsystems.

In [10], a xed-point iteration scheme was used over a set of CTMC sub-models. The decomposition

of the model into sub-models controlled the state space explosion and the iteration modeled the repair priorities between the dierent storage components. The model considered congurations with

shadowed (commonly known as mirrored) disks and characterized system along with application

level recovery.

In Figure 12, we show the block diagram of an example storage system conguration, that will

be used to demonstrate the technique. The conguration shown consists of two HSCs, and a set

of disks. The disks are further classied into two system disks and two application disks. The

operating system resides on the system disk, and the user accounts and other application software

on the application disks. Further, it is assumed that the disks are shadowed4 and dual pathed

and ported between the two HSCs [10]. A disk dual pathed between two HSCs can be accessed

cluster-wide in a coordinated way through either HSC. In case, one of the HSC fails, a dual ported

disk can be accessed through the other HSC after a brief failover period.

4

16

HSC1

System

Disk 1

Application

Disk 1

HSC2

System

Disk 2

Application

Disk 2

2H

2

1H

0H

2A

2S

1S

2H

0S

1H

(a)

1A

2S

0A

1S

(b)

2A

1A

(c)

(a) HSC subsystem, (b) SDisk subsystem, (c) ADisk subsystem

We now discuss the sequence of approximation models developed to compute the availability

of the storage system in Figure 12. The rst model assumed that each component in the block

diagram has its own repair facility. We assumed that the repair time is a 2-stage hypoexponentially

distributed random variable with the rst phase being the travel time and the second phase being

the actual repair time. This model can be solved as a pure series-parallel availability model to

compute the availability of the storage system, similar to the solution of the RBD in Section 4.1.

In the second improved model, we removed the assumption of independent repair. Instead, it

is assumed that a repair facility is shared within a subsystem. The storage system is now assumed

as a two-level hierarchical model. The bottom level consists of three independent CTMC models,

namely HSC, SDisk and ADisk, representing the HSC, system disk and application disk subsystems

respectively. The top level consists of a reliability block diagram representing a series network of

the three subsystems. In Figure 13(a), (b), (c) we show the CTMC models of the three subsystems.

The reliability block diagram at the top level is shown in Figure 14.

The states of the CTMC model adopt the following convention.

State nX represents, n components of the subsystem are operational, where n can take the

values 0, 1 or 2.

HSC

SDISK

ADISK

Figure 14: Top level reliability block diagram for the storage subsystem

17

State TnX represents that the eld service has arrived and that (n , 1) components of the

subsystem are operational and the rest are under repair.

In the above notation, the value of X is H , S or A, where H is associated with the HSC subsystem

model, S with the system disk subsystem model and A with the application disk subsystem model.

The steady state availability of the storage subsystem in Figure 14 is given by,

A = AH AS AA

(7)

AX = P2X + P1X + PT2

(8)

where PiX and PT are the steady-state probability that the Markov chain is in state iX and state

TiX respectively.

In the third approximation we took into account disk reload and system recovery. This takes into

account the following activities. When a disk subsystem experiences a failure, data on the disk may

be corrupted or lost. After the disk is repaired the data is reloaded on to the disk from an external

source, such as a backup disk or tape. While the reload is a local activity of a disk subsystem,

recovery is a global system-wide activity. This behavior is incorporated in the Markov models of

Figure 15(a), (b), (c) as follows. The HSC Markov model is enhanced by including application

recovery states R2H and R1H after the loss of both the HSCs in the HSC subsystem. The system

disk Markov model is extended by incorporating reload states L2S and L1S , and application recovery

states R2S and R1S . The reload followed by application recovery starts immediately after the rst

disk is repaired. We further assume that a component could suer failures during a reload and/or

recovery. The application disk Markov model is extended similar to the system disk model by

including reload states L2A and L1A , and recovery states R2A and R1A . The expression for the

steady-state availability of the storage subsystem is similar to the expression obtained in the second

approximation.

In the fourth approximation, the assumption of independent repair facility for each subsystem is

eliminated. In this approximation, the repair facility is shared between subsystems, and when more

than one component is down, the following repair priority is assumed: (1) any subsystem with all

failed components is repaired rst; (2) otherwise, an HSC is repaired rst, system disk second, and

application disk third. This repair priority scheme does not change the Markov model for the HSC

subsystem, but changed the model for the system and application disk subsystems. The system

disk has the second highest priority and hence, the system disk repair rate D is slowed down by

multiplying it by P1 , the probability that both HSCs are operational, given that eld service is

present and the system is not in a recovery mode. Then P1 is given by,

X

iX

P1 = (P + PP2H + P ) :

2H

T1

T2

H

(9)

In [10] it is assumed that a component can be repaired during recovery. Then the system disk

repair rate, D from the recovery states is slowed down by multiplying it by P2 where,

P2 = (P PR+2 P ) :

R1

R2

H

18

(10)

R

2H

2H

2

1H

0H

R

1H

2H

2S

1S

1H

0S

2S

R

2S

(a)

(b)

2A

2

1A

2A

0A

R

2A

2A

2

1A

1A

1A

(c)

Figure 15: CTMC models:

(a) System Recovery Included for HSC subsystem,

(b) Disk Reload and System Recovery Included for SDisk subsystem,

(c) Disk Reload and System Recovery Included for ADisk subsystem

19

1S

1S

2S

2

1S

The application disk has the lowest repair priority, and is enforced by probabilistically slowing

down the repair rate. The repair rate from the non-recovery states is slowed down by multiplying

D by P3 where,

nH

P3 = PA2H PB2S :

(11)

Here A = P2H + PT1 + PT2 and B = P2S + PT1 + PT2 + PL1 + PL2 . Then P3 expresses the

probability that both HSCs are operational given that the HSC subsystem is not in the recovery

states or in states with less than two HSCs operational, and that both system disks are operational

given that the system disk is in non-recovery states or states with more than one system disk up.

The steady-state availability is computed as in the rst approximation.

In the above approximations we included the eld service travel time for each subsystem. In the

real world, if a eld service person is present and repairing a component in one subsystem, he would

respond to a failure in another subsystem. Thus in this case we should not be including travel time

twice. Also, the eld service would follow the repair priority described above. The Markov model

for each subsystem can be modied, by iteratively checking the presence of eld service person in

the other two Markov models. The eld service person is assumed to wait on site until reload and

recovery is completed in the SDisk and ADisk subsystem, and until recovery is completed in the

HSC subsystem.

The HSC subsystem is extended as follows. The rate of transition due to a component failure

is probabilistically split using the variable y1 (or 1 , y1 ). The probability that the eld service is

present for repairing a component in either of the two disk subsystems is,

H

,((1 , P2S , P1S , P0S ) (1 , P2A , P1A , P0A)):

(12)

The initial value of y1 is assumed to be 0 in the rst iteration. Then the above value of y1 is

used for the next iteration.

The system (application) disk subsystem is extended as follows. The rate of every transition due

to a component failure that occurs in the absence of the repair person in the system (application)

disk subsystem is multiplied by y2 (or 1 , y2 ). The expression for y2 is similar to the expression

for y1 except S (A) is replaced by H . This takes into account that the eld service is present in the

HSC and/or application (system) disk subsystem.

In a similar manner, the next approximation rened the model by taking into account the

global nature of system recovery. That is, if a recovery is ongoing in one subsystem the other

two subsystems are forced to go into recovery. The approximated eect of global recovery is

achieved with an iterative scheme that allows for interaction between the sub-models. The nal

approximation only modied the HSC subsystem model to incorporate the eect of an HSC failover5

as shown in Figure 16.

In state 2H , instead of a single failure transition labeled 2H , we now have three failure transitions. If the primary HSC fails the model transitions from state 2H to state PFAIL with a rate

H , and PFAIL transitions to state 1H after a failover to the secondary HSC with a rate FD . In

Failover is the procedure of switching to an alternate path or component after failure of a path or a component

[19]. During the HSC failover period all the disks are switched on to the operational HSC.

5

20

2H

2H

(1-P )

H

(1-P )

det

SFAIL

PFAIL

2H

1H

H det

0H

P

FD

1H

RSFAIL

RPFAIL

H

H det

FD

det

T

1H

F

state 2H if the failure of the secondary HSC is detected with probability Pdet a transition to state

1H occurs with rate Pdet H and if not detected then a transition occurs with rate (1 , Pdet )H to

state SFAIL. The steady state availability of the HSC subsystem is then given by,

H

(13)

The steady state availability of the storage subsystem is given by Equation 7. In [10], after

various experiments it was observed that the storage downtime is more sensitive to detection of a

secondary HSC failure than the average failover time.

In this section we discuss a VAXcluster availability model that tolerates largeness and automates

the generation of large Markov models. Ibe, Sathaye et al. [1], use generalized stochastic Petri

nets to model VAXclusters. The authors used the software tool SPNP [8] to generate and solve

the Markov model underlying the SPN. In fact, the SPN model in [1] allows extensions to permit

specications at the net level, hence the resulting model is a stochastic reward net.

In Figure 17 shows a partial SPN VAXcluster system model. The details of the entire model in

[1] are beyond the scope of the paper. The place PUP with N tokens represents the initial condition

that all the N processors are up. The processors can suer a permanent or intermittent failure,

represented by the timed transitions tINT and tPERM respectively. The ring rate of the transition

tPERM and tINT ) are marking dependent. This rate is expressed as #(PUP ; i)P and #(PUP ; i)I

respectively, where #(PUP ; i) represents the number of tokens in place PUP in any marking i.

The place PPERM (PINT ) represents that a permanent (intermittent) failure has occurred. When

permanent (intermittent) failure occurs, it will be covered with probability c (k) and uncovered

with probability 1 , c (1 , k). The covered permanent and intermittent failure is represented by

immediate transitions tPC and tIC respectively. The uncovered permanent and intermittent failure

is represented by immediate transitions tPU and tIU respectively. A failure is considered to be

covered only if the number of operational processors is at least l, the quorum. The input and

output arc with multiplicity l from and to the place PUP ensures quorum maintenance. In Figure

21

Intermittent

P

UIF

Block

Permanent

l

Block

P

CPF

l

t

IC

t

P

N

P

INT

P

UP

INT

1-k

P

REB

t

t

c

P

PERM

P

RP

PC

PERM

1-c

t

IU

PU

P

UPF

REB

RECONFIG

Cluster

Reconfiguration

RECONFIG

Block

IP

17, the block labeled, \Cluster Reconguration Block" represents a group of reconguration places

in the complete model of [1]. In addition, the following behavior is represented by the SPN:

A covered permanent failure is not possible while the cluster is being rebooted after an

uncovered failure (token in either PUIF or PUPF ). This is represented by an inhibitor arc

from PUIF and PUPF to the immediate transition tPC .

It is assumed that a failure does not occur while the cluster is being recongured. This is

represented by the inhibitor arcs from the \Cluster Reconguration Block" to tPERM and

tINT .

A processor under reboot can suer a permanent failure. This is represented by the fact that

when there is a token in PREB both the transitions tREB and tIP are enabled.

The steady-state availability is given by:

A=

X

i2

22

ri i

(14)

8

>

1;

>

>

>

<

ri = >

>

>

>

:0;

if (#(PUP ; i) l)

W

V

[#(PofClusterRebootPlaces; i) < 1 #(PUIF ; i) < 1]

[#(PClusterReconfigurationBlock; i) < 1]

Otherwise

In this section, we present a SPN model that considers realistic VAXcluster congurations which

include uniprocessors and multiprocessors [3]. The heterogeneity in the model allowed each multiprocessor to contain varying number of processors, and each VAX in the VAXcluster to belong

to dierent VAX families. Henceforth, we refer to the SPN model as the heterogeneous model.

Throughout this section we refer to a single multiprocessor system as a machine, which consists of

two components: one or more processors and a platform. The platform consists of the memory,

power, cooling, console interface module and I/O channel adapters [3]. As in the uniprocessor case

in the above sections, we depict covered and uncovered permanent and intermittent failures. In

addition, we depict the following failure/recovery behavior for a multiprocessor. A processor failure

in a machine requires a machine reboot to map the faulty processor oine. The entire machine is

not operational during the processor repair and the reboot following the repair. Before and after

every machine reboot a cluster reconguration maps the machine out and into the cluster. The

platform components of a machine are single points of failure for that machine [3]. In addition, the

following features are included:

The option of including the quorum disk. A quorum disk acts as a virtual node in the

The option of including failure/recovery behavior like unsuccessful repair and unsuccessful

reboots. An unsuccessful repair is modeled by introducing a faulty repair. Faulty repair

means that diagnostics called out a wrong FRU (eld replaceable unit) or that the right FRU

was requested but is DOA (dead on arrival). Unsuccessful reboot means a processor reboot

did not complete and has to be repeated.

The option of including detailed cluster reconguration information. For example, if the quorum does not exist in reality, the time to form a cluster is longer than the usual reconguration

time.

The overall model structure is shown in Figure 18. The VAXcluster is modeled as an 1 N

array, where each plane represents a subnet of a machine consisting of Mi processors. The place

PUP in plane i contains Mi tokens, and represents the initial condition that the machine is up

with Mi processors. The cluster reconguration subnet consists of a single place clust reconfig,

which initially consists of a single token. Whenever a machine i initiates a reconguration the

subnet associated with it ushes the token out of clust reconfig and returns the token after the

reconguration. This ensures that all machines in the cluster participate in a cluster state transition.

Similarly, the quorum disk is treated as a separate subnet which interacts with each of the N

subnets. The number of processors in a machine is varied by varying the number of tokens Mi

in each subnet. The model allows the machines in the cluster to belong to a dierent VAX series

because every subnet handles the failure/recovery behavior of a machine separately. If Mi = 1 in

23

Quorum

Disk

Machine N

Machine N-1

Machine 3

Machine 2

Machine 1

Field

Service

Cluster

Transition

a subnet then the machine follows the failure/recovery behavior of a uniprocessor. The detailed

subnet associated with a machine is beyond the scope of this paper, and is presented in [3].

The heterogeneous SPN model included various extensions like variable arcs, enabling functions

or guards, rate type functions, etc., and hence is a Stochastic Reward Net, [14]. For example, when

an intermittent covered failure transition res, the rate type function for the rate IC is dened as:

If (Mach UP + (mark(Pqdup ) NQV ) QU )

IC = platint + (mark(PUP ) I )

else

IC = k (platint + (mark(PUP ) I ))

where Mach UP represents the number of machines up in the cluster, platint is the platform

intermittent failure rate, I is the processor intermittent failure rate and k is the probability of a

covered intermittent failure. The marking, mark(Pqdup) > 1 implies the quorum disk is up and

NQV is the number of quorum votes assigned to the quorum disk. The number of votes needed

for the VAXcluster to be operational is given by QU .

On the other hand, the rate type function for an intermittent uncovered failure rate is given by,

IU = (1 , k) (platint + (mark(PUP ) I ))

The heterogeneous VAXcluster model is available if:

1. mark(clust reconfig) = 1, that is a cluster reconguration is not in progress,

2. NU + (mark(Pqdup ) NQV ) QU

where NU is obtained using the following algorithm.

24

Initial: NU = 0

For i = 1; ; N

If ((mark(Platform failure; i) = 0)AND(mark(PUP ; i) > 0)AND

Repair and Reboot Transitions Disabled then

NU = NU + 1

NU represents that a machine is up if no platform failure has occurred and that at least one

processor is up and repair or reboot is not in progress.

This heterogeneous model was evaluated using the SPNP package [8]. This package solved the

SPN by analyzing the underlying CTMC. We resolved the problem in [3] by using a technique that

involved the truncation of the state space [7]. The state space cardinality of the CTMC isomorphic

with the heterogeneous model increased with the number of machines in the VAXcluster, as well

as the number of processors in each machine. To implement this state space reduction technique

by specifying a truncation level K for processor failures in the model, the maximum value of K is

M1 + M2 + + MN . The value K species that the reachability graph and hence the corresponding

CTMC be generated up to K processor failures. This is implemented in the model by means of an

enabling function associated

PN with all the failure transitions. The enabling function disables all the

failure transitions if ( i=1 Mi , mark(PUP; i)) K . This technique is justied as follows:

In real systems, most of the time the system has majority of its components operational [7].

This means the probability mass is concentrated on a relatively small number of states in

comparison to the total number of states in the model.

We observed the impact of varying the truncation level on the availability measures for an

example heterogeneous cluster, and concluded that the eect was minimal.

We used the heterogeneous model to not only evaluate measures associated with standard system

availability, but also with system reliability and task completion. In the system reliability class

measures, we evaluated measures like frequency of failures and frequency of disruptive outages. The

term disruptive is dened as follows { any outage that exceeds the specied tolerance limit of the

user. The task completion measures evaluated the probability that the application is interrupted

during its application period.

In this paper we discuss an example measure from each of the three classes of measures [3]:

Mean Cluster Downtime D in minutes per year: This is a system availability measure and

represents the average amount of time the cluster is not operating during a one year observation period. Then the expression for D in terms of the steady state cluster availability, A

is given by,

D = (1 , A) 8760 60:

(15)

represents the mean number of recongurations which exceed the specied tolerance duration

during the one year observation period. We evaluate FDR as,

rg

(16)

where rg is the cluster reconguration (in, out or formation) rate, thresh is the time units of

the specied tolerance duration on the reconguration times and Prg is the probability that

a reconguration is in progress.

25

Probability of Task Interruption under Pessimistic Assumption (Prob Psm): This is a task

completion measure. It measures the probability of a task that initially nds the system

available and which needs x hours for execution, but is interrupted by any failure in the

cluster. This is a pessimistic assumption because the system does not tolerate any interruption

including the brief reconguration delays. The expression for Prob Psm is given by:

Prob Psm =

,P

j 2Upstate (1:0 , e

N

k=1

(c ( + )+(

k;j

p;k

i;k

plt;k

+

plt int;k

))x

Pj

(17)

where for machine k, ck;j is the number of operational processors, Pj is the probability of

being in an operational state, p;k (i;k ) is the processor permanent(intermittent) failure rate,

plt;k (plt int;k ) is the platform permanent (intermittent) failure rates for machine k, and A

is the cluster availability.

In [3], we used these three measures for a particular conguration to study the impact of

truncation. In Table 4, we present the number of states and, the number of transitions of the

underlying CTMC.

Trunc.

Level

1

2

3

4

5

No. of No. of

States Arcs

348

948

2088

7110

6394 26686

13236 66596

20728 122746

Mean Cluster

Freq. of Disruptive

Downtime min./yr. recong. threshold=10s

12.91732078

9.96432050

13.00257283

9.96751584

13.00258549

9.96751604

13.00258549

9.96751604

13.00258549

9.96751604

Prob. of task

Interruption t=1000s

0.00032767

0.00032767

0.00032767

0.00032767

0.00032767

On observing the results we can conclude that we could truncate the state space of the SPN

model for the heterogeneous cluster without impacting the results.

5 Conclusion

We started the paper by brie
y discussing various non-state space and state space availability and

performance modeling approaches. Using the problem of deciding the optimal number of processors

in an n-component parallel multiprocessor system, we showed the limitations of a pure availability or

performance model, and emphasized the need for a composite availability and performance model.

Finally, we took a case study from a corporate environment and demonstrated an application of the

techniques in a real situation. Several approximations and assumptions were made and validated

before use, in order to deal with the size and complexity of the models encountered.

26

References

[1] O. Ibe, A. Sathaye, R. Howe and K. S. Trivedi, \Stochastic Petri Net Modeling of VAXcluster

System Availability", Proc. Third International Workshop on Petri Nets and Performance

Models (PNPM89), pp. 112-121, Kyoto, Japan, 1989.

[2] K. S. Trivedi, A. Sathaye, O. Ibe, and R. Howe, \Should I Add a Processor?", Proc. 23rd

Annual Hawaii Conference on System Sciences, pp. 214-221, January 1990.

[3] A. Sathaye, K. S. Trivedi and R. Howe, \Availability Modeling of Heterogeneous VAXcluster

Systems: A Stochastic Petri Net Approach", Proc. of International Conference on FaultTolerant Systems, Varna, January 1990.

[4] J. Muppala, A. Sathaye, R. Howe and K. S. Trivedi, \Dependability Modeling of a Heterogeneous VAXcluster System Using Stochastic Reward Nets", Hardware and Software Fault

Tolerance in Parallel Computing Systems, D. Avresky (ed.), pp. 33-59, Ellis Horwood Ltd.,

1992.

[5] N.P. Kronenberg, H.M. Levy, W.D. Strecker, R.J. Merwood, \VAXclusters: A Closely

Coupled Distributed System", ACM Trans. Computer Systems, Vol. 4, pp. 130-146, May 1986.

[6] E. Balkovich, P. Bhabhalia, W . Dunnington, and T. Weyant, \VAXcluster Availability Modeling", Digital Technical Journal, No. 5, pp. 69-79, September 1987.

[7] R. Muntz, E. de Souza e Silva, and A. Goyal, \Bounding Availability of Repairable Computer

Systems", IEEE Trans. on Computers, Vol. 38, No. 12, pp. 1714{1723, December, 1989.

[8] G. Ciardo, J. Muppala and K. S. Trivedi, \SPNP: Stochastic Petri Net Package", Proc. Third

Int. Workshop on Petri Nets and Performance Models (PNPM89), pp. 142 - 151, Kyoto, Japan,

1989.

[9] R. Sahner, A. Puliato and K. S. Trivedi, Performance and Reliability Analysis of Computer

Systems: An Example-Based Approach Using the SHARPE Software Package, Kluwer Academic Publishers, Boston, 1995 (418 pages).

[10] A. Sathaye, K. Trivedi and D. Heimann, \Approximate Availability Models of the Storage

Subsystem," Technical Report, DEC., September 1988.

[11] D. Siewiorek and R. Swarz, The Theory and Practice of Reliable System Design, Digital Press,

1982.

[12] K. S. Trivedi, Probability and Statistics with Reliability, Queuing and Computer Science Applications, Prentice-Hall, Englewood Clis, NJ, 1982 (624 pages).

[13] L. Tomek and K. S. Trivedi, \Fixed Point Iteration in Availability Modeling", InformatikFachberichte, Vol. 283: Fehlertolerierende Rechensysteme, M. Dal Cin (ed.), pp. 229-240,

Springer-Verlag, Berlin, 1991.

[14] D. R. Avresky, Hardware and software fault tolerance in parallel computing systems, Ellis

Horwood Ltd., New York, 1992.

[15] H. Sun, X. Zang and K. S. Trivedi, \A BDD-based Algorithm for Reliability Analysis of

Phased-Mission Systems", IEEE Transactions on Reliability, Vol. 48, No. 1, pp. 50{60, March

1999.

27

Systems," IEEE Transactions on Reliability, Vol. 38, No. 1, pp. 146-152, April 1989.

[17] T. Luo and K. S. Trivedi, \An improved algorithm for coherent-system reliability", IEEE

Transactions on Reliability, Vol. 47, No. 1, pp. 73{78, 1998.

[18] J. Muppala and K. S. Trivedi, \Numerical transient solution of nite Markovian queueing

systems", Queueing and related models, U. N. Bhat and I. V. Basawa (eds.), pp. 262{284,

Oxford University Press, 1992.

[19] Introduction to VAXcluster Application Design, Digital Equipment Corporation, 1984.

[20] A. Satyanarayana and A. Prabhakar, \New topological formula and rapid algorithm for reliability analysis of complex networks", IEEE Transactions on Reliability, Vol. 27, pp. 82-100,

1978.

[21] S. A. Doyle and J. B. Dugan, \Dependability assessment using binary decision diagrams",

Proc. 25th Intl. Symposium on Fault Tolerant Computing, pp. 249-258, 1995.

[22] S. A. Doyle, J. B. Dugan and M. Boyd, \Combinatorial models and coverage: a binary decision

diagram (BDD) approach", Proc. Annual Reliability and Maintainability Symposium, pp. 8289, 1995.

28

- Storytelling tesisUploaded byJUAN CADILLO LEON
- Jamisola Dissertation 06Uploaded byRodrigo S Jamisola
- Simulation EffectivenessUploaded byJohnCleave
- FTA Metodology Cam0500244 EnUploaded bystefan
- Artificial Cognition in Production SystemsUploaded bynphllm
- SES - Device ReliabilityUploaded bySES_Cincinnati
- Research Plan Dr. Babar Post DocUploaded byMuhammad Nauman Hafeez Khan
- Towards a Seamless Development Process for Automotive Engine Control SystemUploaded byMuhidin Arifin
- Review of Immersive IntelUploaded byJoseph NECHVATAL
- Cost and financing: How much will it cost and who will pay? Safety, resilience and social cohesion: a guide for education sector plannersUploaded byPEIC Data
- 25-Simulation Equipment Techniques and Applications[510-534]Uploaded byracut_khansatra
- Simulation Exercise Manual HandbookUploaded byMayte Gz
- 20151226A_007102Uploaded byMundarinti Devendra Babu
- A comparison of noise simulation modelsUploaded byGiuseppe Marsico
- SBI COMPLETE Simulation is the Imitation of the Operation of a RealUploaded byPieka Biela
- D.A. Humphreys et al- ITER Plasma Control and US InvolvementUploaded byMsdsx
- Development of Operating SystemsUploaded byTimothy Parks
- Si March BrochureUploaded byKazi Hasan
- Simulasi sistem industriUploaded byJennyca Dwika Permata
- Life is a dreamUploaded byravishankar6910
- 8440-10470-1-PBUploaded byRoss Zhou
- zp wp3Uploaded byapi-438287205
- en_TS_Bahnflyer_RZ.pdfUploaded byHazarika
- Simulador Io AnalogicasUploaded byAlex Saravia
- Final Year Projects 2014 HANDBOOK 10Mar2014Uploaded byJaved Bhatti
- ORK Unproven Unification of Lambda Calculus and Thin ClientsUploaded bydjclocks
- Sgs Ind Aim UslUploaded byPranay Radke
- Compact Manoeuvring Simulator 1Uploaded byluizrj82
- Joseph X.F. Ribeiro.pdfUploaded byBen Anim

- Parker PumpsUploaded byIvanZavaleta
- Gg Ggggg Ggggg GggggUploaded byKenshin Himura
- Networkk Reliability LarbiUploaded byKenshin Himura
- Kuo-Zuo-koon K Out NUploaded byKenshin Himura
- K out N BolandUploaded byKenshin Himura
- pandian-A-10-83-5-511468eUploaded byKenshin Himura
- Tecnical Manual Availability ArmyUploaded byKenshin Himura
- Paper Nikil Diev Reliability Index CCUploaded byKenshin Himura
- Combined Cycle DesignUploaded bypremk20

- Electrical %26 Electronics Old Syllabus Sem III-VIIIUploaded byBibin Varghese Thekkan
- display monitorUploaded bybbbru1
- numatics-series-l2-solenoid-catalog.pdfUploaded byJorge Luis Malagon
- Eurotherm 2116 ManualUploaded bydps32
- Gorenje Washer-Dryer WI73140Uploaded byDel W
- File-1392754335Uploaded byjcpolicarpi
- edoc.site_comm.pdfUploaded byStef Jean
- Encoder CatalogUploaded byCao Minh Tuấn
- AS 60092.354-2005 Electrical installations in ships Single- and three-core power cables with extruded solid i.pdfUploaded bySAI Global - APAC
- Thrombotimer-en_web-1.pdfUploaded byAlaa
- 24c16wUploaded bymaldomatt
- GUIA DE PRODUCTOS.pdfUploaded byJulio Cesar Ramirez Olave
- CDC_121121 DIFA - Piedras Negras Cabling Specifications v1.5Uploaded byjanchapa
- Introduction to PcbUploaded bydileepanme
- Japan_2010Uploaded byHong Tong
- Vibrating and Sorting Technologies - Vibrating Sorting Tables _ EPA AUploaded byavcschaudhari
- Electrical Power and Controls Solutions Ch. 1Uploaded byKenneth Russell Sloat
- Compa Qp 1220Uploaded byCarolina Benitez
- Automation AssignmentUploaded bypatricklarin
- ABB Motor Signature AnalisisUploaded byHari Krishna.M
- ArresterFacts 016 Selecting Arrester MCOV-UcUploaded byIsra Maraj
- x1 3 5 Manual-2018-CopiadoUploaded byKer Salas
- 19K-M100Uploaded byjesus
- C-VLAN 802.1x.pdfUploaded byGiorgio Valtolina
- PV simulation report by lulu.pdfUploaded byChiu Tak Shing
- It1353 Embedded SystemsUploaded bySiva Ranjith
- Actual TopicUploaded bySankar Susarla
- exam1Uploaded byJeffy Purugganan
- 36778447 Bio Medical Instrumentation EE045Uploaded byVarun Manoharan
- tad732geUploaded byroozbehxox