You are on page 1of 17

ELSEVIER

0951-8320(95)00055-0

Reliabili O, Engineering attd System St(reO, 51 (1996) 241-257


1996 Elsevier Science Limited
Printed in Northern Ireland. All rights reserved
0951-8320196]$15.0(I

An overall model for maintenance


optimization
Jorn Vatn, Per Hokstad & Lars Bodsberg
SINTEF Safety and Reliability, 7034 Trondheim. Norway

This paper presents an approach for identifying the optimal maintenance


schedule for the components of a production system. Safety, health and
environment objectives, maintenance costs and costs of lost production are all
taken into consideration, and maintenance is thus optimized with respect to
multiple objectives. Such a global approach to maintenance optimization
requires expertise from various fields, e.g., decision theory, risk analysis and
reliability and maintenance modelling. Further, a close co-operation between
management, maintenance personnel and analysts is required to achieve a
successful result. In the past this has been a major obstacle to the extensive use
of proper maintenance optimization methods in practice, and techniques to
promote the communication between the involved parties of the optimization
process is an essential element in the suggested approach. A simple step by
step presentation of the required modelling is provided. Contrary to most
current methods of RCM (Reliability Centered Maintenance), the approach is
based on an analytic model, and therefore gives a sound framework for
carrying out a proper maintenance optimization. The approach is also flexible
as it can be carried out at various levels of detail, e.g., adopted to available
resources and to the managements willingness to give detailed priorities with
respect to objectives on safety vs production loss. 1996 Elsevier Science
Limited.

1 INTRODUCTION

the maintenance fully into the total business system of


a company and to formulate an approach with a sound
theoretical basis, see e.g. Ref. 7.
The present paper describes a global approach for
quantifying the costs and benefits of the maintenance
programme of a production system/plant. It is model
based and thus will allow us to carry out an
optimization in a well defined sense. The method is so
far restricted to incorporating the most fundamental
maintenance strategies, but the effect of these
maintenance rules on the overall costs are explicitly
modelled. Further, it is a main objective of the
method to promote communication between management, maintenance engineers and reliability engineers,
so that they can agree on the basis for the decisions,
and to understand the reasons why one decision is
preferred to another.
The method is, to some extent, based on utility
theory, 8-~ and it is a prerequisite that the management identifies the main objectives of the company,
e.g., regarding safety, environmental damage, maintenance cost and loss of production. Further, it is
necessary to define system performance measures

Nowadays, Reliability Centered Maintenance (RCM)


is probably the global analysis method which is most
frequently used for identifying a cost effective
maintenance programme for a plant. The goal of
R C M is to identify the 'optimal' maintenance for each
component, taking safety requirements, maintenance
costs and costs of lost production into account, see
Refs 1-5. No doubt, the RCM methodology gives a
practical and structured approach for arriving at a
r e c o m m e n d e d maintenance strategy for each component. However, there exists no sound foundation
for claiming that the maintenance strategy derived
from the RCM approach is in any sense 'optimal'.
On the other hand, there exists a vast number of
academic papers describing narrow maintenance
models, which are rarely, if ever, used in practice.
Most of these papers are too abstract, and the models
that could be useful are difficult to identify among this
large number of suggested models; see the discussion
in Ref. 6. Further, in order to achieve real
improvements, there is a need now both to integrate
241

242

J. Vatn, P. Hokstad, L. Bodsberg

with respect Io these objectives, and the plant


management must provide (quantitative) preferences
with respect to these performance measures. The
reliability analyst uses this information to establish an
overall loss function, which merges the various
performance measures into an overall measure for
degree of goal achievement. This approach is based on
the so-called multiple attribute theory, see Ref. 9, and
provides the basis for a real optimization with respect
to types and frequencies of maintenance.
The present method requires reliability modelling to
relate the system performance measures to the
dependability (e.g., availability or failure frequency)
of the components. Further, a model is needed to
specify the relationship between the component
dependability and the maintenance strategy/frequency
chosen for that component.
We have here applied the term 'dependability', in
spite of the fact that it is currently causing some
dispute. We adopt the definition of 'dependability'
given in Ref. 11 as "The collective term used to describe
the availability performance and its influencing factors:
reliability performance, maintainability pert~ormance
and maintenance support performance.'
Thus, well-known models from various fields are
combined to obtain a complete, structured approach
for quantifying the effect of various maintenance
strategies on the overall performance of a production
system. A unified framework is thus provided, not
requiring any subtle theory, but nevertheless forming
a firm basis for decisions on maintenance optimization. Further, the focus is on main aspects of plant
activity, and it is the aim to provide an insight into the
relation between maintenance and economic return
and safety. In particular, influence diagrams are used
as a means for promoting communication between the
persons involved, see Section 2. The approach is
flexible, as the analysis can be carried out at various
levels of detail. The method is here presented as a
total analysis of the plant, but it will often be
appropriate to break down the analysis, using the
approach to perform detailed studies of critical
subsystems.
An example to illustrate the approach is provided at
the end of the paper. Some theoretical questions
related to the approach will be treated in future
papers.
Abbreviations

CM
ETA
FTA
MTTF
M TTR
MTTT
PM

and notation

Corrective maintenance
Event tree analysis
Fault tree analysis
Mean time to failure
Mean time to repair
Mean time to test
Preventive maintenance

RCM
TTF
A
U
Ek

F~
D C,i
Ni l
Pii'k

f
T

R(t)

f(t)
~.(t)
v(t)
2L

Reliability centered maintenance


Time to failure
Availability
l-A, Unavailability
Undesired event no. k
Frequency ('likelihood') of undesired
event, Ek (system level)
Damage category i, j
Number of events resulting in DCii
Frequency of DC o - E(Xij)
Probability of DCi, given Ek
1/MTTF, (Component) failure frequency
Length of test/PM interval
P ( T T F > t), Survival function
R'(t), Probability density function (pdf)
f ( t ) / R ( t ) , Hazard rate
Mean number of failures in (0, t]
Co-product,.xuy = 1 - (1 x)(1 - y )

2 ANALYSIS

FRAMEWORK

AND

INFLUENCE

DIAGRAMS

Influence diagrams ~2"~3 are used to visualize the


overall analysis framework. An influence diagram
demonstrates the relations between the decisions to be
taken, e.g., regarding type (and frequency) of
maintenance, and the various measures of system
performance, see Fig. 1. Formally, an influence
diagram is a directed graph, G = (N, A) where N is
the set of nodes, and A is a set of arcs connecting the
nodes.
Figure 1 shows the relationships between three
types of nodes:
1. Decision nodes, representing decisions on
component quality, as measured by the
inherent hazard rate (being the hazard rate
that would appear if no preventive maintenance is carried out)
maintenance tasks to be taken for various
components, e.g.,
- scheduled restoration/replacement
- functional testing (e.g., to detect dormant
failures)
condition monitoring
logistics, e.g., the availability of spare parts
and maintenance personnel
2. Per)brmance nodes, representing
(a) component performance, as measured by e.g.,
number of failures of the component (say, for one
year)
number of on-demand failures of the component
(per year)

243

A model for maintenance optimization


1.

2a.

2b.

2c.
I.

I
F

'

. . . . .

Costs
__

__

I
J

Component
quality

i
Overall cost

Maintenance
tasks

Total Loss

Logistics
(spare parts)

Decision node
Performance node

Ultimate
system
I,
L performance _j

Value node

Fig. 1. Influence diagram showing effect of maintenance.

total component down-time due to repairs


(per year)
(b) intermediate system safety performance, as
measured by e.g.,
number of accidents of a specific type (per
year)
(c) ultimate system performance, as measured by
total system down-time, due to repairs (per
year)
number of system shut-downs (per year)
number of injured persons at the plant (per
year)
number of killed persons at the plant (per
year)
total amount of pollution in cubic meters (per
year)
hours of maintenance (per year)
3. Value node, representing the overall system
performance measure, e.g., given as the total loss
obtained by merging the above system performance measures.
The decision nodes which represent the choices
regarding type and frequency of maintenance are the
nodes which will be the main concern in the present
discussion. The decisions which are taken regarding
the maintenance of the various components will affect

their performance, as measured, e.g., by the number


of occurring failures. These influences are indicated by
arrows, see Fig. 1. Similarly, observe that the number
of hours of maintenance (node 2c) is directly
influenced by the type and frequency of the chosen
maintenance.
In the influence diagram of Fig. 1 there are three
different types of performance nodes. The component
performance nodes (2a) are directly affected by the
maintenance decisions. The ultimate performance
nodes (2c) represent those system measures which are
essential to the decision maker, e.g., total system
down time, the number of persons injured, etc. These
should be defined based on the overall objectives of
the system, e.g.,
- high personnel safety
low environmental threat
low risk of material damage
- low production losses
low operation and maintenance cost.
-

Between these two types of node, we have


intermediate safety performance nodes (2b), representing the number of various accidents (undesired
events). For example accident A might be a fire,
accident B might be vessel rupture etc. By specifying
such events, it is made clear to the decision makers

244

J. Vatn, P. Hokstad, L. Bodsberg

and analysts that the evaluation of system safety is


entirely based on the specified events/scenarios. Such
a limitation of the analysis is, of course, essential. To
perform a maintenance optimization, it is often not
feasible to perform a complete risk analysis, and the
effect of the chosen maintenance programme on safety
may be assessed by a somewhat simplified approach,
considering a rather limited number of representative
scenarios only. However, it is important that these
scenarios are representative with respect to the
involved risk.
Note that the states of the performance nodes are
formally given as random variables. However, the
actual maintenance optimization will normally be
based on the mean of these variables.
The value node represents the total value function
of the management, comprising the various system
performance measures into a measure of overall
system performance. This overall performance could,
for instance, be given as the total loss (i.e., costs) per
year, or as the utility measured in net profit per year.
Thus the value function is also formally a random
variable, and the overall goal of the analysis is to
establish a set of maintenance actions minimizing the
expected (mean) value of the value function. Note
that, in order to establish the value function, it is
required that the management must make some
quantifications of (relative) costs related to the
ultimate system performance measures. Some of these
costs, in particular those related to safety (e.g.,
number of persons injured/killed) are rather controversial, and ways of handling this potential problem
are further discussed in Section 3.2.
The main purpose of the influence diagram is to
promote communication between the persons involved in the analysis to clarify the overall modelling
approach. The diagram shows which maintenance

options exist, which relations/influences are modelled


and which are not. Referring to Fig. 1, it is seen that,
in the present model, the repair time depends neither
on component quality (i.e., inherent hazard rate) nor
on maintenance strategy. However, if such dependencies are judged to be relevant by those involved in
the maintenance optimization process, the model
should of course be modified, and appropriate arrows
be included in the influence diagram. Thus, the
influence diagram of Fig. 1 demonstrates exactly
which accidents (here A and B) contribute to the
safety evaluations, and which system performance
measures form the basis for the decisions.
A main task of the maintenance optimization
process is to agree upon level of detail, i.e., what to
include in the analysis. Figure 2 represents the
influence diagram of a somewhat simplified optimization analysis. Here, some nodes have been removed.
In particular, the number of occurrences of type A
and B accidents are given as the ultimate system
performance measures with respect to safety. This
change will result in a considerable simplification of
the required reliability modelling, and it is also
expected to be easier for the decision maker to
provide the costs related to these system performance
measures than to decide on the cost for instance
related to 'no. of persons killed' (see Section 3.2 for
further discussion).
For an analysis related to Fig. 2 the overall system
performance could, for instance, be measured by a
simple loss function of the form
Loss = C1 . ( # D o w n t i m e hours) + C2 ( # S h u t downs)

+ C3 ( # T y p e A accidents) + C4. ( # T y p e B accidents)


+ Cs. ( # H o u r s of maintenance)

~ - ~ o .

Fig.

2. Influence diagram of simplified analysis.

!~,....I

(1)

A model for maintenance optimization


Here there are five (ultimate) system performance
measures, and Ck, k, = 1 , 2 . . . . . 5, are the costs
(losses) related to these. For instance, C~ is the cost of
one hour of downtime. Further, a proper reliability
model will provide the mean value of the various
system performance measures as a function of the
chosen maintenance frequencies, and thus the
maintenance strategy which minimizes the expected
loss can be identified.
The persons involved in the process of establishing
an appropriate maintenance level are analysts, experts
and the decision maker. The experts will typically be
maintenance and operating personnel and the decision
maker will typically be the manager of the company.
All of these should have well defined tasks in the
process. The analysts (reliability engineers) administrate the process and are responsible for performing the
reliability modelling required to account for the
arrows drawn from decision nodes to performance
nodes. The analysts must also establish models taking
care of the arrows from component performance to
system performance. However, the experts are
essential in deciding which arrows to include in the
influence diagram. The last arrows from system
performance to the value node represents the main
task of the decision makers (management), as they
must provide priorities with respect to the specified
system performance measures. In the present paper it
will be assumed that there is only one decision maker.
To overcome the problems arising with several
decision makers, multiple influence diagrams can be
used, see Ref. 14 for an illustrative example.
The maintenance optimization process described
here is presented as a method for 'global' optimization
of plant maintenance. Obviously, this will be a very
demanding task. However, the analysis can be carried
out at various levels, and the overall analysis could be
carried out in steps. One could start with an analysis
at a high level, identifying the subsystems being most
critical with respect to goal achievement. These could
be chosen as the systems of more detailed analyses,
where an optimization is carried out with respect to
maintenance frequencies of the components of a
particular subsystem.
The framework of a maintenance optimization
approach presented above demonstrates the main
objections to the standard RCM approach. The RCM
methodology will not base the decisions on an overall
performance measure (node 3). Further, the underlying reliability model, here demonstrated, e.g., by the
relation between nodes 1 and 2a, is rather shaky.
3 A METHOD FOR MAINTENANCE
OPTIMIZATION

Although the present method is based on utility


(decision) theory, it has also been an objective that it

245

should be developed according to the main principles


of systems analysis. ~5
3.1 Main steps o f the analysis

The analysis method is carried out in four steps:

1. Define the problem. The system boundary and


the objective of the analysis are defined.

2. Establish the loss function and preferences. The


main objectives of plant activity are identified,
and the form of the loss function is decided in
this step.

3. Dependability modelling ('Description of the


worM'). The degree of goal attainment is
quantified by a dependability model.

4. Result compilation. The expected value of the


overall loss function is established, and a
minimization of this is carried out with respect to
frequency of the identified PM activities.
In step 1 it is required to (1) define the main
objectives of the plant activity, and (2) define the
boundaries of the system to be analysed. System
boundaries are established through a discussion
between the analyst(s) and the decision maker. The
type of questions addressed/excluded must also be
agreed upon. For example: which alternative PM
strategies are considered? Should questions related to
spare parts be included? The influence diagram is used
actively during this process. The decision alternative
to be evaluated in the study should be clearly
identified (decision nodes).
The present approach is based on having the
following objective of the analysis:
- to identify a loss function that corresponds to the

overall objectives of the plant, and to minimize the


mean of this function with respect to the type and
amount of PM carried out for plant components.
The loss function will typically be a function of some
variables, e.g., performance measures like downtime
hours and number of shut downs, see eqn (1). The
expectation of the loss function with respect to these
variables will in this paper be denoted the risk

function.
The steps 2-4 required to achieve this objective are
discussed below.
3.2 Establishing loss function and preferences

The overall system performance measure (loss)


represents a synthesis of the management objectives
for plant activity, and is, as such, the basis for the
maintenance optimization model. Thus, the process of
establishing the overall loss function is a very crucial
part of the method. It is the objectives and

246

J. Vatn, P. Hokstad, L. Bodsberg

preferences of the management which are the main


inputs to this step in the analysis.

3.2.1 The risk function


The loss function measures the degree of goal
attainment with respect to the identified main
objectives of the plant. The risk function (i.e., the
expected loss) is established in four steps:
1. The plant management provides a list of the
main objectives of the plant, and from this list
they and the analysts agree on which performance measures are appropriate for measuring
the degree to which these objectives are fulfilled.
This discussion is carried out assisted by the
influence diagram. Some compromising may be
required to keep the analysis at a reasonable
level of complexity. Thus, the analyst is
responsible for ensuring that they keep to
performance measures which make the analysis
manageable, e.g., not too costly.
2. The form of the loss function is then obtained by
combining the chosen system performance
measures with cost figures in an appropriate way.
Equation (1) represents one very simple example
of such a loss function. For simplicity we restrict
ourselves to linear loss functions in the present
description of the method.
3. The costs related to each performance measure
are also assessed. Some of these costs are
'objective' figures, e.g., man-hour costs and costs
of production unavailability. However, costs
related to safety may require difficult and
controversial prioritizations by the management,
and will be discussed separately, see below.
4. By replacing the performance measures with
their mean values, the loss function is transformed to a risk function, which can then be
minimized. Dependability modelling is required
to express the attributes (mean value of the
performance measures) as a function of maintenance action/frequency. This reliability and
maintenance modelling is the topic of Section
3.3. Note that for non-linear loss functions the
true risk function can not be obtained by
inserting expected values of the performance
measures and the expectation of the entire loss
function should be calculated with respect to the
performance measures.

3.2.2 Safety performance measures


One objection to a method as described here is the
potential difficulties of relating measures of safety
performance to economic losses, and combining these
into one overall loss function. In particular, it is a
rather sensitive matter to quantify the cost of loss of a

human life. Another problem is that safety performance measures tend to require rather complex
dependability analyses, which make the total analysis
too complex and costly to be attractive for plant
management.
However, it is now due time to meet this challenge
in order to achieve a proper maintenance optimization
method. The problem of including safety aspects in
the analysis must be approached, rather than be
'swept under the carpet' due to the controversies
involved. A maintenance optimization method such as
RCM can never achieve credibility when claiming to
prioritize safety without trying, at least in a rough
way, to quantify safety and in some way relate this to
the costs involved.
Various types of safety performance measures can
be defined, cf. Figs 1 and 2. These measures pose
quite different requirements on the decision makers
willingness/ability to quantify the costs involved. We
consider three options:

(i) Number of accidents"


This type of safety performance measure is illustrated
in Fig. 2. In this case the exact consequences of the
various accidents are not evaluated. This has two
distinct advantages. Firstly, the controversial question
of, e.g., assessing the cost of a human life is avoided,
and therefore it is expected that the manager is quite
willing/able to assess the cost of the event; i.e., he
states how much he is willing to pay to avoid such an
event. Secondly, the dependability analysis becomes
quite simple.

(ii) Health and environmental damage


This is the type of safety performance measure
illustrated in Fig. 1:
- number of persons injured;
- number of persons killed;
- cubic metres of pollution of a specific type.
It is expected that the decision maker will be more
reluctant to give his cost estimates with respect to
these performance measures than those in (i). Further,
the measures (ii) will result in an overall model which
is more complex, as E T A of various accidents must be
performed.

(iii) Occurrences of damage categories


An intermediate approach will be to define the
consequences by 'damage categories' rather than
specifying the exact consequence of the activity in
terms of number of injured/killed, etc. (see Ref. 10).
Typical damage categories of accidents are:
-

debilitating injury or long-term illness;


1-3 fatalities;
more than 3 fatalities;
pollution greater than x m 3 pollution of specified
type.

A model for maintenance optimization


The undesired event (accident) can be developed in
a rather rough way to predict the probability of the
various damage categories.

Relative costs
The introduction of relative weights can be a helpful
tool for achieving costs related to safety performance
measures, see Ref. 10. The decision maker is first
asked to give the relative costs (weights) of the chosen
damage categories.
Figure 3 illustrates the results from such a weighting
procedure, for personnel safety measures. According
to this figure the decision maker indicates that he is
willing to pay 10 times more to avoid an event
involving fatalities, than an event involving debilitating injury (one death corresponds to 10 debilitating
accidents). Default values could be provided for the
weights. However, a decision maker can change these
to reflect his own preferences.
In addition to the weights, a so-called anchoring
point must be provided. That is, one of the damage
categories must be assigned an exact cost. Note that
the damage category giving rise to the least
controversies could be chosen as anchoring point.
Now consider the three damage categories of Fig. 3.
The safety performance measures are X ~ , X~2 and
X~3, defined as the number of events (per year)
resulting in DCj ~, DCt2 and DC~, respectively. As an
example, assume that damage category DCL~ is chosen
as the anchoring point, and that the cost of one event
leading to DC~3 is set to C~ = 5 10% (see Ref. 10 for
a discussion). Then the contribution to the loss
function from personnel safety measures is given by:
Ls,1~,,,(Xil, XI2, Xj3 )
= [0.01 . Xll + 0-1. Xl2 + XL~]CI
= [0-05. X,, + 0-5. X,2 + 5. XL~] 10%

Weight

D C t l = Minor/moderate injury or illness


DC~2 = Debilitating injury or serious long-term illness
DC 13 = Death

0.1 ~
o.01

!
i
I

DC 11

DC 12

DC 13

DamageCategory
Fig. 3. Weighting of damage categories.

(2)

247

In conclusion, through a co-operation between


analyst and decision maker, it is a manageable task to
identify an overall loss function, for which appropriate
input can be obtained with reasonable effort, which
can be adapted to the level of ambition of the analysis.

Total loss function


The contribution from loss of safety to the overall
loss function was highlighted above. Other contributions are more straightforward to establish, see
example in Section 3.4. The total loss function is then
easily established if it can be assumed that it is an
additive function. So if the main objectives of the
plant are related to (personnel) safety, environmental
threat, material damage, production loss and maintenance costs, the overall loss function is written
L = Ls,,m~,+ L~:,wi ............ t + LMaterial
+ Lprodl.oss + LMamt ......... 'e

(3)

3.3 Dependability modelling

The task to be undertaken in step 3, 'description of


the world', is to model the effect that the maintenance
activities have on the system attributes ( = m e a n value
of the performance measures). The modelling is
established in two steps:
Step 3a.

Step 3b.

Identify how the system attributes are


given by the dependability of the
components,
Identify how different maintenance actions
affect component dependability.

These steps are discussed in Sections 3.3.1 and 3.3.2


respectively.

3.3.1 System model


The relevant performance measures have been
identified in step 2, see Section 3.2.2. These are all
related to undesired events (i.e., events causing some
costs), see Fig. 1, where the undesired events are
'production unavailable', 'production shuts down',
'accident A',
'accident B'
and
'maintenance
performed'.
In general, a set of undesirable events, Ek, k =
1, 2 . . . . K are identified, each representing a scenario.
Usually Ek will be the top event of a fault tree, and an
FTA analysis is carried out to identify the relation
between this top event and various component failure
events (i.e., the causes indicated in Fig. 4). This F T A
modelling provides the 'likelihood', Fk, of undesired
event Ek. This Fk is either the frequency of the event
(e.g., mean number of failures per year of operation),
or the probability of a failure on demand.
It may also be necessary to investigate the
consequences of an undesired event, Ek. If, for
instance, the consequences are given as occurrences of

J. Vatn, P. Hokstad, L. Bodsberg

248

' d a m a g e categories' (see Section 3.2), the probability


P,/k that damage category DCi/ occurs is found by
performing an E T A .
The overall concept is illustrated in Fig. 4 indicating
the 'input' and ' o u t p u t ' of one undesired event, Ek.
The total frequency of one particular damage category
DCii is given by

Fi, : ~ Fkpi/k

(4)

The frequency, F#, of damage category Dii


represents the input to the risk function.
A challenging part of the analysis is to identify a set
of hazardous events which will form the basis for
evaluating safety performance (e.g., choosing the
events 'Accident A ' and ~Accident B' in Fig. 1). The
n u m b e r of potential events which can be included is
unlimited, and the selection of scenarios to include is a
creative rather than mechanistic process. There are,
however, certain principles for the structuring of the
undesirable events (see, e.g., Ref. 16). Particularly
important is the use of the 'pinch-point' principle
which means that undesirable events should be
selected so that the consequences of the event are
independent of the event cause. This means, for
instance, that Ek is defined as the occurrence of a
specific d a m a g e category, thereby avoiding the
evaluation of an E T A for this event. Furthermore, the
selected undesirable events should, as far as possible,
be disjoint. Some examples of hazardous events
related to an oil production installation are:
E~: explosion/rupture of tank (e.g., separator) due to
overpressure:
E2: hydrocarbon release not initiated by tank rupture:
E~: fire, not initiated by hydrocarbon release.

3.3.2 Component model


The 'likelihood', Fk, of an undesired event Ek may be
determined by an FTA. The basic events of the fault
tree represent c o m p o n e n t failure modes. Thus, by
performing standard FTA, the frequencies, Fk, will be
given as a function of the dependability of the
components.
The c o m p o n e n t dependability can be measured, for
instance, by failure frequency, f , or average

[L

Om~e
cmeg~y

ConlxKluence=

availability, A. All relevant components have one or


several functions. Any c o m p o n e n t will always have a
function related to support of production, and we
refer to the production availability of the component.
C o m p o n e n t s of safety systems have, in addition, a
safety function, with a performance measure denoted
safety availability. Failure of a safety function can
either be loss of a protective function, e.g., a safety
valve fails to close, or a c o m p o n e n t whose failure
poses a safety problem, e.g., external leakage through
a valve. In general we can define an availability
relative to any function of the component.
The c o m p o n e n t dependability will be determined by
(1) the inherent hazard rate (given by the quality level
of a new component), and (2) the maintenance which
is performed on it during operation. The main
question addressed here is how various maintenance
activities will affect the. dependability.
Dependability measures
Let T be the time to failure (TTF) after a new
c o m p o n e n t is installed, and let R(t) = P(T > t) be the
survival function of the TTF. If a c o m p o n e n t has two
functions as discussed above, a T T F should be
introduced for both. Thus, for, e.g., fire & gas
detectors there are two survival functions, one for
time to loss of safety and one for time to loss of
production. The MTTF = Mean Time To Failure is

MTTF = E(T) = I / R(t) dt

and the inherent hazard rate equals z(t) = R'(t)/R(t). This is the hazard rate that would be
observed for time to first failure, provided no PM is
carried out. It is also denoted, for instance, the naked
failure rate, see Ref. 17.
A repairable component, being as good as new after
each failure, has the asymptotic failure .frequency,

f = 1/MTTF

(6)

This equals the


corresponding to

average frequency of failures,


the average ROCOF (Rate of
OCcurrence Of Failure) 1~'''. In fact, the misused term
'failure rate' could be applied as a synonym for
f - 1 / M T T F . If the c o m p o n e n t is unavailable for a
time MTTR (Mean Time To Repair) by each failure,
the average availability equals
A = 1/(1 + f . M T T R )

LU;J

Lh'*d~irolY~
Evonl

? ~

compommt
Fm~

Fig. 4. Risk quantification.

(5)

(7)

The average unavailability (caused by failures with


frequency f ) equals
~um

U - 1- A

(8)

For a repairable c o m p o n e n t

(with no reliability

A model.for maintenance optimization


trend) the four alternative dependability measures,
MTTF, f, A and U, provide very much the same
information. Observe that the relations (7) and (8) are
valid for any c o m p o n e n t with failure frequency f ,
irrespective of maintenance action. Using eqns (5) and
(6) we implicitly assume that times between failures
are a series of identically distributed r a n d o m
variables, i.e., R(t) in (6) does not change from failure
to failure.

Reliability trend
The case of having increasing/decreasing ROCOF
(reliability trend) is not treated in the present paper.
H e r e we restrict ourselves to the "standard' models,
where the optimal maintenance strategy will be found,
assuming that c o m p o n e n t s can be brought to a state as
good as new by PM a n d / o r repair. Ignoring an
existing trend will of course invalidate the results. The
effect of any trend can, however, to some extent be
eliminated by updating the reliability optimization
analyses on a regular basis. In other situations it could
be required to introduce models handling trends in a
more formal manner.

Maintenance models
An extensive n u m b e r of maintenance models are
discussed in the literature, see Ref. 20 for a survey.
For demonstration purposes five of the most
well-known maintenance models for repairable systems are reviewed, providing expressions for failure
frequency and availability resulting from the following
maintenance models:
1: replace c o m p o n e n t by failure (i.e., corrective
maintenance (CM) only)
2: Age R e p l a c e m e n t Policy ( A R P )
3: Block R e p l a c e m e n t Policy (BRP)
4: Minimal Repair Policy ( M R P )
5: periodic testing to detect hidden failures.
A major area which is not covered by the models
above
is 'on-condition'
replacement
models. 2
These models are rather complicated to handle, and
often require inputs which are not available.

249

Maintenance
model 1 above represents the
possibility of not performing preventive maintenance
(PM) at all. The models 2 - 4 are the most well-known
maintenance models for evident failures, i.e., failures
that are revealed immediately upon occurrence.
Several generalizations are found, e.g., in Ref. 21.
Model 5 applies for hidden (dormant) failures. This
model is relevant for standby equipment and
detectors, having failures that are only revealed by
functional testing or actual demands. Note that for
this standard model it is assumed that a c o m p o n e n t is
'as good as new' after a test, and this may actually
require a replacement rather than a test.
The actual PM and CM carried out for models 1-5
are summarized in Table 1. For all models, except
model 4, the c o m p o n e n t is as ~good as new' after
repair. The 'minimal repair' strategy for model 4
means that the c o m p o n e n t is repaired to the state it
had immediately before a failure occurred. Observe
that for model 5 we have classified the replacements
p e r f o r m e d at fixed intervals as PM (and not CM),
even though the c o m p o n e n t may have failed at this
instant.
The expressions of f a n d / o r U = 1 - A for these
models are summarized in Table 2. These results are
either found directly in the literature, see, e.g., Refs
18, 21-23, or they are rather straightforward to derive.
All expressions are given in terms of the R(t) and z(t)
of the T T F of a c o m p o n e n t which is as good as new
and which is not maintained (cf. model 1). The
expressions in Table 2 demonstrate how the
replacement interval, r, affects the failure frequency
and unavailability of the c o m p o n e n t subject to a
specific maintenance strategy. The expressions for f
are for some models also given for the special case
that the T T F follows a Weibull distribution, i.e.
R(t) = e x p ( - ( A t ) " ) . Observe that alternatives to the
Weibull distribution of course exist, see, e.g., Ref. 24.
For model 3, the frequency of failures i s f = V(r)/v,
where V(t) is the mean n u m b e r of failures in the
interval (0, t], see, e.g., Ref. 23, However, if the
interval r is short c o m p a r e d to MTTF, we have
V ( r ) ~ 1 - R ( r ) , and the approximate expression of

Table 1. Maintenance strategies

Maintenance model
1: Corrective maintenance only
2: Age Replacement Policy
(ARP)
3: Block Replacement Policy
(BRP)
4: Minimal Repair Policy
(MRP)
5: Periodic testing;
hidden failures

Corrective maintenance

Preventive maintenance

Replace by failure
Replace by failure

None
Replace after a time r of operation

Replace by failure

Replace at fixed instants, k. r


(k = 1,2 . . . . )
Replace at fixed instants, k. r
(k = 1, 2 . . . . )
Replace at fixed instants, k. r
(k = 1,2 . . . . )

'Minimal repair' by failure


None

250

J. Vatn, P, Hokstad, L. Bodsberg


T a b l e 2. M a i n r e s u l t s for m a i n t e n a n c e

Model
General

Failure frequency, f
Weibull

1 - R(r)

1
1 + f . MTTR
1

1
-

Unavailability, U,
due to failures

R(r)

R(x) dr

A
1"(o~ ' 1 )

R(t)dt
1

strategies

z(t) dt

1 - e I~''

~A"r" i

A"r" '

Failure m o d e
A basic event in the fault tree for an event Ek
represents the occurrence of a specific failure m o d e of
a component. For safety systems, having two main

1 + [. M T T R
I +f. MTTR

3.3.3 Splitting hazard rate on failure m o d e and failure


cause
The maintenance models 1-5 form the basis for
identifying a maintenance strategy/frequency for the
components. However, a maintenance strategy should
ideally be defined relative to specific failure causes
and specific failure modes, and so it is required to
perform a splitting w.r.t, c o m p o n e n t failure types.

Table 2 will be sufficiently accurate. Models 3 (BRP)


and 4 ( M R P ) are rather similar, but have different
repair strategies. In the M R P model a minimal repair
is p e r f o r m e d if the c o m p o n e n t fails, and in the B R P
model a replacement is carried out. So, these models
represent two extreme repair strategies, and it is
reassuring that if the T T F follows a Weibull
distribution these two models give approximately the
same results (for small and m o d e r a t e r). Thus,
without specifying the exact amount of repair by
failure, f - ~ A " . r " 1 is a reasonable approximation in
the Weibull case. If T T F is exponentially distributed
with p a r a m e t e r A, the unavailability, U, of model 5
will, for small At, approximately equal A.r/2, which is
the well-known and frequently used formula for safety
unavailability (Mean Fractional Dead Time).
The unavailability due to performing periodic PM
must be accounted for separately. For models 3-5,
this unavailability is UI = M T T T / ( M T T T + t), where
M T T T - average time the function is unavailable in
order to perform a test/PM. The two unavailabilities,
U and U~, should be added to give the total
(approximate) unavailability, due to maintenance.

1 + f. MTTR

R(t)

dt

functions (related to safety and regularity), there are


also two main failure modes, often denoted Fail-ToOperate and Spurious Operation, respectively, e.g.,
see Ref. 25. If E~ represents a safety related event, the
basic events will usually refer to the failure mode
Fail-To-Operate and maintenance model 5 applies,
whereas in other fault trees the models 1-4 apply for
the basic events.
If the two failure modes of a c o m p o n e n t enter two
different fault trees, the maintenance strategies with
respect to these failure modes must of course be
co-ordinated. For the c o m p o n e n t of a safety system, a
complete PM is carried out at each test, i.e., the
c o m p o n e n t is "as good as new" with respect to both
hidden and evident failures. Of course model 5 applies
with respect to d o r m a n t failures, but this means that
the c o m p o n e n t is 'replaced' at fixed instants, k . v
(k = 1,2 . . . . ), also with respect to evident failures,
and either a B R P or a M R P applies with respect to
these.
Faihtre cause
The specification of a maintenance strategy related to
a specific failure m o d e is based on the hazard rate
corresponding to this basic event, i.e., zl,,,,,,,,,,,,(t). For
instance, if Ek is the event spurious svstem trip, then
zh,,,,, ........,(t) could be the hazard rate corresponding to
the failure m o d e spurious closing for a shutdown valve
( S D V ) . The failure mechanism/cause is important for
deciding whether an aging effect (or burn in effect)
will be present, and in order to identify an optimal
maintenance strategy we split the zh,,~,,, .......(t) according to main failure causes. Further, it is sensible to
identify the part where the failure occurs.
Considering the failures of, for instance, a shutdown
valve (SDV), we present some examples of failure
cause categories. Examples 1-2 are relevant for the

A model for maintenance

basic event 'spurious operation (closing) of an SDV',


and examples 3-5 apply for the basic event 'fail to
operate (close) an SDV'.
1: Trip of the pilot valve, due to electronic failure.
2: Leakage of control line due to mechanical
damage (e.g., falling object).
3: Mechanical degradation of the main valve
(includes several failure mechanisms, corrosion,
erosion . . . . ).
4: Internal leakage in the actuator, due to pitting
caused by condensation in the bottom of the
actuator or damaged seal material.
5: Blocking of control line due to control fluid
contamination.
Based on such a classification we write the hazard rate
of the relevant basic event as
z,,,,,i, ....... ,(t) = z,(t) + z2(t) + z3(t) + . . . .

251

optimization

ageable. One or two (possibly three) r's per


component will usually be sufficient. If there are
identified two r's, prior knowledge should be utilized
to identify which of these r's is the largest, and define
this to be a multiple of the smaller r. Without such
restrictions the optimal strategy will be difficult to
manage and become very unattractive.
The next step is to find the failure frequency, ~, or
the unavailability, Uj, corresponding to failure
category j (i.e. to failures with hazard rate Zi(t)). The f
and Uj are calculated directly from the appropriate
formula in Table 2. For models 1-3 and 5 the
corresponding survival function is required:
Ri(t ) = e fl,:j~,,)d,,

(10)

The total 'failure rate' of the basic event equals


ft,,.,.,,. ,,v,.,,, : ~ fj

(9)

(11)

where z~(t) is the contribution to the hazard rate due


to failure category j. Observe that this modelling is
based on the assumption that any failure of the
component can be 'allocated' one cause category, and
that a purely additive model applies for the hazard
rate.
The
knowledge
of
the
various
failure
causes/mechanisms must now be collected from
maintenance personnel (and field data) to predict the
form of each of these hazard rates, zl(t), z2(t) . . . . A
maintenance strategy will then be chosen for each of
these, cf. Fig. 5. For some failure categories, the
hazard rate is decreasing (or constant), and n o PM
should be carried out, otherwise there should be a PM
carried out at an interval, r. Thus, for each failure
category we choose one of the 5 maintenance models,
according to a decision logic similar to that applied in
todays RCM analysis.
The intervals, ~ are at this stage still unknown, as it
is the objective of the total approach to arrive at
optimal values for these. The grouping into failure
categories must of course not be too detailed;
otherwise the maintenance strategy becomes unman-

r-~,,,

,., I

The unavailability, ~ , can also be calculated for


each failure category, and these are added to get the
total unavailability (failure probability) of the basic
event, i.e., approximately Uh,,.,.i,. ....... , ~ ~j Uj.
To summarize, the following steps are carried out to
specify a complete maintenance strategy of a
component failure mode, see Fig. 5:
identify main failure causes, and split the hazard
rate, zh..,i,. ....... ,(t) into the contributions of the
various failure causes;
predict the form of a hazard rate, zi(t), and choose
one of the maintenance models 1-5 for each failure
cause;
retrieve the appropriate failure frequency a n d / o r
unavailability from Table 2;
add the failure frequency a n d / o r the unavailability
of each failure category to obtain the total
frequency/unavailability for the basic event, giving
a measure of component dependability.

In short, this shows how to go from the component


hazard rate to a recommended maintenance strategy,

~Model(x,) ~

"' "'

IchooseMaint[

-~..~

~ , ~

fnm~ iMmt~ ~f]

I ""
'""'[
Fig. 5. Maintenance strategy for various failure causes, and component dependability.

J. Vatn, P. Hokstad, L. Bodsberg

252

and further find an expression for the components


dependability as a function of the r's. By inserting
these in the various FTAs, the complete expression
for the overall risk function is obtained. It then only
remains to determine the optimal r's, corresponding
to the chosen maintenance strategies.
Note that the failure cause level of a c o m p o n e n t
often corresponds to a part failure on a lower level.
For example, a p u m p is a component, and one failure
cause is a broken impeller (part). Scheduled
replacement of the impellers do, however, not affect
other failure causes, such as, e.g., a seal oil system
failure. This means that the r's for the various failure
causes are generally not the same. The discussion on
whether it is efficient to pack the corresponding
maintenance tasks for one c o m p o n e n t into one
maintenance schedule (package) or not is beyond the
scope of this paper.

3.4 Results compilation--a numerical example


The suggested approach
example.

is illustrated

by a small

3.4.1 System description


A simplified process system is used as an example
system, see Fig. 6. The process m e d i u m is
pre-processed in two separate vessels. Each vessel has
100% capacity, so the two vessels provide full
redundancy. Each vessel has its own supply pump, and
the two pumps are driven by a c o m m o n motor. The
process is completed in the reactor.
To avoid high pressure in the reactor, various
protective devices are supplied in order to shut-off the
process medium. U n d e r normal production, the
process control valve (PCV) regulates the inlet of the
process medium. U p o n detecting a high level in the
reactor, a level transmitter (LT) transmits a signal to
the PCV so that the fluid through the PCV is reduced.
If the process control system fails, the process
shut-down system represents the final safety barrier.
U p o n high reactor pressure, a pressure transmitter
(PT) activates the process shut-down valve (SDV),
and the process m e d i u m is shut-off.

Pump

Process

3.4.2 Establishing the loss function


Initially, the influence diagram of Fig. 7 is established,
showing overall performance measures related to five
main objectives: personnel safety, environmental
threat, material damage, loss of production and
maintenance costs. The safety evaluation will be based
on the occurrence of three undesired events, defined
as
Ej = reactor explosion
E2 = external leakage from p u m p s or valves
E3 = spurious trip of process.
Observe that the event E3, apart from causing
production loss, also can cause material damage.
Safety p e r f o r m a n c e will be measured with respect
to the occurrence of, in total, 7 damage categories,
related to personnel safety, environmental threat and
material damage, respectively.
1. Personnel safety
DC~ = m i n o r / m o d e r a t e injury or illness
DC~2=debilitating injury or serious
illness
DC~3 = death.

2. Environmental threat
DC21 = (minor) release of process m e d i u m (e.g. due
to leakage in valve or p u m p )
DC2R = major release of process medium (e.g. upon
explosion in reactor).
3. Material damage
DC3~ = m i n o r damage to p u m p s or m o t o r (e.g.,
upon spurious trip of reactor)
DC32 major damage to reactor (e.g., upon
explosions).
Further, introduce X 0 as the n u m b e r of events (per
year) resulting in DCi~, and these Xij's are then the 7
system p e r f o r m a n c e measures with respect to safety.
The p e r f o r m a n c e measures with respect to loss of
production are
=

X4~ = n u m b e r of production shutdowns (spurious


trips) per year, and
X42 = n u m b e r of hours without production (during
a year).

Preprocessing

medium

long-term

SDV
(valve)

I
I
PCV
(valve)
PT = Pressure transmitter
LT = Level transmitter

Motor

Fig. 6. Example of process system.

253

A model for maintenance optimization

[- . . . . .

Costs

Component
quality

Maintenance
tasks

Overall cost
=
Total Loss

l Logistics
(spare parts)

Fig. 7. Influence diagram for Example.

Finally, the performance measures with respect to


maintenance activities are
Xsj.~ = number of tests (per year) for failure cause l,
X52.t = number of preventive replacements (per year)
for failure cause l and
X53.z = number of corrective repairs (per year) for
failure cause 1.
Note that, for each component, maintenance activities
may be due to various failure causes.
Also, the costs associated with all these performance measures have to be established. The
assessment of costs due to the occurrence of various
damage categories requires that the management
performs some prioritizations, see Section 3.2.2. Now
assume that the management prioritizations result in
the weights 0.01, 0.1 and 1.0 for the above three
personnel safety categories, giving the following loss
function for personnel safety:
Lsafe,y(Sll,

damage, and the cost C~ = 1055 for DC32, w e get the


following loss function for material damage:
LM,,,,.ri,,(X3,, X32) = [0.001. X3~ + X32]C3

= [0.001. X3, + X32] 105$


Further, costs related
estimated to:

Thus, the loss due to trips is given by:


Lpr,,m ........= C41 X41 -[- C42 X 4 2

C51"1 +
"
+ C53.1. X 5 3 . !
7~7~,st'l 7~Replacement-I

= [0"01. Xll + 0 " 1 . X l 2 + X13]CI

(16)
(12)

where C~ = the cost of one event resulting in


DCI3 = 5 1065 is the anchoring point.
The relative weights of the D C ' s for environmental
threat are given as 0-01 and 1, and the cost of one
event resulting in DC22 is C2 = 1065. Thus, the loss
function with respect to environmental threat is:
LE,,o, ............,(X2,, X22) ~-- [0.01. X21 q- X22]C2

= [0.01. X2, + X22] x 1065 (13)


Assuming

the weights 0.001

and

(15)

The costs for test and repair/replacement for the


various components are given in Table 4. In general,
the overall 'loss' related to maintenance activity is
given by:

I= I

: [0.05. x , , + 0.5. x , ~ + 5. x,~] 106$

are

Cost per trip = C4~ = 1 0005


Cost per hour for loss of production = C42 = 5 0005.

ZM,,i,,t ......... v =

S12, S13)

to loss of production

(14)

1 for material

Here l sums over the number of failure causes. The


number of tests per year equals Xs,.~ = 1/rT~,.t, where
rT~,.~ is the length of appropriate test interval.
Further, the number of replacements due to PM
equals X52.~ = 1/rR~0, . . . . . . ~.t. The corresponding costs
are denoted Cs~.t and C52.~, respectively. Observe that
it is assumed that the repair costs equal the
replacement costs. Due to the assumed additivity, the
overall loss function is now obtained from:
L = Lsam?. +

LE.oi............ t + LM,,t,.riat
+ L l ' r , , d l ........ + LM,a ............,,

(17)

J. Vatn, P. Hokstad, L. Bodsberg

254

and the risk (mean loss) is found by replacing X~i with


its mean, ~, = E(X~t ).

given in the first line of Table 3. Further, from Table


3, it is seen that E2 (external leakage) with probability
1 leads to DC21, and E~ (spurious trip) with
probability 0.5 leads to DC31. The frequency of
damage category DCii is now given by F0 = ~,k Fk'pij k
(i = 1, 2, 3).
The attributes related to loss of production are
F4~ = m e a n number of production shutdowns, and
~2=mean
number of hours without production
(during a year). Of course F41 equals F~ found above.
Further, F42 = 1 - A, where the production availability
(A) in the present example equals

3.4.3 System dependability


Notation
In the following, dependability parameters are
identified by various indices. The first index refers to
the actual failure mode, whereas the second index
represents the component identifier. The abbreviations are defined in Table 4.
Explicit formulae for the frequencies Fk of the
undesired events Ek (k = 1, 2, 3) are in this example
easily obtained without using FTA.
The safety protection of the reactor fails if both the
process control system and the shut-down system fail.
The frequency of (safety critical) failures initiated by
the process control system is f = f F r o . t ' c v + f F T o . ~ r .
The (safety) unavailability of the shut-down system is
U = I - - A F T o , s l ) v A H o , p T. The frequency of loss of
safety function is then given by

A = A t.Mr-Mo(A t MF.t'u 2tAt.MF.rr)2Aso.sDvAso.rr


Aso. rcvA so, t~rA I.MF, I,I'.
(21 )
Observe that no system analysis is required to
determine the loss due to maintenance, see formula
(21) above.

3.4.4 Component dependability


The system attributes, ~j ( i - 1 , 2 , 3 ) ,
are now
expressed in terms of component dependability
parameters such as component availability and
component failure frequency. The component dependability parameters further depend on the type and
level of maintenance, cf. Section 3.3.2.
For model 4 we will assume a Weibull distributed
TTF, and the frequency of failures is given by (see
Table 2) f~= A'~ ~ ~. If proper lifetime data are
available, i.e. failure times are independent and
identically distributed (i.i.d.), the parameters a and A
can be obtained by standard estimation in the Weibull
distribution. The i.i.d, assumptions are often not
exactly valid, and consequently the parameters are not
obtained so easily. One approach then is to estimate
the (average) M T T F in the non-homogeneous data set
by simply averaging all failure times, and then use
some expert judgement to obtain the shape parameter
a. The parameters of interest in this situation are
hence M T T F and a. Thus, the data supplied for the
example system is presented in terms of the M T T F
and a rather than in A and a, using an alternative
expression for f~:

F~ = F = (1 - AF~o..s-t)v. AFrO.r-r)(fFro,rcv +.fFro,,.T)


(18)
So, in the example, it is assumed that the frequency of
reactor explosions (F~) equals the frequency of loss of
safety protection (F), i.e., F~ = F. The frequency of
external leakage ( E X L ) from pumps or valves equals

F: = 2f,.:xt.,t~u + f,,x,...sz,v + ft~:xt.t,cv


Further, the frequency
(spurious trip rate) is:

of production

(19)

shut-down

Ea = .h.MF.MO + (f,~MF.,'U + f,.~F.m')(l -- A,.MF.ruAt.M,. m,)


+ L o ,,,~ + L o . , . , +

&o.,.c~ + Lo.,., +.hM,,,.:

(20)
An E T A should now be used to assess the frequency
of the various safety related damage categories. It is
assumed that the E T A of the hazardous event E~
(reactor explosion) resulted in three possibilities,
Consequence 1 (minor) results in DC~t (probability
0.6)
Consequence 2 (moderate) results in both DCw_ and
DC32 (probability 0.35)
Consequence 3 (major) results in both DC~3, DC2:
and DC32 (probability 0-05).

f~=\

(.F(a ' + l ) ) " ,,(,,-1


MTTF / "

(22)

For model 5 it is assumed that T T F is exponentially


distributed, and the (safety) availability is given by

These probabilities will explain the probabilities P0 1,

Table 3. Probability of damage categories, given an undesired event

Undesirable event

E~ = Reactor explosion
E, - External leakage from
pumps or valves
E~ = Spurious process trip

Frequency

Probability, P(j.k of DC o given event Ek


Dll

DI_,

DI~

D2L

D22

D31

D~2

/~
~

0.6
0

0.35
0

0.05

0.05

0.4

F~

I/

0.5

[I

A model for maintenance optimization


A~-1-r/(2MTTF).

Some example data are shown


in T a b l e 4, to i l l u s t r a t e the t e c h n i q u e .
T h e final s t e p in the analysis is to c a r r y o u t the
o p t i m i z a t i o n . T h e o p t i m i z a t i o n o f i n t e r v a l s can b e
t r e a t e d as a ' r e s o u r c e a l l o c a t i o n p r o b l e m ' a n d s o l v e d
by s t a n d a r d o p e r a t i o n a l r e s e a r c h p r o c e d u r e s such as
l i n e a r a n d d y n a m i c p r o g r a m m i n g , see Ref. 26. T o solve
the o p t i m i z a t i o n p r o b l e m , we h a v e i n t r o d u c e d s o m e
simplifications. F i r s t w e h a v e t r a n s f o r m e d t h e p r o b l e m
to a d i s c r e t e p r o b l e m by a s s u m i n g t h a t t h e a c t u a l
i n t e r v a l l e n g t h is e i t h e r two weeks, one month, two

months, three months, six months, one year, one and a


half years o r two years. T o solve the d i s c r e t e p r o b l e m
it c o u l d b e p o s s i b l e to find the s o l u t i o n b y e x h a u s t i v e
c o m p u t e r s e a r c h i n g . I n s t e a d o f d o i n g that, t h e
p r a c t i c a l a p p r o a c h u s e d h e r e is to select initial v a l u e s
for e a c h i n t e r v a l length, t h e n o b t a i n an o p t i m u m for
e a c h i n t e r v a l by t a k i n g o n e at a time. This p r o c e d u r e
can t h e n b e r e p e a t e d o n c e o r twice. T h e ' o p t i m u m '
i n t e r v a l l e n g t h s thus o b t a i n e d a r e s h o w n in the
r i g h t m o s t c o l u m n o f T a b l e 4. W e o b s e r v e t h a t the
i n t e r v a l s a r e s h o r t e r for e q u i p m e n t a p p l y i n g m o d e l 5
t h a n for e q u i p m e n t a p p l y i n g m o d e l 4. T h e r e a s o n for
this is t h e l i n e a r d r o p in u n a v a i l a b i l i t y , U ( r ) =

255

r / ( 2 M T T F ) in m o d e l 5, w h e r e a s the d r o p in the
f r e q u e n c y , f ( r ) = A"t" ~, is r a t h e r low for a , b e i n g in
t h e o r d e r of 1.5. T h e result for the two failure m o d e s
of the p u m p m a y at first l o o k surprising; for t h e failure
m o d e L M F , the r e p l a c e m e n t i n t e r v a l is f o u r t i m e s
l o n g e r t h a n for the E X L failure m o d e in spite o f the
fact t h a t MTTF~MV < MTTFFxL. This can be e x p l a i n e d
by the l o w e r r e p l a c e m e n t cost r e l a t e d to the E X L
failure m o d e . F u r t h e r the E X L failure m o d e has an
effect on the e n v i r o n m e n t , with a c o r r e s p o n d i n g high
p e n a l t y (cost).
F i n a l l y , n o t e t h a t if t h e r e is s o m e u n c e r t a i n t y with
r e s p e c t to c h o i c e o f m a i n t e n a n c e ( m o d e l 1 - 5 ) , an
o p t i m i z a t i o n can also be c a r r i e d o u t with r e s p e c t to
these. A n o t h e r w a y to utilize t h e analysis is to start
f r o m initial v a l u e s a n d p e r f o r m an i m p o r t a n c e r a n k i n g
o f t h e r ' s with r e s p e c t to the given risk f u n c t i o n ,
s h o w i n g which give the h i g h e s t p o t e n t i a l for
improvement.
4 SUMMARY

Maintenance optimization methods combine most


RAMS (Reliability-Availability-Maintainability-Safety)

Table 4. Component data used in the analysis. Data is supported for each failure cause for all failure modes. The two
rightmost columns are results from the analysis

Component

Fail.
mode

Failure cause

PU
PU
PP
MO
MO
PCV
PCV
PCV
PCV
PCV
SDV
SDV
SDV
SDV
SDV
LT
LT
PT
PT
RE

LMF
EXL
LMF
LMF
LMF
FFO
FTO
SO
SO
EXL
FTO
FTO
SO
SO
EXL
FTO
SO
FTO
SO
LMF

Mechanical wear
Loose seals
Deposition
Mechanical wear
Control & monitoring
Mech. deg. valve/act
Contr. line blocked
El. pilot failure
Contr. line leakage
Loose seals
Mech. deg. valve/act
Contr. line blocked
El. pilot failure
Contr. line leakage
Wear
Natural aging
Set point drift
Natural aging
Set point drift
Wear

MTTF
(hrs)

Shape
(a)

MTTR
(hrs)

Cost
test

Cost
repl.

Maint.
task

5000
10000
-10000
10000
10000
10000
10000
10000
-20000
20000
7500
7500
-10000
10000
15000
10000
10000

1-50
1.20
2.00
1.10
1.00
1.20
1.20
1.00
1.10
1-20
1.00
1-00
1-00
1.10
1.20
1.10
1.20
1-00
1.20
1.50

3.0
3-0
3.0
1.0
1.0
15.0
15.0
10.0
10-0
10-0
10-0
10.0
10.0
10-0
10.0
2.0
2-0
2.0
2-0
10.0

0
0
0
0
0
0
0
0
0
0
1000
1000
0
0
0
0
0
750
0
0

2500
750
5000
500
500
4000
4000
1000
1000
750
4000
4000
1000
1000
750
1000
500
1000
500
10000

Model 4
Model 4
NA
Model 4
NA
Model 4
Model 4
NA
Model 4
NA
Model 5
Model 5
NA
Model 4
NA
Model 4
Model 4
Model 5
Model 4
Model 4

Abbreviations:
Components:
PU
= Pump
MO
= Motor
PP
= Preprocessing vessel
SDV = Shut-down valve
PCV = Process control valve
PT
= Pressure transmitter
LT
= Level transmitter
RE
= Reactor vessel

Failure modes:
FTO = Fail to operate
SO
= Spurious operation
LMF = Loss of main function
EXL = External leakage

r
(days)
360
90
-360
-180
360
-90
-15
15
-60
-180
90
15
90
180

256

J. Vatn, P. Hokstad, L. Bodsberg

techniques into one unified approach: for instance


involving
reliability analysis, e.g., the modelling of life time
distributions and root cause analysis,
maintenance analysis, in principle combining any
PM and CM model,
safety and risk analysis, e.g., the use of acceptance
criteria and scenario analysis.
Further, decision theory is used, and data analysis
and expert judgement methods are essential in order
to establish credible input data to the analysis. Finally,
thorough operational experience is required to
establish optimal methods. The need for this diverse
expertise may be one of the reasons why it has been
so hard to implement model based maintenance
approaches in practice.
It has been a main objective of this paper to collect
essential results and models from various fields, and
give a unified presentation. From this we claim that a
model based approach to maintenance optimization is
truly a manageable task.
The present p a p e r essentially provides the overall
description of a concept for maintenance optimization,
and some detailing and generalizations may be desired
for practical applications. So this is also a very fruitful
area for further interdisciplinary research, and
S I N T E F intend to develop this concept further. Some
interesting model extensions of the method are:
include models for imperfect repair (both for PM
and CM) for a specific failure category,
include models for opportunity maintenance,
include models on condition based maintenance,
elaborate on redundancy modelling, e.g., c o m m o n
cause failures and staggered testing,
include use of acceptance criteria from a 'loss
function point of view",
present a practical approach for large systems, e.g.,
including a screening process, possibly using
measures o f importance,
allow for use of true utility functions in situations
where the effect of the alternatives on the attributes
are not known in advance, i.e., there is uncertainty
about the p a r a m e t e r s describing 'the world', 2v
more explicit use of decision logic. The decision
logic of current R C M practice could be maintained.
The various methods developed in this article
should be used as decision support when using the
logic,
specify an M T T R model: i.e., the relation between
the M T T R and maintenance activity (step 3b),
compiling the results (interval lengths of various
maintenance tasks) into so-called maintenance
schedules.
The main assumptions and restrictions for using the
approach as described in the present p a p e r are

the various failure modes of a c o m p o n e n t have


individual hazard rates that sum up to the total
hazard rate, z(t), of the component,
the hazard rate of a specific failure mode is a sum of
contributions from the various failure cause
categories (any failure belongs to one failure
category only),
the TTFs corresponding to the hazard rate of
specific failure modes/failure causes are performing
as independent random variables. In particular,
repair of one failure category does not affect the
TTFs of other failure causes: i.e., no censoring.
The approach is intended to include all failure
types,25"~-~2'~physical as well as functional failures. (The
c o m p o n e n t has a functional failure when it is unable
to perform its function without the presence of
physical failures.) Failures due to design errors are not
affected by PM strategies, but should be included in
the overall modelling.

ACKNOWLEDGEMENT
The present paper is written with fundings from the
Growth Point Centre in Safety and Reliability at
SINTEF
and
NTH
(Norwegian
Institute
of
Technology).

REFERENCES
1. Nowland, F.S. & Heap, H.F., Reliability centered
maintenance. DDC No. AD-A066579, Defense Documentation Center, Defense Logistics Agency, Alexandria,
VA, USA, 1978.
2. Moubray, J,, Reliability-Centred Maintenance, RCM II,
Butterworth-Heinemann Ltd, Oxford, 1991.
3. Anderson, R.T. & Neri, L., Reliability-Centered
Maintenance, Management and Engineering Methods,
Elsevier Science Publishers Ltd, London, 199(I.
4. Moss, M.A., Designing for Minimal Maintenance
Expence, 7"he Practical Application of Reliabili O, and
Maintainability, Marcel Dekker, Inc., New York, 1985.
5. Sandtorv, H. & Rausand, M., RCM - closing the loop
between design reliability and operational reliability.
Maintenance, 6 (1994) 13-21.
6. Dekker, R., Smit, A. & van Rijn, C. Mathematical
models for the optimization of maintenance and their
application in practice. Maintenance, 9 (1994) 22 26.
7. Sherwin, D., Future prospects for maintenance management and optimization. Presented at Maintenance '94:
The Business of Maintenance in Europe, Gotenburg,
Sweden, 7 11 March 1994.
8. de Groot, M. H., Optimal Statistical Decisions.
McGraw-Hill, Inc., New York, 1970.
9. Keeney, R,L. & Raiffa, H., Decisions with Multiple
Objectives: Preference attd Value Tradeoff;s. John Wiley
& Sons, New York, 1976.
10. The LIPS Committee for Utilization and Technical
Evaluation, Laboratory Integration Prioritization
System, LIPS. Draft Version 1.0, 1994.

A model for maintenance optimization


11. IEC 50(191), Dependability and quality of service. In
International Electrotechnical Vocabulary, International
Electrotechnical Commission, Geneva, 1990, p. 7.
12. Shachter, R.D., Evaluating influence
diagrams.
Operations Res., 34 (1986) 871-882.
13. Moosung, J.A.E. & Apostolakis, G.E., The use of
influence diagrams for evaluating severe accident
management strategies. Nllcl. Tech., 99 (1992) 142157.
14. Hong, Y. & Apostolakis, G.E., Conditional influence
diagrams in risk management. Risk Analysis, 13 (1993)
625-636.
15. Blanchard, B. & Fabrycky, W., Systems Engineering and
Analysis, Prentice-Hall, Inc., Englewood cliffs, NJ,
1990.
16. Kaplan, S., Risk assessment and risk management basic concepts and terminology. In Risk Management:

Expanding Horizons in Nuclear Power and Other


Industries, Hemisphere Publ. Corp., Boston, MA, 1991,
pp. 11-28.
17. Cooke, R., The analysis of competing failure modes. In
Proc. PSAM-II, San Diego, USA, 20-25 March 1994.
18. Hcyland, A. & Rausand, M., Reliability Theory, Models
and Statistical Methods, John Wiley & Sons, NY, 1994.
19. Ascher, H. & Feingold, H., Repairable Systems

Reliability. Modelling, Inference, Misconceptions and


Their Causes, Marcel Dekker, Inc., New York, 1984.
20. Valdez-Flores, C. & Feldman, R.M., A survey of
preventive maintenance models for stochastically de-

21.

22.
23.
24.
25.
26.
27.
28.
29.

257

teriorating single-unit systems. Nay. Res. Log. Quart. 36


(1989) 419-446.
Block, H., Borges, W. & Savitz, T., Preventive
maintenance policies. In Reliability and Quality Control,
(Ed. A. P. Basu), North-Holland, Amsterdam, 1986, pp.
101-106.
Hokstad, P., Reliability modelling and the Martingale
intensity process. In SRE-Symp. Reliability of Repairable Systems, SRE-93, Malmti, November 1993.
Taylor, H. M. & Karlin, S., An Introduction to
Stochastic Modeling, Academic Press, NY, 1984 pp.
278-289.
Hokstad, P. & FrOvig, A.T., A model for degraded and
critical failures for a component with periodic testing.
Reliab. Engng System Safety, 51 (1996).
Bodsberg, L., VULCAN--a vulnerability calculation
method for process safety systems. SINTEF Report
STF75 A94051, Trondheim, Norway, 1994.
Phillips, D.T., Ravindran, A. & Solberg, J. Operations
Research: Principles and Practice, John Wiley & Sons,
New York, 1976.
Vatn, J., Optimizing replacements intervals under
uncertainty. Submitted for publication, 1994.
Bodsberg, L., Comparative study of quantitative models
for hardware, software and human reliability assessment. Qttality & Reliab. Engng Int., 9 (1993) 501-518.
Bodsberg, L. & Hokstad, P., A system approach to
reliability and life-cycle cost of process safety-systems.
IEEE Trans. Reliab. 44 (1995).

You might also like