You are on page 1of 27

Reliability Centered Maintenance

6800$5<
This report summarizes the main elements of Reliability centered maintenance (RCM). The
presentation is to a great extent based on the outline of the RCM methodology by Rausand & Vatn
(1998). In this presentation we have made effort to include ideas and examples from railway
applications.
RCM is a method for maintenance planning developed within the aircraft industry and later adapted to
several other industries and military branches. This report presents a structured approach to RCM, and
discusses the various steps in the approach. The availability of reliability data and operating experience
is of vital importance for the RCM method. The RCM method provides a means to utilize operating
experience in a more systematic way. Aspects related to utilization of operating experience are therefore
addressed specifically. In this paper, RCM is put into a risk analysis framework, taking advantages of
reliability modelling in a more structured way than in more traditional RCM approaches.

7$%/(2)&217(17
SUMMARY.............................................................................................................................................. 1
TABLE OF CONTENT............................................................................................................................ 1
1 INTRODUCTION ............................................................................................................................. 2
2 A CONCEPTUAL MODEL FOR RCM ........................................................................................... 2
3 MAIN STEPS OF AN RCM ANALYSIS......................................................................................... 3
Step 1: Study preparation................................................................................................................. 4
Step 2: System selection and definition........................................................................................... 5
Step 3: Functional failure analysis (FFA)........................................................................................ 6
Step 4: Critical item selection........................................................................................................ 10
Step 5: Data collection and analysis .............................................................................................. 12
Step 6: Failure modes, effects and criticality analysis................................................................... 14
Step 7: Selection of Maintenance Actions..................................................................................... 16
Step 8: Determination of Maintenance Intervals ........................................................................... 18
Step 9: Preventive maintenance comparison analysis ................................................................... 21
Step 10: Treatment of non-MSIs ..................................................................................................... 22
Step 11: Implementation.................................................................................................................. 22
Step 12: In-service data collection and updating ............................................................................. 22
4 DISCUSSIONS AND CONCLUSIONS ......................................................................................... 23
General benefits:.............................................................................................................................. 23
Problem areas in the analysis: ......................................................................................................... 24
Conclusions: .................................................................................................................................... 25
REFERENCES ....................................................................................................................................... 26

M. Rausand and J. Vatn. Reliability Centered Maintenance. In C. G. Soares, editor, Risk and Reliability in
Marine Technology. Balkema, Holland, 1998

,1752'8&7,21
The reliability centered maintenance (RCM)
concept has been on the scene for more than 20
years, and has been applied with considerable
success within the aircraft industry, the military
forces, the nuclear power industry, and more
recently within the offshore oil and gas industry.
Experiences from the use of RCM within these
industries (see e.g. Sandtorv & Rausand 1991)
show significant reductions in preventive maintenance (PM) costs while maintaining, or even
improving, the availability of the systems.
According to the Electric Power Research
Institute (EPRI) RCM is:
a systematic consideration of system
functions, the way functions can fail,
and a prioritybased consideration of
safety and economics that identifies
applicable and effective PM tasks.
The main focus of RCM is hence on the system
functions, and not on the system hardware.
Several textbooks and reports presenting the
RCM concept have been published. The most
important books are Nowlan and Heap (1978),
Moubray (1991), Smith (1993), Anderson &
Neri (1990), and Moss (1985). These textbooks
provide a good introduction to RCM, but most
of them are a bit inaccurate regarding stringency
of definitions of the basic concepts. The main
ideas presented in these textbooks are more or
less the same, but the detailed procedures are
rather different.
Mk

..

In all these books there are generally more focus


on maintenance than on reliability. The use of
reliability data sources like OREDA (1997) is
not at all emphasized. Smith (1993) states for
example on page 103 that: . . . , it is the authors
experience that any introduction of quantitative
reliability data or models into the RCM process
only clouds the PM issue and raises credibility
questions that are of no constructive value.
The main objective of this paper is to present a
structured approach to RCM, and to put more
focus on reliability models and methods in the
RCM process.

$ &21&(378$/ 02'(/ )25


5&0
Most of the available PM models (Valdez-Flores
and Feldman 1989) are based on the assumption
that; (1) only single units are considered, and
that (2) the cost of a single unit failure can easily
be quantified in (discounted) monetary units. In
RCM we have to consider the entire PM
program, i.e. several units simultaneously. It is
further required to consider failure consequences
which cannot be measured directly in monetary
units. In the present paper, we will split the
possible failure consequences into the following
four consequence classes:
S: Safety of personnel
E: Environmental impact
A: Production availability
C: Material losses (costs)
..

C1

M1

B2

C2

M2

B3

M3

Undesired
event

Barrieres

B1

C3

Total
loss

:
Fault tree
analysis

Event tree
analysis

Risk analysis

Figure 1 A conceptual model for RCM based on risk analysis


2

To measure all the consequences in monetary


units, we have to define economic values of the
life and health of persons, and of different
environmental aspects. This is at best a difficult
and controversial task. A new conceptual model
for the RCM approach is illustrated in Figure 1.
This model is based on the ideas presented by
Vatn et. al (1996). The basis for the new
conceptual model is a traditional risk analysis.
The risk analysis approach is based on a number
of socalled undesired events in the system.
An undesired event is typically an outset of a
possible accident, for example a gas leakage or
an unintended stop of a compressor. In this
context the term accident is defined in a very
broad sense, including all events causing a loss
related to one of the consequence classes defined
above.
An undesired event may be caused by a number
of basic events B1, B2, . . . . The basic events may
comprise failures of technical items, human
errors, and environmental impacts. The basic
events are often identified and modeled by fault
tree analysis.If failure rates and other necessary
data are available for the basic events, the fault
tree analysis will provide estimates of the
frequency of occurrence of the various undesired
events.
The consequences of an undesired event will
normally depend on the barriers that are
established to prevent escalation of the
undesired event. On an oil production plattform
the barriers may comprise emergency shutdown
(ESD) systems, pressure relief systems, fire
walls, fire fighting systems, etc. The use of two
or more parallell tracks in the railway
infrastructure can be considered as a barrier to
prevent consequence of a turnout failure.The
possible consequence chains starting from an
undesired event are often identified and modeled
by event tree analysis, supplemented by various
physical models, like fire and explosion models.
The output of the event tree analysis will be a set
of possible consequences C1, C2, . . . . If
necessary input data is available for the barriers
and physical models, the event tree analysis will
provide frequencies or probabilities of the
various consequences. In order to build models
for railway applications a key element will be to

know the time table, the configuration of the line


etc.
By analyzing all the undesired events in this way
we will in principle end up with a complete
consequence spectrum for the system, i.e. a
listing of all possible consequences together
with an estimated frequency of each
consequence.
A traditional risk analysis stops with the
consequence spectrum. If possible we should,
however, combine the effects of the
consequences into a total loss measure (loss
function). This can in some cases be done
without too strong controversies.
The system maintenance activities M1, M2, . . .
will affect the frequencies of both the basic
events and the barrier failures. An effective PM
task may prevent a failure of a process unit or a
barrier (e.g. a safety valve).
On the other hand, failures may also be caused
by human or procedural errors during
maintenance. Experience has shown that many
major accidents have occurred either during
maintenance or because of wrongly executed
maintenance.
The overall objective of RCM is to establish PM
tasks that are applicable and effective with
respect to the consequence classes defined
above. To be effective, a PM task must therefore
provide a reduced expected loss related to one or
more of the four consequence classes.

0$,1 67(36 2)


$1$/<6,6

$1

5&0

The RCM analysis may be carried out as a


sequence of activities. Some of these activities,
or steps, are overlapping in time, as illustrated in
Figure 2. The RCM process comprises the
following steps:
1. Study preparation
2. System selection and definition
3. Functional failure analysis (FFA)
4. Critical item selection
5. Data collection and analysis
3

Rausand&Vatn

Reliability Centered Maintenance

6. Failure modes, effects and criticality analysis


(FMECA)

11.Implementation

7. Selection of maintenance actions

The various steps are discussed in the following


sections with a focus on Steps 18. The time
sequence of Steps 18 is illustrated in Figure 2.
The sizes of the boxes do not reflect the required
workload in the steps.

8. Determination of maintenance intervals


9. Preventive maintenance comparison analysis
10.Treatment of noncritical items

12.Inservice data collection and updating

8 Maintenance intervals
7 Maintenance tasks
6 FMECA
5 Data collection and analysis
4 Critical item selection
3 FFA
2 System selection
1 Study prep.
Time

Figure 2 The RCM process

6WHS
6WXG\SUHSDUDWLRQ
The main objectives of an RCM analysis are:
1. to identify effective maintenance tasks,
2. to evaluate these tasks by some costbenefit
analysis, and
3. to prepare a plan for carrying out the
identified maintenance tasks at optimal
intervals.

policies, and acceptance criteria with respect to


safety and environmental protection should be
made visible as boundary conditions for the
RCM analysis.
The part of the plant to be analyzed is selected in
Step 2. The type of consequences to be
considered should, however, be discussed and
settled on a general basis in Step 1. Possible
consequences to be evaluated may comprise:
(i) risk to humans,
(ii) environmental damages,

If a maintenance program already exists, the


result of an RCM analysis will often be to
eliminate inefficient maintenance tasks.
Before an actual RCM analysis is initiated, an
RCM project group should be established, see
e.g. Moubray (1991) pp. 1617. The RCM
project group should include at least one person
from the maintenance function and one from the
operations function, in addition to an RCM
specialist.
In Step 1 Study preparation the RCM project
group should define and clarify the objectives
and the scope of the analysis. Requirements,

(iii) delays and cancellation of travels,


(iv) material losses or equipment damage,
(v) loss of marked shares, etc.
The possible consequence classes can not be
measured in one common unit. It is therefore
necessary to prioritize between means affecting
the various consequence classes. Such a
prioritization is not an easy task and will not be
discussed in this presentation. The tradeoff
problems can to some extent be solved within a
4

Rausand&Vatn
decision theoretical framework (Vatn 95 and
Vatn et al. 1996).
RCM analyses have traditionally concentrated
on PM strategies. It is, however, possible to
extend the scope of the analysis to cover topics
like corrective maintenance strategies, spare part
inventories, logistic support problems, etc. The
RCM project group must decide what should be
part of the scope and what should be outside.
The resources that are available for the analysis
are usually limited. The RCM group should
therefore be sober with respect to what to look
into, realizing that analysis cost should not
dominate potential benefits.
In many RCM applications the plant already has
effective maintenance programs. The RCM
project will therefore be an upgrade project to
identify and select the most effective PM tasks,
to recommend new tasks or revisions, and to
eliminate ineffective tasks. Then apply those
changes within the existing programs in a way
that will allow the most efficient allocation of
resources.
When applying RCM to an existing PM
program, it is best to utilize, to the greatest
extent possible, established plant administrative
and control procedures in order to maintain the
structure and format of the current program.
This approach provides at least three additional
benefits:
(i) It preserves the effectiveness and
successfulness of the current program.
(ii) It facilitates acceptance and implementation of the projects recommendations
when they are processed.
(iii) It allows incorporation of improvements
as soon as they are discovered, without the
necessity of waiting for major changes to
the PM program or analysis of every
system.

6WHS
6\VWHP VHOHFWLRQ DQG GHI
LQLWLRQ
Before a decision to perform an RCM analysis at
a plant is taken, two questions should be
considered:

Reliability Centered Maintenance


To which systems are an RCM analysis
beneficial compared with more traditional
maintenance planning?
At what level of assembly (plant, system,
subsystem . . . ) should the analysis be
conducted?
Regarding the first question, all systems may in
principle benefit from an RCM analysis. With
limited resources, we must, however, usually
make priorities, at least when introducing the
RCM approach in a new plant. We should start
with the systems that we assume will benefit
most from the analysis. The following criteria
may be used to prioritize systems for an RCM
analysis:
(i) The failure effects of potential system
failures must be significant in terms of
safety,
environmental
consequences,
production loss, or maintenance costs.
(ii) The system complexity must be above
average.
(iii) Reliability data or operating experience
from the actual system, or similar systems,
should be available.
Most operating plants have developed an
assembly hierarchy, i.e. an organization of the
system hardware elements into a structure that
looks like the root system of a tree. In the
offshore oil and gas industry this hierarchy is
usually referred to as the tag number system.
Several other names are also used. Moubray
(1991) for example refers to the assembly
hierarchy as the plant register.
The following terms will be used in this paper
for the levels of the assembly hierarchy:
Plant: A logical grouping of systems that
function together to provide an output or product
by processing and manipulating various input
raw materials and feed stock. An offshore gas
production platform may e.g. be considered as a
plant. For railway application a plant might be a
maintenance area, where the main function of
that plant is to ensure satisfactiory
infrastructure functionality in that area. Moubray
(1991) refers to the plant as a cost center.
5

Rausand&Vatn
System: A logical grouping of subsystems that
will perform a series of key functions, which
often can be summarized as one main function,
that are required of a plant (e.g. feed water,
steam supply, and water injection). The
compression system on an offshore gas
production platform may e.g. be considered as a
system. Note that the compression system may
consist of several compressors with a high
degree of redundancy. Redundant units
performing the same main function should be
included in the same system. It is usually easy to
identify the systems in a plant, since they are
used as logical building blocks in the design
process.
The system level is usually recommended as the
starting point for the RCM process. This is
further discussed and justified for example by
Smith (1993) and in MILSTD 2173. This
means that on an offshore oil/gas platform the
starting point of the analysis should be for
example the compression system, the water
injection system or the fire water system, and
not the whole platform.
The systems may be further broken down in
subsystems, and subsubsystems, etc. For the
purpose of the RCMprocess the lowest level of
the hierarchy should be what we will call an
RCM analysis item:
RCM analysis item: A grouping or collection of
components which together form some
identifiable package that will perform at least
one significant function as a standalone item
(e.g. pumps, valves, and electric motors). For
brevity, an RCM analysis item will in the
following be called an analysis item. By this
definition a shutdown valve, for example, is
classified as an analysis item, while the valve
actuator is not. The actuator is a supporting
equipment to the shutdown valve, and only has a
function as a part of the valve. The importance
of distinguishing the analysis items from their
supporting equipment is clearly seen in the
FMECA in Step 6. If an analysis item is found
to have no significant failure modes, then none
of the failure modes or causes of the supporting
equipment are important, and therefore do not
need to be addressed. Similarly if an analysis
item has only one significant failure mode then
the supporting equipment only needs to be

Reliability Centered Maintenance


analyzed to determine if there are failure causes
that can affect that particular failure mode
(Paglia et al. 1991). Therefore only the failure
modes and effects of the analysis items need to
be analyzed in the FMECA in Step 6. An
analysis item is usually repairable, meaning that
it can be repaired without replacing the whole
item. In the offshore reliability database
OREDA (1992) the analysis item is called an
equipment unit. The various analysis items of a
system may be at different levels of assembly.
On an offshore platform, for example, a huge
pump may be defined as an analysis item in the
same way as a small gas detector. If we have
redundant items, e.g. two parallel pumps, each
of them should be classified as analysis items.
When we in Step 6 of the RCM process identify
causes of analysis item failures, we will often
find it suitable to attribute these failure causes to
failures of items on an even lower level of
indenture. The lowest level is normally referred
to as components.
Component: The lowest level at which
equipment can be disassembled without damage
or destruction to the items involved. Smith
(1993) refers to this lowest level as Least
Replaceable Assembly (LRA), while OREDA
(1997) uses the term maintainable item.
It is very important that the analysis items are
selected and defined in a clear and unambiguous
way in this initial phase of the RCMprocess,
since the following analysis will be based on
these analysis items. If the OREDA database is
to be used in later phases of the RCM process, it
is recommended as far as possible to define the
analysis items in compliance with the
equipment units in OREDA.

6WHS
)XQFWLRQDO IDLOXUH DQDO\VLV
))$
The objectives of this step are:
(i) to identify and describe the systemss
required functions,
(ii) to describe input interfaces required for the
system to operate, and
(iii) to identify the ways in which the system
might fail to function.
6

Rausand&Vatn

Step 3(i): Identification of system functions


The objective of this step is to identify and
describe all the required functions of the system.
In many guidelines and textbooks (e.g. Cross
1994), it is recommended that the various
functions are expressed in the same way, as a
statement comprising a verb plus a noun for
example, close flow, contain fluid, transmit
signal.
A complex system will usually have a high
number of different functions. It is often difficult
to identify all these functions without a
checklist. The checklist or classification scheme
of the various functions presented below may
help the analyst in identifying the functions. The
same scheme will be used in Step 6 to identify
functions of analysis items. The term item is
therefore used in the classification scheme to
denote either a system or an analysis item.
1. Essential functions: These are the functions
required to fulfill the intended purpose of
the item. The essential functions are simply
the reasons for installing the item. Often an
essential function is reflected in the name of
the item. An essential function of a pump is
for example to pump a fluid.
2. Auxiliary functions: These are the functions
that are required to support the essential
functions. The auxiliary functions are
usually less obvious than the essential
functions, but may in many cases be as
important as the essential functions. Failure
of an auxiliary function may in many cases
be more critical than a failure of an essential
function. An auxiliary function of a pump is
for example containment of the fluid.
3. Protective functions: The functions intended
to protect people, equipment and the
environment from damage and injury. The
protective functions may be classified
according to what they protect, as:
safety functions
environment functions
hygiene functions
Safety protective functions are further
discussed e.g. by Moubray (1991) pp. 40

Reliability Centered Maintenance


42. An example of a protective function is
the protection provided by a rupture disk on
a pressure vessel (e.g. a separator).
4. Information functions: These functions
comprise condition monitoring, various
gauges and alarms etc.
5. Interface functions: These functions apply to
the interfaces between the item in question
and other items. The interfaces may be
active or passive. A passive interface is for
example present when an item is a support
or a base for another item.
6. Superfluous functions: According to
Moubray (1991) Items or components are
sometimes
encountered
which
are
completely superfluous. This usually
happens when equipment has been modified
frequently over a period of years, or when
new equipment has been overspecified.
Superfluous functions are sometimes present
when the item has been designed for an
operational context that is different from the
actual operational context. In some cases
failures of a superfluous function may cause
failure of other functions.
For analysis purposes the various functions of an
item may also be classified as:
(a) Online functions: These are functions
operated either continuously or so often that
the user has current knowledge about their
state. The termination of an online function
is called an evident failure.
(b) Offline functions: These are functions that
are used intermittently or so infrequently that
their availability is not known by the user
without some special check or test. The
protective functions are very often offline
functions. An example of an offline
function is the essential function of an
emergency shutdown (ESD) system on an oil
plattform. Many of the protective functions
are off-line functions. The termination of an
offline function is called a hidden failure.
Note that this classification of functions should
only be used as a checklist to ensure that all
relevant functions are revealed. Discussions
about whether a function should be classified as
7

Rausand&Vatn

Reliability Centered Maintenance

essential or auxiliary etc. should be avoided.


Also note that the classification of functions
here is used at the system level. Later the same
classification of functions is used in the failure
modes, effects and criticality analysis (FMECA)
in Step 6 at the analysis item level.
The system may in general have several
operational modes (e.g. running, and standby),
and several functions for each operating state.
The essential functions are often obvious and
easy to establish, while the other functions may
be rather difficult to reveal.
Step 3(ii): Functional block diagrams
The various system functions identified in Step
3(i) may be represented by functional diagrams
of various types. The most common diagram is
the socalled functional block diagram. A
simple functional block diagram of a pump is
shown in Figure 3.
Control system
System boundary

Fluid in
Pump fluid

Fluid out

El. power

Environment

Figure 3 Functional block diagram for a


pump
The necessary inputs to a function are illustrated
in the functional block diagram together with the
necessary control signals and the various
environmental stressors that may influence the
function.
It is generally not required to establish
functional block diagrams for all the system
functions. The diagrams are, however, often
considered as efficient tools to illustrate the
input interfaces to a function. The functional
block diagram is recommended for RCM by
Smith (1993). A detailed description of this type

of diagrams is given by e.g. Pahl and Beitz


(1984).
In some cases we may want to split system
functions into subfunctions on an increasing
level of detail, down to functions of analysis
items. The functional block diagrams may be
used to establish this functional hierarchy in a
pictorial manner, illustrating seriesparallel
relationships, possible feedbacks, and functional
interfaces (Blanchard & Fabrycky 1981).
Alternatives to the functional block diagram are
reliability block diagrams and fault trees.
Functional
block
diagrams
are
also
recommended by IEC812 as a basis for failure
modes, effects and criticality analysis (FMECA)
and will therefore be a basis for Step 6 in the
RCM procedure.
Step 3(iii): System failure modes
The next step of the FFA is to identify and
describe how the various system functions may
fail.
Since we will need the following concepts also
in the FMECA in Step 6, we will use the term
item to denote both the system and the analysis
items. According to accepted standards (IEC
50(191)) failure is defined as the termination
of the ability of an item to perform a required
function.
British Standard BS 5760, Part 5 defines failure
mode as the effect by which a failure is
observed on a failed item. It is important to
realize that a failure mode is a manifestation of
the failure as seen from the outside, i.e. the
termination of one or more functions.
In most of the RCM references the system
failure modes are denoted functional failures.
Failure modes may be classified in three main
groups related to the function of the item:
(i) Total loss of function: In this case a function
is not achieved at all, or the quality of the
function is far beyond what is considered as
acceptable.
(ii) Partial loss of function: This group may be
very wide, and may range from the nuisance
category almost to the total loss of function.
8

Rausand&Vatn

Reliability Centered Maintenance

(iii) Erroneous function: This means that the


item performs an action that was not
intended, often the opposite of the intended
function.

Performance

Target value
Acceptable
deviation

A variety of classifications schemes for failure


modes have been published. Some of these
schemes, e.g. Blache & Shrivastava (1994), may
be used in combination with the function
classification scheme in Step 3(ii) to secure that
all relevant system failure modes (functional
failures) are identified.
In the following we will need to classify failures
as:
Sudden failures: Failures that could not be
forecast by prior testing or examination.
Gradual failures: Failures that could be forecast
by testing or examination. A gradual failure will
represent a gradual drifting out of the
specified range of performance values. The
recognition of gradual failures requires
comparison of actual device performance with a
performance specification, and may in some
cases be a difficult task. An example of a
gradual failure situation is illustrated in Figure 4.
The specified performance is illustrated by the
target value, together with the acceptable
deviation from this target value. As soon as the
actual performance drifts outside the acceptable
deviation, we have a failure.
An important type of failures are the socalled
ageing failures:

Failure

Time

Figure 4 Example of a gradual failure


Ageing failures: Failures whose probability of
occurrence increases with the passage of time, as
a result of processes inherent in the item. Ageing
failures are also sometimes called wearout
failures.
An ageing failure is normally caused by some
physical, chemical or other processes that are
deteriorating the item. These processes are
usually referred to as failure mechanisms. The
ageing failure is sometimes a gradual failure,
meaning that the performance of the item is
gradually drifting out of the specified range. In
other cases the ageing failure will be sudden.
The inherent resistance of the item may
gradually be reduced until a failure occurs. The
performance of the item may in such cases be
perfect until the failure occurs.
The system failure modes (functional failures)
may be recorded on a specially designed FFAform, that is rather similar to a standard FMECA
form. An example of an FFA-form is presented
in Figure 5
.

System:
Ref. drawing no.:
Operational
mode

Function

Performed by:
Date:
Function

System

requirements

failure mode

Page: of:
Criticality
S
E

Figure 5 Example of an FFA-form

Rausand&Vatn
In the first column of Figure 5 the various
operational modes of the system are recorded.
For each operational mode, all the relevant
functions of the system are recorded in column
2. The performance requirements to the
functions, like target values and acceptable
deviations (ref. Figure 4) are listed in column 3.
For each system function (in column 2) all the
relevant system failure modes are listed in
column 4. In column 5 a criticality ranking of
each system failure mode (functional failure) in
that particular operational mode is given. The
reason for including the criticality ranking is to
be able to limit the extent of the further analysis
by disregarding insignificant system failure
modes. For complex systems such a screening is
often very important in order not to waste time
and money.
The criticality ranking depends on both the
frequency/probability of the occurrence of the
system failure mode, and the severity of the
failure. The severity must be judged at the plant
level.
In the conceptual RCM model in Figure 1 the
system failure modes will be undesired events.
In addition the undesired events will also
include accidental events (like external impacts)
that are not normally identified as a loss of
system function. Such events are usually
identified by using various risk identification
checklists.
The severity ranking should be given in the four
consequence classes; (S) safety of personnel, (E)
environmental
impact,
(A)
production
availability, and (C) economic losses. For each
of these consequence classes the severity should
be ranked as for example (H) high, (M) medium,
or (L) low. How we should define the
borderlines between these classes, will depend
on the specific application.
If at least one of the four entries are (M) medium
or (H) high, the severity of the system failure
mode should be classified as significant, and the
system failure mode should be subject to further
analysis.
The frequency of the system failure mode may
also be classified in the same three classes. (H)
high may for example be defined as more than
once per 5 years, and (L) low less than once per

Reliability Centered Maintenance


50 years. As above the specific borderlines will
depend on the application.
The frequency classes may be used to prioritize
between the significant system failure modes.
If all the four severity entries of a system failure
mode are (L) low, and the frequency is also (L)
low, the criticality is classified as insignificant,
and the system failure mode is disregarded in the
further analysis. If, however, the frequency is
(M) medium or (H) high the system failure
mode should be included in the further analysis
even if all the severity ranks are (L) low, but
with a lower priority than the significant system
failure modes.
If we were able to define a total loss function in
the conceptual model in Figure 1, the criticality
of the various system failure modes (undesired
events) could be assessed explicitly. This
approach will, however, not be costefficient in
most practical applications.

6WHS
&ULWLFDOLWHPVHOHFWLRQ
The objective of this step is to identify the
analysis items that are potentially critical with
respect to the system failure modes (functional
failures) identified in Step 3(iii). These analysis
items are denoted functional significant items
(FSI). Note that some of the less critical system
failure modes have been disregarded at this stage
of the analysis. Further, the two failure modes
total loss of function and partial loss of
function will often be affected by the same
items (FSIs).
For simple systems the FSIs may be identified
without any formal analysis. In many cases it is
obvious which analysis items that have influence
on the system functions.
For complex systems with an ample degree of
redundancy or with buffers, we may need a
formal approach to identify the functional
significant items. In the conceptual model in
Figure 1 the analysis item failures are classified
as basic events. This means that the causal
analysis in the conceptual model should be
pursued down to the analysis item level and not
further. As explained in section 2, the basic
events will also comprise events that are not
classified as analysis item failures, like human
10

Rausand&Vatn

Reliability Centered Maintenance

errors and environmental impacts. In the


conceptual model, fault tree analysis is
suggested as a suitable technique for
identification and modeling of basic events.
Depending on the complexity of the system,
other techniques like reliability block diagrams,
or Monte Carlo simulation (see e.g. Hyland and
Rausand 1994) may be more suitable. In an
petroleum production plant there are often a
variety of buffers and rerouting possibilities.
Rerouting will also be possible in railway
applications. For such systems, Monte Carlo
next event simulation may often be the only
feasible approach.

personnel. These analysis items are denoted


maintenance cost significant items (MCSI).

If failure rates and other necessary input data are


available for the various analysis items, it is
usually a straightforward task to calculate the
relative importance of the various analysis items
based on a fault tree model or a reliability block
diagram. A number of importance measures are
discussed by Hyland and Rausand (1994). In a
Monte Carlo model it is also rather
straightforward to rank the various analysis
items according to criticality.

In the FMECA analysis of Step 6, each of the


MSIs will be analyzed to identify their possible
impact upon failure on the four consequence
classes: (S) safety of personnel, (E) environmental impact, (A) production availability, and
(C) economic losses. This analysis is partly
inductive and will focus on both local and
system level effects. From the present step we
know that a failure of an MSI may have impact
on one or more of the system functions. In
addition, the failure of an MSI may have several
local effects and also effects on system level not
involving the identified system functions. There
may also be analysis items, that are not
classified as MSIs, that have negative effects on
the system level not involving the identified
system functions. This observation may be seen
as an argument for not to screen out socalled
noncritical items.

The sum of the functional significant items and


the maintenance cost significant items are
denoted maintenance significant items (MSI).
Some authors, e.g. Smith (1993), claim that such
a screening of critical items should not be done,
others e.g. Paglia et al. (1991) claim that the
selection of critical items is very important in
order not to waste time and money. We tend to
agree with both. In some cases it may be
beneficial to focus on critical items, in other
cases we should analyze all items.

The main reason for performing this task is to


screen out items that are more or less irrelevant
for the main system functions, i.e. in order not to
waste time and money analyzing irrelevant
items.
In addition to the FSIs, we should also identify
items with high failure rate, high repair costs,
low maintainability, long lead time for spare
parts, or items requiring external maintenance

System functions obtained by functional analysis


System
function I

MSI
A

System
function II

MSI
B

MSI
C

System
function III

MSI
1

System
function ;

Anal. item
0

MSIs considered

Figure 6 Relation between top level system functions and analysis items
11

In Step 6 a complete FMECA is carried out for


all the MSIs. The FMECA is partly an inductive
analysis that identifies all the local and system
level consequences of the MSI failure modes.
This means that other (top level) functions than
those identified may be considered in the
FMECA. This is illustrated in

Operating profile (continuous or intermittent operation)

Figure 6, where the system function X is affected


by analysis item N.

Calendar and accumulated operating time


for overhauls

On the other hand, there might be important


items which are omitted from the FMECA
because the corresponding top level functions
were overlooked. This is the case for analysis
item M in

Maintenance and downtime costs

Figure 6 that has an impact on system function


X. The only way to ensure that all functions are
considered, is to include all items in the FMECA
analysis. However, this will often lead to a too
comprehensive analysis.

6WHS
'DWDFROOHFWLRQDQGDQDO\VLV
The data necessary for the RCM analysis may
according to (Sandtorv & Rausand 1991) be
categorized in the following three groups:
1. Design data

Control philosophy (remote/local and


automatic/manual)
Environmental conditions
Maintainability

Recommended maintenance for each


analysis item based on manufacturer
specification, general guidelines or
standards, or inhouse recommended
practice.
Failure information, when a failure occurs
the following registrations are relevant:
System number (tag number) of the
analysis item
Calendar time
Accumulated operating time to the
failure
Failure event
Failure mode

System definition: a description of the


system boundaries including all subsystems and equipment to fulfill the main
functions of the system.

Failure cause

System breakdown: the assembly hierarchy as described in Step 2.

Downtime

A technical description of each subsystem,


such as the structure of the subsystem,
capacity and functions (e.g. input and
output).
System performance requirements, e.g. desired system availability, environmental
requirements.
Requirements related to maintenance/testing e.g. according to rules and regulations.
2. Operational data

Failure consequences
Repair time (active and passive)

3. Reliability data
Reliability data may be derived from the
operational data. The reliability data is used
to decide the criticality, to mathematically
describe the failure process and to optimize
the time between PMtasks. The reliability
data includes:
Mean time to failure (MTTF).
Mean time to repair (MTTR).
Failure rate function z(t).

Performance requirements
12

Rausand&Vatn

The failure rate function is briefly described in


the following. Let T denote the time from an
item is put into operation at time t = 0 until a
potential failure occurs. The item may be either
new or used when it is put into operation. In
many cases the item will be reput into
operation after a refurbishment or a failure has
been corrected. The uncertainties in the time to
failure T may be described by the distribution
function F(t) = Pr(T t), or the probability
density function f(t) = F(t). The probability
density function f(t) may be expressed as:
f(t)t Pr(t < T t+t )
Hence, f(t)t is approximately equal to the
probability that the item will fail in the time
interval t,t+t].
The life distribution is often most effectively
characterized by the socalled failure rate, or
force of mortality (FOM). The failure rate
function z(t) may be expressed as:
z(t) t Pr(t < T t+t T > t)
If we consider an item that has survived the time
interval 0,t], i.e. T > t, then the probability that
the item will fail in the time interval t, t+t] is
approximately z(t) t. In many cases the failure
rate will be an increasing function of time,
indicating that the item is deteriorating. In other
cases the failure rate may be decreasing,
indicating that the item is improving. There are
even cases where the failure rate is decreasing in
one time interval and increasing in another. In
some cases we may predict the form of the
failure rate curve based on knowledge about the
relevant failure mechanisms. An example of a
failure rate function for a deteriorating item is
given in Figure 7.
A popular class of life distributions is the
Weibull distribution where the failure rate
function is given by:
-1

zW(t) = ()(t)

When > 1, the failure rate function zW(t) is


seen to be increasing, meaning that the item is

deteriorating. It is also seen that the degree of


deterioration increases with . When < 1, the
failure rate function zW(t) is decreasing meaning
that the item is improving. When = 1, the
failure rate is constant, meaning that failures are
truly random.

Failure rate

A functional relation between the value of


condition monitoring information and the
failure rate z(t).

Reliability Centered Maintenance

Wearout limit

Time

Figure 7 Failure rate with identifiable


wearout limit
In some cases the value of the shape parameter
may be estimated based on knowledge about the
relevant failure mechanisms, i.e. based on expert
judgment.
Several other life distributions are available.
Among the most popular are; the lognormal
distribution, the BirnbaumSaunders distribution, and the inverse Gaussian distribution.
All of these distributions are rather flexible, and
may be used for detailed modeling of specific
failure mechanisms. For the purpose of this
paper, however, the class of Weibull distributions is sufficiently flexible to be the preferred
distribution. In the rest of this paper we will
therefore assume that the time to failure follows
a Weibull distribution.
The various reliability parameters may be
estimated from relevant operational data.
Estimation techniques are thoroughly discussed
in Hyland and Rausand (1994).
The operational and reliability data are collected
from available operating experience and from
external files where reliability information from
systems with similar design and operating
conditions can be found (e.g. data banks, data
handbooks, field data from own data storage,
manufacturers recommendations). The external
information available should be considered
carefully before it is used, because such
information is generally available at a much
13

Rausand&Vatn

Reliability Centered Maintenance

coarser level than what is indicated in point (2)


and (3) above. The following three points should
be considered before reliability data is used:

adjusted based on updated information and


experience.
In some situations there is a complete lack of
reliability data. This is the fact when developing
a maintenance program for new systems. The
maintenance program development starts long
before the equipment enters service. Helpful
sources of information can then be experience
data from similar equipment, directions from
manufacturers and results from testing. The
RCM method will even in this situation provide
useful information. A successful application of
RCM requires an extensive amount of
information. Both qualitative and quantitative
data are required. A systematic approach to the
collection phase is essential. The results of the
total RCM process depend highly on the quality
of the input data.

What are the system boundaries for the


system (analysis item) from which the data
arrives?
What are the specific operating and
maintenance features that may influence on
the data validity?
Is the time scale used calendar time,
operating time, or some other time
concept?
A most valuable source of reliability data is the
OREDA handbook (1997) and the OREDA
database. OREDA contains data from a wide
range of offshore equipment. The data has been
collected mainly from platform maintenance
records from the whole North Sea area and from
the Mediterranean Sea. The handbook presents
generic data, while more detailed, manufacturer
specific data is available in the database. The
OREDA database is, however, available only for
the participants in the OREDA project.

6WHS
)DLOXUH PRGHV HIIHFWV DQG
FULWLFDOLW\DQDO\VLV
The objective of this step is to identify the
dominant failure modes of the MSIs identified
during Step 4.

At the outset of the analysis, the relevant


reliability may often be scarce, because of little
or no operating experience. The initial
information used may, however, later be
6\VWHP

3HUIRUPHGE\

5HIGUDZLQJQR

'DWH

Failure (IIHFWRIIDLOXUH
MSI Operational Function mode Consequence Worst case
mode
class
probability
S E A C S E A C

'HVFULSWLRQRIXQLW

MTTF Criticality

3DJHRI

Failure Failure
cause mechanism

%MTTF Failure
Maintenance Failure
characteristic action
characteristic
measure

Recommended
interval

Figure 8 RCM FMECA-form


A wide variety of different FMECA forms are
used in the main RCM references. The FMECA
form used in our approach is presented in Figure
8. The various columns in this FMECA form are
discussed below:
MSI: This will typically be the analysis item
number in the assembly hierarchy (tag number),
optionally with a descriptive text.

Operational mode: The MSI may have various


operational modes, for example running and
standby.
Function: For each operational mode, the MSI
may have several functions. A function of a
standby water supply pump is for example to
start upon demand.
14

Rausand&Vatn

Reliability Centered Maintenance

Failure mode: A failure mode is the manner by


which a failure is observed, and is defined as
nonfulfillment of one of the equipment
functions.

failure mode of a supporting equipment. A fail


to close failure of a safety valve may for
example be caused by a broken spring in the
failsafe actuator.

Effect of failure/Severity class: The effect of a


failure is described in terms of the worst case
outcome with respect to safety (S),
environmental
impact
(E),
production
availability (A), and direct economic cost (C).
The effect can either be specified by means of
consequence classes, or some numerical severity
measure. A failure of an MSI will not
necessarily give a worst case outcome due to
e.g. redundancy, buffer capacities, etc. A
conditional likelihood field is therefore
introduced.

Failure mechanism: For each failure cause, there


is one or several failure mechanisms. Examples
of failure mechanisms are fatigue, corrosion, and
wear.

Worst case probability: The worst case


probability is defined as the probability that an
equipment failure will give the worst case
outcome. To obtain a numerical probability
measure, a system model is required. This will
often be inappropriate at this stage of the
analysis, and a descriptive measure may be used.
Proposed
classes
are
serial,
redundancy,cold standby, hot standby,
and buffer.
MTTF:Mean time to failure for each failure
mode is recorded. Either a numerical measure or
likelihood classes may be used.
Criticality: The criticality field is used to tag off
the dominant failure mode according some
criticality measure. A criticality measure should
take failure effect, worst case probability and
MTTF into account. Yes is used to tag off the
dominant failure modes.
The information described so far should be
entered for all failure modes. A screening may
now be appropriate, giving only dominant
failure modes, i.e. items with high criticality.
For the dominant failure modes the following
fields are required:
Failure cause: For each failure mode there may
be several failure causes. An MSI failure mode
will typically be caused by one or more
component failures. Note that supporting
equipment to the MSIs entered in the FMECA
form is for the first time considered at this step.
In this context a failure cause may therefore be a

% MTTF: The MTTF was entered on an MSI


failure mode level. It is also relevant to enter the
MTTF for each failure mechanism. To simplify,
a per cent is given, and MTTF can be calculated
for each failure mechanism. The %MTTF will
obviously be only an approximation since the
failure mechanisms usually are strongly
interdependent.
Failure characteristic: Failure propagation may
be divided into three classes.
1. The failure propagation can be measured by
one or several (condition monitoring)
indicators. The failure is referred to as a
gradual failure.
2. The failure probability is agedependent, i.e.
there is a predictable wearout limit. The
failure is referred to as an ageing failure.
3. Complete randomness. The failure cannot be
predicted by either condition monitoring
indicators or by measuring the age of the
item. The time to failure can only be
described by an exponential distribution, and
the failure is referred to as a sudden failure.
Maintenance action: For each failure
mechanism, an appropriate maintenance action
may hopefully be found by the decision logic in
Step 7. This field can thus not be completed
until Step 7 is performed.
Failure characteristic measure: For gradual
failures, the condition monitoring indicators are
listed by name. Ageing failures are described
by an ageing parameter, i.e. the shape
parameter () in the Weibull distribution is
recorded.
Recommended maintenance interval: The
identified maintenance action is performed at
intervals of fixed length. The length of the
interval is found in Step 8.

15

Rausand&Vatn
6WHS
6HOHFWLRQ RI 0DLQWHQDQFH
$FWLRQV
This phase is the most novel compared to other
maintenance planning techniques. A decision
logic is used to guide the analyst through a
questionandanswer process. The input to the
RCM decision logic is the dominant failure
modes from the FMECA in Step 6. The main
idea is for each dominant failure mode to decide
whether a preventive maintenance task is
suitable, or it will be best to let the item
deliberately run to failure and afterwards carry
out a corrective maintenance task. There are
generally three reasons for doing a preventive
maintenance task:
(a) to prevent a failure
(b)to detect the onset of a failure
(c) to discover a hidden failure
Only the dominant failure modes are subjected
to preventive maintenance. To obtain
appropriate maintenance tasks, the failure causes
or failure mechanisms should be considered. The
idea of performing a maintenance task is to
prevent a failure mechanism to cause a failure.
Hence, the failure mechanisms behind each of
the dominant failure modes should be entered
into the RCM decision logic to decide which of
the following basic maintenance tasks that is
applicable:

Reliability Centered Maintenance


Example:
A distance gauge might be used to measure the
distance between the switch point and stock
rail to detect that the 3mm limit will be
reached. At a predefined level (i.e. 2.7 mm),
the system alerts the maintenance crew, which
carry out an appropriate maintenance action.

Scheduled oncondition task (SCT) is a


scheduled inspection of an item at regular
intervals to find any potential failures. There are
three criteria that must be met for an on
condition task to be applicable:
1. It must be possible to detect reduced failure
resistance for a specific failure mode.
2. It must be possible to define a potential
failure condition that can be detected by an
explicit task.
3. There must be a reasonable consistent age
interval between the time of potential failure
and the time of failure.
Example:
A manual inspection every second month will
reveal whether the 3 mm limit is soon being
reached. Appropriate maintenance action can
be issued.

1. Continious oncondition task (CCT)

There are two disadvantage of a scheduled


versus a continuous on-condition task:

2. Scheduled oncondition task (SCT)

The man-hour cost of inspection is often


larger than the cost of installing the sensor

Since the scheduled inspection is carried out


at fixed points of time, one might miss
situations where the degradation is faster
than anticipated.

3. Scheduled overhaul (SOH)


4. Scheduled replacement (SRP)
5. Scheduled function test (SFT)
6. Run to failure (RTF)
Continuous oncondition task (CCT) is a
continuous monitoring of an item to find any
potential failures. An oncondition task is
applicable only if it is possible to detect reduced
failure resistance for a specific failure mode
from the measurement of some quantity.

An advantage of a scheduled on-condition task


is that the human operator is then able to sense
information that a physical sensor will not be
able to detect. This means that traditional Walk
around checks should not be totally skipped
even if sensors are installed.
Condition monitoring is discussed in Nowlan &
Heap (1978), and statistical models are
presented in e.g. Aven (1992) and Valdez-Flores
& Feldman (1989).
16

Rausand&Vatn

Reliability Centered Maintenance


Example:

Scheduled overhaul (SOH) is a scheduled


overhaul of an item at or before some specified
age limit, and is often called hard time
maintenance.
An overhaul task can be considered applicable to
an item only if the following criteria are met
(Nowlan & Heap 1978):
1. There must be an identifiable age at which
the item shows a rapid increase in the items
failure rate function.
2. A large proportion of the units must survive
to that age.
3. It must be possible to restore the original
failure resistance of the item by reworking it.
Examples:
Rehabilitation of wooden sleepers borings
every three year. Lubrication of the char/slideplate every three day. Cleaning every
month.

Scheduled replacement (SRP) is scheduled


discard of an item (or one of its parts) at or
before some specified age limit. A scheduled
replacement task is applicable only under the
following circumstances (Nowlan & Heap
1978):
1. The item must be subject to a critical failure.
2. Test data must show that no failures are
expected to occur below the specified life
limit.
3. The item must be subject to a failure that has
major
economic
(but
not
safety)
consequences.
4. There must be an identifiable age at which
the item shows a rapid increase in the failure
rate function.

Replacement of the motor every one year The


motor is then either overhauled to a god as
new condition, or replaced in the
maintenance depot.

Scheduled function test (SFT) is a scheduled


inspection of a hidden function to identify any
failure. A scheduled function test task is
applicable to an item under the following
conditions (Nowlan & Heap 1978):
1. The item must be subject to a functional
failure that is not evident to the operating
crew during the performance of normal
duties.
2. The item must be one for which no other type
of task is applicable and effective.
Example:
Sighting or hammer blow every year to detect
loose lockspikes fastening chars/baseplates on
wooden sleepers.

Run to failure (RTF) is a deliberate decision to


run to failure because the other tasks are not
possible or the economics are less favorable.
In many situations one maintenance task may
prevent several failure mechanisms. For
example function testing of an ESD-valve (with
an offline function) will reveal any failure
mechanisms causing a hidden failure. Hence in
some situations it is better to put failure modes
rather than failure mechanisms into the RCM
decision logic.
Note also that if a failure cause for a dominant
failure mode corresponds to a supporting
equipment, the supporting equipment should be
defined as the item to be entered into the
RCM decision logic.

5. A large proportion of the units must survive


to that age.
The criteria given for using the various tasks
should only be considered as guidelines for
selecting an appropriate task. A task might be
found appropriate even if some of the criteria are
not fulfilled.
17

Rausand&Vatn

Reliability Centered Maintenance

The RCM decision logic is shown in Figure 9.


Note that this logic is much simpler than those
found in standard RCM references, e.g.
Moubray (1991). It should be emphasized that

such a logic can never cover all situations. For


example in the situation of a hidden function
with ageing failures, a combination of scheduled
replacements and function tests is required.

Yes
Does a failure alerting
measurable indicator
exist?

Yes

Is continious
monitoring
feasible?

No

No

Yes
Is ageing parameter
>1?

Yes

Is overhaul
feasible?
No

No

Is the function
hidden?

Yes

Continious oncondition
task (CCT)
Scheduled oncondition
task (SCT)
Scheduled overhaul
(SOH)
Scheduled
replacement
(SRP)

Scheduled function
test (SFT)

No
No PM activity
found (RTF)

Figure 9 Maintenance Task Assignment/Decision logic


6WHS
'HWHUPLQDWLRQ RI 0DLQWH
QDQFH,QWHUYDOV
The RCM decision logic was qualitatively used
to establish preventive maintenance tasks. These
tasks are performed at times k, k=1,2, . . . , .
Hence, for each task, the optimal interval
should be decided. When balancing costs, we
realize that the preventive maintenance cost
increases with decreasing , and the cost of
unplanned failures decreases with decreasing .
In this presentation, only tree simple models are
discussed. Model 1 and Model 2 are appropriate
models for scheduled rework/replacement task,
while Model 3 may be used for scheduled
function testing. For more general models, see
Valdez-Flores & Feldman (1989).

To determine the optimal interval t some crucial


information is required. First we need
information about cost structures, i.e. the total
cost of the preventive maintenance action and
the total cost of a failure which the maintenance
action was supposed to prevent. Note that the
models are developed for single unit systems,
thus for redundant systems we realize that a
failure needs not necessarily give a system
failure. If the cost of a system failure is cs, then
the cost element to use in the model should be cs
= p + cr where p is the probability that the
(redundant) unit will cause a system failure, and
cr is the repair/replacement cost of the unit.
In addition to cost structures, information about
the actual failure distribution is necessary. This
information will typically be mean time to
failure (MTTF), and the shape parameter for
units where ageing, wear, corrosion etc. are
18

Rausand&Vatn

Reliability Centered Maintenance

present. Note that the failure information should


be obtained at a failure cause level, i.e.
corresponding to the failure cause the preventive
maintenance task is designed for.
Model 1 Minimal repair policy
The minimal repair policy describes a single
unit system subjected to preventive replacement
at periods of fixed lengths. To be formal the unit
is put into operation at time t = 0, and replaced
at times k for k = 1,2, . . . . If the unit fails in an
interval (k-1), k] a minimal repair, see e.g.
Hyland and Rausand (1994), is performed. The
situation where the unit is replaced upon a
failure in (k-1), k], i.e. a block replacement
policy is discussed in Model 2.
The total cost of a minimal repair is denoted cm
and the total cost of a replacement is cp. It will
be convenient to introduce = cp/cm. Typically
>> 1, or at least > 1. In special situations we
can even have < 1.
The expected cost per unit time is:
&( ) =

We will consider a socalled Weibull process


with W(t) = (t). In this case the time from t = 0
until the first failure has a Weibull distribution
with survivor function R(t) = exp(-(t)).
It can be shown that the expected cost per unit
time, C(t), is minimized when:
1

1
=

provided > 1.
Hence, to optimize the replacement interval,
estimates for the parameters; cm, cp, and are
required. cm is the total cost of a minimal repair,
including any harm to material, personnel and
environment. Assessing a value of cm may
therefore cause controversies. and are the
parameters in the failure distribution of the item.
Often it is more convenient to specify the failure
distribution in terms of mean time to failure
(MTTF) and the shape parameter , yielding:
1

077)
= 1

( + 1) 1

F S + FP: (W )

where W(t) = E(N(t)) is the expected number of


failures in 0, t].
Table 1 Optimal replacement interval relative to MTTF. For a given value of and , the table
entry should be multiplied with MTTF to give the optimum replacement interval length

Cost ratio = cu /cp


2

10

20

50

100

200

1.2

13.40

12.20

12.95

12.62

2.050

.897

.393

.165

.090

.050

1.5

8.19

7.97

1.22

.85

.590

.432

.253

.133

.083

.052

1.7

6.60

1.59

.83

.66

.503

.389

.247

.141

.093

.061

2.0

4.84

.86

.67

.57

.464

.377

.259

.161

.113

.080

2.5

.99

.71

.60

.54

.461

.394

.294

.202

.152

.115

3.0

.82

.67

.59

.54

.478

.421

.331

.242

.192

.152

4.0

.75

.66

.61

.57

.523

.476

.398

.316

.265

.223

Model 2 Block replacement policy


The block replacement policy describes a
singleunit system put into operation at time t =

0. The unit is replaced at times kt for k=1,2,. . .


and at failures. The cost of a planned
replacement is denoted cp, and the total cost of
19

Rausand&Vatn

Reliability Centered Maintenance

an unplanned replacement, i.e. a failure is cu. Let


W(t) denote the renewal function, see e.g.
Hyland and Rausand (1994), for the lifetime
distribution of the unit. The average cost per unit
time is:
&( ) =

Point where we can find out that


LWLVIDLOLQJ("potential failure")

Acceptable
deviation

Model 3 - Functional testing


This model is appropriate for scheduled function
testing. Consider a protective device with a
constant failure rate . A functional test of the
device is performed at times k for k=1,2,. . .
The cost of a Functional test is ct. If a failure is
detected upon a test, the device is replaced at a
cost of cr. Further assume that the device is
demanded with a frequency f, i.e. the rate of
critical situations. A hazardous situation occurs
if the protective device fails upon a demand. The
total cost of such a situation is ch.
The expected cost per unit time is:

FW
2

+ FU
+ I FK

2
2

yielding an optimal interval :

 FW

I FK  FU

Target
value

In order to use Table 1 the value of must be


specified. During the data analysis in Step 5 the
value of should have been found by e.g. expert
judgment.


=

Performance/
Condition

F S + FX: (W )

If the times between failures are Weibull


distributed, W(t) can be found by the algorithm
given by Smith and Leadbetter (1963).
Numerical methods are, however, required to
find the optimal interval . In Table 1, numerical
values for the optimal replacement interval is
given relative to MTTF.

&

degradation in performance, or some indicator


variable is alerting about the failure.

077) FW
I FK  FU  077)

Model 4 - Scheduled on-condition tasks, the


concept of P-F-intervals
The idea behind a scheduled on-condition task is
that a coming failure is alerted by some

F
Failure
Time
P-F interval

Figure 10 P-F interval


In Figure 10 the performance is viewed as a
function of time. The point P is the first point
in time where we are able to reveal the outset of
a failure. When the performance is below some
limiting value a failure will occur. The length
from a potential failure is detectable until a
failure occurs is denoted the P-F interval. The
length of the P-F interval is assumed to vary
from time to time, and is therefore modeled as a
random variable. In order to establish an optimal
maintenance interval, , the following quantities
must be defined:
'

Delay time, i.e. the time from a


potential failure is revealed until an
appropriate
corrective
action
is
completed. For simplicity the delay time
is considered as a deterministic quantity.
FL:
Cost of (manual) inspection.
Cost of (unplanned) failure.
FX:
73): PF interval (random variable).
:
E(73)) = Mean value of P-F interval.
:
SD(7PF) = Standard deviation of P-F
interval.
MTTF: Mean time to failure if no corrective
maintenance is carried out
The expected cost per unit time is given by:

& ( ) =

FL + F X H W Pr( 73) W + ')GW


0

where =1/(MTTF-), and we have assumed


that the time from the component is in a perfect
state until a potential failure reveals is
20

Rausand&Vatn

Reliability Centered Maintenance

exponentially distributed.
In order to optimize Eq. (1) numerical values are
required for ci, cu, MTTF, D, and . Numerical
methods are usually required to optimize Eq.
(1). The calculations will be simplified if we
choose a distribution for TPF with a closed form
of the cumulative distribution function.
Model 5 - Continuos on-condition tasks
The idea of continuos on-condition monitoring
is to measure one or more indicator variable.
The reading of the component in this manner
can be used to detect a coming failure. The
variable being monitored is denoted X(t) in
Figure 11.
X(t)

"Failure Limit"

"Action Limit"

Applicability: meaning that the task is applicable


in relation to our reliability knowledge and in
relation to the consequences of failure. If a task
is found based on the preceding analysis, it
should satisfy the Applicability criterion.
A PM task will be applicable if it can eliminate a
failure, or at least reduce the probability of
occurrence to an acceptable level (Hoch 1990) or reduce the impact of failures!
Cost-effectiveness: meaning that the task does
not cost more than the failure(s) it is going to
prevent.
The PM task's effectiveness is a measure of how
well it accomplishes that purpose and if it is
worth doing. Clearly, when evaluating the
effectiveness of a task, we are balancing the
cost of performing the maintenance with the
cost of not performing it. In this context, we
may refer to the cost as follows (Hoch 1990):
1. The cost of a PM task may include:
the risk of maintenance personnel error,
e.g. maintenance introduced failures

Time
Failure

the risk of increasing the effect of a failure


of another component while the one is out
of service

Figure 11 Continuos monitoring

the use and cost of physical resources

In Figure 11 the deteriorating process is shown.


Here X(t) is can be interpreted as the cumulative
damage at time t. When the damage exceeds
some limit, a failure occurs. In Figure 11 we
have also shown an action limit, upon where
to take a maintenance action. The challenge here
is to decide the optimal action limit. No
general approach seems applicable here since the
solution is highly dependent on how X(t) is
modeled. Aven (1992) discusses one method
where an underlying chock model is assumed.

the unavailability of physical resources


elsewhere while in use on this task

6WHS
3UHYHQWLYH
PDLQWHQDQFH
FRPSDULVRQDQDO\VLV
Two
overriding
criteria
for
selecting
maintenance tasks are used in RCM. Each task
selected must meet two requirements:
It must be applicable
It must be effective

production
maintenance

unavailability

during

unavailability of protective
during maintenance of these

functions

The more maintenance you do the more


risk you will expose your maintenance
personnel to
2. On the other hand, the cost of a failure may
include:
the consequences of the failure should it
occur (i.e. loss of production, possible
violation of laws or regulations, reduction
in plant or personnel safety, or damage to
other equipment)
the consequences of not performing the
PM task even if a failure does not occur
21

Rausand&Vatn
(i.e., loss of warranty)
increased premiums for emergency repairs
(such as overtime, expediting costs, or
high replacement power cost).
Balancing the various cost elements to achieve a
global optimum will always be a challenge. The
conceptual RCM model in Figure 1 may be a
starting point. If such a model could be
established, and the various cost elements
incorporated, the trade-off analysis is reduced to
an optimization problem with a precisely
defined mathematical model.
Often the resources available for the RCM
analysis do not permit building such an overall
model, hence we can not expect to achieve a
global optimum. Sub-optimization can to some
extent be achieved by simplifying the model in
Figure 1. For example one could consider only
one consequence at a time and/or only one
maintenance task at a time.

6WHS
7UHDWPHQWRIQRQ06,V
In Step 4 critical items (MSIs) were selected for
further analysis. A remaining question is what to
do with the items which are not analyzed. For
plants already having a maintenance program it
is reasonable to continue this program for the
non-MSIs. If a maintenance program is not in
effect, maintenance should be carried out
according to vendor specifications if they exist,
else no maintenance should be performed. See
Paglia et al (1991). for further discussion.

6WHS
,PSOHPHQWDWLRQ
A necessary basis for implementing the result of
the RCM analysis is that the organizational and
technical maintenance support functions are
available. A major issue is therefore to ensure
the availability of the maintenance support
functions. The maintenance actions are typically
grouped into maintenance packages, each
package describing what to do, and when to do
it.
As indicated in the outset of this paper, many
accidents are related to maintenance work.

Reliability Centered Maintenance


When implementing a maintenance program it is
therefore of vital importance to consider the risk
associated with the execution of the maintenance
work. Checklists could be used to identify
potential risk involved with maintenance work:

Can maintenance people be injured during


the maintenance work?

Is work permit required for execution of the


maintenance work?

Are means taken to avoid problems related


to re-routing, by-passing etc.?

Can failures be
maintenance work?

etc.

introduced

during

Task analysis, see e.g. Kirwan & Ainsworth


(1992) may be used to reveal the risk involved
with each maintenance job. See Hoch (1990) for
a further discussion on implementing the RCM
analysis results.

6WHS
,QVHUYLFH GDWD FROOHFWLRQ
DQGXSGDWLQJ
As mentioned earlier, the reliability data we
have access to at the outset of the analysis may
be scarce, or even second to none. In our
opinion, one of the most significant advantages
of RCM is that we systematically analyze and
document the basis for our initial decisions, and,
hence, can better utilize operating experience to
adjust that decision as operating experience data
is collected. The full benefit of RCM is therefore
only achieved when operation and maintenance
experience is fed back into the analysis process.
The process of updating the analysis results is
also important due to the fact that nothing
remain constant, best seen considering the
following arguments (Smith 1993):

The system analysis process is not perfect


and requires periodic adjustments.

The plant itself is not a constant since


design, equipment and operating procedures
may change over time.

Knowledge grows, both in terms of


understanding how the plant equipment
behaves and how technology can increase
22

Rausand&Vatn

Reliability Centered Maintenance

availability and reduce costs.


Reliability trends are often measured in terms of
a non-constant ROCOF (rate of occurrence of
failures), see e.g. Hyland & Rausand (1994).
The ROCOF measures the probability of failure
as a function of calendar time, or global time
since the plant was put into operation. The
ROCOF may change over time, but within one
cycle the ROCOF is assumed to be constant.
This means that analysis updates should be so
frequent that the ROCOF is fairly constant
within one period.
Opposite to the ROCOF, the failure rate or
FOM, is measuring the probability of failure as a
function of local time, i.e. the time elapsed since
last repair/replacement. However, the FOM can
not be considered constant, if so there is no
rationale for performing scheduled replacement/repair.
The updating process should be concentrated on
three major time perspectives (Sandtorv &
Rausand 1991):

Short term interval adjustments

Medium term task evaluation

Long term revision of the initial strategy

The short term update can be considered as a


revision of previous analysis results. The input
to such an analysis is updated reliability figures
either due to more data, or updated data because
of reliability trends. This analysis should not
require much resources, as the framework for the
analysis is already established. Only Step 5 and
Step 8 in the RCM process will be affected by
short term updates.
The medium term update will also review the
basis for the selection of maintenance actions in
Step 7. Analysis of maintenance experience may
identify significant failure causes not considered
in the initial analysis, requiring an updated
FMECA analysis in Step 6. The medium term
update therefore affects Step 5 to 8.
The long term revision will consider all steps in
the analysis. It is not sufficient to consider only
the system being analyzed, it is required to
consider the entire plant with it's relations to the
outside world, e.g. contractual considerations,
new laws regulating environmental protection
etc.

',6&866,216
&21&/86,216

$1'

The following summarizes some main benefits,


drawbacks and problems encountered during
application of the RCM method in some
offshore case studies.

*HQHUDOEHQHILWV
Cross-discipline utilization of knowledge: To
fully utilize the benefits of the RCM concept,
one needs contributions from a wider scope of
disciplines than what is common practice. This
means that an RCM analysis requires
contribution from the three following discipline
categories working closely together:
1. System/reliability analyst
2. Maintenance/operation specialist
3. Designer/manufacturer
All these categories do not need to take part in
the analysis on a full time engagement. They
should, however, be deeply involved in the
process during pre- and post-analysis review
meetings, and quality review of final results.
The result of this is that knowledge is extracted
and commingled across traditional discipline
borders. It may, however, cost more at the outset
to engage all these personnel categories.
Traceability of decisions: Traditionally, PM
programs tend to be cemented. After some
time one hardly knows on what basis the initial
decisions were made and therefore do not want
to change those decisions. In the RCM concept
all decisions are taken based on a set of
analytical steps, all of which should be
documented in the analysis. When operating
experience accumulates, one may go back and
see on what basis the initial decisions were
taken, and adjust the tasks and intervals as
required based on the operating experience. This
is especially important for initial decisions based
on scarce data.
Recruitment
of
skilled
personnel
for
maintenance planning and execution: The RCM
way of planning and updating maintenance
23

Rausand&Vatn
requires more professional skills, and is
therefore a greater challenge for skilled
engineers. It also provides the engineers with a
broader and more attractive way of working with
maintenance than what sometimes is common
today.
Cost aspects: As indicated, RCM will require
more efforts both in skills and manhours when
first being introduced in a company. It is,
however, documented by many companies and
organizations that the long term benefits will far
outweigh the initial extra costs. One problem is
that the return of investment has to be looked
upon in a long term perspective, something that
the management is not always willing to take a
chance on.
Benefits related to PM-program achievement:
Based on the case studies we have carried out,
and experience published by others, the general
achievements of RCM in relation to a traditional
PM-programs may be summarized as follows:

By careful analysis of the failure


consequences, the amount of PM tasks can
often be reduced, or replaced by corrective
tasks or more dedicated tasks. We have
therefore chosen to include corrective
maintenance as a possible outcome of the
RCM analysis.

Emphasis has been changed from periodic


rework or overhaul tasks of the large
assemblies/units to more dedicated object
oriented tasks. Consequently,
condition
monitoring has been more frequently used to
detect specific failure modes.

Requirement for spare parts has been


reduced as a result of a better justification
for replacements.

Design solutions have been discovered that


were not optimal from a safety and plant
economic point of view.

3UREOHPDUHDVLQWKHDQDO\VLV
Identification of Maintenance Significant Items:
In some cases there may be very little to achieve
by limiting the analysis to only include the
MSIs. Smith (1993) argues that concentrating on
critical components (MSIs) is directly wrong

Reliability Centered Maintenance


and that it in most analyses exclude important
equipment from appropriate attention. He writes
(page 82):
. . . we should be very careful
not to prematurely discard
components as non-critical until
we have truly identified their
proper tie and priority status to
the functions and functional
failures.
Other authors argue that the main objective of
the RCM process is to create a basis for
maintenance evaluation and task adjustment.
The selection of MSIs will reduce this basis and
result in an insufficient evaluation process.
The rationale for working with the MSIs only
was to reduce the analysis work. Thus there is
always a risk of an insufficient analysis when
the non-MSIs are not subjected to a formal
analysis. In our presentation the criterion for
classifying an analysis item as non-MSI is:
The item should not affect any of the critical
system failure modes (Step 4). By using such
an approach the criterion for disregarding an
item is traceable, and may be reevaluated later.
Further we believe that this criterion makes very
good sense.
Lack of reliability data: As indicated the full
benefit of the RCM concept can only be
achieved when we have access to reliability data
for the items being analyzed. Is now RCM
worthless if we have no or very poor data at the
outset? The answer to this question is no, even
in this case the RCM approach will provide
some useful information for assessing
maintenance tasks. PM intervals will, however,
not be available. As a result of the analysis, we
should at least have identified the following:

We know whether the failure involves a


safety hazard to personnel, environment or
equipment

We know whether the failure affects


production availability

We know whether the failure is evident or


hidden

We have a better criterion for evaluating


cost-effectiveness
24

Rausand&Vatn
Lack of reliability data will always be a
problem. First of all there are problems with
getting access to operational data with sufficient
quality. Next, even if we have data, it is not
straight-forward to obtain reliability data from
the operational data. Before we discuss some
problems with collecting and using operational
data, it should be emphasized that there will
never be a complete lack of reliability figures.
Even if no operational data is available, expert
judgment will be available. However, the
uncertainty in the reliability figures can be very
large.
Based on our various engagements in the
OREDA project and other data collection
projects on offshore installations, we have
experienced the following common difficulties
related to acquisition of failure data:

Data is generally very repair oriented and not


directed towards describing failure causes,
modes and effects.

How the failure was detected is rarely stated


(e.g. by inspection, monitoring, PM, tests,
casual observation). This information is very
useful in order to select applicable tasks.

Failure modes can sometimes be deduced,


but this is generally left to the data collector
to interpret.

The true failure cause is rarely found, but the


failure symptom can to some extent be
traced.

Reliability Centered Maintenance


unit(s) being considered. Further if several units
are used to enlarge the data set, these units
should be operated identically under the same
environmental conditions. The requirements
above are very seldom fulfilled, hence the
estimation techniques may collapse. We
therefore recommend use of expert judgment to
establish appropriate ageing parameters. The
ageing parameter is a measure of how
deterministic a failure is, and it is reasonable to
believe that this measure is relatively constant
for each failure cause. On the other hand, it
seems meaningless to establish a general set of
recommended MTTF-figures for the various
failure mechanisms.
Trade-off analyses: There are four major criteria
for the assessment of the consequences of a
failure: safety, environment, production
availability, and economic losses. During the
analysis, we have to quantify these measures to
some extent to be able to use them as decision
criteria. Further, a trade-off analysis is required
to balance each means against the different
consequences. Referring to Figure 1 we need to
consider the effect of the maintenance tasks
M1,M2,.., on the consequences C1,C2,..,. This will
require comprehensive reliability models.
Further, the transformation of the consequences
C1,C2,.., into a unidimensional loss function is at
best a difficult and controversial task. A
framework for dealing with these problems is
given in Vatn et al. 1996.

Failure effect on the lower indenture level is


reasonably well described, but may often be
missing on higher indenture level (system
level).

Operating conditions when the failure


occurred is frequently missing or vaguely
stated.

Assessing proper interval: The RCM concept is


very valuable in assessing the proper type of PM
task, but traditionally RCM does not basically
include any tool for deciding optimal
intervals. The "new framework for RCM given
in Figure 1 together with standard PM-models
listed under Step 8 are believed to form a very
sound basis for deciding optimal intervals.

As mentioned above, there are often problems


with estimating reliability data from the
operational data. Reliability data comprises
MTTR and MTTF figures together with the
failure rate function. Reasonable estimates for
MTTR and MTTF may be found by various
averaging techniques. The failure rate function,
i.e. the ageing parameter is much harder to
obtain. Available estimation techniques require
no reliability trend (in calendar time) for the

&RQFOXVLRQV
RCM is not a simple and straightforward way of
optimizing maintenance, but ensures that one
does not jump to conclusions before all the right
questions are asked and answers given. RCM
can in many respects be compared with Quality
Assurance. By rephrasing the definition of QA,
RCM can be defined
25

Rausand&Vatn
All
systematic
actions
required to plan and verify
that the efforts spent on
preventive maintenance are
applicable and cost-effective.
Thus, RCM does not contain any basically new
method. Rather, RCM is a more structured way
of utilizing the best of several methods and
disciplines. Quoting Malik (1990) the author
postulates: . . . there is more isolation between
practitioners of maintenance and the
researchers than in any other professional
activity. We see the RCM concept as a way to
reduce this isolation by closing the gap
between the traditionally more design related
reliability methods, and the practical related
operating and maintenance personnel.

5()(5(1&(6
R. T. Anderson and L. Neri. Reliability-Centered
Maintenance. Management and Engineering
Methods. Elsevier Applied Science, London,
1990.
T. Aven. Reliability and Risk Analysis. Elsevier
Science Publishers, London, 1992.
K. M. Blache and A. B. Shrivastava. Defining
failure of manufacturing machinery &
equipment. Proceedings Annual Reliability
and Maintainability Symposium, pages 6975, 1994.
B. S. Blanchard and W. J. Fabrycky. System
Engineering and Analysis. Prentice-Hall,
Inc., Englewood Cliffs, New Jersey 07632,
1981.
BS 5760-5. Reliability of systems, equipments and
components; Part 5: Guide to failure modes,
effects and criticality analysis (FMEA and
FMECA). British Standards Institution,
London, 1991.
N. Cross. Engineering Design Methods: Strategies
for Product Design. John Wiley & Sons,
Chichester, 1994.
R. R. Hoch. A Practical Application of Reliability
Centered Maintenance. The American
Society of Mechanical Engineers, 90JPGC/Pwr-51, Joint ASME/IEEE Power
Gen. Conf., Boston, MA, 21-25 Oct., 1990.
A. Hyland and M. Rausand. Reliability Theory;
Models and Statistical Methods. John Wiley
& Sons, New York, 1994.

Reliability Centered Maintenance


IEC

50(191). International Electrotechnical


Vocabulary (IEV) - Chapter 191 Dependability and quality of service.
International Electrotechnical Commission,
Geneva, 1990.
IEC 812. Analysis Techniques for System Reliability
- Procedures for Failure Modes and Effects
Analysis
(FMEA).
International
Electrotechnical Commission, Geneva,
1985.
B. Kirwan and L. K. Ainsworth. A Guide to Task
Analysis. Taylor & Francis, London, 1992.
M.A. Malik. Reliable preventive maintenance
scheduling. AIEE Trans., 11:221-228, 1990.
M. A. Moss. Designing for Minimal Maintenance
Expense. The Practical Application of
Reliability and Maintainability. Marcel
Dekker, Inc., New York, 1985.
J. Moubray. Reliability-centred Maintenance.
Butterworth-Heinemann, Oxford, 1991.
F. S. Nowlan and H. F. Heap. Reliability-centered
Maintenance. Technical Report AD/A066579, National Technical Information
Service, US Department of Commerce,
Springfield, Virginia, 1978.
NPD. Regulations concerning implementation and
use of risk analyses in the petroleum
activities. Norwegian Petroleum Directorate,
P.O.Box 600, N-4001 Stavanger, Norway,
1991.
OREDA-97. Offshore Reliability Data. Distributed
by Det Norske Veritas, P.O.Box 300, N1322 Hvik, Norway, 3 edition, 1997.
Prepared
by
SINTEF
Industrial
Management. N-7034 Trondheim, Norway.
A.M. Paglia, D.D. Barnard, and D.E. Sonnett. A
Case Study of the RCM Project at V.C.
Summer Nuclear Generating Station. 4th
International Power Generation Exhibition
and Conference, Tampa, Florida, US,
5:1003-1013, 1991.
G. Pahl and W. Beitz. Engineering Design. The
Design Council, London, 1984.
M. Rausand and J. Vatn. Reliability Centered
Maintenance. In C. G. Soares, editor, Risk
and Reliability in Marine Technology.
Balkema, Holland, 1997.
H. Sandtorv and M. Rausand. RCM - closing the
loop between design and operation
reliability. Maintenance, 6, No.1:13-21,
1991.
A. M. Smith. Reliability-Centered Maintenance.
McGraw-Hill, Inc, New York, 1993.
D. J. Smith. Reliability, Maintainability and Risk,
Practical
methods
for
engineers.
26

Rausand&Vatn

Reliability Centered Maintenance

Butterworth Heinemann, Oxford, 4th


edition, 1993.
W. L. Smith and M. R. Leadbetter. On the renewal
function for the Weibull distribution.
Technometrics, 5:393-396, 1963.
Weapon Systems and Support Equipment.
Reliability-Centered
Maintenance.
Requirements for Naval Aircraft. US
Department of Defense, Washington DC
20301, 1986.
C. Valdez-Flores and R.M. Feldman. A survey of
preventive
maintenance
models
for
stochastically
deteriorating
single-unit
systems.
Naval
Research
Logistics
Quarterly, 36:419-446, 1989.
J. Vatn. Maintenance Optimization from a Decision
Theoretical Point of View. In Proceedings,
ESREL95, pages 273-285, London, 1995.
Chameleon Press Limited.
J. Vatn, P. Hokstad, and L. Bodsberg. An overall
model for maintenance optimization.
Reliability Engineering and System Safety,
51:241-257, 1996.

27

You might also like