You are on page 1of 21

White Paper

Achieving Plant Safety & Availability Through


Reliability Engineering and Data Collection

Date: 19 December 2006


Author(s): Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham

Risknowlogy B.V.
Brunner bron 2
6441 GX Brunssum
The Netherlands
www.risknowlogy.com

RISKNOWLOGY Experts in Risk, Reliability and Safety


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

© 2002 - 2007 Risknowlogy B.V.

All Rights Reserved

Printed in The Netherlands

This document is the property of, and is proprietary to Risknowlogy. It is not to be disclosed in whole or in part and no portion of this document shall be
duplicated in any manner for any purpose without Risknowlogy’s expressed written authorization.

Risknowlogy, the Risknowlogy logo, and Functional Safety Data Sheet are registered service marks.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 2


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Achieving Plant Safety & Availability Through Reliability


Engineering and Data Collection

Dr. M.J.M. Houtermans


Risknowlogy B.V., Brunssum, The Netherlands
T. Vande Capelle
HIMA Paul Hildenbrandt GmbH + Co KG, Brühl, Germany
M. Al-Ghumgham
SAFCO, Jubail, Kingdom of Saudi Arabia

Abstract
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.

1 Introduction
World wide chemical and other processing plants are trying to implement reliability programs to
improve plant safety while trying to maintain plant availability. These programs can vary significantly
in size and complexity. Any kind of reliability program, like a preventive maintenance (PM) program,
consists always of one or more reliability models and reliability data to execute these models. It is
needless to say that the actual successful implementation and utilization of these reliability programs
heavily depends on the accuracy of the reliability models and the availability of realistic data, or at
least as close as possible data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 3


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

At the start of any reliability data program good data is usually missing. Companies depend in that
case on external data sources (e.g., handbooks, databases, and expert opinions) that do not
necessarily represent the situation at their own plant. Data needs to be collected for each piece of
equipment, device or instrument needed to operate the plant. Many companies observe that during
the first usage of the reliability program further fine tuning of the collected data is needed as there is
an offset between the current model and the actual situation observed in the plant. Once the lack of
data or the uncertainty in data starts to decrease the models become more accurate and the
companies start to grab the benefits of the implemented reliability programs. Plant availability and
safety will both increase, more preventive maintenance will take place and the total lifetime operating
cost (TLOC) will decrease because of less unscheduled maintenance and associated spurious trips of
the plant.
The objective of this paper is to give the reader a better understanding of the importance of
reliability engineering focusing on the collection of reliability data for the purpose of process
availability, safety and preventive maintenance programs. The paper will first explain what reliability
engineering is and why it is important for processing plants to have a reliability program. Second the
paper will give an overview of different programs in use today by companies (Preventive
maintenance, risk based inspections, etc). Next the paper will focus on the role of reliability modeling
and data. The paper will explain the kind of reliability data needed and the current available sources
for this data. The focus will be on the actual data available within the plant that can among others be
collected through the information derived from the archived data in the DCS systems. An excellent
time to collect data is during scheduled plant turnarounds. Based on the example of a control valve
the paper will explain the role that humans can play to improve the administration of reliability data
during plant turnaround. Also the importance of failure analysis will be addressed. Finally the paper
will conclude with an example of a decision support model which heavily depends on accurate
reliability data. This example demonstrates how a plant owner can benefit from a practical application
of reliability engineering.

2 Reliability engineering
Reliability engineering plays an important but undervalued role in today’s processing plants around
the world. Many companies might not realize it but reliability engineering lies at the heart of total asset
management, a popular buzz term in industry today. What total asset management entails is not
really clearly defined yet but it incorporates elements such as reliability centered maintenance (RCM)
, total productive maintenance (TPM), design for maintainability, design for reliability, life cycle
costing, loss prevention, probabilistic risk assessment and others. The objective of total asset
management is to arrive at the optimum cost-benefit-risk asset solution to meet our desired
production levels. In other words, how can we spend the least money on our plant meeting our
production targets while maintaining process availability and process safety. Many aspects are
involved to achieve this but when it comes to the hardware and software that we use in our plant then
reliability engineering is the discipline to utilize here.
Reliability engineering is a very broad discipline and is practiced by engineers that design hardware
and/or software of individual products but also by engineers who use these products and integrate
them into larger systems. A reliability engineer in a plant has a similar task as a reliability engineer
who is responsible for the design of a transmitter or valve. They apply similar techniques to perform
their jobs only on a different scale and with a different focus.
Reliability itself is defined as the probability that a product or system meets its specification over a
given period of time [1]. The word specification is of course very broad and a product might have
several functions. One can calculate the reliability of each individual function or of all functions
together which make up the specification. The term time can also be replaced by distance, or cycles
or other units as appropriate. In other words it is very important to be clear when we talk about
“reliability” as it can have different meanings to different people and in different situations. In a plant
we can calculate process availability, unavailability, probability of fail dangerous, fail safe, etc., which
are all aspects of, and related to, reliability. In general reliability deals with probability of failure of

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 4


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

components, products and systems and is therefore at the heart of disciplines like hazard and risk
analysis, loss prevention, maintenance programs, quality assurance and so on.
Reliability engineering is thus the discipline of ensuring that a product or system will be reliable
when operated in a specified manner. It is performed throughout the entire life cycle of a product or
system, including design, development, test, manufacturing, operation, maintenance and repair. In
process plants it is often a staff function who’s prime responsibility is to ensure that maintenance
techniques are effective, that equipment is designed and modified to improve maintainability, that
ongoing maintenance technical problems are investigated, and that appropriate corrective and
improvement actions are taken. But in reality it is much broader than that. Reliability engineering
deals with every aspect of a component or system; from making a reliable design, to reviewing
operating and maintenance procedures, or even to setup a reliability data collection program. In many
plants reliability engineering is often also called maintenance engineering.

3 Overview of reliability programs


Reliability engineering plays a role in many well known reliability programs. Typical programs where
reliability engineering is not always associated with but where it is often the important pillar are:
ƒ (Probabilistic) risk assessment

ƒ (Functional) safety assessment

ƒ Condition based maintenance

ƒ Preventive maintenance (PM)

ƒ Risk based inspections (RBI)

ƒ Reliability centered maintenance (RCM)

Probabilistic risk and safety assessment heavily depends on reliability engineering techniques and
theory. With risk assessments we try to establish the risk associated with operating a process plant.
Often risk assessment uses a "top-down" approach to establish and rank the risk of individual areas of
a plant and process equipment to eventually establish the risk associated with the complete facility or
process plant. Risk is defined as the combination of consequences and frequencies. We can only
determine the frequency of an event occurring if we know the individual probabilities for equipment
failure associated with that event. In order to be able to carry out a risk assessment we need to know
how often a pump fails or a valve is stuck open or the instruments air is lost. Determining these
probabilities is the discipline of reliability engineering. Without proper failure rate data of equipment
we cannot establish a quantitative risk level.
When the risk level is established it can be that it is too high, and therefore needs to be reduced, or
that it is low enough, but needs to be maintained at that level. Standards like IEC 61508 [2] and IEC
61511 [3] are based on this concept. When we need to reduce the risk we can either reduce the
consequence or reduce the frequency of the hazardous event. We can reduce the frequency if we
implement a safety system. But this safety system needs to be reliable enough. We need to design a
safety system that is so reliable that it reduces the risk to a level where we can accept it again. This
means that we not only need to have reliable safety system components but also make an
appropriate safety system design to achieve overall reliability. In order to maintain our level of risk we
also need to maintain our process plant and safety system. This is why reliability engineering is often
called maintenance engineering. It is to make sure that the assumption we made during our risk and
safety assessments are maintained throughout the life of our facility. Being able to collect failure rate
data or predict failure behavior can help us in our maintenance strategy.
One program used for the predication of failures is condition based maintenance or predictive
maintenance [5]. As it names implies it means that we perform maintenance based on the condition

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 5


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

of the equipment subject to maintenance. We try to measure the condition of equipment in order to
assess whether it will fail during some future period. The objective is to avoid failure and thus we
either maintain or replace the product just in time. What actually needs to be monitored depends on
the equipment and can mean that we measure for example particles in the lubrication oil of a gearbox
or that we need to apply statistical process control techniques and monitoring the performance of
equipment. If we associate reliability theory with maintenance then we can try to probabilistically
predict when to perform maintenance. This is called reliability centered maintenance [5]. It is a
structured process, originally developed in the airline industry, which heavily depends on reliability
data and expert systems to interpret that data.
Condition based maintenance or reliability centered maintenance can still mean that we are too late
or that the maintenance occurs at an inconvenient time. In order to prevent this we can instead
implement preventive maintenance. The strategy of preventive maintenance is to replace or overhaul
a piece of equipment at a fixed interval, regardless of its condition at that time or the expected
probability of failure. It is purely based on time. Reliability and decision modeling can demonstrate that
it is often more cost effective to replace a piece of equipment before it has failed and at a scheduled
time then to wait until it fails unexpectedly. In this way replacements can be made at, for example,
scheduled plant shutdowns.
There is not one program that is the best strategy for a plant. Most likely, for different pieces of
equipment different programs are applied. Some equipment lends itself perfectly for condition based
or reliability centered maintenance, other equipment not at all and preventive maintenance is more
appropriate.

4 Reliability modeling
Reliability engineering heavily depends on probabilistic methods. In order to predict something,
whether it is the reliability of a piece of equipment or a complete process plant, we first need a
reliability model. There are many different techniques and methods developed over time that we can
use to make models. If we make models of (complex) systems for the purpose of prediction we
usually depend on one or more techniques like:
ƒ Reliability block diagrams

ƒ Fault trees

ƒ Markov models

ƒ Monte Carlo simulation

Other techniques exist as well but these are very common ones. Figure 1 shows a safety function
required to reduce the risk associated with high temperature in a vessel. In order to protect the vessel
against over temperature a safety system has been build with two temperature sensors connected via
two transmitters to a logic solver. The logic solver consists of an input board, a cpu board and an
output board. The input board utilizes two input channels while the output board utilizes three output
channels. These three output channels are required because we need to open two relays to stop two
pumps and one solenoid valve needs to close in order to open a drain valve.
For this system we can do all kind of analyses, e.g., calculation of the probability of fail safe,
probability of fail dangerous, the availability of the safety function, the unavailability of the process due
to spurious trips, the desired periodic proof test interval, optimization of maintenance strategies and
so on. In order to perform these analyses we need a reliability model of this safety function. Three
different reliability models have been created of the same function, i.e., a reliability block diagram, a
fault tree and a Markov model represented respectively in Figure 2, Figure 3, and Figure 4. In order to
actually perform calculations we need to fill the models with reliability data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 6


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Measure the temperature in the reactor and if the temperature exceeds 65 C


then open the drain valve and stop the supply pumps to the reactor. This function
needs to be carried out within 3 seconds and with safety integrity SIL 3

Sensing Logic Solving Actuating

T1 TM 1 I1 O1 R1- Pump A
Common Circuitry

Common Circuitry
I2 O2
T2 TM 2 I3 O3
I4
I5
CPU O4
O5
R2- Pump B

I6 O6
I7 O7 SOV Drain V
I8 O8

Figure 1 - From specification to hardware design of the safety instrumented system [6]

T1 TM 1 I1

CC CPU CC O1 O2 O3 R1 R2 SOV ESD SV

© Risknowlogy 2002-2005
T2 TM 2 I2

Figure 2 - Block diagram safety function [6]

Safety Function Failed

Input Failed Logic Failed Output Failed

Path 1 Failed Path 2 Failed CC CPU CC O1 O2 O3 SOV Drain R1 R2 R3


Valve

T1 Tm1 I1 T2 Tm2 I2

Figure 3 - Simplified fault tree diagram safety function [6]

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 7


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Path 1
Failed

System
OK Failed

Path 2
Failed

Figure 4 - Markov model safety function [6]

5 Reliability data and data collection


The reliability models in the previous paragraph are useless if we cannot fill them with appropriate
reliability data. Not many companies have their own reliability database and collection program and
need therefore to depend on different sources for the data. It is very important though to utilize the
best possible data. If data is uncertain, i.e., data for which we do not necessarily know whether it is
correct data, also the results will be uncertain. In [7] Rouvroye demonstrates the effect of uncertain
data for the safety performance of a HIPPS installation. The probability of a dangerous failure on
demand, i.e., the probability that the safety function of the HIPPS cannot be carried out when
required, is shown in Figure 5. This figure shows that the results can be a factor of 10 better or worse,
which can have a significant impact in the safety world where a factor of 10 means a difference in SIL
level [2]. Whether uncertain data really has an impact depends on the kind of problem we are trying to
solve. In general counts that the more accurate the data the better the results will be. There are also
techniques available that allow us to determine the influence of uncertain data on the results. These
techniques fall under sensitivity analysis and make it possible to determine whether it is worth
spending time and resources on finding better data.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 8


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

1E-01

Probability of Failure on Demand (PFD)

1E-02

1E-03

1E-04

10th percentile
median
1E-05
90th percentile
0 2496 4992 7488 9984 12480 14976 17472
Tim e (hours)

Figure 5 - Reliability calculations of a HIPPS with uncertain data, two different periodic proof
test intervals and two periodic proof test coverages [7]
Basically the following data sources exist in industry:
ƒ End user maintenance records

ƒ Industry databases

ƒ Reliability standards

ƒ Handbooks

ƒ Manufacturer data

ƒ Documented reliability studies

ƒ Expert opinions

ƒ Published papers

The most preferred data is always the data from the plant itself. Usually this data is collected via
maintenance records. Your own data is the best data for obvious reasons. Look at it this way. When
two companies buy the same valve but one company uses the valve on an offshore platform in the
North Sea while the other company uses the same valve in a plant in the dessert then we cannot
expect both valves to have the same failure behavior. Not only have the environmental parameters
influence on the failure behavior of a device but also its operational use and the maintenance strategy
of the company. Since no two companies are the same (and probably not even two factories within in
a company are the same) also their similar devices will not fail at the same rate. Thus the best data is
the data you collect yourself.
If this kind of data is not available then the next best possible source is to use data from industry
databases. Figure 6 shows two industry databases or handbooks that can be used. One is the
OREDA [8] database and the other is the SINTEF [9] handbook. Both have collected over time
reliability data. OREDA holds reliability data collected from offshore companies in the North Sea and
the SINTEF handbook holds reliability data specifically for safety equipment. Several other databases

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 9


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

and handbooks exist in the world that can be utilized but no matter who delivers the data it is
important to tailor the data in a way that it useful for the applicable situation.

Figure 6 - Two examples of industry databases, OREDA [8] and SINTEF [9]
When collecting reliability data we need to make sure that we document the right information when
a piece of equipment fails. Basically we are interested in three types of information, i.e., the failure
rate, the failure modes, and the repair times of a device. Unfortunately a lot of maintenance records
that we use are not suitable for reliability data collection as desired information is not recorded or
recorded in away that it cannot be used. It is very important that we get an overview of how often a
device fails and how it has failed. Before we can document that information we first need to be clear
about the function of that device.
For example the function of an ESD valve is to close upon demand. This valve can have the
following general failure modes:
ƒ Stuck open

ƒ Stuck close

ƒ Stuck in position

ƒ Leakage

A control valve has a different function then an ESD valve. For a control valve we might be
interested in failure modes like
ƒ Moves too fast

ƒ Moves too slow

ƒ Stuck in position

ƒ Leakage

What the real meaning of a failure mode is can only be determined when we understand the failure
mode in the larger context of the plant. We need to understand what the functionality of a device is

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 10


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

when it is used. Consider the following two valves. One valve controls the flow of an inlet pipe of a
vessel, while the other valve is a drain valve for the same vessel. The valve on the inlet pipe is
normally open and should close upon demand. The drain valve is normally closed and should open
upon demand. Both valves have the same failure modes like stuck open, stuck close, stuck in position
or leakage. But the effects of these failure modes are the opposite for both valves. Thus, it is
important to understand the function of a device on device level, and on system level, in order to be
able to properly document the failure behavior.
Collecting failure rates can also be done on different levels but the only correct level will be on
failure mode level. In practice the maintenance department should track for each device the number
of devices installed, the operating hours of each device and the time that the device has failed. This
information, combination with the failure modes allows us to calculate the failure rate per failure mode
and that is exactly what we need for our reliability models.
For each device we should basically collect the failure rates per failure mode but in practice many
companies do not have this kind of information. Often they need to work the other way around in
order to determine the failure rates per mode. Consider the safety industry where they are only
interested in 4 different failure modes [11]:
ƒ Safe detected

ƒ Safe undetected

ƒ Dangerous detected

ƒ Dangerous undetected

Only electronic devices can benefit from diagnostics and have detected failure modes. A partial
stroke test is not a diagnostic test as diagnostic tests are defined as frequent tests that run fully
automatically [10]. Most partial stroke setups require human interaction though. Therefore it is in most
cases not possible for mechanical devices, like valves, to define safe detected and dangerous
detected failure modes, only undetected failure modes. This also makes sense. When a valve is stuck
open and one performs a partial stroke test once in 6 months then potentially we do not know about
this stuck at failure for 6 months. Detecting this failure with a partial stroke test after 6 months is good
but very slow to take advantage of it. Therefore only devices that have build in tests, which run
automatically and frequently are useful as we can act upon a failure immediately and do something
about it.
If we have the following information available then we can calculate the four failure modes as
desired:
ƒ The overall failure rate of a device (λ), this includes all failures of the device regardless of
their failure mode

ƒ The safe ratio of the failures (SR), i.e., the ratio between all safe failures and all dangerous
failures of a device

ƒ The safe diagnostic coverage of a device (SDC), i.e., the percentage of all safe failures that
can be detected through diagnostic tests

ƒ The dangerous diagnostic coverage of a device (DDC), i.e., the percentage of all dangerous
failures that can be detected through diagnostic tests

Consider the following example where we can calculate the four failure rates important in the safety
industry from the following basic data:
ƒ λ = 5.5 E-6 /h

ƒ SR = 80%

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 11


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

ƒ SDC = 90%

ƒ DDC = 90%

This basic data results in the following four failure rates:


ƒ Safe detected failure rate λ_sd = 5.5E-6 x 0.8 x 0.9 = 3.96E-6 /h

ƒ Safe undetected failure rate λ_su = 5.5E-6 x 0.8 x (1.0 - 0.9) = 0.44E-6 /h

ƒ Dangerous detected failure rate λ_dd = 5.5E-6 x (1.0 - 0.8) x 0.9 = 0.99E-6 /h

ƒ Dangerous undetected failure rate λ_du = 5.5E-6 x (1.0 - 0.8) x (1.0-0.9) = 0.11E-6 /h

This is the kind of calculations that companies make when they do not have their own reliability
data. In reality one does not get this information as the maintenance department collects information
on failure mode level. If you have the failure rate information failure mode level it is possible to
calculate the overall failure rate, the safe ratio and the safe and dangerous diagnostic coverage
factors. More and more suppliers of devices are providing end-users with this kind of detailed product
information though. Consider the functional safety data sheet© in Figure 7 where this basic failure
rate information was used to also calculate factors like the safe failure fraction, the MTTFsafe,
MTTFdangerous, etc.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 12


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Figure 7 - Functional safety data sheet© with basic reliability data

The only reliability data still missing in order to make the model complete is repair data and proof
test data. Many product suppliers make statements about how long it takes to repair their transmitter

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 13


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

or valve and often this is considered to be 8 or 24 hours. In practice only the end-user knows how
long it will take to repair a particular device. It depends on many different factors. For example is the
failed device in stock or not? If we do not have it in stock how long does it take to order it and to have
it shipped to the desired location? If we do have it in stock then how long does it actually take to
replace it? Do we have only one repair crew or do we have multiple repair crews available? In our
model we can make the assumption that something takes only 8 hours to repair but if in reality it takes
30 days to repair then our calculations results are not much worth. The closer we can make our
model to practice the more useful reliability engineering will be.

5.1 Reliability Practiced Program


Companies collect reliability data in many different ways and where possible they try to automate the
data collection process. Normally reliability-centered maintenance programs work from an offline
database, which is developed for chemical equipment plants. Reliability engineers feed the plant
specific static data into the database and use it for RCM modeling. The more data because available
the more closely the reliability model gets to the actual situation in the Field. Fortunately in these
days, many chemical plants are operated thru DCS controls and a lot of the plant data is archived in
the DCS and/or the plant information system. The collected data is unfortunately not being utilized or
transmitted to RCM database which limits their application and utilization.
In addition, as plant hardwired failures are required in construction of RCM programs, it is observed
in chemical plants that there is a tendency to avoid component failure analysis. Unfortunately the
maintenance role is often solely responsible for replacing failed components to assure plant
availability. In order to have better feedback and results from a reliability centered maintenance
program it is of utmost importance that root cause analysis of failed components is practiced more
often. These would lead to better data and thus improve the overall maintenance and repair strategy.

5.2 Component Failure Analysis

Many plant operators have an established a methodology or procedure for repairing or


replacing failed components in the plant. The following is an example of such a
procedure [11,12]. The purpose of this procedure is to provide a methodology to
analyze component failure that happens in a plant and its systems. It is assumed that
component failures occur in random fashion, which makes failure occurrence a difficult
complex process to predict. Objective of this procedure is to document available data
on past failures and to empower our knowledge on failures. This will certainly help to
predict and may even prevent future failures and thus incidents. Success can only be
achieved if the maintenance, operations, and engineering department work together
when applying this policy. The benefit of this procedure is to establish a follow-up
system that will have an objective approach to question and demand close
coordination to implement incident recommendations.

1. The component failure that happens in a system “X”, shall be isolated and
removed. If the Failure has caused a plant shutdown, then a new certified
component shall be installed to manage and speedup plant start-up activities.
The new component certificate shall be issued by OEM (Original Equipment
Manufacturer).
2. The failed component shall be clearly tagged. The maintenance engineer shall
record the components ID, its function, the failure description, and the physical
and environmental state. A sample form is shown in Figure 8. Then, it shall be

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 14


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

shipped to the OEM by the maintenance department. This is required to analyze


the faulty component and issue the final failure report.
3. Maintenance department shall issue a copy of the inspection report to operations,
and engineering.
4. Operations shall call for an overview meeting to discuss the failure report, and
prepare an action plan, if the inspection report contains serious
recommendations.
5. Examples of component failures can be tabulated for the past two-year. This
table can be updated every 6 months, and maintenance will review and insert the
right updated information, with target date to complete each activity.
6. Random regular site visits shall be made to ensure that regular preventive
maintenance has been done. The team should be led by operations.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 15


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

COMPONENT SYSTEM FAILURE ANALYSIS - REQUEST FORM

Plant / Unit ______________________________________________

Engineer ______________________________________________

Equipment ID ______________________________________________

Equipment Function ______________________________________________

Time of Failure ______________________________________________

_____________________________________________
Fault Description
_____________________________________________

Unit Operations Approval ______________ __________ _________________


Name ID Signature

Email:
_____________________________________________
Name:
OEM Address _____________________________________________
Fax / Tel :
_____________________________________________

_____________________________________________

Surrounding Status _____________________________________________


Record
_____________________________________________

Figure 8 - Sample Form Maintenance Record

5.3 Human Interaction & Control Valve Example


In many plants, the human interaction plays a major role in daily job planning as well as major plant
turnaround jobs. The frequency of a plant turn around is normally set every three years. For every
plant turn around there is a define list of Jobs that are being plan, and accordingly people used to plan
material and resources to accomplish these big tasks. Therefore, the human factor will play a main
role to organize job resources and administration follow-up that will lead into an optimized execution

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 16


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

for all jobs respecting cost, and time constraints. A RCM program alone can’t be used to manage
plant turnaround jobs, but the integration of the human role into this process will facilitate such theme.
There are some further issues that need to be asked, and in order to make the typical questions clear,
we will use a control valve example to illustrate the approach.
For preventive maintenance (PM) of control valves, we can ask many questions such as:-
ƒ What is the frequency or PM for critical valves?
ƒ Is this a critical control valve? Is this Emergency Shutdown Valve? Is this a high pressure
service valve? Is this valve fully open, close, or regulating?
ƒ How many items are required checking in the control valve? (we can have many accessories
in a control valve such as solenoid valve, Positioner, regulator set, booster set, I/P unit, valve
internals)
ƒ What sort of tests are required as part of PM process? (leak test based on valve leakage
class, and hydro test based on line pressure rating)
ƒ Is there any certificate check for major accessory item such as solenoid valve Positioner, &
I/P?
ƒ Is there any internal component to be replaced such as plug, seat, and valve soft kit?
ƒ Is there any outside component that shall be replaced (such as Solenoid valve, I/P, pneumatic
set, Diaphragm .etc) based on plant standard or applied practices?
ƒ Is there any certificate to be issued for each of small component that can jeopardize valve
malfunction?
ƒ Is there any bypass line on the valve that can help Maintenance to do PM for the valve when
the plant is running?

As one can see, after establishing answers to these questions, and others, we can put a closer
view on real reliability models and see how through an audit & validation process, we can enhance
the RCM models in the real world.

6 Practical Example of How to Benefit from Reliability


Engineering
The following is an example of how end-users can benefit from reliability engineering. In this example
the question is asked whether a plant should use a single pressure transmitter (1oo1) for a certain
application, or if it would benefit from redundant (1oo2) or even triple redundant (2oo3) pressure
sensors. The question was whether there would be financial benefits, if any, for using more than one
pressure sensor. A risk based methodology was used to determine the scenario cost associated with
the three pressure sensor architectures.
It is assumed that a sensor can fail in four ways:
ƒ Safe detected

ƒ Safe undetected

ƒ Dangerous detected

ƒ Dangerous undetected

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 17


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

The failure rates for each of these possible failure modes are calculated with the values from Table
1.
Table 1 – Reliability data sensor
Parameter Value
Overall failure rate of the sensor 8.6E-6 / h

Safe ratio 50%


Safe detected diagnostics 25%
Dangerous detected diagnostics 25%
Common cause 5%

Table 2 shows data related to the process. The mission time is the time the pressure sensors are
operated. The periodic test interval is the time between periodic proof tests. The periodic test
coverage represents the percentage of failures that can be detected. The demand rate represents the
number of demands that come from the process. A demand means that the safety function needs to
be carried out and thus needs to be available (the pressure sensor needs to work in order to carry out
the safety function).

Table 2 – Process data


Parameter Value
Mission time 10 years
Periodic test interval 6 months
Periodic test coverage 100%
Probability of a process demand 6 per year

The financial data from Table 3 is used to estimate the cost associated with three different sensor
architectures.
Table 3 – Financial data
Parameter Value
Cost sensor* $ 5000.00 / sensor
Cost associated with a spurious trip of the plant** $ 1,000,000.00 / trip
Cost associated with an accident** $ 15,000,000.00 / accident
* These cost include all cost of the sensor including installation, repair cost, etc.
** These cost include all cost associated with it (repair, production loss, etc).

The three different architectures all have the same possible operating modes or failure scenarios.
These scenarios are:
ƒ Operational – the sensors subsystem has no failures that effect the measurement of the
sensor

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 18


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

ƒ Trip – The sensor subsystem has failed in a way that the associated logic solver (DCS or
safety plc) can only decide to trip the process

ƒ Dangerous – The sensor subsystem has failed in a way that the associated logic solver (DCS
or safety plc) cannot take any action when demanded from the process.

For each of these scenarios it will be calculated what the probability of occurrence is. The
probabilities for each of these scenarios are calculated using the Markov modeling technique [12]. For
all architectures Markov models are created that allow us to calculate the probabilities associated with
these scenarios. Once the probabilities for each scenario are known we can calculated the associated
cost with this scenario. The expected cost over the mission time for each sensor subsystem is then
the total cost for each of the scenarios.
Based on the assumptions the results are presented in Figure 9 and Figure 10. Figure 9 shows the
results over a mission time of 10 years without performing a periodic proof test. Figure 10 shows the
same model but then with a periodic proof test performed every 6 months. The results are based on
the weighted scenario cost. For each sensor subsystem we calculate the probability that the sensor is
either:
ƒ Operational;
ƒ Caused a plant trip;
ƒ Failed dangerous.
As the subsystem needs to be in any of these three states, at all times, the total probability adds up
to 1. Please note that the results only apply for these assumptions as it was applicable to this
particular customer. In this case the 2oo3 sensor architecture clearly favors the results. The pressure
sensor system clearly benefits from a periodic scheduled proof test every 6 months. The probability
weighted scenario cost for the three architectures are in this case:
ƒ 1oo1 subsystem: $1,185,421.60;
ƒ 1oo2 subsystem: $103,572.38;
ƒ 2oo3 subsystem: $60,792.63.
In all three cases the dangerous scenarios contribute the most to the overall weighted cost. This is
due to the 6 demands per year and the high cost associated with a possible accident. The periodic
proof tests improves the system significantly, see below in Figure 9. The overall achieved
improvement due to periodic proof testing is demonstrated in Table 4.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 19


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

Installed Failure Initial + Business = Total Cost Scenario Probability


Subsystem m odes Investm ent Interuption Cost Probability W eighted
Scenario Cost

Operational $5,000.00 + $0.00 = $5,000.00 0.76632 $3,831.62

1oo1 Subsystem Trip $5,000.00 + $100,000.00 = $105,000.00 0.00006 $5.87

Dangerous $5,000.00 + $15,000,000.00 = $15,005,000.00 0.23362 $21,032,862.62


Total 1.00000 $21,036,700.10

Operational $10,000.00 + $0.00 = $10,000.00 0.93564 $9,356.35

1oo2 Subsystem Trip $10,000.00 + $100,000.00 = $110,000.00 0.00012 $12.71

Dangerous $10,000.00 + $15,000,000.00 = $15,010,000.00 0.06425 $5,786,273.05


Total 1.00000 $5,795,642.11

Operational $15,000.00 + $0.00 = $15,000.00 0.87195 $13,079.24

2oo3 Subsystem Trip $15,000.00 + $100,000.00 = $115,000.00 0.00000 $0.18

Dangerous $15,000.00 + $15,000,000.00 = $15,015,000.00 0.12805 $11,535,745.22


Total 1.00000 $11,548,824.64

Figure 9 – Decision model result without periodic proof testing

Installed Failure Initial + Business = Total Cost Scenario Probability


Subsystem modes Investment Interuption Probability Weighted
Cost Scenario Cost

Operational $5,000.00 + $0.00 = $5,000.00 0.98688 $4,934.42

1oo1 Subsystem Trip $5,000.00 + $100,000.00 = $105,000.00 0.00000 $0.31

Dangerous $5,000.00 + $15,000,000.00 = $15,005,000.00 0.01311 $1,180,486.86


Total 1.00000 $1,185,421.60

Operational $10,000.00 + $0.00 = $10,000.00 0.99895 $9,989.55

1oo2 Subsystem Trip $10,000.00 + $100,000.00 = $110,000.00 0.00001 $0.68

Dangerous $10,000.00 + $15,000,000.00 = $15,010,000.00 0.00104 $93,582.16


Total 1.00000 $103,572.38

Operational $15,000.00 + $0.00 = $15,000.00 0.99949 $14,992.37

2oo3 Subsystem Trip $15,000.00 + $100,000.00 = $115,000.00 0.00000 $0.00

Dangerous $15,000.00 + $15,000,000.00 = $15,015,000.00 0.00051 $45,800.25


Total 1.00000 $60,792.63

Figure 10 – Decision model result with periodic proof testing

Table 4 – How the periodic proof test improves the system significantly
Architecture Without Proof Test With Proof Test Improvement
1oo1 $21.0 Mil $1.2 Mil 17.5

1oo2 $5.8 Mil $103 k 58

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 20


White Paper
Dr. M.J.M. Houtermans, T. Vande Capelle, M. AL-Ghumgham
Achieving Plant Safety & Availability Through Reliability Engineering and Data Collection

2oo3 $11.5 Mil $60 k 191

7 Conclusions
This paper has addressed plant safety and availability thru the eye reliability engineering and
reliability data collection. The paper explained what reliability engineering was, how reliability models
can be made and what kind of data needs to be collected. It demonstrated through practical examples
how reliability data can be collected, what problems may arise and how plants can benefit from good
reliability data.

8 References
1. Bently, J.P., An Introduction to Reliability & Quality Engineering. John Wiley & Sons,
ISBN 0-582-08970-0, 1993
2. IEC, Functional safety for electrical / electronic / programmable electronic safety-related
systems. IEC 61508, IEC, Geneva, 1999
3. IEC, Functional safety: safety instrumented systems for the process industry sector.
IEC 61511, IEC, Geneva, 2003
4. Condition based maintenance
5. Moubray J., Reliability Centered Maintenance, 2nd Edition, ISBN: 0831130784, April
1997
6. Vande Capelle, T., Houtermans, M.J.M., Functional Safety For End-users and system
integrators,
7. Rouvroye, J.L., Et. Al., Uncertainty in safety, New Techniques For The Assessment And
Optimisation Of Safety In Process Industry. American Society for Mechanical
Engineers, 1994
8. Det Norske Veritas, OREDA, Offshore Reliability Data, 2nd Edition, ISBN 82 515 0188 1,
1992
9. SINTEF, Reliability data for safety instrumented systems. PDS Data handbook, 2004
Edition. SINTEF, September 2004.
10. Velten-Philipp, W., Houtermans, M.J.M., The effect of diagnostic and proof testing on
safety related systems. Control 2006, Glasgow, Scotland, 30 August – 1 September,
2006
11. Houtermans M.J.M, IEC 61508: An Introduction to Safety Standard for End-Users.
SISIS 2004, Buenos Aires, Argentina, September 2004
12. Billinton R., Allan R.N., Reliability Evaluation of Engineering Systems, Concepts and
Techniques. Pitman Books Limited, London, 1983.
13. Al-Ghumgham MA, “On A Neural Network-Based Fault Decoction Algorithm”; Chapter 4 of
Master Research Thesis in fulfillment of Master Degree program for Control Systems
Engineering, KFUPM 1992.
14. Al-Ghumgham MA, Angelito Hermoso, Humaidi, MA,“Safety and reliability: Two faces of A
coin for Ammonia Plant ESD System”; ISA EXPO 2005, Chicago, USA.

RISKNOWLOGY Experts in Risk, Reliability and Safety Page 21

You might also like