You are on page 1of 69

Root Cause Analysis

(RCA)

An essential element of Asset Integrity


Management and Reliability Centered
Maintenance Procedures

Dr Jens P. Tronskar
Definition of Root Cause Analysis (RCA)

Root Cause Analysis (RCA) is a structured


process that uncovers the physical, human,
and latent causes of any undesirable
event in the workplace.

Can be;
•Single or multidiscipline cases
•Small or large cases
Some other definitions
Failure Cause –
• The physical or chemical
processes, design defects,
quality defects, part Failure Effect – The
misapplication, or other consequence(s) a
processes that are the basic failure mode has on the
reason for failure or that initiate operation, function, or
the physical process by which status of an item.
deterioration proceeds to failure.
• The circumstances during Failure – The termination
design, manufacture, or of its ability to perform a
operation that have led to a required function
failure.
Failure Mode – The effect
by which a failure is
observed on the failed item
Root Cause (RCA)

Indispensible component of proactive and


reliability centred maintenance
Uses advanced investigative techniques
Apply correctives
Eliminates early life failures
Extends equipment lifetime
Minimizes maintenance

Slide 4 Edit in Veiw > Header and footer Edit in Veiw > Header and footer
Traditional maintenance strategies
tend to neglect something important:

Identification and correction of the


underlying problem.
A Root Cause Analysis will disclose:

Why the incident, failure or breakdown occurred


How future failures can be eliminated by:
– changes to procedures
– changes to operation
– training of staff
– design modifications
– verification that new or rebuilt equipment is free of defects
which may shorten life
- repair and reinstallation is performed to acceptance standards
- identification of any factors adversely affecting service life and
implementation of mitigating actions
Production Improved availability “up-time”
and increased production
Todays’ level

Reactive Periodic Predictive Proactive Era of


maintenance/ Maintenance maintenance
(conndition Strategies RCFA strategies
monitoring
Reactive maintenance

• Run the equipment until breakdown


• Overhaul and repair
• Extensive unplanned downtime and recurrent
repair
Periodic maintenance

Scheduled calendar or interval-based


maintenance
Expensive components exchanged even
without signs of wear or degradation
Unexpected failures with incorrect schedules
and component change-out
Predictive maintenance by
condition monitoring
Apply technologies to measure the condition of
machines
Predict when corrective action should be
performed before extensive damage to the
machinery occurs
Short and long-term benefits of
Proactive Maintenance Strategies
involving RCFA:

Optimization of service conditions:


Increased production
Reduced downtime
Reduced cost of maintenance
Increased safety
Experience and statistical data
MMS DATABASE
Information on equipment design and service conditions
Failure statistics i.e. MTBF
Description of service failures, approach and methods
for failure investigation
Consequences of failure:
Downtime/pollution and spillage/secondary damages
Causes of failures
Recommendations and remedial actions
Methods and analytical tools to identify
the causes of failure or breakdown
Review background data
Loss Causation Model and RCA methods and working
process
Detailed analyses of failed parts/components:
Analyse service conditions
Utilise experience data from data bases or other sources
Laboratory investigation
The Loss Causation Model
LACK OF BASIC IMMEDIATE
CONTROL CAUSES CAUSES INCIDENT LOSS

Inadequate
System Personal Substandard
Factors Acts Inadequate Unintended
Inadequate Controlled Harm or
Standards Event Damage
Job/System Substandard
Inadequate Factors Conditions
Compliance
to
Standards

© Det Norske Veritas


Something A failure Here the losses
The main causes… Is done wrong occur
or gone wrong
Data Collection
•Interviews
•Documents (paper) evidence
•Parts/component evidence
Interviewing Considerations
• Where to interview
• Who to interview
• Condition of people
at the scene
• How to handle
multiple witnesses
• How to handle after
the incident
• How to work with
teams
Investigation techniques
• A number of named techniques that are
commonly used within RCA:
– Step-method
– FMEA
– Bow-tie
– Event Tree
– Failure Tree
– Interview
– Fish Bone
– Why-Why
• The techniques have strength and weaknesses
depending on the situation.
Methods for RCA; Content

• Data Collection
– Interviews
– Paper and technical evidence
• Methods for RCA
– STEP
– FMEA
– FTA
STEP 1: Register Equipment Incidents
Purpose : Register Off-spec. Operation /
performance, Survey & Condition Monitoring data

1 Register Equipment Incidents Survey & Condition Monitoring data


Start: Trigged by off-spec. operation/performance,

Stop: Incident logged in Maximo

Input to Process control Expected


Process output from
Process
Off-spec operation /
performance :
• Equipment failure History of Condition
• Trips Assess cause Monitoring, Surveys,
• Abnormalities Issue Run-Log or of failure Perform and Recommended
Work Request in short-term Maintenance Action
Maximo Corrective action in Maximo
Survey/Inspections/
Audits/Reviews and
Condition Monitoring
by Maintenance
Off-spec operation/
performance logged
Failure report in Maximo:
in Maximo * Equipment failures
* Trips
* Abnormalities

Operation
log

Operation Maintenance
department department
Maximo

Resources
STEP 2: Trigger Mechanism for RCA
Purpose: Evaluate need for RCA

Start: Registered HSE issues or off-spec operation/


performance incidents
Stop: Start RCA

Expected
Input to Process control output
process from
process
Incidents above trigger level

Off-spec operation/ Single incidents Single operation


performance: with high RAM incidents with production
loss/repair cost > X
Equipment production loss or
failures repair cost Off-spec operation vis-à-
Trips Prepare vis (KPI) Recommended
Abnormalities monthly report
RCA Case
per site Multiple operating
Do Preliminary incidents per Tag no./
LCC; Actual Loss/ Equipment type
Cost vs Investment
Prepare (Replacement) High risk findings from
quarterly report survey/CM
for HQ

Surveys, Audits,
Inspection, Reviews
and Condition Incidents below trigger level,
monitoring by and mitigation not cost No Action
effective
Maintenance

Plant Reliability Engineer/ HQ Senior Reliability Engineer


Senior Planning Engineer Reliability Engineer (Plant/HQ)
Resources
STEP 3: Appoint the RCA Team
• Minor RCAs:
– Run within a department, using the procedure
• Larger RCAs:
– Leader – appointed by the Plant manager
– Facilitator – reliability engineer.
– Discipline(s) or specialists at specific plant

• Optional to involve:
– Disciplines from other sister plants
– HQ-Engineering support and technical staff
– Vendor
– Failure laboratories
– Other 3rd parties
– Specialist
STEP 4: The Root Cause Analysis
The main RCA report
1 Description of the Incident(s)
An incident is the event that precedes the loss or potential loss. This section should include a description of what happened.
Include all aspects related to the incidents, like outage time, cost of repair, people involved, tools in use, operational status,
weather conditions etc.

2 Immediate Cause(s)
The immediate causes of an incident are the circumstances that immediately preceded the contact and can usually be seen or
sensed. For example if the incident is an oil spill, the immediate cause could be a broken sealing. The Immediate Causes
often are the same as the failure codes registered in Maximo.

3 Basic Cause(s)
Basic Causes are the real causes behind the immediate causes: the reasons why the substandard acts and conditions
occurred, the factors that, when identified, permit meaningful management control. In case of an oil spill caused by a broken
sealing, the Basic Causes could be that the sealing used was of wrong type, it had a design failure or it might be installed
wrong.

4 Lack of Control
Lack of Control means insufficient oversight of the activities from design to planning and operation. Control is achieved
through standards and procedures for operation, maintenance and acquisition, and follow-up of these. If an oil spill has
occurred because of wrong installation of a sealing, the Lack of Control could be related to inadequate procedures for
checking after maintenance.
Loss/Incident

Immediate Causes

Basic Causes

Lack of Control
RCA reporting system
Methods for RCA

• STEP; Sequential Time Event Plotting


• FMEA; Failure Mode Effect Analysis
• FTA; Fault Tree

• + common sense, engineering/operational


experience
STEP; Sequentially Time Event Plotting
Deviation 1 Deviation 2
Actors 1 2 Time line

Actor 1 Event 3 Event 5 Event 7 Accident

Actor 2 Event 1

1. Identify actors
Actor 3 Event 2
2. Identify events
3. Link 1&2
4. Mark Substandard
Actor 4 Event 6 acts/deviations

Event 4
Actor 5

…all links are AND gates


FMEA; Failure Mode and Effect Analysis
Loss/Consequence:
Pump not started
Consequence
Function/ Failure Likelihood
Failure Mode System/ Detection Comment
Object Cause (low – possible- high)
Component
Broken axel Fatigue None
Pump Corrosion Loss of Pressure
Impeller
/Wear Pressure Indicator
El. Motor Winding None
Fail to
Soft-starter Unknown None
Operate
Switch In off position None
Signal Alarm
Wrong signal
Sensor Fail to operate None
to control unit
No detection
High Temp.
Fail to operate of failure and
Protection
larger damage
Fault Tree
Top

What is a Fault Tree? event


OR
A

• Identifies causes for an Intermediate


Event
assumed failure (top event) Component 1 And
Gate AND
• A logical structure linking E1 E2

causes and effects


• Deductive method Component 2 Component 3

• Suitable for potential risks E3 E4


Basic
Event
• Suitable for failure events
Which one to use?
• STEP:
– For complex events with many actors
– When time sequence is important
• FMEA:
– Getting overview of all potential failure
– Easy to use
• FTA:
– Identifies structure between many
different failure causes
– Non-homogenous case (different
disciplines)
Detailed analyzes of failed
parts/components
Typical examples of systems/equipment
that can be analyzed:
Electrical generators Fire and gas-detectors
Heat exchangers Sensors and measuring devices
Subsea equipment Components of gasturbines
Valves Compressors
Control systems Cranes and lifting equipment
Pumps Well and down hole drilling
equipment
Proactive maintenance through
Root Cause Failure Analysis
(RCFA)
Maintenance strategy based on systematic and
detailed knowledge of the causes of failure and
breakdown
Systematic removal of failure sources
Prevent repetitive problems
Minimise maintenance down-time
Extend equipment life
RCFA evaluates factors affecting
service performance such as:
Materials/corrosion/environment
Changes in operational conditions
Stresses and strains
Presence of defects and their origin,
nature and consequences
Design
Welding procedures and material
weldability
The most common causes of
service failures or breakdown:
Incorrect operation
Poorly performed or inadequate
maintenance
Incorrect installation and bad
workmanship
Incorrect repair introducing new defects
Poor quality manufacture leading to sub-
standard components
Poor design
Examples of problems disclosed
by the laboratory investigation
as part of the RCFA:
GEARS
• Incorrect material • Vibration
• Incorrect heat treatment • Incorrect surface
• Incorrect design treatment
• Incorrect assembly • Geometric imperfections
• Corrosion • Incorrect operation
• Lubricating problems • Fatigue or overloading
Examples of problems disclosed
by the laboratory investigation
as part of the RCFA:
BOLTS
• Indoor material • Poor or incorrect surface
• Poor design treatment
• Manufacturing defects • Geometric imperfections
• Incorrect assembly • Incorrect application
• Corrosion • Incorrect torque or
• Vibration overloading
Examples of problems disclosed
by the laboratory investigation
as part of the RCFA:
BALL-/ROLLER BEARING
• Poor design • Overload
• Manufacturing defects • Inadequate lubrication
• Poor alignment and • Vibration
balance • Contamination
• Seal failure • Fretting
• Electrical discharge • Corrosion
(arcing)
Root Cause Failure Analysis
Disclosed Failure of:

MAIN BEARING
• Heavily worn raceway, cracking of
casehardened surface, plastic deformation of
sealing groove
• The main cause of failure was overloading of
the bearing.
Actions/recommendation:
• Reanalysis by FEM and redesign
Root Cause Failure Analysis Disclosed
Failure of:
O-RING
• Four gas leaks on TLP
platform equipment in HP &
IP service
• Caused by explosive
decompression (ED) of O-
Ring
• Actions/recommendation:
Change to another O-Ring
type with other elastomer
Examples of problems disclosed by
the laboratory investigation as part
of the RCFA:
DRIVE SHAFTS
• Incorrect material quality Surface defects
• Incorrect design Corrosion
• Poor quality manufacture Incorrect balance and
• Geometric imperfections alignment
• Incorrect operation Incorrect assembly
Fatigue or overloading
ROOT CAUSE FAILURE ANALYSIS
DISCLOSED:

Bearing Breakdown
• Axial overloading
• Thrust washers fitted in both bearing housings
• Incorrect assembly
Actions/recommendation:
Remove thrust washers from one bearing
housings
ROOT CAUSE FAILURE ANALYSIS
DISCLOSED:

Gear Breakdown
• Broken gear tooth. Fatigue initiated from
quench cracks.
• Fabrication induced defects (Basis for
discussion of liability and subsequent claims
against manufacturer)
Actions/recommendation:
Fitting of new gears where heat treatment and
case hardening procedure had been verified
to be correct
ROOT CAUSE FAILURE ANALYSIS
DISCLOSED:

Damaged pinion and gear wheel


Severe surface deformation on one side of teeth
No surface hardening
Incorrect lubrication
Actions/recommendations:
Renew gear wheel and pinion with components
that have been verified to have correct surface
hardening. Change lubricant and revise
lubrication procedure.
Typical components
that can be analysed
Gears Motor rotors/stators
Bearings Pressurized components and
Bolted connections pressure vessels
Shafts Steel wire ropes
Impellers Hydraulic components
Pistons/cylinders Welded joints
Reliability assessment

Management Process-1

SW: Other..

Operator Process-2

… considering total system reliability!


STEP
(Sequentially Time Event Plotting)
STEP Method
(Sequentially Time Event Plotting)
• Capturing of the sequential events leading up to an
accident.
• Can be a simple timeline
• Investigation of larger incidents/accidents where the
time sequence is important
• Handles complex events with:
– several actors
– several events in parallel
– a longer time horizon
• Should include both equipment, control and human
actions
STEP; Sequentially Time Event Plotting
Deviation 1 Deviation 2
Actors 1 2 Time line

Actor 1 Event 3 Event 5 Event 7 Accident

Actor 2 Event 1

1. Identify actors
Actor 3 Event 2
2. Identify events
3. Link 1&2
4. Mark Substandard
Actor 4 Event 6 acts/deviations

Event 4
Actor 5

…all links are AND gates


Example of a simple STEP diagram
1 Deviation 1

Actors January May June Time

Engineer Missed annular Case:


inspection of Manual valve
valve sealing
oil leakage

Sealing Inadequate
Sealing becomes dry tightening
and brittle

Valve Oil leakage

Manually
Operator Moving the
valve
FMEA
Failure Mode and Effect Analysis

FMECA
Failure Mode and Effect Criticality
Analysis
FMEA (Cause-Consequence)
(Failure Mode and Effects Analysis)
• Overview of failure mode and effect for a
complex machinery/operation
• Getting an overview of all potential failure
causes and effects at an initial stage of an
investigation
• Requires detailed knowledge of the problem in
question
• Easy to use for both events and for potential
losses where risk is included
• Not good at handling time series
Technique/Working Process
Analysis Goal Expert sessions
•Guided brain-
storming to collect Likely
information Causes
System definition
•System boundaries •Fill in forms
•Operational state Evidence Finding
•Limitations, assumptions •Inspections
•Failure Analysis
•Interview

Exclusion
System description
•Documentation
•Division into sub-systems
(e.g. functional decomposition)

Final
Analysis planning Causes
•Find expert team
•Plan expert sessions
(when, what, who?)
•Make documentation available
Cases/Examples
Offshore Gas production
Statistics from 320 incidents/ “RCA” cases

Total Losses;
Ca. 100 mill$/yr
Other
18 % Personal related
26 %

Preventive
Maintenance
8%

Lack of
management of
work
15 %
Design
33 %
Immediate Causes - Substandard Conditions

180
160
140
120
Immediate causes
100
N

80
60
40
20
0

A1.3: Failure during A1.4: Failure during A1.5: Failure during


service startup mainteannce

Immediate Causes - Substandard Acts


25

20

15

10

l ur
e ir re
s ce es ls un n g it
pa du an ur na cti ti o ti n rm
fai / re e n d ig u a c
te
s e
ed ce oc ite ce ls r re kp
ia t an pr ma pr tro n stg or in g or
i t n n r g n o t ur w
r in int
e ti o fo ki n co c er
a
rd ou
t
to a ra ion w or i ng ri ng op rro ith
a m e t k u
er op ar
a of lo o d w E kw
Op rin
g
of ap io n er ed o slo or
: u n r t v g W
4.
1 rd io p ola ro a To
A1 rro iola t rin g : Vi a to d am
E V du .3 e r nt
1: .1: 13 Op me
10. 3 il ure A : ip
A A1 Fa 1.
1
Eq
u
.4: A1 3 :
A1
3 0.
A1
Basic Causes
Basic Causes - work related

70
60
No of events
50
40
30
20
10
0

i gn P M ... . M se rk n ce er
y ... i..
.
s t o r .. t C o n o a i v n cr
de w en l
n ng te n
es
p
of de ig s
ad fi cie n ni af fi cie l r A ai nt t des k
de
B sif pl
a Q
A su na Q m A
a
W
/
or
In n t nt I n ti o o f Q S / w
ie ie ra n g in re
ff ic ff ic
p e ni g e
e du
su su O an an oc
In In Pl C h P r

Basic Causes - Personal Factors

25
20
15
10
5
0
ce ... gs .. or
t
ng e
n d in u a. p ini dg
ir e e w i t p l e
pe at dr
a s
fs
u tra ow
x r el / or k
o e d kn
e b fo w ck
t
of
of f jo f in
ull a r ela
k
ck o o f L c
La ck ck ess J ob La
a a r
L L St
Explosion and fire at refinery
Refinery Explosion & Fire

Localised
Corrosion in
overhead
Piping

Debutanizer
Overhead
Receiver
Debutanizer
Column
Longford Gasplant
Rich oil de-ethanizer reboiler
Root Cause Failure Analysis

DISCLOSED:

BRITTLE FRACTURE IN CHANNEL


TO TUBESHEET WELD

• Low temperature due to process upset


• caused brittle fracture initiation from root
• of weld containing lack of fusion defect

• Actions/recommendations:
• Reconstruct using low temperature steel
• grade, carry out proper UT. Modify operation
• procedure and controls to prevent
Damage mechanism: • future process upsets.
Brittle fracture
RCFA of LNG Plant Failure
RCFA of LNG Plant Failure
RCFA of WHRU
Metallurgical investigation
Findings

• Explosion caused by trip of turbine and leak


from WHRU gas coil to header weld
• Following gas leak, auto-ignition of air/gas
mixture occurred. The auto-ignition temperature
was equal to the surface temperature of the
equipment based on instrument readings
• Weld failure due to creep/fatigue and time
dependent embrittlement of weld HAZ
• Damage was caused by air/gas mixture
explosion equivalent to 68 kg TNT
Failure of 24” OD subsea clad pipeline
Corrosion in 24” OD clad pipeline

You might also like