Utilizing RCA For Process Related Failures

Utilizing RCA for Process Related
Issues
Presented at SMRP Conference – 2006
Written by: Ken Latino
Practical Reliability Group

P.O. Box 284
Daleville, VA 24083-0284
302-525-4309
info@practicalreliabilitygroup.com
www.practicalreliabilitygroup.com
Utilizing RCA for Process Related Issues
Normally when we think of Root Cause Analysis (RCA) we think of a machine that has
failed. Perhaps a drive shaft has fractured on a critical piece of equipment and we
need to find out what happened. Although this is a common use of RCA there are
other types of events that lend themselves to this type of analysis. These include
process issues, quality defects, customer complaints, and many others.
In actuality, process issues that cause daily upsets are perfect candidates for RCA.
Whether you are having trouble removing moisture from a product, shutting down a
production line due to continual conveyor trips, exceeding the limits on critical
process variables or anything in between, RCA can help. These issues tend to be very
costly to a plant because they are ongoing and can cause significant losses in
production, product quality, and ultimately customer dissatisfaction.
Before we begin talking about the specifics of using RCA on process issues, it is
important to define what is meant by RCA. Although there is not an official standard
for performing RCA, there are some specific guidelines that are needed to perform
any RCA:
A factual definition of the failure

Data, Data, Data!!!
A cross-functional team
A systematic process for analyzing the data
A mechanism for addressing the corrective actions
A means of ensuring that the corrective actions were effective
Let’s begin by discussing these critical factors. Most analysis teams fail because they
are trying to solve a problem that does not exist. This may seem silly, but it is
surprising how often the analysis team defines the problem improperly. Instead of
focusing on the consequences and factual failure modes, they tend to focus on
symptoms that often are not the real problem. We will discuss this in more detail
later in this paper, but for now just remember that you cannot solve problems that
“do not exist”. Accurately defining the problem is the key to a successful analysis.
I often correlate RCA investigations with police investigations. I ask my clients what
the first step is in a police investigation. The answer is always to collect the
evidence. The success of a police investigation is totally based on the evidence.
While we are not looking for “whodunit” in our investigations, we still require the
physical data to make our case. Without data, we are simply guessing, using trial and
error! This is a very expensive method for solving problems. In order to be
successful, you must collect pertinent data about the issues you are studying. For
example, if we are having a problem with controlling process temperature, we will
need to collect data on when the problem started, process changes, process flow
diagrams, maintenance histories, and much more.
PRG - Utilizing RCA for Process Related Issues

1
I use a simple acronym to help collect this data. It is simple to remember, and works
for almost any type of investigation. It is called the 5P’s, and stands for the five
categories of data that need to be collected for any analysis. The 5P’s stands for:
Parts
People
Position
Paper
Paradigms
The idea is for the analysis team to discuss what data will be necessary to determine
the root causes of the issue based on these 5 categories. They would need to
determine what parts would have to be analyzed to solve the problem. For example,
items like failed heat exchanger tubes, broken coupling bolts, instruments, etc.
Positional information would include things like the time of the events. This is
particularly important when you are performing analysis on process related issues that
happen repeatedly. The idea is to determine if there is a pattern or correlation for
when the events are occurring. Other positional information is related to key process
variables, location of the events, and the like. Positional information is arguably the
most important of the 5P’s. Especially when dealing with process related issues.
Paper information is fairly obvious, but there are a couple of items that should always
be collected. Drawings like process flow diagrams (PFD) and piping and
instrumentation diagrams (P&ID) are extremely useful in any RCA. The analysis team
needs a visual reference for the problem they are studying and drawings like these
usually fit the bill. If relevant, machine or equipment cutaway drawings can also be
useful. These can usually be acquired from the vendor or the manufacturer of the
equipment. Other common paper items that should be considered are maintenance
histories, shift logs, inspection reports, specifications, and a host of others. Be
careful with paper data because it can quickly be overwhelming. If paper data is
secure and available, then you can always collect it later in the analysis as you need
it. Do not collect more than you need, or it can stifle the analysis unnecessarily.
As in a police investigation, interviews with knowledgeable people are a must. People

information usually comes in the form of a discussion or interview, and should be
specific to the problem at hand. In some cases, it will be eyewitness accounts of the
issue or a third party like a vendor or sister facility that might be having similar
issues. Just remember that if you are trying to get eyewitness accounts of the issue,
it is best to get to the individual as quickly as possible so that their information is not
lost through short term memory loss or the opinions of others.

2
Finally the last “P” represents paradigms. Paradigms are the collective opinions of
others. This is the typically evident after a number of interviews with plant
personnel. They usually come in the form of statements like “It is a design problem”
or “It is maintenance’s fault it continues to fail”. Although paradigms are not facts,
they are perceived as facts and can often dictate how we deal with the issue at hand.
For example, if we have a problem with leaking tubes it might be a paradigm that it is
a metallurgical issue, and therefore, we have to solve the problem with different
metals. It may have nothing to do with the metallurgy, but if that is the paradigm,
then the problem will persist.
We have heard about the virtues of cross-functional teams for years. The fact is it is
hard to solve problems in a vacuum. We need the opinions and knowledge of others.
If conventional wisdom were enough, then the problems would already be solved.
The reason it persists is that the chronic causes fall outside of conventional wisdom.
I suggest having representation of operations, maintenance, and technical services

involved in an analysis. These, coupled with an unbiased facilitator, will provide the
catalyst for a successful outcome. Let me expand for a moment on the unbiased
facilitator. It is human nature to assign a complex problem to the person we feel has
the most knowledge and experience. However, this is precisely the wrong thing to
do. The facilitator should not be the expert in the particular problem, but rather an
expert in RCA methodology and facilitation. This gives them the ability to ask the
tough questions since they have nothing to gain or lose from the outcome of the
analysis.
Once the data has been collected and a formal team is in place, we can now begin to
systematically determine the real root causes of the problem. I want to highlight
“causes” because every failure or problem has multiple contributing causes. There
are many techniques for systematically determining root causes. Some that come to
mind are fault trees, fishbone diagrams, cause and effect diagrams, and many others.
I prefer to use a logic tree for looking at process related issues. Logic trees are
similar to fault trees except that they are historical and not probabilistic. The key to
any scientific method for solving a problem is to develop hypotheses, and then use
facts and data to prove or disprove the hypotheses.

3
A logic tree is constructed on an event (aka failure event), modes (aka failure modes),
hypotheses, and causes. Below is a simple depiction of the logic tree structure.
Event
Event
Mode
Mode
Hypothesis
Hypothesis Hypothesis
Hypothesis
Hypothesis
Hypothesis
Physical
Causes
Human
Causes
Latent
Causes
Logic Tree Structure
The first two levels of the logic tree define the problem. A better way to think of the
Event block is the consequence of the event. For example, if we have repeated
conveyor failures, then the consequence would be lack of feed to a downstream
process.
The modes are the factual reasons for the event (consequence). For example, the
conveyor might have repeated trips, or perhaps the rollers continue to malfunction.
The key to the both the mode and the event components of the logic tree is that they
are facts. The top of the logic tree must be facts, or the process will not be
successful. Remember, you cannot solve problems that do not exist!
The next step is the formulation of ideas. These are hypotheses or “educated
guesses”. These are the ideas about how the modes have occurred in the past. These
hypotheses are guesses they need to be validated to see if they are true or not true.
This step is critical to the process. I always tell my clients that if you are not planning
on doing the verifications of the hypotheses, than you should not bother even
performing the RCA in the first place. You are simply just guessing with the exclusion
of the verification process.

4
Verifying hypotheses is a reiterative process. Eventually, the generation of
hypotheses will result in the discovery of causes. There are three types of causes:
Physical
Human
Latent
Let me explain the difference between these different types of causes. Physical root
causes are the physical mechanisms that cause the modes, and ultimately, the
consequences to occur. Examples of physical causes are high vibration, overload, and
corrosion just to name a few. The human root causes are related to human
intervention. Things like not doing something you were expected to do, or doing
something that you were not supposed to do. Examples of human root causes are
things like misalignment, opening the wrong valves, running a machine beyond it
design limitations, and many others. Generally, people do not wake up in the
morning and decide to make thing fails or run poorly at work. Most workers want to
keep their jobs and want to do a good job at work. Therefore we need to dig a little
deeper into why people make these mistakes of omission or commission. These are
called latent or system roots. These are the underlying systems that dictate how
work gets done. Examples of latent root causes are things like time pressures to
perform a repair, incorrect on nonexistent procedures, lack of knowledge or skill, and
many others. These are the real “ROOT CAUSES” of failure. If these underlying or
latent issues are not addressed the problem will persist. I like to make the analogy to
a weed in your yard. If you simply cut it at the stem it will grow back, but if you dig
up the underlying roots it will not reappear.
The predictable thing about these causes is that they always come in that order
(Physical, Human and Latent). In order to be successful, you must identify all of the
causes. It is virtually impossible to eliminate a problem if you do not first identify
the physical, human, and latent roots. I often see analysis teams make mistakes by
identifying things like poor design or poor maintenance practices on the top of their
logic tree. These are latent issues, and you cannot accurately identify those until you
know what the physical causes are. If it is a design issue, then you must first identify
what about that design makes the mode occur. Is it too big, too small, too fast, or
too slow?

5
Once the causes have been identified and verified, it is now time to implement
corrective actions to ensure that the problem will not continue to cause the negative
consequences. This seems like common sense, but many analysis teams fail to
convince others of the need to make the necessary changes to alleviate the
consequences of the problem. Some things are simple and require little or no
authorization from management, while others require management support and
backing. Analysis teams need to be able to accurately communicate the business
need for the corrective actions so that management will buy in to the effort. Many
companies have formal processes for recommendations and the tracking of those
recommendations. If this is the case at your company, then find out how the system
works and make sure to take advantage of the work process.
Last of all, you need to monitor the success or effectiveness of the corrective actions.
For example, if you are having repetitive conveyor failures, then you need to track
the number of those events prior and after the corrective actions have been
implemented to verify that they were effective at reducing the consequences.
Now that we understand the basic RCA approach, let’s examine how this technique
can and should be used to address process failures. First let me explain the
difference between what I am calling process failures and more traditional equipment
failures.
Many times RCA is only used when there is a catastrophic failure of an asset or there
is a safety or environmental incident. Although these issues must be analyzed to
ensure that they do not happen again, they are typically sporadic in nature. Sporadic
simply meaning that they are somewhat rare events. Rather than focus on sporadic
events, I would like to focus on the more common chronic process events. I am not
talking about the pump that might fail every 6 month due to a seal leak. I am talking
about process issues like bottling lines that jam every few minutes in a beverage
plant, web breaks in a printing press, or paper making machine or product quality
problems due to the inability to control key process variables. These are events that
are ongoing and are effecting the bottom-line each and everyday.
These issues are somewhat easier to solve due to one very important reason. They
provide us the ability to collect lots of data about the problem. This is unlike
sporadic failures where you really only get one opportunity to collect the failure data.
If you do not collect the data immediately, then it is impossible to recover it at a
later time. The steps for analyzing a process failure are not really that much
different than studying a mechanical type of failure. The key is to define the failure
event and modes accurately and factually.

6
Let’s examine a couple of examples to make the ideas more concrete. I once worked
with a cigarette manufacturer who was having problems with “rod breaks”. A rod is
the actual paper and tobacco rolled into a continuous “rod” before it is actually cut
into individual cigarettes. This event was happening many times a day on most of the
cigarette making machines. In actuality, it was happening literally hundreds of times
a day. Each event only took a few minutes to correct with only limited interaction
from the operator. When they added up the frequency times the cost of the few
minutes of downtime, it ended up being a multi-million dollar problem.
When we began to assist in the analysis, we asked how the rod was breaking. Since
they had obviously given this issue a great deal of thought, we figured it would be
easy to get the answer to this simple question. It turned out that they were not sure
how to answer the question. They said that it just broke. We pressed on because we
needed to know the exact “mode” of the event. Since they were not sure how to
answer, we had them do some data preservation work to help describe the mode(s)
for a broken rod. It was determined that there were several modes, but there were
two that occurred most frequently. We call it Jagged Edge and it went from NE to
SW, or NW to SE.
Straight Slice NE SW
Straight Slice NW SE
Jagged Edge NE SW
Jagged Edge NW SE
Figure 2 – Rod Break Example
By clearly describing the failure modes in this way, we were able to more clearly
define the problem. Although there were many ways that a rod had broken in the
past, these were the most common and the ones in most need of a solution.
So the logic tree definition was created to very specifically delineate the modes of a
rod break. The causes for each of these failure modes could be similar or, like in
many cases, totally different. For this reason, it is critical to clearly define how the
failure is currently occurring.

7
Below is an example of how the logic tree problem definition was created.
Unplanned
Unplanned
Outage
Outageofof
Cigarette
Cigarette
Machine
Machine
Straight
StraightSlice
Slice Straight
StraightSlice
Slice Jagged
JaggedEdge
Edge Jagged
JaggedEdge
Edge
NE
NE SW
SW NW
NW SESE NE SW
NE SW NW
NWSE
SE
Figure 3 – Rod Break Logic Tree Failure Definition
Once the top of the logic tree is defined and factual, then the process for developing
hypotheses is consistent with any other type of RCA. Start by simply asking a series of
“How Can” questions, starting on the mode in question, and slowly work down to the
physical, human and latent roots.
Unplanned
Unplanned
Problem Definition
Event Outage
Outageofof
Cigarette
Cigarette
Machine
Machine
Mode(s)
Straight
StraightSlice
Slice Straight
StraightSlice
Slice Jagged
JaggedEdge
Edge Jagged
JaggedEdge
Edge
NE
NE SWSW NW
NW SESE NE
NESWSW NW
NW SE
SE
How Can?
Tobacco
TobaccoIssue
Issue Paper
PaperIssue
Issue Machine
MachineIssue
Issue Glue
GlueIssue
Issue
Operating
Operating Paper
Paper
Parameter
Parameter snagged
snaggedonon
Physical
Incorrect
Incorrect guide
guide
Speed
Speedofofrod
rod Speed
Speedofofrod
rod Misalignment
Misalignment Human
too
toofast
fast too
tooslow
slow ofofrod
rodguide
guide
Why?
Alignment
Alignmentofof No
Noprocedure
procedurefor
for No
Notime
timefor
for
guide
guidenot
not rod
rodguide
guide proper
proper Latent
deemed
deemedcritical
critical alignment
alignment alignments
alignments
Figure 4 – Sample Logic Tree
Let’s look at a few more examples of defining the problem definition for process
related failures.

8
High Temperature Issues of Hydrocarbon in a Petrochemical Plant
A temperature indicator identifies that the temperature on the outlet of the heat
exchanger is above the specified level. The consequence is that the product is off-
spec, and could potentially cause an over-pressurization situation downstream.
Problem Definition
Event Off-Spec
Off-SpecProduct
Product
Temperature
Temperatureexceeds
exceeds96 96
Mode CCatatTI-010
TI-010atatoutlet
outletofof
HEX-010
HEX-010
Instrumentation
Instrumentation Heat
Heat
Cooling
Cooling
Product
ProductIssue
Issue Malfunction
Malfunctionon
on Exchanger
Exchanger
Water
WaterIssue
Issue
TI-010
TI-010 Issue
Issue
Less
Lessthan
than Cooling
Coolingwater
water
adequate
adequateflow
flow temperature
temperature
ofofcooling
coolingwater
water exceeds
exceedsspec.
spec.
P-101
P-101not
not
Restriction
Restrictionofof providing
providing
Flow
Flow required
requiredflow
flow
Piping
PipingIssue
Issue Valve
ValveIssue
Issue
…
…
Figure 5 – Petrochemical Product Cooling Issue
Steel Mill Example
A steel mill is experiencing an issue where a width gauge is indicating that the width
of a roll of steel is too narrow, and does not meet customer specifications.
Problem Definition
Event Off-Spec
Off-SpecProduct
Product
Narrowing
Narrowingofofthe
thewidth
widthatat
Mode the
thewidth
widthgauge
gauge
Figure 6 – Problem Definition for Coil Issue

9
Gas Plant Example
A gas plant is cleaning gas for its downstream customers. Operations indicate that
there is a foaming situation in the amine scrubber. The foaming is causing the plant
to have unplanned downtime, resulting in a restriction of service to its customers.
Unplanned Problem Definition

Event UnplannedUnit
Unit
Outage
Outage
High
Highdifferential
differentialpressure
pressure
Mode ininthe
theamine
aminescrubber
scrubber
Figure 7 – Problem Definition for Gas Plant Foaming Issues
These are just a few examples of defining the problem definition for process issues. If
you can successfully define the problem, then the success of the analysis improves
exponentially. I would challenge you to go out and look for process issues (defects) at
your facility and apply these simple yet powerful techniques.
As I mentioned earlier, in the past there has been a mindset in industry that RCA is a
tool for large catastrophic system or asset failures, or for safety and environmental
incidents. Although these are excellent uses of RCA, it is missing many of the large
process opportunities that are robbing our facilities each and every day.
These techniques coupled with the data, knowledge, and experience of our workforce
allows us to solve almost any problem. The key is to make sure that you properly
define the problem based on facts and not assumptions. If you can master this skill,
then solving the problems is just a matter of data validation and perseverance. As
my son’s soccer coach always like to say; “can’t is a cowardly word”. No problem is
too large to solve given the right mix of tools, techniques, and people.

10

Utilizing RCA For Process Related Failures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Utilizing RCA For Process Related Failures

Uploaded by

Copyright:

Available Formats

Utilizing RCA for Process Related

Practical Reliability Group

A factual definition of the failure

PRG - Utilizing RCA for Process Related Issues

As in a police investigation, interviews with knowledgeable people are a must. People

PRG - Utilizing RCA for Process Related Issues

I suggest having representation of operations, maintenance, and technical services

PRG - Utilizing RCA for Process Related Issues

Logic Tree Structure

PRG - Utilizing RCA for Process Related Issues

PRG - Utilizing RCA for Process Related Issues

PRG - Utilizing RCA for Process Related Issues

Figure 2 – Rod Break Example

PRG - Utilizing RCA for Process Related Issues

Figure 3 – Rod Break Logic Tree Failure Definition

Figure 4 – Sample Logic Tree

PRG - Utilizing RCA for Process Related Issues

Figure 5 – Petrochemical Product Cooling Issue

Steel Mill Example

Figure 6 – Problem Definition for Coil Issue

PRG - Utilizing RCA for Process Related Issues

Unplanned Problem Definition

Figure 7 – Problem Definition for Gas Plant Foaming Issues

PRG - Utilizing RCA for Process Related Issues

You might also like