You are on page 1of 41

ROOT CAUSE

ANALYSIS
(RCA)
Getting to the Bottom of It: Root Cause Analysis Steps,
Tools, and Examples
What is root cause analysis?
By definition, root cause analysis is the process of
finding the underlying cause for an effect we
observe or experience.
In the context of failure analysis, RCA is used to
find the root cause of frequent machine
malfunctions or a significant machine
breakdown.
 RCA is a reactive process, meaning it’s performed after the
event occurs. But once a root cause analysis is done, it takes
the shape of a proactive mechanism since it can predict
problems before they occur.
 If you fix a symptom of the problem, but you don’t fix the
actual cause of the problem, there’s a high chance the
failure will happen again.
 For example, suppose you replace the broken belt but don’t
change the misaligned part causing the belt to overheat
and break.
 In that case, you could bet your paycheck that the belt is
going to fail again. RCA tries to follow the chain of cause
and effects to pinpoint the problem that will make all the
other faults disappear when finally eliminated.
The RCA process does not guarantee an
outcome
 Conducting root cause analysis can be very complicated. It
involves a vast amount of data collection and review.
 The result of a root cause analysis isn’t always black and white.
It can’t always tell you if the problem you identified is the root
cause.
 You will often get only a strong correlation between cause and
effect and not the exact cause.
 From there, you’ll have to use your experience and
professional knowledge to judge whether to investigate further
or not.
The RCA process does not guarantee
an outcome
 RCA is a craft that requires specialized knowledge and
in-the-field experience. Meaning you’re likely the best
person for the job here.
 Otherwise, any fixes implemented will likely be just a
cosmetic solution to the problem. In the worst-case
scenario, the changes made could actually make the
situation worse.
 Despite these limitations, RCA is still a powerful tool for
understanding and improving the fundamental nature
of systems and procedures.
Industry applications
 Over the years, RCA has evolved to work within various
fields, each with its own unique needs and approach.
The most apparent use of RCA is in the medical field. The
TV show House is an excellent example of RCA in
action.
 In the show, a complex and bizarre medical case usually
shows up at the hospital. The doctors are stumped! That
is until the unconventional wildcard Dr. House jumps in
and saves the day with his crazy theories and methods.
Aside from the healthcare field, many other
industries use root cause analysis regularly. Some
of them are:
 manufacturing (machine failure analysis)
 industrial engineering and robotics
 industrial process control and quality control
 information technology (software testing, incident management,
cybersecurity analysis)
 complex event processing
 disaster management and accident analysis
 pharmaceutical research
 change management
 risk and safety management
 These industries will generally use one specific type of root cause analysis
that fits their situation best. Below are some examples of different types of
RCA methodologies used by various fields and industries.
Different types of RCA
RCA comes in different forms depending on the problem you’re
trying to solve. Here’s what they look like:
 Safety-based RCA comes from the field of occupational safety
and health, as well as accident analysis. This type of root cause
analysis is used to determine why an accident happened at
work I.e. why someone cut themselves or why a part was
accidentally dropped by a worker at heights).
 Production-based RCA is used in the field of manufacturing to
ensure quality control. You might use this to find out why the
injection-molded plastic parts are coming off the line warped.
Different types of RCA
 Process-based RCA is used in business and manufacturing to
determine the fault in a process or a system. This might be used in
accounting to determine why vendors aren’t getting paid on time.
 Failure-based RCA is used in engineering and maintenance to
determine the root cause of any type of equipment failure.
 Systems-based RCA originated as a combination of some of the
root cause analysis techniques listed above. This methodology is an
approach that combines two or more methods of RCA. It can be
used in a wide variety of fields/applications.
When to perform a root cause analysis?
 When you’re doing an RCA to determine the source of a fault, you’ll usually find 3
basic types of problems:
1. physical causes
2. human causes
3. organizational causes

-You can also do a root cause analysis if you want to drill down and find out exactly
why a process or procedure is producing better-than-average results. By identifying
the cause of a positive event, you could presumably replicate it and see those results
elsewhere. Even if it’s time-intensive, one round of RCA can mean a lot of bang for
your buck.

-Keep in mind that RCA requires a significant investment of time, manpower, and
money. And it will likely cause further disruption in the specific production line or the
system you’re working on. So bearing that in mind, you don’t need to (and you
shouldn’t) do RCA for every single fault.
 Persistent faults
If the same fault occurs over and over, it’s worth investigating. If the same defect
is repeatedly happening, you can assume that it won’t be cleared simply by
fixing the visible problem. There is an underlying reason for the recurring faults.
These types of incidents need to be investigated with RCA.
 Critical failure
To determine if a failure is critical, you can look at the cost to the plant or the
total downtime due to the particular failure. When a critical failure occurs, it
needs to be investigated to identify the root cause to help avoid this situation in
the future. Explosions at an oil rig and airplane crashes are examples of critical
failures that need to be investigated.

 Failure impact
There are critical machines and critical sub processes in any system. A failure of
these types of machines will halt the entire operation because there may not be
a backup or mitigation plan for that particular machine. In this case, how critical
the machine is will determine whether or not to do RCA.
The 3R of Root Cause Analysis

No doubt you’ve heard these 3 Rs:


“reduce, reuse, recycle” or maybe
even “reading, writing, arithmetic.”
But RCA also has its own system of
3 Rs: Recognize, Rectify, Replicate.
Recognize
 The actual cause of a problem is not always apparent,
and simple cosmetic fixes usually don’t do much to
correct the underlying fault.
 Even though RCA can be an elaborate time-consuming
exercise, we do it to pinpoint the actual cause so we can
take corrective actions that will eliminate future issues.
 As mentioned earlier, RCA can also be done to identify
the reason for an unexpected positive outcome.
 This first step is when you notice something’s not working
quite right. The machine is leaking fluid, making a weird
sound, or not running as productively as it usually does.
 This is when it’s time to put on your detective cap and find
out what’s going on.
Rectify
 Once you’ve recognized the root cause, it’s time to start a corrective
course of action. If the root cause is addressed, the same problem
should not be cropping up again.
 If the same problem reappears, it’s likely because the cause you
identified was not actually the root cause.
 In this case, you might have to go through the RCA process again to
make sure that you get to the actual root cause.
 For example, you notice the machine is leaking fluid, so you patch the
hole in the metal. If you stop seeing fluid on the ground under the
machine, you’ve solved the problem, and you’ve taken care of the
root issue.
 But if a leak crops up again in a week, it’s time to run another RCA to
find out if there are other holes in the metal or if gaskets are failing.
Replicate
 Once you’ve identified and rectified the root cause, your
next step is to ensure it will not happen again at any
point during the process or system.
 Sometimes you’ll want to do an RCA to get to the
bottom of an unexpectedly good outcome. In that case,
you will test whether the same factors can be replicated
in other scenarios and environments.
 Suppose there were issues with faulty parts coming off
the line, but you’ve since fixed the issue.
 The next step would be to replicate the problem to test
whether you actually fixed the root issue.
How to do a root cause analysis?
RCA can be accomplished using many different tools
and techniques. And even though those processes may
look different, they all arrive at the same end goal: fixing
the root cause of the issue.
 But it’s important to not stop investigating when you find a
correlation between events.
 Correlation means there is a link between two events, but it
doesn’t automatically mean that one event caused the other.
 That’s why it’s essential to continue your sleuthing until you find a
causal relationship. Find out what event caused another event.
This will help you find the actual root cause.
 From the data collected, chronological sequencing, and
clustering, we should be able to create a causal graph (or use
one of the root cause analysis tools we discuss later).
 You can use this graph to represent the relationship between
various events that occurred and the data collected.
 The different paths are given different probability weights. They
can serve as a visual tool to track down the root cause.
Fixing the root cause should eliminate the issues.
If the symptoms occur again, it’s time to return
to the drawing board and conduct RCA again.
 Once the problem is solved, you will need to take proactive steps to ensure it doesn’t
happen again. There can be multiple solutions applied to solve a single issue.
 For example, the root cause could be the wear of a bearing, which happened much
earlier than expected. In this case, the procedure has to be adjusted to change the
bearing at an earlier time.
 Similar steps to avoid recurrence of fault can be changes in the maintenance
schedule, different modes of maintenance, changes in design, different OEM
vendors, etc.
 The implemented solution will have to be in line with the available resources. So, if the
root cause is pushing the machine too hard, the obvious answer is to shorten the
machine run time.
 However, if the production schedule doesn’t allow for shortened runtimes, another
solution might be scheduling more preventive maintenance.
Tried-and-true RCA tools and
techniques

There are many tried and trusted frameworks


available to execute RCA. None of these
methods are foolproof, but they provide a solid
base for how to go about root problem
investigation.
 Each method has its own list of benefits and
shortfalls. Some methods are more suitable for
different industries and types of problems.
5 Why analysis
 5 Whys is the original technique developed by Sakichi
Toyoda for root cause analysis at Toyota factories. It is
addressing everything with a ‘why’, just like a curious
child.
 Keep asking ‘why’ until you’ve reached the root cause.
You can continue this process until you reach a stage
where there is no need to ask ‘why’ again. At that point,
you should have reached the root cause of the
problem.
 As a rule of thumb, asking and finding answers to 5
subsequent ‘why’s’ should be more than enough to
reveal the root cause of most problems. Hence the
name ‘5 why’ analysis.
Benefits of the 5 Whys:
helps identify the root cause of a problem
offers an understanding of how one
process can cause a chain of problems
helps determine the relationship between
different root causes
highly effective without complicated
evaluation techniques
When to use the 5 Whys:
for simple to moderately complex
problems
more complex issues may need this
method in conjunction with another
any time human error is involved in the
issue
Fishbone diagram (a.k.a. Ishikawa diagram)
 The Ishikawa method for root
cause analysis emerged from
quality control techniques
employed in the Japanese
shipbuilding industry by Kaoru
Ishikawa.
 The shape of the resulting
diagram looks like a fishbone,
which is why it is called a
fishbone diagram.
 This diagram is built on the
idea that multiple factors can
lead to a failure/event/effect.
The 5 M framework (shown above) from
the Toyota Production System uses RCA with
the Ishikawa method.

The 5 Ms are:
man/mind power
machines
measurement
methods
material
*The problem or fault is written down at the far right end, where
the fish head would be.
*The cause of the problem is represented along the horizontal
line.
*Further effects and their respective causes are written down
along the fish bones representing each of the 5 Ms.
*This process continues until the team is convinced that the root
cause is identified.

Benefits of the fishbone diagram:


 a good way to brainstorm within a defined structure
 helps to visually diagram a problem or condition’s root cause
 helps to show bottlenecks in the process
 helps to find ways to improve the process
When to use a fishbone diagram:
to analyze a complex problem with many
causes
when you need a different view of the issue
to identify root causes
to identify bottlenecks and identify issues
where a process doesn’t work
Failure mode and effects
analysis (FMEA)
 FMEA is a proactive approach to root cause analysis, preventing potential failures of a
machine or system.
 It is a combination of reliability engineering, safety engineering, and quality control efforts. It
tries to predict future failures and defects by analyzing past data.
A diverse cross-functional team is essential when using
FMEA. You will need to clearly define and communicate
the scope of the analysis to your team members.
Each subsystem, design, and process is closely
reviewed. The purpose, need, and function of each
system are questioned.
Potential failure modes are brainstormed. Failure of
similar processes and products in the past can also be
analyzed.
The potential effects and disruptions that could be
caused by each of the identified failure modes are
assessed and used to calculate its RPN.
If the failure mode has a higher RPN than a company is
comfortable with, you can address this by changing one or
more factors outlined in the image above.
Benefits of FMEA:
enables early identification of a failure point
captures the collective knowledge of a team
improves the quality, reliability, and safety of the
process
a logical, structured approach for identifying
process areas of concern
reduces process development time, cost
documents and tracks risk reduction activities
When to use the FMEA
methodologies:
when designing a new product, process, or
service (DFMEA)
when you’re going to update a current way of
doing things
when you have a plan for quality improvement
when you need to understand the failures in a
process and improve upon them (PFMEA)
Fault tree analysis (FTA)
 Fault tree analysis is a
method for root cause
analysis that uses boolean
logic (using AND, OR, and
NOT) to figure out the cause
of failure.
 It was developed in Bell
laboratories to evaluate an
Inter Continental Ballistic
Missile (ICBM) launch
control system for the U.S Air
force.
Fault tree analysis tries to map the logical
relationships between faults and the
subsystems of a machine.
The fault you are analyzing is placed at the
top of the chart. If two causes have a logical
OR combination causing effect, they are
combined with a logical OR operator.
For example, if a machine can fail while in
operation or while under maintenance, it is a
logical OR relationship.
If two causes need to occur
simultaneously for the fault to happen, it
is represented with logical AND.
For example, if a machine only fails
when the operator pushes the wrong
button AND relay fails to activate, it is a
logical AND relationship.
It is represented using the boolean AND
symbol.
Benefits of using a fault tree
analysis:
 use deduction to find the causes of each event, like the
5 whys
 highlights the critical elements related to system failure
 creates a visual representation for analysis
 can focus on one area of failure at a time
 exposes system behavior and possible interactions
 accounts for human error
 promotes effective communication
When to use a fault tree
analysis:
 when the effect of a failure is known — to find out how it
might be caused by a combination of other factors
 when designing a solution — to identify ways it may fail
in order to make the solution more robust
 to identify risks in a system
 to find failures that can cause the failure of all parts of a
“fault-tolerant system”
Pareto charts
 A Pareto chart indicates the frequency
of defects and their cumulative effects.
Italian economist Vilfredo Pareto
recognized a common theme with
almost all frequency distributions he
could observe.
 There is a vast imbalance between the
ratio of failures and the effects caused
by them.
 He proposed that in any system, 80% of
the results (or failures) are caused by
20% of all potential reasons.
 The principle is dubbed the Pareto
principle (some know it as the 80-20
rule). This skew between cause and
effect is evident in many different
distributions, from wealth distribution
among people to failures in a machine.
With the 80-20 principle in mind, you can use
Pareto analysis to dig into failures and possible
causes.
To start, draw a bar graph that includes the
frequency of faults and causes. With this graph,
it’s easier to see the skew between causes and
failures. Usually, you’ll see how a small
percentage of factors cause the majority of
faults.
Next, you’ll analyze the causes that contribute to
the largest number of faults and take corrective
action to eliminate the most common defects.
Benefits of using pareto charts:

defects are ranked in order of severity,


with the most severe handled first
can determine the cumulative impact of
the defect
offers a better explanation of defects that
need to be resolved first
When to use a pareto chart:
 to analyze problems or causes in a process that involves the
frequency of occurrence, time, or cost
 to narrow down a list of problems to find the most significant
 to analyze a problem with a broad list of causes to identify specific
components
 Pareto charts work great for determining the priority for taking up
root cause analysis. According to the Pareto principle, eliminating
20% of the most common failure causes can result in reducing the
overall number of malfunctions by 80%.
 Pareto charts will indicate the top failure causes to be further
investigated and addressed, according to the criticality of the
machine, the impact failure of a specific part, or a combination of
the two.

You might also like