Root Cause Analysis – Quality of Process?

By Robert J. Latino, Sr. VP, Reliability Center, Inc.

Abstract: We have all heard the term Root Cause Analysis (RCA) and we likely all interpret its meaning in a different fashion. This is the single most reason we see for the ineffective use of RCA, lack of communication or miscommunication amongst the users. If we are all using various forms of RCA, then when we compare our results we are not comparing apples with apples. We will explore the discipline necessary to provide consistency to our RCA application thus quantumly improving the credibility and communication of the results. Since the evolution of Total Productive Maintenance (TPM) in the United States there has been a consistent movement towards exploring the quality of the process versus the quality of the product. Before the advent of TPM, quality organizations were typically content with testing the quality of the product as it came off the line as a finished product. While an admirable concept at the time, we learned by that time it was too late if we found quality defects. The entire product and/or lot would have to be reworked at great expense to the organization. Then the TPM concepts of W. Edwards Deming were introduced and they pushed the “quality of process” concept. In short, this meant that we would measure key variables within the process stages and monitor for any unacceptable variances. In this manner, we can correct the process variation and prevent the production of off-spec products. This era has continued into the 21st century with the introduction of Six Sigma. Now take the above summaries of the application of TPM and let's apply them to a non-manufacturing process such as RCA. As we discussed earlier, RCA means different things to different people. Many consider undisciplined efforts such as “trial and error” as their RCA approach. This means that we perceive a problem to exist and we go right to what the most obvious cause is TO US! This is the “finished product” approach. We do not validate any of our assumptions, we just assume a cause and spend money to implement a fix and hope it works. Experience shows this approach to be ineffective and very expensive. Now let's apply the TPM concept to a disciplined method of RCA such as a Logic Tree used in the PROACT® process. A Logic Tree strives to graphically represent the cause and effect relationships that lead to the surfacing of the undesirable event. In this approach, we must clearly identify the undesirable event and its associated modes with supporting facts. Facts are supported by some essence of science, direct observation and documentation. They cannot be hearsay or assumptions! For instance below, most people would insist that we start with a bearing failure. However, when the event occurred, why was it brought to our attention? It was not brought to our attention because the bearing failed. It was brought to our attention because the failed bearing caused a pump to stop pumping something. Therefore the last effect that caused us concern

was the pump failure. One reason (or mode) that the pump failed was due to the bearing failing. This is clearly evidenced by the failed bearing (physical evidence). The top of the Logic Tree may look like this: Event Fact – DCS Verification Mode Fact – Physical Bearing Figure 1.0 – Event and Mode Supported by Facts Continuing our search backwards for the cause and effect relationships, we would then ask “How can a bearing fail?”. Our hypotheses may be; erosion, corrosion, fatigue or overload. How can we prove which are true? We would simply have a metallurgical lab analyze the failed bearing for us and produce an analysis report. For the sake of this example, lets say our metallurgical report indicated fatigue patterns only. Now our Logic Tree would advance a level and look like the following:

Figure 2.0 – Additional Hypotheses You can now see that as we develop new sets of hypotheses, we are proving what we say at each level of the process. This is demonstrating quality of process. When we continue this reiterative process, we are validating our conclusions each step of the way. This way, when we draw our conclusions about root causes, they will be right because we have supported them based on fact and not merely assumption. This also means that when we agree to spend money to overcome the identified causes, that the money will be well spent because the problem will not recur. Applying TPM to thought processes is not a new concept however. When you think of scientific experimentation, it follows the same premise. When conducting scientific experiments, we must first develop a hypothesis and then an appropriate test method to draw valid conclusions. If you think about it, any investigative occupation must follow this “quality of process” approach. Think about detectives, NTSB investigators, doctors, fire investigators, etc., they must all develop hypotheses and prove what they say.

In an effort to move our cultures toward precision, we must treat our administrative processes with the TPM concepts in mind as well. The TPM approach is applicable to Equipment, Process and Human situations. We must not limit ourselves by applying the concepts narrowly.

Root Cause Analysis: Quality of Process (2)
By Robert J. Latino, Sr. VP Strategic Development, Reliability Center, Inc. For the ARTICLE # 1 OF THIS SERIES, CLICK HERE PLEASE

Abstract: If we reflect twenty plus years ago, we will recall that most of our quality efforts were directed checking the quality of the final product in the finishing and packaging stages. By that point, if something was found defective, we would have to scrap the entire run or lot of products produced. Then came the TPM initiatives that stressed “quality of process” and we started to implement Statistical Process Controls (SPC) and Statistical Quality Control (SQC). We started to look at quality “during” the manufacturing process ensuring that when the finished product came off the line, it was a quality product. Can we do the same with Root Cause Analysis (RCA)?

Taking the TPM parallel described in the abstract above, let’s see if it applies to non-manufacturing processes such as RCA. If asked, almost everyone will say they are doing Root Cause Analysis (RCA). And to a large part, they will be correct in their own minds. This is because of how they define RCA versus how the person asking the question defines RCA. This is like if we asked a sample population, “Do you live a healthy life?” The majority would reply with an emphatic YES. However, what does healthy mean to these people? To some if means we are alive, to others it means that we eat right and exercise, to others it means that they are emotionally sound and to others it may mean that they are content in their religious beliefs.

So how many ways cannot someone interpret RCA? Some believe it is 1) having the local expert provide us a solution, some believe it is 2) brainstorming in a room and drawing conclusions from hearsay and some believe in 3) the use of a disciplined thought process to seek true root causes. 1) When the perceived expert provides a solution as an individual, we are more apt trust their instincts, spend the money and their solution and see if it works. Sometimes it works, but more often that not, it does not work. Checking to see if the solution works is like checking quality only at the finished product stage. It is too late if there is a defect found! When teams are used to brainstorm using quality techniques such as fishbone and/or 5 WHYS, they will usually draw conclusions based on majority opinion. This means that solutions tend to be implemented based on the consensus of the group’s opinions, not on any factual basis where tests prove that these opinions are correct. Again, we are checking quality of the final product and not of the thought process that drew the conclusions. When teams use a disciplined RCA process that requires hypotheses to be developed as to how something could occur, and then REQUIRE verification with some essence of science as to whether it is true or not, then we are employing quality of process! This is because we are proving our hypotheses with facts rather than relying on hearsay, assumptions and ignorance.



To demonstrate these points, look at the following abbreviated example:

Figure 1.0 - PROACT® RCA Disciplined Logic Tree The above depicts a disciplined thought process called PROACT®. Let’s think back to our RCA scenarios. If a critical pump were to fail, in some cases we would get our best engineers to take a look at it. They would do their engineering magic themselves and may conclude that a different type of bearing (perhaps more heavy duty) should be in this service. We would change out the bearings with the new designed ones. Given the above scenario, would the problem go away?

What about if we get our brainstorming teams together and everyone looks at the past performance of the pump and its maintenance history and concludes that it is a new lubricant they are using and that it should be changed. Under the above scenario, would the problem go away? Utilizing the disciplined approach above, we are going to have to have the bearing reviewed by metallurgists. They will send back a report concluding (with science) that there is evidence to support the presence of fatigue. We ask ourselves, How can fatigue occur on the bearing? We hypothesize that it can come from high vibration. We check our vibration monitoring records and conclude that there is evidence of excessive vibration. How can we have excessive vibration? We hypothesize that it can come from imbalance, resonance and misalignment. We check our balance certifications and our vibration records for resonance, and find not evidence to support that they are contributors. We ask the mechanic who aligned the pump to align it again and observe his practices. From the observation, we can conclude that he does not know how to properly align. When we ask, Why would he not align it properly?, we find that he was never trained in how to align, he was using worn alignment tools and no procedure existed to follow. Now we know the REAL root causes, so we can develop solutions, that when implemented, WILL WORK!! Using the PROACT® disciplined process, we are utilizing quality of process versus quality of product. The facts are leading us to our conclusions, not hearsay. We are not using “trial and error” solutions to see if they work. By the time we get to solutions, we know they will work because we have maintained quality of the RCA process. While the undisciplined RCA approaches are attractive to organizations because they produce a quick answer, it does not mean that the answer is correct. They are quick approaches because they lack proof that they are

correct. True RCA involves taking the time to prove what we say, before we spend money to prove we are wrong!!

Root Cause Analysis: Quality of Process (3)
By Robert J. Latino, Sr. VP Strategic Development, Reliability Center, Inc. Where Does Root Cause Analysis Stop, At the HOW or the WHY? . Abstract: When most people conduct their version of a Root Cause Analysis (RCA), where do they usually stop? How do they know when they are done? How do they know that the problem will not recur? These questions represent reality when we are the ones in the field working on a pressing problem with management on our backs. If we consider ourselves manufacturing detectives, are we content with the stopping at the “HOWS” or the “WHYS”? I was watching a TV series the other night, my favorite by the way, called Crime Scene Investigators or CSI. It is a series about forensic specialists that use high tech tools to prove and disprove hypotheses for mainly prosecutors and detectives. The entire show revolved around various crime scenes and how the cases are built to prepare for a “solid case” in court. Putting this perspective into our world as RCA analysts, we too must build a “solid case.” However our court is not likely going to be a judge and/or jury, but rather a select number of managers that we are going to request money from to implement RCA recommendations. While the objectives may be different, the means to attain them are similar. In both instances, we must prove a solid case in order to obtain desired ends. In the criminal detective’s instance the goal is a conviction. In the analyst’s case, the goal is to implement recommendations to prevent recurrence of the undesirable event. Looking at it this way, when we typically conduct analyses, are we more like the forensic engineer or the prosecutor and detective looking to win his case? What is the difference between the two roles? The forensic engineer’s role is simply to determine with science HOW the event occurred? This means that a certain sequence of cause and effect relationships linked up and resulted in the undesirable event. Their role is to prove that each hypothesis did or did not occur. They in essence will map out HOW the crime occurred and be able to prove that it happened just that way. Now let’s look at the role of the prosecutor and the detectives. How do they fit into the big picture? Their role is typically to determine the WHY? The forensic engineers provided them the HOW pieces of the puzzle, now the detectives and the prosecutors must determine WHY the crime was committed. In other words, they must identify the motive of the person that triggered the HOW (the sequence of events that lead to the outcome or the crime) to occur. This is the same for us in industry. We use our technology (i.e. – vibration monitoring, infrared imaging, electron microscopy, stress analysis, etc.) to prove and disprove our hypotheses, but our analysts must explore WHY people make decisions that result in undesirable outcomes or failure. Take, for instance, the Logic Tree example below that we used in Part II of this series.

Inadequate Training (LATENT) Improper Tools (LATENT) No Procedures (LATENT)

Picture 1.0 - PROACT® RCA Disciplined Logic Tree The undesirable outcome is that some pump failure to perform its intended function. In an effort to prove our “solid case” we must understand the cause and effect relationships that lead up to the event. This will involve using science to prove our hypotheses. In the above case let’s explore HOW the pump could have failed and use science to prove our case: HYPOTHESIS Erosion, Corrosion, Fatigue & Overload High Vibration Misalignment VERIFICATION TECHNIQUES Metallurgical Analysis

Vibration Monitoring Instruments Laser Alignment Technology

These questions answer the HOW, but what about the WHY? In this case someone misaligned a pump and that decision resulted in a sequence of cause and effect relationships that caused the pump to fail prematurely. The “forensics” confirmed for us the HOW, but WHY would a person choose to align in that fashion. This is where we need to understand the motive of WHY people make decisions that are in error. As an analyst, if we were to go deeper and understand the thought process or the rationale for such a decision (Latent Root), we would uncover the real ROOT CAUSES of WHY physical failure occurs. People often misalign because they were never trained in proper alignment practices, no procedure exists outlining alignment as a required practice with specifications and/or the current alignment equipment we are using is worn or inadequate for the application. If we do not explore the WHY, then the HOW is likely to recur. In this example, if we merely change out the failed bearing, does the problem go away for good? Even if we identify an excessive vibration and take measures to identify it sooner so that we can better predict impending failure, does that make the problem go away? If we discipline the mechanic for not aligning properly, “Does that make the issue go away?”

As you can tell, none of these commonly applied solutions will totally prevent the recurrence of the pump failure. Only the identification of the WHY that triggers the physical root to occur, will prevent recurrence.

If you now reflect on your current RCA efforts, do you stop at the HOW (forensics level) or at the WHY (detective level)?

ean Manufacturing and the Environment Contact Us Search:
• • • • •


This Area



You are here: EPA Home Environmental Innovation Lean Manufacturing and the Environment Lean Thinking and Methods Total Productive Maintenance (TPM)

Total Productive Maintenance (TPM)
• • • •

Introduction Method and Implementation Approach Implications for Environmental Performance Useful Resources

Total Productive Maintenance (TPM) seeks to engage all levels and functions in an organization to maximize the overall effectiveness of production equipment. This method further tunes up existing processes and equipment by reducing mistakes and accidents. Whereas maintenance departments are the traditional center of preventive maintenance programs, TPM seeks to involve workers in all departments and levels, from the plant-floor to senior executives, to ensure effective equipment operation. Autonomous maintenance, a key aspect of TPM, trains and focuses workers to take care of the equipment and machines with which they work. TPM addresses the entire production system lifecycle and builds a solid, plant-floor based system to prevent accidents, defects, and breakdowns. TPM focuses on preventing breakdowns (preventive maintenance), "mistake-proofing" equipment (or poka-yoke) to eliminate product defects and non-de, or to make maintenance easier (corrective maintenance), designing and installing equipment that needs little or no maintenance (maintenance prevention), and quickly repairing equipment after breakdowns occur (breakdown maintenance). The goal is the total elimination of all losses, including breakdowns, equipment setup and adjustment losses, idling and minor stoppages, reduced speed, defects and rework, spills and process upset conditions, and startup and yield losses. The ultimate goals of TPM are zero equipment breakdowns and zero product defects, which lead to improved utilization of production assets and plant capacity. Top of page

Method and Implementation Approach
TPM is focused primarily on keeping machinery functioning optimally and minimizing equipment breakdowns and associated waste by making equipment more efficient, conducting preventative, corrective, and autonomous maintenance, mistake-proofing equipment, and effectively managing safety and environmental issues. TPM seeks to eliminate five major losses that can result from faulty equipment or operation, as summarized below.

Five major losses that can result from faulty equipment or operation Poor Maintenance Loss Category Unexpected breakdown losses Set-up and adjustment losses Costs to Organization

Results in equipment downtime for repairs. Costs can include downtime (and lost production opportunity or yields), labor, and spare parts. Results in lost production opportunity (yields) that occurs during product changeovers, shift change or other changes in operating conditions. Results in frequent production downtime from zero to 10 minutes in length and that are difficult to record manually. As a result, these losses are usually hidden from efficiency reports and are Stoppage losses built into machine capabilities but can cause substantial equipment downtime and lost production opportunity. Results in productivity losses when equipment must be slowed down to prevent quality defects Speed losses or minor stoppages. In most cases, this loss is not recorded because the equipment continues to operate. Results in off-spec production and defects due to equipment malfunction or poor performance, Quality defect losses leading to output which must be reworked or scrapped as waste. Equipment and Results in wear and tear on equipment that reduces its durability and productive life span, capital investment leading to more frequent capital investment in replacement equipment. losses Organizations typically pursue the four techniques below to implement TPM. Kaizen events can be used to focus organizational attention on implementing these techniques (see profile of the Kaizen lean method). 1. Efficient Equipment: The best way to increase equipment efficiency is to identify the losses, among the six described above, that are hindering performance. To measure overall equipment effectiveness, a TPM index, Overall Equipment Effectiveness (OEE) is used. OEE is calculated by multiplying (each as a percentage), overall equipment availability, performance and product quality rate. With these figures, the amount of time spent on each of the six big losses, and where most attention needs to be focused, can be determined. It is estimated that most companies can realize a 15-25 percent increase in equipment efficiency rates within three years of adopting TPM. 2. Effective Maintenance: Thorough and routine maintenance is a critical aspect of TPM. First and foremost, TPM trains equipment operators to play a key role in preventive maintenance by carrying out "autonomous maintenance" on a daily basis. Typical daily activities include precision checks, lubrication, parts replacement, simple repairs, and abnormality detection. Workers are also encouraged to conduct corrective maintenance, designed to further keep equipment from breaking down, and to facilitate inspection, repair and use. Corrective maintenance includes recording the results of daily inspections, and regularly considering and submitting maintenance improvement ideas. 3. Mistake-Proofing: Known as poka-yoke1 in lean manufacturing contexts, mistake-proofing is the application of simple "fail-safing" mechanisms designed to make mistakes impossible or at least easy to detect and correct. Poka-yoke devices fall into two major categories: prevention and detection. o A prevention device is one that makes it impossible for a machine or machine operator to make a mistake. For example, many automobiles have "shift locks" that prevent a driver from shifting into reverse unless their foot is on the brake. o A detection device signals the user when a mistake has been made, so that the user can quickly correct the problem. In automobiles, a detection device might be a warning buzzer indicating that keys have been inadvertently left in the ignition. 4. Safety Management: The fundamental principle behind TMP safety and environmental management activities is addressing potentially dangerous conditions and activities before they cause accidents, damage, and unanticipated costs. Like maintenance, safety activities under TPM are to be carried out continuously and systematically. Focus areas include o the development of safety checklists (e.g., to detect leaks, unusual equipment vibration, or static electricity) o the standardization of operations (e.g., materials handling and transport, use of protective clothing, etc.) o and coordinating nonrepetitive maintenance tasks (e.g., especially those involving electrical hazards, toxic substances, open flames, etc.). In many cases, equipment can be modified (see mistake-proofing) to minimize the likelihood of equipment malfunction and upset conditions.

Top of page

Implications for Environmental Performance
Potential Benefits: Properly maintaining equipment and systems helps reduce defects that result from a process. A reduction in defects can, in turn, help eliminate waste from processes in three fundamental ways: 1. fewer defects decreases the number of products that must be scrapped; 2. fewer defects also means that the raw materials, energy, and resulting waste associated with the scrap are eliminated; 3. fewer defects decreases the amount of energy, raw material, and wastes that are used or generated to fix defective products that can be re-worked. TPM can increase the longevity of equipment, thereby decreasing the need to purchase and/or make replacement equipment. This, in turn, reduces the environmental impacts associated with raw materials and manufacturing processes needed to produce new equipment. TPM often attempts to decrease the number and severity of equipment spills, leaks, and upset conditions. This typically reduces the solid and hazardous wastes (e.g., contaminated rags and adsorbent pads) resulting from spills and leaks and their clean-up. Potential Shortcomings: Failure to consider the environmental aspects or impacts associated with equipment during mistake-proofing and equipment efficiency improvement can leave potential waste minimization and pollution prevention opportunities on the table. For example, equipment can often be modified to reduce or eliminate spills, leaks, overspray, and misting that increase clean-up needs. TPM can result in increased use of cleaning supplies, particularly if the route cause of unclean conditions are not addressed. Cleaning supplies may contain solvents and/or chemicals that can result in air emissions or increased waste generation.

Useful Resources
Campbell, John Dixon. Uptime: Strategies for Excellence in Maintenance Management ( Portland, Oregon: Productivity Press, 1995). The Japan Institute of Plant Maintenance, ed. TPM for Every Operator (Portland, Oregon: Productivity Press, 1996). Leflar, James. Practical TPM: Successful Equipment Management at Agilent Technologies (Portland, Oregon: Productivity Press, 2001). Robinson, Charles and Andrew Ginder. Introduction to Implementing TPM: The North American Experience (Portland, Oregon: Productivity Press, 1995). Suzuki, Tokutaro, ed. TPM in Process Industries (Portland, Oregon: Productivity Press, 1994).