You are on page 1of 16

NATIONAL PETROCHEMICAL & REFINERS ASSOCIATION

1899 L STREET, NW, SUITE 1000


WASHINGTON, DC 20036

MC-02-81

SOME POINTS TO HIGHLIGHT ON MACHINERY


FAILURE ANALYSIS

By

Luiz Otavio Amaral Affonso


Mechanical Engineer

Petrobras
Cubatao, SP, Brasil

Presented at the

NPRA
2002 Annual Refinery & Petrochemical Plant
Maintenance Conference and Exhibition
May 7-10, 2002
San Antonio Convention Center
San Antonio, TX
This paper has been reproduced for the author or authors as a courtesy by the National Petrochemical &
Refiners Association. Publication of this paper does not signify that the contents necessarily reflect the
opinions of the NPRA, its officers, directors, members, or staff. NPRA claims no copyright in this work.
Requests for authorization to quote or use the contents should be addressed directly to the author(s)
1 - INTRODUCTION
This paper deals with Machinery Failure Analysis, how it should be performed, what we can
expect to learn from it and what we can do with the knowledge developed after many machine
failures are analyzed in a certain plant.

The “How it should be performed” part of the discussion will be limited to some highlights,
where we are going to explain some very important basic concepts that can make the difference
between a good and a bad investigation. In fact, it is not possible to present a full course on this
subject in the limited space available here.
The “What can we learn from it” part of the job can be summarized as follows: It should be clear
for everyone that works on this matter that a failure is a very good opportunity (although not the
only one) to find out one weak link in our system. This is exactly what we expect to learn from
our failure analysis efforts.
And finally, there is the “What do we do now” section, with some material on how to effectively
use the information extracted from the machine failure analyses to improve the plant
performance, be it mechanical reliability, maintenance cost, safety records or something else.
This will not, obviously, be a complete presentation on machinery troubleshooting. Again, we
are going to discuss some general guidelines that can help in the effort of improving machinery
reliability.

Some real case examples are presented, where the importance of the concepts explained earlier
can be easily seen.

2 - HOW TO TAKE ADVANTAGE OF A FAILURE ANALYSIS PROGRAM


A brief description of the systematic approach used by a large oil refinery to greatly improve the
machinery reliability and reduce maintenance costs will be presented in this section. A summary
of the benefits achieved is also included. A somewhat more complete exposition can be found in
reference 1. The main points here have been:
1. Machinery failures were analyzed and the results fed into a computer data base;
2. Statistical information extracted from the database was used to guide the machinery reliability
improvement efforts.
Such a strategy requires, besides some expenditure on new components, new machines, etc, deep
modifications of the plant crew behavior, as it has been necessary to change focus from
“repairing the machines” to “avoiding the machinery failures”. A strong search for the root
causes of the problems became the norm among plant personnel.
This proactive attitude can be developed in a number of ways, the most important one being
training people. It is the mechanics and the maintenance foremen that will do most of this failure
analysis job, so they should know what they are expected to do. It is true that a great amount of
information can be gathered by these technical people even without any formal training in failure
analysis, but, if we want to make the most out of these programs, everyone involved should
receive

MC-02-81
Page 1
some training. This training should cover the investigation techniques and some review of the
way the machines and components work. A lot more information about this part can be obtained
in reference 2.
After some time it was possible to know what were the most frequent causes of machinery
problems in the plant and these root causes could be adequately treated. These most frequent
problems can be called the “plant failure modes and causes”, some examples are mechanical seal
failures, pump cavitation, excessive piping strain and so on. All the causes for machinery failure
could be found here, except that we focus on the ones that can be observed more frequently in a
certain plant, the failure modes that actually occur. The main reason for grouping similar
problems together is that they can be solved in groups. For example, a great number of seal
failures have been eliminated just by introducing a standard design with adequate features,
suitable for most refinery sealing services.
Although the results of the failure analysis have been used to guide the improvement actions, it
should be mentioned that there were other programs going on at the same time, like
modernization of the predictive maintenance system, training programs for the mechanics and
operators, component standardization programs and so on. Of course, all of these were tightly
connected to the results of the machine failure analysis, as all the action was focused on the
worst problems.
All these changes have made possible a reduction in the failure rate of the 1500 mechanical
equipments by 68%, from 1179 failures in 1990 to 375 in 1997. This significant reduction in the
failure rate has translated into other improvements, like reduction of the maintenance cost,
amount of machinery waiting maintenance and number of people assigned to machinery
maintenance and improved safety. Production losses caused by the machinery failures have been
virtually eliminated.
3 – FAILURE ANALYSIS PRACTICE
This section describes a generic failure analysis procedure. Techniques and precautions
necessary for this task are also discussed.

3.1-Failure Analysis Objectives


The main reason why we analyze the defects that occur in the process plant machinery is very
simple: We want to learn why the problems happen, so we can do something to avoid them.
When we analyze a machine failure, we try to understand the characteristics of the failed system
so we can find out why it did not perform its intended function safely. It is always interesting to
remember that we do so to avoid the repetition of the failure. A failure analysis effort that does
not result in corrective action being implemented is of very limited value.

3.2-Some Things That We Should Not Forget When Analyzing Machinery Failures

We are now going to discuss some important points on a machine failure analysis. Of course, we
have to follow the well known general procedure: Collect information, organize the information,
analyze the data, determine the failure mode, find out the basic causes of the failure, determine
what the best corrective actions are, implement these corrective actions and evaluate the results.

MC-02-81
Page 2
This will take a lot more time and effort than it seems when we see these main steps listed so
briefly.

Even assuming that we stick to the general procedure, there are lots of things that can go wrong
if we don’t pay attention to some other, sometimes forgotten, rules. These are described below:

a) Focus on the first component to fail


When we examine a wrecked machine, it maybe difficult to find where the problem has begun.
We should never forget that, in most cases, only the event that has been the trigger to the failure
matters to us, everything else being only consequences of this first failure. The author has found
himself many times in a situation where it has been necessary to persuade people not to examine
all the remains of a failed machine, as the origin of the problem could be clearly identified. That
is one of the things we are going to learn studying the first real case example, at the end of the
paper.

b) Don’t forget to take a look at the weakest component in the failure sequence (or at the
possibility of hidden failure modes)
This will be the exception case for the first rule. Sometimes a simple problem can destroy a
machine just because there was some hidden “weak link” in the chain. This could have been the
case when, for example, a small auxiliary oil pump steam turbine failure caused the trip of a
huge air blower, with all the deleterious effects on the process. Of course there was a spare oil
pump, and the spare pump worked properly. The problem here had been a jammed check valve
that drained the oil in the emergency tank before the spare pump had time to accelerate. More
details about this other real case example will be discussed at the end of the paper.
This hidden failure mode is one of the biggest nightmares of the instrumented safety system
designer. It may happen for any component whose failure will not be obvious until this
component needs to work. This is easy to understand when we think of a jammed safety valve.

c) Look for more than one root cause


It can be difficult to resist the temptation of finishing a failure analysis job when we find one
valid root cause. This is exactly what we have been searching for, so it may seem that we have
the job done when we find one of the root causes of the event.
This will not be true everytime and we have to look always for more than one root cause. The
main risk we run when we don’t discover all of them, or at least, the main ones, is that we can
only implement corrective action to avoid the repetition of the causes we know. The problem
will be only partially solved and it can happen that the root causes that have not been treated may
show up as a repetition of the exactly same failure we thought we would never see again. A very
good example of this situation will be discussed at the end of the paper.

d) Examine the maintenance and operational history of the machine


It is well known that past and present maintenance and operating conditions can greatly influence
the machine life. So, it becomes obvious that they should be taken into account.

MC-02-81
Page 3
e) Develop a machinery failure analysis data base
Our brain tends to remember things selectively. Sometimes we will not forget an old girlfriend
(even though we may prefer not to remember), other times we may forget to pay the rent. The
general rule says that we will remember more easily the things that do not conflict with our
overall conceptions about the world. Here lays the importance of using a computer data base as a
“memory expansion”. We will simply not be able to remember everything, so this is a great help
when we have to decide what the most effective ways to improve machinery reliability are.

It is not enough to write failure analysis reports and maintenance check lists. Reports should be
used as a communication tool to inform other people in the organization what have been the
conclusions and what must be done next. A computer data base should be used, as pointed out
before, as a “memory expansion” device.

f) Take in consideration the operation mode of the failed mechanism


This part may need some further explanation. Said in other words, we should seek an explanation
that does not conflict with the way a certain machine works. The interaction between the
components and between the components and working fluids will, in most instances, determine
the failure mode.

A good example has been experienced by the author: A small air turbine that is used to actuate a
certain process valve was showing reduced life. The valve stem was somewhat “sticky”, which
caused a certain overload on the actuator. The maintenance foreman decided that this had been
the cause of the accelerated wear and reduced life of the air turbine. As he told me, he was
thinking of the air turbine as an electric motor that can be burned out if overloaded. This is
obviously wrong, and further investigation showed that poor lubrication was the cause of the
problem.

Another instructive way to improve our failure analysis capabilities is by analyzing the way the
mechanism operates and using this information to help explain the observed damage. One more
time, the example discussed later will shed some more light on this subject, describing how the
understanding of operating mode of the steam turbine trip system helped explain a certain
failure.

g) Learn why some machines don’t fail


Does it sound obvious? Sometimes an easy way to improve the reliability of a certain machine is
to find a similar machine in a similar service that shows a good reliability. When we compare
both machines we will, many times, discover that “equal” machines are not so ”equal” as we
thought, the difference being the solution to our problem.
3.3 – How Far Shall We Go?
This is not exactly part of the failure analysis itself, but, as it is very easy to spend more effort
than is adequate, some strategy is necessary to help optimize the use of the available resources.
The best way to handle this question is to concentrate the failure analysis effort on the events that
have the greatest potential to affect the results of the company.

MC-02-81
Page 4
The desired result for the process industry is, normally, related to financial results, safety and
environmental issues and possibly many other areas. A good starting point, which will fit most
companies, is:
a) Non repetitive failures that have no safety, environmental or production impact or potential
are treated as the least important ones. The craftsman and his supervisor should perform the
failure analysis by asking why the event happened 5 or 6 times. This is the old 5 Why method;
b) When we face a repetitive failure, or a failure that has had some actual or potential impact on
safety, environment or production, we are facing problems that may be a threat to the
company results. At this time, the failure analysis should be somewhat more careful and
detailed, being performed by a group containing engineering, maintenance and operations
personnel.
Some question may be raised as to what constitutes a repetitive failure. This concept will be used
to separate the “critical” from the “non-critical” failures, so it must be defined previously. Again,
this can be decided by the plant itself, but a good starting point is to call repetitive failures the
ones that happen with a frequency greater than the plant Mean Time Between Failures (MTBF).
If this criteria results in a too big of a failure analysis workload, it can adjusted accordingly, as it
is only a prioritization criteria.
4 - A STEAM TURBINE FAILURE ANALYSIS EXAMPLE
This section describes a complete failure analysis performed due to a steam turbine over-speed
event that resulted in a pump fire. The reason for this description is to serve as an example of the
procedure and sequence of the analysis. The analysis described is by no means general, applying
only to the specific event.
Many basic concepts discussed earlier are stressed by this example, increasing its value as a
didactic instrument of this subject. The description of the event will list the available data on the
day after the incident, adding information in the same sequence as it was found in the real
investigation.
4.1 – Description of the Event
The system involved was a huge boiler fuel system. The machine concerned was a steam turbine
driven boiler fuel oil pump. The unit was operating normally and, at 4:00 PM, the unit DCS
warned of an unexpected increase in the pump discharge pressure. An inspection of the pump
showed that nothing seemed to be wrong, except that the pump speed was higher than normal.
The pump was running at 4100 RPM, whereas the normal speed should be 3600 RPM and the
trip speed should be 3900 RPM.
The pump and turbine vibrations were very small, despite the increased speed. The operator tried
to adjust the speed by turning the hydraulic governor knob, but this action was not successful.
The steam inlet valve (a gate valve) was used to adjust the pump speed, again without success.
With the end of that shift approaching, it was decided that the turbine governor would be
removed to the workshop on the next day. It should be pointed out here that the spare electric
motor driven pump was available for operation.
Around 3:00 AM the next day, a fire was generated by the hot fuel leakage. Fortunately, it was
controlled easily by the refinery fire brigade. After the event it was noted that the pump was
completely destroyed and that some of the pump piping had ruptured.

MC-02-81
Page 5
4.2 – Data Collection
Inspections of the pump and turbine internals led to the conclusion that the cause of the
destruction of the pump was the turbine over-speed. The evidence that supported this conclusion
was:
a) The pump shaft was distorted, the wear rings and mechanical seal severely rubbed, as would
be expected from the operation above the first critical speed;
b) The turbine bearings showed signs of overheat although no problems were found in the
lubrication system;
c) The turbine shaft was cut by the over-speed disc, disconnecting the governor and the over-
speed trip;
When over accelerated, the pump vibration was too high, in fact, high enough to destroy the
pump and the piping.
Other information gathered during inspection of the pump and turbine:
a) The turbine bearings did not fail catastrophically, the turbine blades were not damaged;
b) There was severe damage to the over-speed trip mechanism parts that were attached to the
shaft. The rest of the turbine trip system components ( except for the disc and mass ) and
governor were in good working order;
c) The over-speed adjusting screw was out of the correct position. The locking washer was
distorted and outside its correct position.
Drawings of the trip system can be seen below. The pictures describe the over-speed trip
mechanism and the turbine thrust bearing area. This information is extremely important to the
failure analysis.
The process data available indicated the pump discharge pressure and the de-aerator pressure,
this being and indirect measure of the turbine exhaust pressure. These variables are shown in the
picture below, extracted from the unit DCS.
It can be seen that there is a sudden increase in the pump discharge pressure around 4:00 PM and
that the de-aerator pressure drops suddenly in the middle of the night. This sudden drop was
explained by the effect of the heavy rain that occurred at that time. The rain cooled down the
low-pressure steam system, reducing the de-aerator pressure.

MC-02-81
Page 6
Figure 4.1 – Cross-section drawing of the turbine showing the thrust bearing area.

Figure 4.2 – The over-speed trip device, as seen from the shaft end side.

MC-02-81
Page 7
Figure 4.3 – Plot of the pump discharge and de-aerator pressures on the day of the event.
This turbine over-speed trip device was adjusted and tested about two weeks before the incident.
On this occasion, the nuts were not chemically locked, as required by the turbine maintenance
instructions.
A good description of the damage that was caused to the over-speed trip mechanism can be seen
in the next pictures. It can be seen that the turbine shaft was cut by the friction with the over-
speed disc. It can also be seen that the over-speed mass was distorted. Another picture shows the
damage to the inside diameter of the disc.
4.3 – Analysis of the event
The analysis of the accident should be done in such a way as to explain everything that has been
observed. The over speed trip mechanism operation should also be taken in account. The
following sequence is thought to explain every detail of the event:
a) The over-speed adjusting screw was dislocated by the centrifugal force. This is supported by
the position where the screw was found. The change in the position of the screw increased
the force on the lock washer, and took this one out of position;
b) The mass is now being held by the head screw. This is slowly removed from its position and
the mass is being moved out of its position. The threads have not been damaged. After some
time the screw does not hold the mass anymore and the mass is thrown out of the disc by the
centrifugal force. Unfortunately, the mass leaves the disc without touching the trip trigger,
which would have tripped the turbine. The dimension of the screw allows this sequence to
happen;
c) The over-speed mass hits the spring of the emergency valve and holds the disc, which can
not rotate anymore. The spring shows damage that supports this conclusion;
d) As the disc is no longer able to rotate, it rubs the shaft until it is completely cut. When the
shaft is cut the governor and over-speed trip are disconnected. At this time, the operators
observe the increase in pump discharge pressure, as the governor is trying to increase the

MC-02-81
Page 8
turbine speed. There is no control of the turbine, its speed is regulated by the steam enthalpy
drop;
e) When the de-aerator pressure is reduced by the cooling effect of the rain, the turbine
accelerates to the over-speed, destroying the pump.

Figure 4.4 - Detail of the wear damage to the inside diameter of the disc.

Figure 4.4 – Detail of the damage to the over-speed mass.

The causes of the accident were:


a) A design error of the over-speed trip mechanism. The lock washer can be easily removed
from its position and the head screw threads were too short, allowing the mass to be taken out
of the disc without touching the trigger. The last picture shows the difference between the old
and new designs;

MC-02-81
Page 9
b) An operator error, as the turbine was not taken out of service when it became clear that the
governor was out of order. It should be noted that the spare pump was available;
c) Finally, a maintenance error, as the screws were not chemically locked on the last
maintenance job. This made it easier for them to go out of position.

Figure 4.5 – Comparison of the damaged and new over-speed masses. Note the difference in the
screw length and in the diameter of the lock washer.
4.4 – Conclusion
The principles stressed in this example are:
a) There were multiple root causes of the failure. We should not be satisfied when we find only
one root cause. This is enough only in the simplest of the cases;
b) The importance of taking into account the maintenance and operation history and the
operating conditions at the time of the event;
c) The importance of taking into account the working mode of the mechanism to analyze the
failure. It should also be noted that only visual inspections have been necessary in this case;
d) The importance of focusing on the first component to fail, as everything else will be only a
consequence of the first failure.

MC-02-81
Page 10
5- A CAT-CRACKER BLOWER TRIP DUE TO AN AUXILIARY OIL PUMP TRIP
The big turbines and compressors that operate in critical unspared service usually have an oil
supply system that is designed to be almost fail-proof. So, there are spare pumps, spare drivers,
spare coolers, etc. The real case we are going to discuss now is a big problem that occurred in
one of these big machines after a small failure that should not have caused so many headaches to
the refinery.

The huge air blower was operating normally until it suddenly tripped. Everyone knows that a
FCC blower trip can be big trouble to the refinery’s operations, even if we don’t think about the
enormous loss of money that results. The cat-cracker air blower sends air to a regenerator that is
full of high temperature (~700 oC ), very abrasive catalyst. There is a check valve to protect the
blower from the harmful contact with this hot material in case of trip.
The cause of the trip of the blower was the following:
a) The auxiliary turbine oil cooler was leaking, causing the trip of the auxiliary turbine;
b) The check valve that should not allow the oil in the pressurized reservoir to flow back to the
pumps while the motor driven pump was accelerating was jammed, allowing the oil pressure
to drop below the trip setting.
In this case, the first component to fail was the auxiliary oil cooler, but the blower trip occurred
because the check valve did not work properly. It should be noted that the check valve defect
was discovered only when it was required to work. This component was the “weak link” that
caused a small failure (oil cooler leakage) to turn into a big failure. We should pay more
attention to the check valve problem, as it can magnify the oil cooler failure many times. The
cooler failure cannot, of itself, harm anything.
6-CONCLUSION
Machinery failure analysis can be a very valuable tool to help improve machinery reliability. It
has been shown that some simple precautions may help improve the quality of the analysis.
These simple rules cannot, however, be a substitute for formal training and field experience.
They should be thought of as guidelines to make someone that already knows the job perform
better.

7-REFERENCES
7.1 - Affonso, Luiz Otávio A.: ”Improving the Reliability of the Machinery in the Cubatão
Refinery”, Proceedings of 7th Process Plant Reliability Conference, Houston, TX, 1998;
7.2 - Affonso, Luiz Otávio A. :”Equipamentos Mecânicos – Análise de Falhas e Solução de
Problemas”, Editora Quality Mark, Rio de Janeiro, Brasil, 2002

MC-02-81
Page 11

You might also like