You are on page 1of 21

|Minimizing Risk & Optimizing Use of Medical Products

Effective Risk Management and Quality Improvement by


Application of FMEA and Complementary Techniques

Benjamin A. Berman November 2003

Introduction
This paper provides my expert opinion of the use and effectiveness of Failure Modes and Effects
Analysis (FMEA) for managing risks and improving quality in several industrial domains. I also
consider and evaluate several other analytical techniques as complementary extensions of FMEA.1
The opinions that I express in this paper are based on a thorough review that I conducted of industry
standards and procedures for risk management, FMEA techniques, and FMEA applications in
aviation and other industries. I also base these opinions on my 25 years of experience in
transportation management and analysis, airline flight operations, safety investigation management,
safety research, and airline accident investigation. I have ten years of experience on the staff of the
U.S. National Transportation Safety Board (NTSB), concluding my service there as the Chief of the
Major Investigations Division. In that position, I managed the overall investigative effort for U.S. air
carrier accidents from the field investigation to the public board meeting and final accident report. I
also managed the U.S. Government’s participation in foreign aviation accidents. My previous NTSB
experience included management of flight operations, air traffic control, and meteorological aspects
of air carrier accident investigations; on-scene and follow-up investigations of flight operations for
several major accident investigations including the USAir flight 427 Boeing 737 accident near
Pittsburgh and ValuJet flight 592 DC-9 accident in the Everglades; and management of research
programs on flight crew human factors and regional air safety issues, both of which were adopted
and published by the NTSB. I am a pilot for a major U.S. air carrier, qualified in the Boeing 737 and
two other transport category aircraft types. I have consulted with the National Aeronautics and
Space Administration (NASA), the World Bank, the European Bank for Reconstruction and
Development, the U.S. President’s Aviation Safety Commission, and several airlines, financial
institutions, airport authorities, and other private entities on safety and analytical matters. I received
the A.B. degree summa cum laude in Economics from Harvard College and am a member of the Phi
Beta Kappa Society.

FMEA—Summary and Definition


According to the Society of Automotive Engineers (SAE) International Aerospace Recommended
Practice (ARP) 5580, Recommended Failure Modes and Effects (FMEA) Practices for Non-Automobile
Applications, FMEA is “a formal and systematic approach to identifying potential system failure
modes, their causes, and the effects of the failure mode occurrence on the system operation…FMEA
provides a basis for identifying potential system failures and unacceptable failure effects that prevent
achieving design requirements from postulated failure modes…FMEA is used in many system
design analyses including assessing system safety, planning system maintenance activities, defining
provisions for fault recovery, fault tolerance, and failure detection and isolation, and identifying
design modifications and corrective actions needed to mitigate the effects of a failure on the system.”

z
1
I acknowledge and thank ParagonRx, LCC for its support of my review of risk-management methodologies
and the writing of this paper. All opinions expressed herein are my own and do not necessarily represent the
opinions, policies, and products of ParagonRx, LLC.
©2009 ParagonRx, LLC
Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

The basic FMEA process involves examining each basic hardware, software, personnel, or functional
element of a system, identifying all the ways in which that element can fail (failure modes), assessing
the effects of each failure mode upon the function of other elements of the system and the entire
system (failure effects), and then assessing the criticality of the failure effects. Integral to the FMEA
process is the specification of corrective actions that will prevent critical failures or restore critical
functions.
FMEA typically uses a worksheet for analyzing data and documenting the results. The worksheet
proceeds, left to right, from the component identification, to the associated failure modes, to the
failures’ effects at various levels of the system (including detectability of the failure modes/effects),
to their risk, reliability, or quality consequences. The following is an example of an FMEA worksheet
that was prepared by the SAE for analysis of a fictitious aerospace application:

©2009 ParagonRx, LLC P age 2 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Source: SAE ARP926B, p. 32.

©2009 ParagonRx, LLC P age 3 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

The criticality or level of risk, from a failure is a combination of the severity of the effect and the
probability of its occurrence. Under FMEA the severity is estimated qualitatively with each effect
assigned to one of several categories ranging from none to catastrophic, and the probability is
assessed either qualitatively or quantitatively (the latter if failure rate data are available from
previous experience or from laboratory or field experimentation). The severity and probability
assessments are combined into an overall assessment of the risk level of the failure effect as being
acceptable or unacceptable, along the lines of the following graphic from Federal Aviation
Administration (FAA) guidance material:

Source: FAA Advisory Circular 25.1309-1A, System Design and Analysis, p. 7


One aspect of the FMEA process that is often ignored in discussions of the methodology (perhaps
because it is not represented on the FMEA worksheet) is the importance of documenting and
retaining all assumptions, including rationales for failure rates and effects categorization that
underlie the FMEA worksheet entries. This is specifically cited by the SAE in its recommended
standard ARP4761, appendix G, section 3.2.1.
My review of FMEA utilization in aerospace and several other fields suggests that the most common
applications of FMEA are in product design and manufacturing processes. FMEA has not typically
been applied to the post-manufacturing environment (such as product distribution and field usage
by providers, operators, maintainers, and customers); however, post-manufacturing applications are
not specifically excluded in FMEA standards. In fact, in SAE ARP5580 section 6.1.1 (5), “failure
conditions caused by the operational and maintenance environment” are specifically cited among the
failure modes to be considered.

Cross-industry acceptance and use of FMEA


FMEA is firmly established as a risk analysis and risk management methodology. Originating in the
U.S. military during the 1940s and supported by military specification beginning in 1949 (MIL-P-
1649, Procedures for Performing a Failure Mode, Effects, and Criticality Analysis), FMEA methods and
applications were officially accepted as a recommended practice for aerospace engineering by the
SAE beginning in 1967 under ARP926, Fault/Failure Analysis Procedure. FMEA had become a standard
part of the design process in the aerospace industry by the 1980s and has been in continuous use

©2009 ParagonRx, LLC P age 4 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

through the present. For example, the Boeing Commercial Airplane Group relied upon FMEA to
substantiate the safety and reliability of design changes for two generations of the Boeing 737
commercial airliner: the 737-300/400/500 series, first produced in the mid-1980s, and the “next
generation” 737-600/700/800/900 series, first produced in the late 1990s and early 2000s. I have
personally examined numerous FMEA documents and FMEA-based safety analyses prepared by
aircraft manufacturers for original and modified transport-category aircraft designs (these FMEA
applications are proprietary to the manufacturers). In addition to these aviation applications of
FMEA, the late 1980s saw the application of FMEA to design and manufacturing processes by a major
U.S. automobile manufacturer, and these practices were recognized by the automotive industry
under the auspices of the Automotive Industry Action Group (AIAG) and the SAE (Surface Vehicle
Recommended Practice J-1739, first issued in 1994). Currently, FMEA is recognized by the SAE
(ARP5580, Recommended Failure Modes and Effects Analysis (FMEA) Practices for non-Automobile
Applications), the FAA (Advisory Circular 25.1309-1A, System Design and Analysis), and the National
Aeronautics and Space Administration (NPA 8715.3, NASA Safety Manual, and NSTS 22206,
Instructions for Preparation of FMEA and CIL). In a subsequent section of this paper, I will provide an
example of a successful government-sponsored (and therefore non-proprietary) aviation industry
application of FMEA that resulted in a significant improvement in commercial air carrier flight
safety.
FMEA has also been applied successfully in a wide range of other domains. For example, FMEA is
being used to analyze design and maintenance issues in building structures (Anker Nielson, Ph.D.,
“Use of FMEA, Failure Modes Effects Analysis on Moisture Problems in Buildings,” Building Physics
2002—6th Nordic Symposium). Also, engineers have applied FMEA to design and manufacturing
processes in the semiconductor industry (Steven Martin and Bedwyr Humphreys, “FMEA Speeds
Time to Market in Photonic IC Manufacturing”, Compound Semiconductor, November 2002). The
authors concluded, “The FMEA technique has been successfully implemented at MetroPhotonics,
aiding in the rapid development and the successful launch of the SurePath product suite…Time to
market and development costs were greatly reduced through the selection of optimum system
alternatives (through FMEA), resulting in a successful product launch within four months of
concept” (Martin and Humphreys, p. 69).
FMEA has become established as a standard methodology for risk management in the healthcare
industry. Under Joint Commission on Accreditation of Healthcare Organizations (JCAHO) Standard
LD.5.2, adopted July 1, 2000, healthcare organizations are required to proactively identify and
manage potential risks to patient safety, using FMEA and root cause analysis to analyze at least one
high-risk process annually. The U.S. Veteran’s Administration has developed and begun
implementation of an application of FMEA that the agency customized for healthcare delivery
(Joseph DeRosier, Erik, Stalhandske, James P. Bagian, and Tina Nudell, “Using Health Care Failure
Mode and Effect Analysis™: The VA National Center for Patient Safety’s Prospective Risk Analysis
System,” The Joint Commission Journal on Quality Improvement, Vol 28. No 5, May 2002). Private health
care organizations (for example, Kaiser Permanente) have begun to implement FMEA-based
processes (Kaiser Permanente, Failure Modes and Effects Analysis Team Instruction Guide, March 2002).
Although healthcare-related applications of FMEA have considered some aspects of pharmaceutical
delivery (for example, Institute for Healthcare Improvement, “Sample FMEA: Comparison of Five
Medication Dispensing Scenarios,” 2003), I am not aware that a comprehensive analysis of
pharmaceutical distribution, delivery, and use, treating all post-manufacture activities as an
integrated system, has been performed to date using FMEA or any alternative, formal risk-
management methodology.

Advantages of FMEA
I suggest that FMEA has several general advantages for organizations seeking to improve quality and
safety:

©2009 ParagonRx, LLC P age 5 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

First, FMEA is a structured process that promotes disciplined elicitation of ideas about the kinds of
failures that may occur, careful analysis of specific risk/hazard areas, proper documentation of
sources and assumptions, and identification of interventions that manage risks to an acceptable level.
Regarding the ultimate goal of risk management, in most applications the FMEA process requires
intervention in each identified adverse outcome until the residual level of risk is acceptable.
Further, as a “bottom-up process” proceeding from the failure an individual component of a system
to the effects on the entire system, FMEA helps organizations identify unforeseen, undesired
outcomes. Its best applications are prospective, facilitating the control or mitigation of adverse
outcomes before they occur.
Also, FMEA explicitly considers the detectability of failure modes, and thus it promotes
consideration of failures that can remain latent; that is, failures that have no immediate effect and (if
they remain undetected) are capable of resulting in adverse effects when combined with subsequent
failure modes or events (however, as is discussed below, the basic FMEA methodology may need to
be modified to fully address latent failures).

Limitations of FMEA
SAE ARP5580 provides the following “cautions” for the application of FMEA:
z First, a FMEA traditionally considers only non-simultaneous failure modes. Each failure
mode is considered individually, assuming that all other system components are performing
as designed. Hence, a typical FMEA provides limited insight into the following anomalous
behaviors:
1. the effects of multiple component failures on system functions, and
2. latent manifestations of defects such as timing, sequencing, etc.
z Second, the prioritization of the failure modes for corrective actions is substantially
subjective. Thus, care should be taken in decision making when using any quantitative
aspects of the numbers presented in the analysis (SAE ARP5580, Section 3.3).
I concur that the basic approach of FMEA is to consider single failures and that a typical FMEA
application handles multiple (simultaneous/sequential) failures with difficulty (later in this paper, I
will suggest several extensions to FMEA that are capable of addressing these issues).
Further, I suggest that the following additional general limitations exist for FMEA:
First, as FMEA has typically been applied in aerospace engineering, designers are permitted to rely
upon human performance (such as interventions by pilots and mechanics) to mitigate the adverse
effects of hardware and software component or system failures. However, in doing so, no
consideration is given to given to imperfect human performance. For example, FAA guidance for
aircraft certification states, “If…a potential failure condition can be alleviated or overcome…without
requiring exceptional pilot skill or strength, credit can be taken for correct and appropriate action”
(FAA AC25.1309-1A, pararaph 11). The assessment of “exceptional” skill or strength is subjective,
and once a specific human response to a failure mode is determined to require unexceptional skill or
strength, FMEA typically assumes that the human will intervene reliably every time that the failure
mode occurs. I believe that this is an unrealistic assumption for human performance, and as a
common treatment of human performance in FMEAs it constitutes a limitation of the typical FMEA
methodology.
Also, as FMEA typically has been applied in design/process applications, there is no inherent
feedback to the FMEA process from the actual failure modes and outcomes experienced in field use.
However, this feedback is not excluded by the FMEA process and the continuing refinement of an

©2009 ParagonRx, LLC P age 6 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

FMEA through feedback has been explicitly recognized as an important aspect of system safety
analysis in some applications.

Keys to successful application of FMEA


I believe that several additional issues are important for obtaining satisfactory results from an FMEA.
First, while FMEA is a structured technique that provides a comprehensive analysis, it is difficult (or
impossible) to prospectively identify all possible failure modes/adverse outcomes from a complex
component or functional element of a system. Because even the best FMEA effort may leave some
failure modes and effects undiscovered, after completing an FMEA it is essential to avoid concluding
that all risks have been compensated for or controlled. This suggests that FMEA analysts need to
maintain an open and creative attitude about identifying failure modes and assessing their effects
and consequences, It also establishes the rationale for obtaining, analyzing, and reacting to feedback
from field use and operations, and for treating the FMEA as a “living document” that will be
revisited and revised on a continuing basis.
Further while planning and performing an FMEA, it is essential to understand the scope of the
analysis and to choose a proper scope that will allow the evaluation of all critical risks that can result
from failure modes. For example, many FMEAs are limited to design issues and do not necessarily
consider manufacturing variations or errors. An aircraft part that includes several linkages may not
consider the effects of cumulative (stack-up) of the manufacturing tolerances that are allowed for
each individual linkage as a possible contributor to failure modes and effects. Even if the scope of the
FMEA for this part is enlarged to include manufacturing processes and therefore considers tolerance
stack-up, the analysis still may not consider the effects of failure modes that remain downstream
from the processes that have been included within the analytical scope, such as improper
maintenance or use. When considering all of a product’s failure modes and effects in all
environments, a still broader scope of analysis might reveal additional factors that significantly affect
safety and quality. For example, consider a pharmaceutical product with an adverse side effect that
poses a risk to some users. One option for controlling the risks of these side effects would be for the
Food and Drug Administration (FDA) to withdraw approval for the product. However, because the
product also has therapeutic value, withdrawal of the product may actually result in a net reduction
of patient health and safety, even considering the adverse consequences of the side effects. The net
therapeutic benefit of the product relative to its side effects will not be identified by an FMEA of its
design, manufacturing, and use—unless the withdrawal of the product is considered as a failure
mode and the scope of analysis is broadened to consider the net consequences of non-use.
In addition to considering downstream effects in scoping the analysis, it is essential to recognize that
the interventions selected in an FMEA to mitigate an identified risk can also introduce their own
failure modes and effects having critical risks. Interventions should be designed to “first, do no
harm;” that is, they should introduce no new uncorrected failure modes. This suggests that FMEA
should be performed on each intervention, as well. In some cases controlling the hazard from one
failure mode can increase the hazard from another, and this may require consideration of multiple
simultaneous or sequential failures as an extension of FMEA.
Also, while interpreting the results of an FMEA, it is essential to understand the derivation and
limitations of the probability analysis that is incorporated in the evaluation of the risks associated
with failure effects. The probability that a failure mode will occur can be obtained from engineering,
field, or registry data such as historic component failure rates; the probability that a functional
element or complex component will fail can be estimated by combining the failure rates of sub-
assemblies or sub-systems. Failure rates may be obtained from laboratory research if actual field data
are unavailable. Lacking in both field and laboratory data, failure mode probabilities may be
estimated. The FMEA analyst’s confidence in the results should depend on the derivation of these
probabilities. An additional probabilistic element in some FMEA applications is the likelihood that

©2009 ParagonRx, LLC P age 7 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

an effect of stated severity will follow from a failure mode. This element needs to be estimated in a
similar manner, with confidence in the results of the analysis once again depending on the source of
the probability estimates. Another probabilistic element can enter FMEA when considering
interventions to control or mitigate an identified risk; here, the probability that the intervention will
successfully address the risk needs to be estimated.
Failure and reliability rates are particularly difficult to estimate when human performance is
involved. The FAA states in its design guidance material that “quantitative assessments of the
probabilities of crew error are not considered feasible” (FAA AC25.1309-1A, paragraph 11); as I have
already discussed, the FAA then turns at times to the unrealistic assumption that humans perform
with perfect reliability. In other domains, performance by trained professionals has been estimated
as being satisfactory in 30-60 percent of exposures to a demanding task. Although the reliability level
of human performance is highly variable depending on the nature of the task, environment, and
individual, it is probably best to assume that human performance in systems often may be much less
reliable than what is demanded of hardware and software systems, and accordingly to plan
compensations when humans may be responsible for detecting primary failure modes or for
intervening to mitigate failure effects.
Review of FMEA applications in various industries suggests that there is no standard definition for
an acceptable level of risk. Based on the high volume of operations with consequent risk exposure
and the public’s low tolerance for mishaps, commercial aviation design and manufacturing is held to
a stringent reliability criterion: certification guidance requires that every failure having catastrophic
consequences must be demonstrated to be extremely improbable; the FAA defines “extremely
improbable failure conditions” as “those having a probability of on the order of 1 X 10E-9 or less”
(AC251309-1A, paragraph 10). In contrast, FMEA applications in other industrial domains accept
catastrophic outcomes with probabilities that may be orders of magnitude more likely. An
interesting criterion for aviation design that incorporates both probability and severity factors
establishes that “in general, a failure condition resulting from a single failure mode of a device cannot
be accepted as being extremely improbable” (FAA AC 25.1309-1A, paragraph 2-g). Thus, every
failure mode having catastrophic consequences, regardless of its estimated likelihood, must be
mitigated by a redundant system or a means of reliably detecting the failure before it occurs (the FAA
guidance does suggest that “…in very unusual cases, however, experienced engineering judgment
may enable an assessment that such a failure mode is not a practical possibility.”).
When considering the effectiveness of interventions in mitigating the risks of failure effects, a
significant implication of probability analysis is the assumption of independent events. Normally,
the probability of two events both occurring is the probability of one event multiplied by the
probability of the other event. For example, consider an aircraft component that FMEA determines to
have an unacceptable failure rate. To control this risk, designers require the mechanic to check the
component before each flight and also require the pilot to recheck the component during the taxi-out
checklist. If there is a 10 percent chance of the mechanic forgetting to check the component and also a
10 percent chance of the pilot skipping the same item on the checklist, the probability of the check
being omitted by both persons is only 1 in 100. In this manner, adequate reliability can be obtained
from two somewhat unreliable human performances by imposing multiple, redundant interventions.
However, this analysis assumes that the pilot and mechanic events are independent, while in reality
these events may interact: a pilot who knows that the mechanic is supposed to be checking the
component may grow to rely on the mechanic and become less likely to perform the re-check. As
another example, consider a pharmaceutical product that requires patients to receive periodic lab
tests to detect possible adverse side effects. Multiple, redundant interventions are designed to ensure
that patients receive the lab tests: doctors and pharmacists are both instructed to track the due dates
for the tests and notify patients. However, if doctors become aware that pharmacists are tracking the
due dates, the doctors may become less likely to perform this effort as well; therefore, multiple
intervention collapses to a single intervention and the redundancy is lost. Whenever the assumption

©2009 ParagonRx, LLC P age 8 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

of independent events is violated and the likelihood of one event becomes a function of another
event, it is impossible to conclude that the desired reliability will result from multiple interventions.
Therefore, interventions must be designed and implemented so as to provide and preserve the
independence of the events.

Complementary analytical techniques


In its Safety Manual, NASA states that “risk assessment should use the simplest methods that
adequately characterize the probability and severity of undesired events.” The NASA manual
further states, “Qualitative methods that characterize hazards and failure modes and effects should
be used first…quantitative methods are to be used when qualitative methods do not provide an
adequate understanding of failures, consequences, and events” (NASA NPG 8715.3).
A variety of analytical methods are available to apply to risk management, in addition to FMEA. I
will briefly define and discuss several of these methods and indicate how they can be used to
complement FMEA and extend its applications into areas in which FMEA is otherwise inherently
limited.
I have described the FMEA method as a “bottom-up” approach that attempts to identify failure
effects (some of which may not yet have occurred in actual use of the product) by starting with
individual component failures, imagining the ways the component can fail, and then proceeding up
the chain of the system to subsequent failures and consequences. Further, I identified the bottom-up
orientation of FMEA as advantageous for a prospective, accident-prevention program.
Some alternative analytical methods are “top-down” in that they begin with the ultimate system
consequence or failure event and then proceed down into the system to identify why the failure
occurred. These methods perform well as retrospective analyses; for example, investigations of
accidents or incidents that have already occurred. However, top-down methods can also be useful in
prospective analysis; for example, when concerned about a severe consequence, recognizing that the
primary FMEA method may miss some failure effects, it may also be helpful to analyze beginning
with the consequence itself and to search creatively for other sub-system functions or component
failures might bring about the undesired result.
The SAE’s recommended standard for the general evaluation of aircraft safety (ARP4761, Guidelines
and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment)
describes an over-arching “System Safety Assessment” (SSA) process. SSA integrates FMEA and
some of the following approaches, as required, to thoroughly evaluate all of the failure modes, failure
effects, and risks of a system and show that the entire system (the aircraft) operates at the required
level of safety/reliability despite all anticipated failure modes.
Functional Hazard Analysis (FHA) is a top-down approach that is most often performed at the
beginning of a design effort, when the final specifications for a product have not yet been settled yet
its basic functions are already established. Using engineering judgment and knowledge from similar
efforts, analysts review the basic functions of a product or process and suggest system-level
hazardous outcomes for further analysis. This method allows the safety/quality improvement
process to begin early in product development, at least at a level of broad generality.
Methods similar to FHA also can be applied retrospectively, after a product is fielded. One
successful application is Hazard Analysis of Critical Control Points, which is used in the food
services industries to evaluate the entire chain of food production and distribution, identifying and
controlling sources of food contamination. This application seems amenable to the simpler FHA
methodology rather than a formal FMEA.
Fault Tree Analysis (FTA) is more formal top-down approach to identifying the causal links between
functional breakdowns and their antecedents in events or failures of lower-level components. The
FTA begins with the system-level failure or consequence that the analysts want to understand.

©2009 ParagonRx, LLC P age 9 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Proceeding down through the system from the top-end level to the underlying processes and
components, the analysis results in a graphical representation of the combinations of subsystem and
component failures that can result in the system event. The fault tree (so-named because it resembles
the root structure of a tree) uses standard notations of Boolean logic to denote precursor or lower-
level events that must occur individually (“or-gate”) or in combination (“and-gate”) to bring about
the higher level event. In this manner, FTA directly incorporates multiple causation
(simultaneous/sequential) events. Further, when failure rates are added to each component of the
tree diagram, the probabilities of each of the lower-level events can be added or multiplied to
estimate the probability of the ultimate system-level event.
The following is an example of FTA provided by the SAE:

Source: SAE ARP926B, p. 46.


As a top-down approach, FTA may identify one or more underlying causes of the top-level event but
omit others that might be identified in the bottom-up FMEA. Additional limitations of FTA are that
the methodology (unlike FMEA) does not represent the severity of consequences; hence, it is difficult
to assess the risks of failure and evaluate them with respect to the available countermeasures,
without also undertaking an FMEA.
Because it handles multiple failures, various multiple causations as expressed through Boolean logic,
and the associated probabilities rather naturally, FTA also complements FMEA where the latter is

©2009 ParagonRx, LLC P age 10 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

limited. I suggest that FTA notation and techniques should be applied selectively to explore multiple
failures and associated probabilities once these factors have been identified in the basic FMEA.
Another advantage of FTA when used in combination with FMEA is the top-down check of the
bottom-up process that I have already described. FTA might be applied selectively, once again, to
confirm that FMEA has not omitted catastrophic outcomes. I would consider selective application of
FTA as a complementary extension to the basic FMEA methodology. This is explicitly recognized by
the SAE in ARP926B.
Probabilistic Risk Assessment (PRA) has been adopted by NASA as formal methodology for
analyzing “the probability (or frequency) of occurrence of a consequence of interest, and the
magnitude of that consequence, including assessment and display of uncertainties.” (Michael A.
Greenfield, “Risk Management Tools,” NASA Langley Research Center presentation, May 2, 2000).
A key contribution of PRA is that it considers, tracks, and documents the current state of knowledge
and certainty of the probabilities that are employed in basic FMEA and other analyses. One
significant limitation of PRA, as defined by NASA, is that the methodology requires specific
experience-based failure rate data for the components and functions that are being analyzed. As a
result, I suggest that it may be difficult to apply formal PRA to “softer” areas such as human
performance in FMEA interventions.
Markov Analysis (MA) is a specialized probabilistic analysis especially well suited to evaluating the
failure effects and consequences of high-technology systems that include self-monitoring, self-
repairing and self-reconfiguring functionalities. MA is capable of handling these complex
relationships between failure mode, effect, and consequence by representing the relationship as a
chain, each element in the chain in an operational or non-operational state, and the movement
between states as a system of differential equations. I would suggest that MA is a good methodology
to employ as a complement to basic FMEA and FTA when the nature of the components,
environment, or operators require it; otherwise, in accordance with the principle of minimizing the
complexity of risk analysis, MA does not appear warranted in most applications.
To summarize these alternative methodologies, it is quite possible to extend a basic FMEA into areas
in which the FMEA method is limited, including multiply caused events, simultaneous or sequential
events, and the estimation of probabilities of failure modes, effects, and consequences (and our
confidence in the estimated probabilities), by applying selected aspects of FTA and PRA to the
FMEA. I do not suggest that complete, formal FTA and PRA need to be undertaken in every FMEA
application; rather, these methodologies should be drawn from as required.

Complementary field reporting and data analysis systems from aviation


In a previous section, I mentioned the importance of feeding information from the post-
manufacturing user communities and processes back into the FMEA to ensure that the consequences
of failure modes that arise only in product use (perhaps because they were rare events and did not
occur during design and testing) are recognized and compensated for once they have been
discovered. There are several fairly recent developments in aviation industry reporting and analysis
systems, potentially useful for refining and refreshing an FMEA on a continuing basis, that may also
have applications in other industries.
Aviation Safety Action Programs (ASAP) are cooperative reporting systems for persons active in
commercial aviation operations, including pilots, mechanics, and aircraft dispatchers, to report the
events that happen in daily line operations. ASAP reports are non-jeopardy; in fact, if a person
reports an event to ASAP independently of enforcement action by the regulatory authority (FAA)
then the FAA will typically waive sanctions for any regulatory violation related to the event. This
waiver of sanctions motivates personnel to report the information. ASAP reflects the aviation
system’s recognition that for human failings, obtaining the information is often more important than
punishment the transgressions, most of which are inadvertent in any case. A key feature of the

©2009 ParagonRx, LLC P age 11 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

ASAP program is the Event Review Team, comprising representatives from the airline, the pilot’s
association, and the FAA, which meets periodically to review all submitted ASAP reports and act on
the information in the reports. ASAP is considered to be successful in revealing, disseminating, and
promoting resolution of adverse events in daily flight operations that would otherwise remain
unknown. ASAP applications are increasingly popular in commercial aviation. These programs are
described in official FAA guidance (Advisory Circular 120-66B, Aviation Safety Action Program).
Whereas ASAP obtains information from the personnel in the aviation system, Flight Operations
Quality Assurance (FOQA) programs tap into the volumes of parametric data generated during
regular flight operations and recorded continuously by on-board solid state recording equipment
(similar to, but usually distinct from the crash-hardened Digital Flight Data Recorders that are used
in accident investigations). In FOQA, the greatest challenges are handling mass data and then
interpreting the information. Initial applications of FOQA concentrated on identifying events in
which normal flight parameters (such as airspeed limitations, g-loading, touchdown relative to
target) were exceeded. The programs are beginning to delve beyond exceedance monitoring to the
consideration of within-specification performance statistics, including both the means and the
distributions about them, which can then define the norms of the industry. There is also a growing
trend in FOQA programs to link the information obtained from FOQA with information derived
from ASAP about the same events. This facilitates the combined analysis of “what” happened (from
FOQA) and “why” it happened (ASAP, to the extent that the personnel involved in the event were
aware of why they performed the way that they did). A long-term NASA research program, the
Automated Performance Management System, is encouraging the establishment of FOQA programs
at various U.S. airlines and enhancing data analysis along these lines. Most of the major U.S. air
carriers are generating and collecting FOQA data on at least their more modern fleet types (these
aircraft are equipped with the required data busses). FOQA programs are described in the Flight
Safety Foundation’s Flight Safety Digest, July-September 1998, “Aviation Safety: U.S. Efforts to
Implement Flight Operational Quality Assurance Programs.” Although analogous data may not be
available in other applications, FOQA demonstrates the value of routine monitoring of the use of
products in the field, including the identification of product misuse (exceedances in FOQA) and the
characterization of norms for product use.
The Continuing Airworthiness Surveillance System (CASS) is an aviation reporting and analysis
system that concentrates on tracking product failure modes, effects, and consequences in actual line
maintenance operations. CASS is one of the oldest data-driven quality assurance programs,
beginning in 1964 and tracing its history to industry concerns about several maintenance-related air
carrier accidents during the 1950s. Air carriers are required to implement CASS by Federal aviation
regulations (14 CFR Part 121.373); interestingly, CASS is the only safety management/quality
assurance system that has been specifically mandated by the FAA. CASS is defined by the FAA as a
“structured process to identify factors that could lead to an accident or incident through collection
and evaluation of information that can be used as indicators of the degree of maintenance program
effectiveness and performance…accomplished through a closed-loop, continuous cycle of
surveillance, investigations, data collection and analysis, corrective action, corrective action
monitoring, and back to surveillance.” (FAA AC 120-16D, Air Carrier Maintenance Programs, and AC
120-79, Developing and Implementing a Continuing Airworthiness Surveillance System).
Event reporting systems with many similarities to these aviation systems are being developed and
used in other industries, including healthcare. I think that review of the characteristics and
implementation of ASAP, FOQA, and CASS may enhance similar systems in alternative industries,
particularly as these aviation systems are applied in combination to obtain information that only the
personnel in the system can report, additional mass data about regular operations, and specific
product and personnel failures in the post-manufacturing environment. Also, I suggest that
information systems with these characteristics can be effective feedback mechanisms for the ongoing
analysis of failure modes, effects, and consequences through FMEA.

©2009 ParagonRx, LLC P age 12 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

The Boeing 737 Flight Controls Engineering Test and Evaluation Board: a successful
application of extended FMEA
On September 8, 1994, USAir flight 427, a Boeing 737-300 airplane, crashed while maneuvering to
land at Pittsburgh International Airport, Pittsburgh, Pennsylvania. All of the 132 persons aboard
were killed, and the airplane was destroyed. The accident occurred in clear weather with light
winds, during the hours of daylight. After a three-year investigation, the National Transportation
Safety Board (NTSB) determined that the probable cause of this accident was “loss of control of the
airplane resulting from the movement of the rudder surface to its blowdown limit…The rudder
surface most likely deflected in a direction opposite to that commanded by the pilots as a result of a
jam of the main rudder power control unit servo valve secondary slide to the servo valve housing
offset from its neutral position and overtravel of the primary slide.” (National Transportation Safety
Board, Uncontrolled Descent and Collision With Terrain, USAir Flight 427, Boeing 737-300, N513AU,
Near Aliquippa, Pennsylvania, September 8, 1994, NTSB AAR-99/01, adopted on 3/24/99).
Before this accident the rudder system of the 737 had been evaluated by Boeing and the FAA, in full
compliance with existing certification requirements, using failure analysis (a less rigorous version of
FMEA) for the original design reviews performed during the 1960s and FMEA for new-model
reviews performed during the 1980s and 90s. Because the rudder systems had not been completely
redesigned in the new model 737s, the FAA required only a very limited scope for the FMEAs
conducted in the 80s and 90s. Despite these analyses and consistent with their limited scope, the
NTSB investigation determined that the airplane’s rudder system was subject to several previously
unidentified single-point failures that could have catastrophic results. One or more of these failure
modes was most likely involved in the rudder system jam and reversal, which led to the fatal
accidents.
The NTSB issued numerous safety recommendations related to its findings regarding the Boeing 737
rudder system and unusual attitude recovery procedures for flight crews. In Safety
Recommendation A-99-21, the NTSB recommended to the FAA:
Convene an engineering test and evaluation board to conduct a failure analysis to identify
potential failure modes, a component and subsystem test to isolate particular failure modes
found during the failure analysis, and a full-scale integrated systems test of the Boeing 737
rudder actuation and control system to identify potential latent failures and validate operation of
the system without regard to minimum certification standards and requirements in 14 Code of
Federal Regulations Part 25. Participants in the engineering test and evaluation board should
include the Federal Aviation Administration (FAA); National Transportation Safety Board
technical advisors; the Boeing Company; other appropriate manufacturers; and experts from
other government agencies, the aviation industry, and academia. A test plan should be prepared
that includes installation of original and redesigned Boeing 737 main rudder power control units
and related equipment and exercises all potential factors that could initiate anomalous behavior
(such as thermal effects, fluid contamination, maintenance errors, mechanical failure, system
compliance, and structural flexure). The engineering board’s work should be completed by
March 31, 2000 and published by the FAA.
In response to this recommendation, the Engineering Test and Evaluation Board (ETEB) was
convened in May 1999 and completed its work in July 2000 with the issuance of a final report.
(Federal Aviation Administration, 737 Flight Controls Engineering Test and Evaluation Board Final
Report, July 20, 2000.) The staff of the ETEB was detailed from the FAA, Boeing (Commercial, Space,
and Military Airplane divisions), Air Line Pilots Association, Ford Motor Company, Air Transport
Association, Interstate Aviation Commission (Russia), NASA, and U.S. Navy.
According to the ETEB’s report, the group conducted:
z A failure analysis of the flight control system to identify potential failure modes;

©2009 ParagonRx, LLC P age 13 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

z Component and subsystem tests to isolate particular failure modes found during the failure
analysis; and
z Full-scale integrated systems tests, including ground and flight testing, of the … 737 rudder
actuation and control system to identify potential latent failures and to validate the operation
of the system (ETEB Final Report, p. 2-3).
The ETEB noted that normal certification procedures for aircraft and components require
consideration of the probabilities of a failure mode or adverse effect. However, the ETEB chose to
evaluate the severity of failure mode consequences without regard to their probability of occurrence.
The ETEB’s rationale for this approach was that the Boeing 737 had experienced approximately four
serous failures of its rudder system in 100 million flight hours, two of which had resulted in fatal
accidents. Therefore, the failures under investigation were extremely rare but of extremely adverse
outcome. Consequently, it was considered appropriate to treat any failure mode with the potential
for catastrophic consequences as of the highest risk level, regardless of how unlikely the failure mode
or effect. A related goal of this new analysis was to “focus…on rare failures that may not have been
considered in the original certification requirements” (because the failures were considered
extremely improbable, ETEB Final Report, p. 2-8). The ETEB described its analytical approach as
follows:
The ETEB conducted a comprehensive and detailed failure modes and effects analysis (FMEA)
for the complete rudder control system…Preliminary hazard classifications were assigned to
each failure, based on the predicted severity and the ability of the flight crew to maintain control
of the airplane and conduct a safe landing. For all failures classified as “catastrophic (Class I)” or
“hazardous (Class II),” the ETEB conducted failure simulations using a detailed high-fidelity
simulation of the rudder control system. In addition, the ETEB conducted pilot-in-the-loop
failure simulations using a motion-base flight simulator. The purpose was to identify the impact
of the failures on the operation of the airplane following flight crew actions. The hazard
classifications of the failures were updated, based on the combined results from these two
simulation activities (ETEB Final Report, p. 2-7).
These tests and simulations were used to verify and validate the hazard levels that had preliminarily
been assigned to the failure modes. Because some failures and interventions had unexpected
consequences in the testing, the feedback from these verifications was extremely important and
influential in the final conclusions and recommendations of the ETEB. This demonstrates how an
FMEA that is open to feedback and change, either from testing or field experience, can provide much
better results than a one-time evaluation.
The ETEB illustrated the verification and feedback built into the FMEA in the following figure from
its final report:

©2009 ParagonRx, LLC P age 14 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Source: ETEB Final Report, p. 2-6


The full range of hazard classifications followed standard FAA practice and was defined as follows
by the ETEB:

©2009 ParagonRx, LLC P age 15 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Source: ETEB Final Report p. 3-3


The ETEB used a standard adaptation of the FMEA analysis form (see table). It is interesting to note
how the form explicitly recognized the mitigating effects of flight crew actions in response to
equipment malfunctions (columns 5, 7, and 8).

©2009 ParagonRx, LLC P age 16 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Source: ETEB Final Report, p.3-2


Although the possibility of imperfect flight crew performance (a realistic expectation for human
intervention in a complex or stressful situation) was not explicitly modeled on the FMEA worksheet,
the ETEB accomplished this important extension to the basic FMEA by validating and revising
assumptions about the reliability of flight crew performance through its testing process. The ETEB
found that flight crews were not able to reliably intervene and mitigate the consequences of rudder
component failures in some operational circumstances, and these revised expectations were entered
into the final versions of the FMEA worksheets.
The following figure provides an excerpt of an actual FMEA worksheet. This worksheet includes a
finding of catastrophic severity for a failure effect that could not be mitigated:

©2009 ParagonRx, LLC P age 17 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

Source: ETEB Final Report, appendix A, p. 95


Another useful extension that the ETEB added to the basic FMEA was the explicit consideration of
latent (preexisting, undetected) failures combined with active failures. Although FMEA is not
considered to be well-suited to the analysis of multiple failure modes, the ETEB was able to readily
analyze these sequential failure combinations by treating the latent and active failures as a single
combined failure mode for subsequent evaluation of the failure effects and consequences. This
manual extension of the FMEA method was effective for linked pairs of errors; I think that it may
have been very complicated to use this method to track and display triple or even more complicated
failure combinations, but these failure combinations were not required.
The table that follows (from ETEB Final Report, p. 3-40) provides a sample of the new latent/active
failure combinations that the ETEB was able to identify and analyze using FMEA:

©2009 ParagonRx, LLC P age 18 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

The FMEA undertaken by the ETEB was successful in identifying a large number of previously
unknown or unevaluated failure modes, several of which had the potential to result in catastrophic
consequences. The following are excerpted from the results presented by the ETEB in its final report:
The [Boeing] 737 rudder control system is susceptible to a number of:
z Failures and jams that can cause uncommanded rudder motion;
z Failures and jams that affect the operation of both the rudder main and standby power
control units (PCU), thereby defeating the independence of the two systems; and
z Latent failures.
These failure modes are single failures, single jams, or latent failures in combination with a detectable
failure or jam.
The rudder control system of the Initial and Classic Model 737s with the modifications required by
the applicable FAA [Airworthiness Directives]…have:
z 14 single failures and jams, and 12 latent failure combinations, that have Class I failure effects
in the takeoff and landing regimes. These same failure modes have 4 Class I effects and 22
Class III (major) effects in the rest of the flight envelope.
z 8 single failures and jams, and 11 latent failure combinations, that have Class II failure effects.
(ETEB Final Report p.. 1-3)
©2009 ParagonRx, LLC P age 19 of 21
Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

The ETEB drew strong conclusions about factors influencing the efficacy of human interventions to
mitigate rudder system failures:
The ETEB conducted 40 hours of pilot-in-the-loop rudder failure simulations with10 pilot and co-
pilot flight crews from four airlines.
z In general, the flight crews found the existing Jammed or Restricted Rudder Emergency
Procedure difficult to use.
z The flight crews appeared to have received little training in the use of the Jammed or
Restricted Rudder Emergency Procedure or the Uncommanded Yaw or Roll Emergency
Procedure.
z The lack of a clear and unambiguous display of rudder position made it difficult for the
crews to diagnose uncommanded rudder deflections and take prompt corrective actions.
z Uncommanded rudder hardover deflections during takeoff and landing resulted in Class I
failure effects [i.e., human intervention was not reliably effective] (ETEB Final Report, p. 1-4).
The ETEB’s investigation of latent failure effects using extended FMEA methods resulted in a
conclusion that “there are several latent failures that, when combined with one additional single
failure or jam, result in Class I or Class II failure effects. There are insufficient inspections for these
latent failures” (ETEB Final Report, p. 1-5).
As I have indicated throughout, no FMEA is can be considered complete unless it leads to the
mitigation of the unacceptable risks that the analysis identifies. The ETEB’s application of FMEA
resulted in the following recommendations for redesign of the rudder system:
Modify the Boeing Model 737 rudder control system to ensure that:
z No single failure or single jam of the rudder control system will cause uncommanded motion
of the rudder surface that results in a Class I failure effect;
z No combination of failures or jams will result in a Class I failure effect, except for those
combinations that are shown to be extremely improbable; and
z No probable single failure or jam will have an effect worse than Class IV.
In addition, The Boeing Company should consider providing a fail-safe rudder control
system design that provides protection from latent failures that contribute to a Class I failure
effect (ETEB Final Report, p. 1-6).
As a result of these recommendations (and the preceding accident investigation causal findings and
recommendations of the NTSB), the Boeing 737 rudder system has been redesigned to provide
reliable redundancy, and a major hardware retrofit program is underway for the entire fleet.
To mitigate risks pending completion of this fleet retrofit, the ETEB also provided the following
recommendations to improve the risk mitigation value of human (pilot and mechanic) interventions
following a rudder system failure:
z Revise and simplify the current “Jammed or Restricted Rudder” emergency procedure.
z Provide additional training to flight crews in the use of the “Jammed or Restricted Rudder”
emergency procedure and the related “Uncommanded Yaw or Roll” emergency procedure.
z Display rudder position to the flight crew.
z Alert flight crews and maintenance crews to the signs of rudder malfunctions, such as
uncommanded pedal motion (ETEB Final Report, p. 1-6).
These recommendations targeted at improving human performance have been partially implemented
by the aircraft manufacturer and FAA, from 2000 to present. Despite the limitations that remain in

©2009 ParagonRx, LLC P age 20 of 21


Effective Risk Management and Quality Improvement
by Application of FMEA and Complementary Techniques

human interventions, it is most significant, I believe, that the result of the FMEA performed by the
ETEB was to render the designers’ expectations for human performance, and the design’s reliance on
human intervention, much more consistent with realistic human capabilities and limitations. This
was a strong contributor to the accuracy and applicability of the FMEA’s results and its ability to
improve system safety.
In all, I believe that the ETEB process was a very successful example of the application of FMEA
extended with (1) top-down analysis (the program began with foreknowledge that the end-level
adverse event to eliminate or mitigate was flight control malfunction leading to loss of aircraft
control), (2) consideration of multiple (latent) failures, and (3) realistic consideration of human
performance during interventions, and (4) feedback from external data sources to FMEA revision. In
the ETEB application, FMEA was not supplemented by data-driven analysis of conditional
probabilities, this was an appropriate, conservative response to the extremely rare/extremely
hazardous nature of the environment and threats.
The ETEB’s work shows how the basic FMEA combined with complementary extensions can form a
comprehensive safety analysis that results in real safety improvement. The excellent results of the
ETEB program are equally a testament, I think, to a strong effort to creatively re-think the failure
modes and effects for a system that had been thought to be completely well-understood and
thoroughly time-tested by 100 million hours of field use. This creativity and openness are necessary
ingredients for any successful analysis.

Conclusions about FMEA


Based on the foregoing review, I conclude the following about the Failure Modes and Effects
Analysis methodology:
z FMEA is a sound methodology for basic, structured risk management and quality
improvement analysis.
z The ideal approach can be to use FMEA as the backbone for analysis that also includes the
integration of complementary methods, as required; for example, it may be appropriate to
apply elements of FTA or PRA to understand and explore the proper scope of analysis, the
significance of failure effects, and the effectiveness of risk management interventions.
z Thoughtful application of FMEA can identify when these extensions are required and to
integrate and document results of an extended analysis.
z The limited reliability of humans in complex systems argues for multiple, redundant,
independent interventions when relying on humans to detect failure modes or actively
intervene to mitigate failure effects.
z FMEA, as extended with appropriate top-down, probabilistic, and feedback methods, is an
excellent framework for risk management and quality improvement in the post-design/post-
manufacture (field distribution, application, or user) environment, including the human
performance aspects of this environment.

©2009 ParagonRx, LLC P age 21 of 21

You might also like