You are on page 1of 206

No Fault Found:

The Search for the


Root Cause
Other SAE books of interest

Integrated Vehicle Health Management:


The Technology
Edited by Ian K. Jennions
(Product Code: R-429)

Integrated Vehicle Health Management:


Implementation and Lessons Learned
Edited by Ian K. Jennions
(Product Code: R-438)

Counterfeit Electronic Parts


and Their Impact on Supply Chains
By Kirsten M. Koepsel
(Product Code: T-130)

For more information or to order a book, contact:


SAE INTERNATIONAL
400 Commonwealth Drive
Warrendale, PA 15096
Phone: +1.877.606.7323 (U.S. and Canada only)
or +1.724.776.4970 (outside U.S. and Canada)
Fax: +1.724.776.0790
Email: CustomerService@sae.org
Website: books.sae.org
No Fault Found:
The Search for the
Root Cause
By Samir Khan, Paul Phillips,
Christopher J. Hockley, and Ian K. Jennions

Warrendale, Pennsylvania, USA


400 Commonwealth Drive
Warrendale, PA 15096

E-mail: CustomerService@sae.org
Phone: +1.877.606.7323 (inside USA and Canada)
+1.724.776.4970 (outside USA)
Fax: +1.724.776.0790

Copyright © 2015 SAE International. All rights reserved.


No part of this publication may be reproduced, stored in a retrieval system, distributed,
or transmitted, in any form or by any means without the prior written permission of
SAE International. For permission and licensing requests, contact SAE Permissions, 400
Commonwealth Drive, Warrendale, PA 15096-0001 USA; email: copyright@sae.org;
phone: +1.724.772.4028; fax: +1.724.772.9765.

SAE Order Number R-441


http://dx.doi.org/10.4271/r-441

Library of Congress Cataloging-in-Publication Data


Malen, Donald E.

FPO
Fundamentals of automobile body structure design / by Donald E. Malen.
    p. cm.
“SAE order no. R-394.”
ISBN 978–0–7680–2169–1
1.  Automobiles—Bodies—Design and construction.  I. Title.
TL255.M255 2011
629.2´6—dc22
2010041059

Information contained in this work has been obtained by SAE International from sources believed to be
reliable. However, neither SAE International nor its authors guarantee the accuracy or completeness of
any information published herein and neither SAE International nor its authors shall be responsible for
any errors, omissions, or damages arising out of use of this information. This work is published with the
understanding that SAE International and its authors are supplying information, but are not attempting
to render engineering or other professional services. If such services are required, the assistance of an
appropriate professional should be sought.

ISBN-Print 978-0-7680-8122-0
ISBN-PDF 978-0-7680-8227-2
ISBN-epub 978-0-7680-8229-6
ISBN-prc 978-0-7680-8228-9

To purchase bulk quantities, please contact:


SAE Customer Service
E-mail: CustomerService@sae.org
Phone: +1.877.606.7323 (inside USA and Canada)
+1.724.776.4970 (outside USA)
Fax: +1.724.776.0790

Visit the SAE International Bookstore at

BOOKS.SAE.ORG
Table of Contents
Acknowledgments..............................................................................ix
Chapter 1: Introduction........................................................................ 1
1.1 Background.................................................................................................................1
1.1.1 Maintenance and NFF–Historical Perspective.......................................3
1.1.2 The Growth of NFF within Aerospace ...................................................4
1.1.3 NFF Related Literature..............................................................................8
1.2 The NFF Phenomena.................................................................................................9
1.3 The Cost of NFF.......................................................................................................13
1.4 Scope of This Book...................................................................................................14
1.5 References.................................................................................................................15

Chapter 2: Basics and Clarification of Terminology......................... 17


2.1 Introduction..............................................................................................................17
2.2 Systems Basics..........................................................................................................18
2.3 Failure and Types of Failure...................................................................................19
2.4 Fault and Types of Fault.........................................................................................20
2.5 Maintenance and Related Terms...........................................................................21
2.6 No Fault Found Terminology.................................................................................23
2.6.1 NFF Classification.....................................................................................23
2.6.2 Case Study–The Impact of Inconsistent Terminology.........................30
2.6.3 Other Related Terms.................................................................................31
2.7 Nomenclature...........................................................................................................33
2.8 Conclusion................................................................................................................36
2.9 References.................................................................................................................36

Chapter 3: The Human Influence...................................................... 39


3.1 Introduction..............................................................................................................39
3.2 The Human Element...............................................................................................40
3.2.1 Organizational Context............................................................................40
3.2.2 Communication........................................................................................42
3.2.3 Human Factors Impacting NFF..............................................................44
3.3 The Maintenance Engineer and System Interactions.........................................46
3.3.1 Typical Maintenance Processes in Civil Aircraft..................................46
3.3.2 Hardware Interactions.............................................................................47
3.3.3 Software Interactions...............................................................................48
3.3.4 Environment Interactions........................................................................49
3.4 Human Factors Survey...........................................................................................49
3.4.1 Introduction...............................................................................................49
3.4.2 Aircraft Testing Resources ......................................................................50
3.4.3 Aircraft Maintenance Manuals...............................................................52
3.4.4 Organizational Pressures.........................................................................53
3.4.5 Maintenance Engineer: Competence and Training..............................55

v
3.5 Best Practice Guidelines..........................................................................................58
3.6 Conclusion................................................................................................................59
3.7 References.................................................................................................................60
Chapter 4: Availability in Context.....................................................61
4.1 Introduction..............................................................................................................61
4.2 Aerospace Maintenance Practice...........................................................................62
4.3 The Quality of Maintenance Systems ..................................................................64
4.4 Design for Maintenance and System Effectiveness............................................66
4.5 Availability................................................................................................................67
4.5.1 The Multiple Facets of Availability........................................................67
4.5.2 Design Requirements for RAM..............................................................71
4.6 The Impact of NFF on Availability........................................................................73
4.7 A Process for Improvement....................................................................................77
4.7.1 Overview....................................................................................................77
4.7.2 A Methodology for Monitoring NFF In-Service..................................80
4.7.3 Unit Removal Datasheets........................................................................80
4.8 Conclusion................................................................................................................82
4.9 References.................................................................................................................83

Chapter 5: Safety Perceptions.......................................................... 85


5.1 Introduction..............................................................................................................85
5.2 Faults and Safety–Some Perceptions.....................................................................86
5.3 A Conceptual Discussion........................................................................................87
5.4 The Regulatory Issues in the Air Environment...................................................89
5.5 Faults and the Link with Maintenance Errors.....................................................91
5.5.1 The Maintenance Contribution...............................................................91
5.5.2 Operational Pressure................................................................................92
5.5.3 The Human Factors Contribution..........................................................93
5.5.4 Diagnostic Maintenance Success............................................................96
5.6 NFF and Air Safety–A Case Study........................................................................97
5.7 Conclusion................................................................................................................98
5.8 References.................................................................................................................99

Chapter 6: Operating Policies for Management Guidance............ 101


6.1 Introduction............................................................................................................101
6.2 Through-Life Engineering Services Context......................................................102
6.3 Policy Requirements..............................................................................................108
6.4 The NFF Control Process...................................................................................... 111
6.5 Application Example.............................................................................................122
6.5.1 Introduction.............................................................................................122
6.5.2 Implementation Prerequisites...............................................................122
6.5.3 Application..............................................................................................123
6.6 Conclusion..............................................................................................................125
6.7 References...............................................................................................................126

vi
Chapter 7: A Benchmark Tool for NFF............................................ 127
7.1 Introduction............................................................................................................127
7.2 Benefits of NFF Management...............................................................................127
7.3 Challenges of Investigating NFF.........................................................................130
7.3.1 Technical Challenges..............................................................................131
7.3.2 Commercial Challenges.........................................................................131
7.4 A Proposed Tool for Managing NFF...................................................................132
7.4.1 The Benchmark Tool...............................................................................132
7.4.2 A NFF Maturity Model..........................................................................133
7.4.2.1 Maturity Model Scoring Matrix..........................................133
7.4.2.2 Mitigation Plan......................................................................142
7.4.2.3 Visual Capability...................................................................142
7.5 Deployment of the Tool.........................................................................................143
7.5.1 Stage 1.......................................................................................................143
7.5.2 Stage 2.......................................................................................................143
7.5.3 Stage 3.......................................................................................................144
7.5.4 Stage 4.......................................................................................................144
7.6 Summary of the Tool.............................................................................................144
7.7 References...............................................................................................................144

Chapter 8: Improving System and Diagnostic Design...................145


8.1 Introduction............................................................................................................145
8.2 Diagnostics Design and NFF................................................................................146
8.2.1 In-Service Feedback Activities..............................................................147
8.2.2 Diagnostic Design Activities.................................................................148
8.3 System Design and System Integrity..................................................................149
8.4 Testability................................................................................................................150
8.4.1 Testability Standards..............................................................................151
8.5 Design for Diagnosis.............................................................................................152
8.6 Information Feedback to Diagnostic Design.....................................................153
8.7 Level of Training....................................................................................................154
8.8 User-Interaction and System Design...................................................................155
8.9 Conclusion..............................................................................................................155
8.10 References...............................................................................................................156

vii
Chapter 9: Technologies for Reducing No Fault Found.................159
9.1 Introduction............................................................................................................159
9.2 Advanced Diagnostics..........................................................................................160
9.2.1 Health and Usage Monitoring of Electrical Systems.........................160
9.2.2 Built-In Test..............................................................................................160
9.2.2.1 Enhanced Understanding of System/Fault Topology.....162
9.2.2.2 BIT Code Diagnostics...........................................................163
9.2.3 Monitoring and Reasoning of Failure Precursors..............................163
9.2.4 Monitoring Life-Cycle Loads................................................................165
9.3 Improvements to Testing Abilities......................................................................166
9.3.1 Testability as a Design Variable............................................................166
9.3.2 Functional and Integrity Testing..........................................................167
9.3.3 Testing Under Environmental Conditions..........................................169
9.3.4 Management of the Test Station ..........................................................170
9.3.5 Tracking Spare Part Units......................................................................171
9.4 Conclusion..............................................................................................................172
9.5 References...............................................................................................................173

Chapter 10: Summary and Ideas for Future Work.......................... 175


Index..................................................................................................181
About the Authors...........................................................................195

viii
Acknowledgments
The authors wish to acknowledge the support and opportunity provided by the
EPSRC Centre for Innovative Manufacturing in Through-Life Engineering Services
in the preparation of this book.  The Centre was set up by the EPSRC in June 2011
as a partnership between Cranfield and Durham Universities with 5 main projects,
research into NFF being one of those. Initial research established that there was very
little knowledge and information available in the public domain about NFF.  What
information that did exist seemed to be in specialist publications or only readily
available to limited groups or industries.  An early aspiration of the project therefore
was to redress this deficiency with a book that would provide a foundation in the
subject for engineers and managers alike. The research and knowledge gained
during the NFF project within the Centre has provided both the material, and
certainly the inspiration, for this book.

We are most grateful to the Centre Director, Professor Raj Roy for his encouragement
and to the SAE for the opportunity. In particular we would like to thank Monica
Nogueira at SAE who has helped us achieve our aspirations. We trust that you
the reader will find this a useful book in your efforts to understand and hopefully
reduce the occurrence of NFF in your area of responsibility.

ix
Chapter 1
Introduction

1.1 Background
In today’s society we are all strongly dependent on the correct functioning of technical
systems, and this dependence has made us vulnerable to their failure. Any disruption,
due to degradation or anomalous behavior, is of major concern, not only to us (the user)
but also to the manufacturers, suppliers, operators, and maintainers of the equipment. It
can have adverse effects on safety, operation, and brand name, and it can directly reduce
the profitability of all elements of the value chain.

Such disruption can come in various forms. Slow degradation, giving enough time to
source and fit a new part, is a relatively minor inconvenience. A part that breaks (rapid
degradation) is somewhat more annoying, as the ability of the equipment to perform
its function is lost until the part is replaced. It is, nonetheless, tolerated, as the link
between cause and effect can be understood and easily remedied. Neither of these is
as troublesome as anomalous behavior, in which systems or subsystems do not act in
accordance with design intent. In this category, one can include emergent behaviors
in which, principally due to the complexity of subsystem interaction, a system can
demonstrate a response with action that was unintended. Another area in the same
category is that of faults where the root-cause diagnosis cannot be identified. In such
cases, a suspect component is replaced only for it to be found that the fault has not gone
away, and when the component is tested on a bench, it is found to be working normally.
This area has been given the name No Fault Found (NFF) and is the subject of this
book. When NFF occurs, other components are replaced until functionality returns,
without a clear idea as to why. Due to our inability to diagnose such problems, the cost
incurred, until recently, has become part of “the cost of doing business.” However, with
all companies now operating much more efficiently than one or two decades ago, this
(large) overhead can no longer be tolerated, and the constituents of the problem have
to be examined.

1
Chapter 1

Figure 1.1 shows a generic operating environment that can be found across many different
sectors (e.g., aerospace, rail, and energy), which operate expensive, complex equipment. The
original equipment manufacturer (OEM) supplies an asset to an operator who is going to use
it as part of a business to make a profit. The operator needs the equipment to be regularly
maintained, and the maintainer will have access to the OEM’s supply chain for spare parts.
This is all done with a respect for standards, and certification in some industries. Altogether,
this picture is much more complicated in real life, but it serves the purpose here of exposing
all of the areas where NFF could have an effect, and that therefore have to be explored to
reduce the (significant) cost of this effect. All of the areas shown are covered, to some degree,
in this book.

Fleet
• In-flight data
• CMC messages
• Post flight reports

Maintenance & Operations OEM – Design,


Logistics Engineering,
Manufacturing

• Maintenance scheduling
• Policies / procedures
• Tools / techniques • Certification & testing
• Operational Schedule
• Fault diagnosis • FMECAs
• Availability
• Test coverage • System test data
• Profit
• Manpower availability • Service contracts
• Turnaround time
• Repeat removals • Reliability
• Asset tracking
• Training • Maintainability
• Cost of NFF – warranty
• Technical publications
• Obsolescence
• Troubleshooting
• Safety
• Human factors
• Diagnostics
• Culture
• Spares supply

Supply Chain
• Parts provision
• Inventory management

Standards
• ARINC 429 etc

Figure 1.1 General operating environment.

Two aligned fields that deserve mention at this point, as they add complexity and richness
to Figure 1.1, are Product Service Systems (PSS) and Integrated Vehicle Health Management
(IVHM). PSS [1-1], or servitization [1-2], arose from some OEMs transforming their business
model from selling a product to selling a service. In the product scenario, income is derived
from the original sale, and future income is dependent on the sale of spare parts. In the
service scenario, a maintenance contract is sold at the same time as the asset, and hence a
steady monthly income is derived in return for effective maintenance; the OEM has become
the maintainer, captured more of the value chain, and assumed more of the NFF “cost of

2
Introduction

doing business.” IVHM [1-3] arose to better inform the OEMs of the behavior of their assets
in service. It provides data from sensors on the asset and processes it, via diagnostic or
prognostic algorithms, into actionable information. With the aim of improved fault isolation,
IVHM provides the vital signal as to the component degradation that Operations and
Maintenance need to begin their respective jobs.

1.1.1 Maintenance and NFF–Historical Perspective


To understand the role that NFF plays in today’s maintenance practices, it is helpful
to examine some of the historical developments over the years. Although a complete
chronology of the use of the term No Fault Found is beyond the scope of this text, an early
reference to NFF for a maintenance-related theme can be identified from 1954 [1-4]. Even
though the article was discussing a maintenance plan for airborne radio equipment, it made
note of the implications of unscheduled removals whose faults could not be confirmed
during bench testing:

“Item 1 includes instances in which malfunctioning


was reported to aircrew or maintenance personnel, but could
not be confirmed on test. There are four possible causes for
such removals: (1) Erroneous reporting: i.e., no fault existed,
(2) Wrong diagnosis: i.e., the fault was in a unit other than
that was removed, (3) Incorrect or wrongly applied test
procedures: i.e., the wrong tests were applied, or the right
tests were incorrectly applied, or (4) Invalid tests: i.e., the
prescribed tests do not disclose the fault.”

All of the four factors listed are still applicable in today’s systems, where the level of NFF has
certainly proven to be an issue that can throw any modern maintenance policy into disarray.

Figure 1.2 captures the natural evolution of maintenance systems since the 1950s, showing
the trend toward increasing component complexity and system integration, and hence
system maintenance requirements. To address these challenges, the discipline of system
engineering has been promoted to accommodate the development of interacting systems.
From a NFF point-of-view, this means that systems have become more intricate and
complicated, and any reported failures may require an even more sophisticated and rigorous
testing procedure to elicit reliable and dependable results.

Across industry, the aerospace sector has reported the major share of NFF failures, primarily
within aircraft avionics, and hence the next section is devoted to NFFs in that sector.
These avionics’ problems not only indicate the correlation between increasing electronic
components within modern systems and the NFF rate, but also demonstrate how a non-issue
can develop into a strategic concern for an organization within a competitive environment
over time.

3
Chapter 1

Increasing  system  complexity  


Maintenance   Maintenance  driven  by  
Maintenance  is  affec&ng  profit   Maintenance  joint  partnership  
is  fix-­‐if-­‐broke   complex  technical  issues    

Increasing  use  
of  FMEA  

Equipment  monitoring  

Increasing  
maintenance  costs   Life-­‐cycle  engineering   Recession  

Systems  
Reliability  
Engineering   Availability  based  
contrac&ng  

Maintenance  
Steering  Group   Most  organisa&ons  
(MSG)  established   adopt  MSG  approach  

Boeing  747     ARINC  672  

Maintenance  
Preven&ve   Produc&ve   US  military   High  demands   REMM  
Steering  
maintenance   Engineering   adopts  RCM   of  air  travel   Methodology  
Group-­‐3  

1950                1960                            1970                              1980              1990                      2000                        2010


   
Figure 1.2 Maintenance evolution over time.

1.1.2 The Growth of NFF within Aerospace


By the late 1960s, a much more competitive marketplace had developed, with an increasing
intolerance of downtime. Coupled with rising system complexities and maintenance
requirements, the civil aviation industry recognized that overhaul activities were not only
costly but were also failing to reach acceptable levels of safety. It was hence recognized
that business and management practices, which had worked well for smaller aircraft, were
simply inadequate for large, complex, high-value assets. During the development of the
Boeing 747, the aviation industry started to search for improved reliability, questioning
maintenance strategies and disregarding the long-established rule that old components are
“the most likely culprits.”

As a consequence, a Maintenance Steering Group (MSG) was commissioned comprised of


aerospace manufacturers, operators, and the Federal Aviation Authority to analyze current
practices. This initiative led to the invention of reliability centered maintenance (RCM)
concepts that had an emphasis on reliability improvements, cost reduction, and good
maintenance practices [1-5]. Work continued from this on specific standards, introduced
over the years, to improve system maintenance and overall scheduling costs [1-6], [1-7].
These documents recognize NFF events as hidden failures. These are functional failures

4
Introduction

that do not become evident to the operating personnel under normal circumstances (e.g.,
the immediate overall operation of the system remains unaffected) (see Figure 1.3). If the
functional failure is not hidden, then the failure is considered as evident.

System  
Subsystem  
Hidden  Failure  mode  
Component  A  

Component  B  

Figure 1.3 Failure in only Component B will not affect the functionality of the overall system.
Components A and B must fail simultaneously for a functional failure to become evident. Such
redundant paths may not always have failure warnings for each component, and hence such
failures will remain hidden.

For modern industry, however, NFF has been an area in which huge sums of money can be
wasted through increasingly high levels of unscheduled removals. As mentioned, this is due
to the continued trend in increasing system complexities and the resulting cost implications
on maintenance programs. Additionally, business culture within civil airlines has changed,
with a stronger desire to reduce costs while increasing the availability of their aircraft.
Recognizing the lack of any NFF standards or procedures, these organizations adjusted their
practices according to their own understanding and requirements. This also meant that
each organization started using their own terminology to describe NFF, which continues to
grow (See chapter 2 on Terminology). The lack of a single common descriptive and standard
term also indicated that NFF was not well understood, and hence led to confusion from a
technical and nontechnical standpoint. These developments will be traced more completely
in later chapters of this book.

Work published throughout the 1980s and 1990s [1-8], [1-9], [1-10], indicates the demand for
greater system reliability, availability, and maintainability at lower cost. This also created
awareness of new failure processes, improvements in management practices, and new
technologies that could improve the understanding of system, subsystem, and component
level health. As maintenance, reliability, and risk became more significant for design
engineers, environmental and safety issues became paramount through legislation [1-11],
[1-12], and hence maintenance procedures had to be refined. This led to advancements
in a number of maintenance and health-monitoring techniques, including condition
monitoring, quality standards, expert systems, and reliability centered maintenance, to
name but a few. Many aircraft were being used well beyond the years envisioned by their
design, which created serious concerns about maintaining airworthiness/safety matters.
Increasing demands for air travel promoted various safety studies and thus established the
requirements for upgrading existing systems while accommodating advances in avionics at
reduced costs. Given the growing total cost of ownership, tight maintenance budgets, and
attempts to remain competitive, organizations were looking to maximize the return on their

5
Chapter 1

contracts rather than enhance their maintenance practices. This period witnessed a rise in
the number of unscheduled removals. Commercial contracts did not acknowledge NFF as
an issue, and no mechanisms were placed to calculate its true costs. With no defined metrics
or responsibility, NFF continued to cause a wasting of resources and unproductive time
utilization—adding to maintenance costs, downtime, and unavailability of systems.

The continued increase in avionic complexity led to an increase in the NFF rate. Practitioners
realized that the issue was affecting their reputation (and relationships within the supply
chain) and hence required a solution. In 1998, a Reliability Enhancement Methodology and
Modeling (REMM) project was initiated to support the reliability enhancement of electronic
systems. Within it, a statistical model to facilitate the assessment of reliability throughout the
product life cycle was included [1-13], [1-14]. One of the major deliverables was to investigate
the root cause of NFF events within the aerospace industry and provide a comprehensive
breakdown of potential reasons for these. This list included:

• Operator policies (e.g., short turnaround times, availability of spares, aircrew mission
priorities)
• Failure recording / reporting (e.g., quality of aircrew debrief, poor data coding)
• Maintenance practice (e.g., training of maintainers, accuracy of technical publications)
• Repeat removals (e.g., minimal use of maintenance history)
• Workshop effectiveness (e.g., pressure to produce throughput, staff training)
• Test coverage (e.g., test philosophy across maintenance levels, comprehensiveness of test)
• Interpretation of results (e.g., fault code interpretation, training of workshop staff)
• Intermittent system connection (e.g., connector integrity, harness / loom integrity)
• Product containing intermittent fault (e.g., solder joints, PCB weakness)
• System design (e.g., built-in test equipment—BITE—coverage, software tolerances)

Much of this work, which was never openly published, emphasized the importance of
developing practical guidance for system designers to facilitate reducing NFFs in both
current and future products. With continuously evolving reliability requirements, and
contract types in the commercial world, it seemed that guaranteeing a fixed failure rate,
for high-value products, may not satisfy end user objectives and government regulations,
especially if these products suffer from a high number of NFF events.

In 2008, fundamental to advances made in NFF, a set of procedures was introduced in the
form of the ARINC 672 Report [1-15]. This report was directly aimed at providing a basis
for a structured process to address NFF within the aviation industry. It also introduced a
definition of NFF:

“Removal of equipment from service for reasons that cannot


be verified by the maintenance process (shop or elsewhere).”

6
Introduction

This document highlights criteria for decision making regarding root causes, and
describes the importance of taking management action at an early stage of the component
repair cycle. It further discusses the means of reducing cost by avoiding unnecessary unit
removals from the aircraft, while placing an emphasis on the requirements of the design
and production stages. The overall aim is to develop a much more fault-tolerant system
with reduced NFF rates. The document further outlines the importance of establishing
cross disciplinary features and design solutions, which can be applied to any engineering
design or application. This involves multi-disciplinary modeling of the root causes of
NFF achieved by understanding and modeling electronic, mechanical, and software
interactions. Such an approach could potentially lead toward the development of design
guides (or handbooks) that cover processes that must be followed to eliminate any root
cause of NFF at the design stage. Of course, these need to cover accurate fault models, fault
trees, system understanding (to aid in recognizing false built-in test—BIT—alarm, such as
that caused by sensors), and system synchronization problems (allowing root causes of BIT
deficiencies). Furthermore, to complement any design guidance and rule sets, solutions are
required to link service experience with design knowledge to generate an official guidance
standard, which will reduce NFF occurrences throughout a system’s life cycle. These must
be evaluated through a series of practical case evaluations within collaborating companies
and through expert judgement.

As a final comment on history, noting the developments that did not occur can be useful.
These missing elements include topics on human factors and industrial standards, which are
both important to ensure correct identification, reporting, and mitigation of related issues.
The earliest documented call for NFF standardization can be traced back to 1987 [1-16] when
research took place into testability attributes of electronic equipment, specifically to mitigate
NFF. This, however, is yet to be achieved across all test / maintenance levels. In fact, recent
research shows that the major drivers that still hinder the diagnostic process are the lack of
standards, insufficient information clarity, and inappropriate use of taxonomies.

Human factors seem to have received renewed emphasis in the mid-2000s with the
publication of related literature that details how the lack of training and experience during
the troubleshooting process affects NFF events [1-17], [1-18], [1-19]. Past studies, based on
anecdotal evidence on the topic, often highlighted the dominating factors to be “soft” in
nature, such as communication and training. This will be addressed in more detail in
chapter 3 (Human Influence).

In conclusion, NFF issues still exist and continue to pose significant problems to the efficient
maintenance and repair of aircraft. The subject area of NFF is not only being tackled within
the organizations that are living with its effects but has also seen dedicated research efforts
from a wide range of international academic institutions. Comprehensive international
standards that deal with the NFF issue are lacking, and inconsistent terminology further
complicates matters. It would, therefore, be useful to work toward harmonizing generic
maintenance-related standards that use a common terminology and framework. It appears
that ARINC 672 has been a positive development in this direction.

7
Chapter 1

1.1.3 NFF Related Literature


It is possible to evaluate academic research efforts in the field of NFF, and categorize all of
the relevant published articles based on the orientation of their NFF related content [1-20], by
analyzing the academic peer reviewed literature. The following four categories have been
identified as being the most influential topic areas:

• Fault diagnostics—includes research into sensors, testing, troubleshooting, fault


isolation manuals, built-in-tests, and environmental testing
• System design—includes hardware and software design, operational feedback, key
performance indicators, benchmarking, and cost trade-off studies
• Human factors—includes communication, training and education, correct equipment
usage, warranty claims, and accountability
• Data management—includes data trending, e-logs, and data fusion / mining

In total, 154 published papers for the period 1990–2014 were identified as being specific to No
Fault Found; these included 38 published journal articles that are primarily concerned with
NFF, 84 related conference papers, and an additional 32 secondary journal publications that
were cited in the NFF specific literature. Focusing on the primary 38 NFF journal articles,
Figure 1.4 illustrates an increasing trend in the total number of publications, with all four
categories showing a similar increase.

Figure 1.4 Classification of NFF related journal publications since 1990.

8
Introduction

The increase in NFF literature is indicative of a growing awareness of the problem, which is
rooted in the need to:

• Improve the availability of equipment and vehicles


• Reduce the turnaround times for passenger aircraft
• Provide an efficient and cost effective maintenance service
• Mitigate against the cost impact of NFF warranty claims, particularly in the consumer
electronics and automotive industries

More specifically, three driving forces are at work here, the first of which relates to the
increasingly competitive business market. Cost savings have, therefore, had to be found,
and tackling wasteful and inefficient maintenance has become a prime target for cost
reduction. NFF has emerged as one of the (large) hidden costs, and must be addressed.
Likewise, government spending cuts have significantly downsized both armed forces
personnel and the purchasing of spare parts. This forces maintenance activities to
modernize and become smarter. The second driving force is the increase in contracting
for availability. Here, maintenance and repair is the responsibility of a third party who
guarantees the customer that they will have a specific availability for their equipment
or vehicle. NFF plays an important role in contracting for availability, as the commercial
contracts must define who is responsible for the NFF cost such as retest, recertification,
etc. The final driving factor is the increased complexity of engineering systems coupled
with a reduction in the skill-set of the relevant maintenance personnel. This has increased
the complexity of a diagnostic task and has influenced the push for increased automated
diagnostic systems to remove the human error.

An interesting conclusion is that over recent years, despite an acknowledgement of the


negative cost impact, the bulk of the interest appears to be related to system reliability,
diagnostic design, and maintenance testing. Business-oriented or cost-related publications
are few.

1.2 The NFF Phenomena


The use of the term No Fault Found varies significantly across industries. It is used across
multiple sectors to describe the scenario whereby a symptom of a fault is observed, and yet
one or both of the following are true:

• Subsequent tests did not recreate the same symptom.


• Inspections revealed no existing condition that is known to correlate with the reported
symptom.

Figure 1.5 is a schematic of what is known as a traditional approach to the classification


of fault investigation findings. A line replaceable unit (LRU) such as an engine is removed
from an operational aircraft, and as a result of troubleshooting, the symptom of a fault is
identified. The LRU is returned to its maintenance provider (i.e., Maintenance, Repair, and
Overhaul, MRO), along with a description of its reason for removal (RFR). The maintainer
then conducts an inspection and/or test of the LRU and classifies it based on their findings
as either Fault Confirmed or No Fault Found, the descriptions of which are:

9
Chapter 1

Figure 1.5 Traditional classification of fault investigation findings.

Fault Confirmed: A fault is confirmed within the LRU, which may be either hardware
or software, and that fault is known to cause the unit’s removal symptom. Colloquially,
therefore, the LRU is termed as having “symptom to cause.”

No Fault Found (NFF): No fault was identified with the LRU that is known to cause the
unit’s removal symptom. The LRU may be found to exhibit faults upon return to the MRO,
yet if none of these faults are known to correlate with the unit’s removal symptom, then
(rightly) they are reported as secondary faults, and the unit is classified as NFF.

This traditional view does not capture the reality of the classification problem within a
complex maintenance program. It was challenged, considering the case when there may
have been a wrong removal and the fault symptoms remained on the aircraft [1-14]. An
adaptation of the framework resulted and is given in Figure 1.6. Here a new category of NFF
is introduced: NFF–Confirmed Not LRU.

NFF–Confirmed Not LRU: The removed LRU is deduced to be not faulty with respect to its
reason for removal. The assertion is made following a logical analysis of data pertaining to
its removal.

In essence, a unit removal can be placed into the NFF–Confirmed Not LRU category
typically for one of two reasons, these being:

• The fault symptom stayed on the aircraft after the LRU removal.
• The subject LRU was one of multiple LRUs that were removed in the same
troubleshooting instance and for the same symptom, and one of the other units (i.e., not
the subject LRU), when tested by the maintenance facility, was found to exhibit a fault
that correlates with the symptom.

10
Introduction

Figure 1.6 Enhanced NFF classification framework.

Logically, if two or more LRUs were removed in the same troubleshooting instance and
for the same symptom, and if one of those units was found upon inspection and/or test
to exhibit a fault that correlates with the symptom, then the remaining unit(s) should be
considered serviceable with respect to the reason for which they were removed. These units
are classified as NFF–Confirmed Not LRU.

Building from the above descriptions, two further key scenarios are associated with NFF;
they are termed NFF–Faulty and NFF–Not Faulty, as shown in Figure 1.7 [1-21]. These
scenarios are very different and require fundamentally different treatments. The NFF–
Faulty scenario is the situation in which the LRU is actually faulty with respect to its reason
for removal (the on-wing troubleshooting had been successful). If the unit is returned to
service without this fault being addressed, it will then go on to exhibit the same symptoms
that resulted in its original removal—in other words, it will become a “Repeat Offender.”
Examples of this category will mostly occur where components exhibit a fault in the
environment where they are used because of extraneous forces. Vibration or temperature
may cause the unit to register as being faulty on-wing, and be removed for this reason, but
may also cause the fault to be not readily diagnosable on the bench.

11
Chapter 1

Figure 1.7 Further NFF classifications.

However, perhaps the fault was revealed during inspection and/or test but the engineer
either did not have sufficient system knowledge or sufficiently detailed symptom data
to correctly associate “symptom-to-cause.” What was then in actuality the Primary Fault
is instead recorded as Secondary. A variation on this theme is the situation in which the
fault is not identified, yet it is unknowingly rectified. An example of this may be loose unit
connections causing electrical intermittent faults that are inadvertently retightened during
maintenance. In both of these cases, the NFF–Faulty units will not repeat offend.

The NFF–Not Faulty scenario is the situation in which the subject unit does not have a
fault that relates to the removal symptom. The troubleshooting process has, therefore, been
unsuccessful due to reasons such as the following:

• The fault may be within another removed LRU.


• The fault may be within another unit still on-wing and with risk of reoccurrence.
• The action of changing the LRU re-seated an intermittent connection, and the cause of
the on-wing symptom has been addressed for a period.
• Perhaps the symptom was related to a system issue (e.g., a result of the condition or state
of two particular units used in combination and when these are separated the symptom
cause is nulled).

As previously discussed, with knowledge of findings on other removed LRUs and scrutiny
of the original aircraft for persistence of the symptom, it may be possible to translate a unit
from the NFF–Not Faulty scenario into a NFF–Confirmed not LRU. In the absence of such
data and a process to review it, the unit will stay classified as NFF.

12
Opera&ons  and  Maintenance   Stakeholder   Original  Equipment  Manufactu
     
•  Lost  man  hours   •  Intangible  cost  (e.g.  reputa&on)   •  Capital  expenditure  
•  Maintenance  cost   •  Warranty  cover  (dependant  on   Introduction
•  Inventory   Maintenance  
•  Warranty  cover  (dependant  on   contracts)   •  Obsolescence  cost  
contracts,  cost  of  non-­‐ •  Cost  of  in-­‐tolerance  failures   •  Repair  cost  
conformance)   •  System  interface  training   •  Safety  
1.3 The Cost of NFF
•  Produc&on  cost   •  Safety  
•  Machine  unavailability  
NFF poses problems and a financial burden to almost everyone who is involved with
•  Intangible  cost  (e.g.  loss  of  
through-life support service,
future   from
business,   etc)  the operators and customers, to the manufacturers and
their suppliers. • The
Safety  
sources of cost incurred by business because of NFF are captured in
•  Cost  of  advance  tests  
Figure 1.8. These lists are an attempt to be inclusive but, with very diverse business models
and sectors, other sources may well emerge.

Opera&ons  and  Maintenance   Original  Equipment  Manufacturer  


   
•  Lost  man  hours   •  Capital  expenditure  
•  Maintenance  cost   •  Inventory  Maintenance  
•  Warranty  cover  (dependant  on   •  Obsolescence  cost  
contracts,  cost  of  non-­‐ •  Repair  cost  
conformance)   •  Safety  
•  Produc&on  cost  
•  Machine  unavailability  
•  Intangible  cost  (e.g.  loss  of   Supply  Chain  
future  business,  etc)    
•  Safety   •  Intangible  cost  (e.g.  losses  in  
•  Cost  of  advance  tests   produc&vity,  customer  
goodwill,  etc)  
•  Packaging  and  handling  cost  
Stakeholder   •  Machine  down&me  
  •  Transporta&on  cost  
•  Intangible  cost  (e.g.  reputa&on)   •  Safety  
•  Warranty  cover  (dependant  on  
contracts)  
•  Cost  of  in-­‐tolerance  failures  
•  System  interface  training  
•  Safety  

Figure 1.8 Sources of NFF cost throughout the business.

The direct costs of parts and labor can easily be captured, but other major impacts upon
business costs (often hidden) are not easily quantifiable. These are costs such as those
incurred within the supply chain, maintenance performance, as well as indirect effects such
as customer perception and futile maintenance efforts. Customers fall into two categories,
those that maintain their own fleet of aircraft, ships, or other vehicles; and those who
subcontract their fleet maintenance (either completely or partially). NFF events will impose
a burden on both of their maintenance operations, leading to financial implications due to
increased downtime of the equipment and additional supply chain costs. A reduction in the
overall operational availability will also occur, depending on the reliability, maintainability,
and logistical factors, all of which contribute to the cost of resolving a NFF. The costs
involved with NFF issues can often be quantified by measuring the proportion of the repair
budget that is spent or “wasted” on the maintenance activities involved in locating the root
cause of the NFF.

Figures published by the Air Transport Association (ATA) in 1997 estimated annual NFF
costs for an airline operating 200 aircraft at $20M, or $100,000 per aircraft per year. It is
likely that a similar figure is true for today’s airline industry, even though such a figure
is not currently available. Other studies show that some 4500 NFF events were costing

13
Chapter 1

ATA member airlines $100M annually. Recent efforts within the United States Air Force
to mitigate NFF focused on tackling individual avionics equipment, such as the Modular
Lower Power Radio Frequency unit for the F-16. It was found that in excess of $2M in
maintenance costs were being incurred annually for just this one unit at the maintenance
depots [1-22]. Boeing’s 787 Dreamliner has recently raised safety concerns after overheating
batteries caught fire while the aircraft was parked at Boston’s Logan International Airport.
Investigations indicated a number of potential causes and faulty components for the fire,
with each case ending in a NFF [1-23]. Businesses experienced a direct knock-on effect, as
many airlines had to ground their aircraft due to safety concerns. Analysts forecasted that
while these aircraft were out of service, Boeing lost an estimated $393M, with a resulting
impact upon their production line and future deliveries. This issue probably cost hundreds
of millions of dollars on its own, as airlines are likely to seek financial compensation for
their delays and loss of service. Such high costs provide the incentive to tackle the NFF
problem, but the underlying reasons must be understood and separately resolved in
each organization.

To add to these very evident costs, organization and contract anomalies contribute further
to the problem. It has become clear during the course of recent discussions and conferences
that the influence of NFF on the maintenance regime and system availability is evident to
maintenance managers, as they are responsible for spares and manpower provisioning. It
is not so evident to the maintenance engineers who are just doing the next job, or to the
level above the maintenance manager, as the metrics to measure such problems are not in
place. This is further compounded because not many contracts envisage NFF issues, let
alone define who pays for them. Since effective maintenance management is paramount in
the resolution and reduction of such events, contractual obligations must be recognized as a
vital phase in the need to improve supporting actions and budgeting for NFF reduction.

1.4 Scope of This Book


As stated earlier, this book will cover most of the areas shown in Figure 1.1, with the
exception of cost, which was covered briefly in the last section. With such a new subject, the
book aims to bring Systems Engineering and Quality Management procedures to bear on
NFF and illustrate their use with a number of use cases.

With the plethora of definitions and terminology used in the NFF area, chapter 2 deals
with terminology, and aims to give a consistent set of definitions for the book, something
that will hopefully be embraced by the community. This is followed by chapter 3 on the
importance of the human in the loop. Sometimes engineers or technologists forget that a
piece of equipment or a technology is not, in itself, the solution to a problem. It is only when
the humans that operate the system buy in, and totally understand and accept new methods,
that success is achieved.

Next, the book tackles NFF and systems effectiveness, looking at its component parts of
availability and safety in chapters 4 and 5, respectively. The ideas expressed in chapter 5
are, perhaps, the most speculative in the book but aim to give current thinking in regard to
NFF and safety.

14
Introduction

The book then moves on to aspects of management. Operating policies are discussed in
chapter 6 before chapter 7 proposes a benchmarking tool for NFF. The use of this tool will
allow organizations to see at what level they are addressing the NFF problem and what is
involved in moving to the next level.

The complexity brought about by modern embedded electronics poses unprecedented


challenges in maintenance and repair, threatening customer satisfaction and causing
increased warranty costs. This, of course, has an impact on the testability of an item and
may result in the incorrect fault detection and hence contribute to the NFF phenomenon.
In light of this, a look at system and diagnostic design is taken in chapter 8; with the right
technology, to increase diagnosis, being examined in chapter 9.

The book concludes with chapter 10, which gives an overall summary and thoughts on
future direction.

1.5 References
1-1. Tukker, A., and U. Tischner, eds. New Business for Old Europe: Product Service
Development, Competitiveness and Sustainability. Sheffield, UK: Greenleaf
Publishing, 2006.

1-2. Vandermerwe, S., and J. Rada. “Servitization of Business: Adding Value by Adding
Services.” European Management Journal 6, no. 4 (1998): 314–324.

1-3. Jennions, I. K., ed. Integrated Vehicle Health Management–Perspectives on an Emerging


Field, ISBN 978-0-7680-6432-2. Warrendale, PA: SAE International, 2011.

1-4. Bushby, T. R. W. “A Maintenance Plan for Airborne Radio Equipment.” In


Transactions of the IRE Professional Group on Aeronautical and Navigational Electronics 3,
2–7. 1954.

1-5. Kinnison, H. A., and T. Siddiqui. “Part 1: Fundamentals of Maintenance.” In


Aviation Maintenance Management, 15–32. McGraw-Hill, 2012.

1-6. ICE 60300-3-11. Dependability management—Part 3–11: application guide–


reliability centred maintenance. 2009.

1-7. SAE JA1012. A guide to the reliability-centered maintenance (RCM) standard.


Warrendale, PA: SAE International, 2011.

1-8. Simpson, W., B. Kelly, and A. Gilreath. Predictors of organizational level testability
attributes, Publication 1511-02-2-4179. Annapolis, Maryland: ARINC Research
Corporation, 1986.

1-9. Morris, N. M., and W. B. Rouse. “Review and evaluation of empirical research in
troubleshooting.” Human Factors: The Journal of the Human Factors and Ergonomics
Society 27, no. 5 (1985): 503–530.

15
Chapter 1

1-10. CENELEC, EN50126 Railway applications. The specification and demonstration of


reliability, availability, maintainability and safety (RAMS). 1999.

1-11. Allan, R. Air Navigation: The Order and the Regulations. Civil Aviation Authority, 2003.

1-12. Kinnison, H. A., and T. Siddiqui. “Chapter 19, Maintenance Safety.” In Aviation
Maintenance Management, 237. McGraw-Hill, 2012.

1-13. James, I., J. Marshall, M. Evans, and B. Newman. “Reliability metrics and the REMM
model.” In Proceedings of annual symposium—reliability and maintainability RAMS
(iD:1), 474–9. 2004.

1-14. James, I.,, D. Lumbard, I. Willis, and J. Goble. “Investigating no fault found in the
aerospace industry.” In Reliability and Maintainability Symposium, 441–446. IEEE, 2003.

1-15. ARINC Working Group 672. Guidelines for the reduction of no fault found (NFF):
ARINC. 2008.

1-16. Simpson, W. R., A. E. Gilreath, and B. A. Kelley. Predictors of organizational-level


testability attributes (No. 1511-02-2-4179). Annapolis, MD: ARINC Research Corp., 1987.

1-17. Murphy, D. M., and M. E. Paté‐Cornell. “The SAM framework: Modeling the effects
of management factors on human behavior in risk analysis.” Risk Analysis 16, no. 4
(1996): 501–515.

1-18. Tustin, W. “NFF or No Fault Found.” TEST ENGINEERING AND MANAGEMENT


68, no. 1 (2006): 30.

1-19. Sauer, J., G. R. J. Hockey, and D. G. Wastell. “Effects of training on short-and long-
term skill retention in a complex multiple-task environment.” Ergonomics 43, no. 12
(2000): 2043–2064.

1-20. Khan, S., P. Phillips, I. Jennions, and C. Hockley. “No Fault Found events in
maintenance engineering Part 1: Current trends, implications and organizational
practices.” Reliability Engineering & System Safety 123 (2014): 183–195.

1-21. Neal, J. “An investigation of NFF unit removals in civil aerospace,” MSc Thesis.
Cranfield University, 2013.

1-22. Söderholm, P. “A system view of the No Fault Found (NFF) phenomenon.” Reliability
Engineering & System Safety 92, no. 1 (2007): 1–14.

1-23. Denning, S. “What went wrong at Boeing.” Strategy & Leadership 41, no. 3 (2013): 36–41.

16
Chapter 2
Basics and Clarification
of Terminology

2.1 Introduction
Any problem-solving process starts off by describing the problem. This enables the
understanding required to implement a complete solution that can ensure that the
problem does not occur again. When describing an engineering problem, it is important
to first recognize existing terminology and the context in which it is used. This is
because some terms are often used in various ways, leading to confusion and ambiguity
within the community. The case with NFF is similar, and clarifying some of its concepts
is surprisingly difficult. Determining a description or consequence of NFF events is
a subtle process, and current literature regarding NFF is inconsistent, and does not
always consider the entire scope of the topic.

Over the past two decades, many attempts have been made to define NFF problems;
however, these descriptions vary from business to business–some appear to have a
structured process while others take up a more flexible attitude by accepting it as part of
maintenance. A need to move toward formal discussion and investigation is necessary
to enable a common taxonomy within the subject area.

This chapter is an attempt to document the consensus view on concepts within NFF.
The chapter not only will clarify the use of NFF terminology but also will help facilitate
technical discussions later in this book. It presents a basic description that allows
various concepts to be characterized, and lists common terms (in a bold typeface) while
documenting some variations related to systems, failures, and maintenance.

17
Chapter 2

2.2 Systems Basics


Many definitions of a system can be found in the literature, some of which have been listed
in Table 2.1. They all indicate that a system is a combination of entities that interact with
each other. These entities can include physical, behavioral, or symbolic elements that are
linked together to exchange information. A system has a hierarchal structure, and it can be
composed of, or decomposed into, several individual entities that can implement a function,
or deliver a service.

Table 2.1 Definitions of a System


Definition Reference
A system may be viewed as a combination of interacting elements [2-1]
organized to achieve one or more stated purposes
A system is a network of interdependent components that work [2-2]
together to try to accomplish the aim of the system
A system is a deterministic entity comprising an interacting [2-3]
collection of discrete elements

The function of a system is what a system does. The service of a system is what it is intended
to deliver. These requirements are described within the functional specification that
includes the technical requirements for items, materials, or services, including benchmarks,
which determine that the requirements have been fulfilled. Functions and services can have
inherent requirements from regulatory bodies, business, and the economy.

Stakeholders make up a group who can affect, or is affected by, the achievement of the
system requirements. The behavior of a system is what the system does to implement these
requirements when it is being delivered to its stakeholders.

The number of interacting entities within a system is an indication of the complexity of that
system. Highly complex systems can be perceived as being complicated because of their
behaviors, which can be attributed to one or more of the following characteristics:

• A large number of entities


• A large number of links between entities
• Nonlinearities and discontinuous links
• Emergent behaviors
• Uncertain characteristics of entities and their links

The integrity of a complex engineering system can be described as its mandated


operational and technical attributes. These are associated with the design, assurance,
and verification functions, which have been prescribed within the system requirements.
Integrity demonstrates the system’s ability to function correctly without being degraded by
environmental variations, upgrades, and maintenance activities.

The system life cycle, in systems engineering, covers all phases of the system’s life,
including system conception, design and development, implementation, integration,
distribution, operation, maintenance and support, retirement, phase-out, and disposal [2-4].

18
Basics and Clarification of Terminology

2.3 Failure and Types of Failure


Refinement of descriptions is important in functional specifications. This helps differentiate
between terms such as defects, malfunctions, failures, and faults, which can be classified by
factors such as cause, degree, type, etc.

A failure is defined as the state or condition of not meeting a desired or intended objective
(e.g., a service stops performing its required operation). Describing different types of
failures implies the existence of a classification. Table 2.2 gives an indication of the different
classifications of failures.

Table 2.2 Examples of Failures and Their Classification


Classification Failure
By cause Human (misuse, maintenance induced)
Inherent – manufacturing flaws
Wear-out – aging
By time Sudden termination of function
Gradual degradation
By degree Intermittent
Partial
By type Critical
Major
Minor

Failure symptoms can be good health indicators of an impending failure, as they often
manifest themselves in various ways before the event has taken place. Pre-failure event
symptoms present the opportunity to detect any early changes in system performance that
can affect system integrity. These are measurable performance indicators that compare
existing conditions with ideal ones. Post-failure events are used to describe the influence
of a failure after it has occurred. Their different types and implications are summarized
in Table 2.3.

Table 2.3 Post-Failure Events


Post-failure events Description
Failure Effects The immediate impact of the event (e.g., equipment ceasing to
work)
Failure Mechanism The process(es) that result in the event (e.g., corrosion of
components due to operation in a marine environment)
Failure Cause The reason(s) that result in the event (e.g., bad design)
Failure Consequence The impact of the event (e.g., power station cannot produce
electricity)
Failure Criticality The magnitude of the failure consequence that can be used for
evaluating the severity of the event to give a risk ranking, the
combination of consequence and likelihood of an event

19
Chapter 2

2.4 Fault and Types of Fault


A fault is the inability to function or incorrect functioning of an entity. It can be an inherent
weakness of the design or implementation, and can result in a failure.

A fault may either be active or latent within the entity. It can lead to a failure when
certain conditions are met, and will vary considerably, depending on frequency and
consequence. The ability to identify the pattern of these two characteristics is called the fault
reproducibility.

Faults can be categorized into four types:

• Hard faults are permanent faults that can be reproduced easily.


• Transient faults are momentary faults that are triggered due to internal energy within
the entity.
• Intermittent faults are faults that are activated only under certain conditions and
cannot systematically be reproduced. These malfunctions only occur at irregular
intervals, with normal functionality at all other times.
• Human error faults are faults caused by humans. They may be inadvertent or accidental,
and they may occur for many reasons ranging from management pressure and personal
stress to training, among others.

These four faults types can be introduced or detected at various phases of the system
life cycle:

• Design and development phases: faults are caused inherently due to internal design
decisions.
• Integration phase: faults are caused by interacting entities.
• Operation, maintenance, and support phases: faults will occur during service or
operation under external influence.

Some examples of the causes of faults in each phase are given in Table 2.4.

Table 2.4 Examples of Fault Causes During Various System Phases


Phase/fault Intermittent Hard Transient Human error
type
Design and Inadequate built in Inadequate Inadequate Poor logic,
development test, lines of code, design design, designer error
intolerance failures, environment
redundancy,
conformity issues
Integration Fault propagation, Inadequate Inadequate Propagation
many lines of code, design design
legacy systems
Operation, Degradation Degradation Environment User
maintenance, incompetence,
and support operational
pressure

20
Basics and Clarification of Terminology

Fault diagnosis is the process of localization and recognition of the reasons that cause a fault.

Other useful fault-related terms include:

• Fault avoidance prevents the occurrence, or minimizes the possibility, of faults before
they can result in a system failure.
• Fault detection is the process of automatic or manual recognition that a fault
has occurred.
• Fault isolation is the process of determining where a fault has occurred so that an
appropriate recovery can be initiated.
• Fault tolerance ensures that the system continues to be operational in the presence of
a fault.
• Fault recovery is the process of regaining operational status via reconfiguration, even
in the presence of faults.

2.5 Maintenance and Related Terms


Maintenance is the combination of all technical and administrative actions, including
supervision actions, intended to retain an entity in, or restore it to, a state in which it can
carry out a required function. A maintenance echelon, or maintenance line, is a physical
location within an organization where specified levels of maintenance are carried out (e.g.,
a repair shop). The maintenance lines are characterized by the skill of the personnel, the
facilities available, the location, etc.

Maintenance, and its repair activities, is expected to achieve a high success rate in all
modifications that take place during the system life cycle. This includes identification of a root
cause, if there is one, or positive identification that there is no root cause. Only in this way can
the correct and most appropriate maintenance activity be carried out, allowing integrity of the
removed unit to be established, and hence for it to be returned safely to service.

It is worth mentioning that maintenance and fault tolerance are related concepts. The
distinction between them in this book is that maintenance involves the participation of
external entities (e.g., the maintenance personnel), while fault tolerance is a design attribute.

Preventive maintenance is the process of performing specific inspections, tests,


measurements, adjustments, or part replacements, and is specifically aimed at preventing
failures. These preventive actions are taken at predetermined intervals based upon a
time interval such as hours or days, or the number of operations, such as the number
of landings in the case of landing gear. Preventive maintenance takes place during
scheduled maintenance.

Corrective maintenance follows the principle of “run to failure,” where the effect is not
necessarily serious or disruptive to the mission. The corrective maintenance action consists
of replacing a failed system, subsystem, or component to ensure that full, fault-free operating
condition is restored. Of course, corrective maintenance also covers those unexpected
failures that can be serious or disrupt the mission. It takes place during scheduled or
unscheduled maintenance.

21
Chapter 2

On-condition maintenance can support preventive maintenance where components


are replaced based upon observation and test results. Each of these activities is further
supported by corrective maintenance, which will only be conducted in response to
discrepancies or failures during operation.

Availability is the probability that the system or equipment used under stated conditions
will be in an operable and committable state at any given time. It has many different
elements including reliability, maintainability, and supportability. However, availability
comes in different forms, ranging from inherent or intrinsic availability, which only the
designer can influence with the reliability and maintainability designed in, to operational
availability, which is seen by the user and will be affected by the support environment that
is not designed in to the equipment.

We expect that when complex systems are put into service they are fault and failure free,
but the nature of reliability is that it is a probability, and a fault or failure can occur in the
first 10 hours of operation or after 1000 hours of operation. Reliability is the probability that
the system will work for a period of time under stated environmental conditions without
a failure. The metric used by designers to assess the overall system reliability is the mean
time between failures (MTBF). MTBF has many variations, such as mean time before
unscheduled removal (MTBUR) and mean time before critical failure (MTBCF). These
are used when it is desirable to differentiate between the type of failures–by reviewing
the components, the failure modes, and the modes of operation. Similarly, mean time to
repair (MTTR) represents the average time required to repair a failed component or device.
Another useful metric is TSI (time since installation). This metric is often cited by NFF
experts as a key for NFF monitoring. If a technician knows what the TSI for an LRU is, in
conjunction with the fleet average TSI, this can aid in the technician-level troubleshooting
decisions. The TSI metric is much more useful than MTBUR statistics for addressing NFF.

Maintainability is a design attribute establishing that a failed entity will be restored to


operational effectiveness within a given period of time when the repair action is carried
out in accordance with prescribed procedures. This attribute is concerned with the ease of
repairing the system after a failure has been discovered, or modifying the system to include
new features. Maintenance is the actual task performed on the entity; it includes operations
such as the inspection, removal, and repair and is often misused in place of maintainability.
They are, of course, inextricably linked, but it is maintainability that should be given strong
consideration during the initial system development phase. This is driven by the fact that
the required maintenance and its associated costs are accrued over the system life cycle,
and can significantly affect the overall cost of ownership. The process that encompasses the
measures taken to reduce these costs and the maintenance burden is known as design for
maintainability (DfM). In addition to costs, these measures often include conformance to
specification, frequency of repairs, and maintenance work orders per year.

22
Basics and Clarification of Terminology

Quality is a measure of conformance to specification.

Safety is the property that reflects a system’s ability to operate, normally or abnormally,
without danger of causing human injury, or death, and without damage to the system’s
environment.

Repairability reflects the extent to which the entity can be repaired in the event of a failure.

Testability is a qualitative design attribute that determines the degree to which an entity
can be tested under required conditions, which are included as part of the overall design
objectives, goals, thresholds, and constraints. One of the goals of testability is to increase the
fault coverage of an entity. This requires making use of design techniques that make test
generation, and test application, cost effective—an optimization process known as design for
testability (DfT).

2.6 No Fault Found Terminology


Now that system, failure, fault, and maintenance have been defined, this section will attempt
to clarify the special case of No Fault Found events. It will illustrate the need for correct
terminology with a use case and finish by defining some specific NFF terms.

2.6.1 NFF Classification


The NFF phenomenon does not take place in isolation—it is one possible result of a sequence
of events that begin with a warning or alarm (reported fault) on board the main equipment.
This results in a series of actions at various maintenance levels, until the decision to tag
an entity as NFF is taken. Over the course of the past couple of decades, NFF has been
associated with various phases of the system life cycle, and yet its theoretical understanding
has proven to be rather limited. Table 2.5 lists some of the different acronyms that have been
used to indicate the same problem, and indicates that it would be useful to look generally at
how NFF events are recognized across various industries.

A recent international survey [2-5] into the causes and perceptions of NFF in the aerospace
industry shows that approximately half of the respondents prefer the use of the acronym
NFF—including the UK and most of Europe. However, the other half of the respondents refer
to it in a variety of other terms. Within the United States, terms such as Re-Test OK (RTOK),
Trouble-Not-Isolated (TNI), Fault Not Indicated (FNI), and No Trouble Found (NTF) have
appeared to be the more common variants. Fundamentally, they are all being applied to
similar events that require further investigation and correct classification, and hence action.
But this will not be achieved until the terms are aligned. This abundance in the number of
terms indicates the need for a description.

23
Chapter 2

Table 2.5 A Non-exhaustive List of Acronyms Often Associated


with the No-Fault Found Phenomena
Acronym Full form
NFF No Fault Found
UTRF Unable To Reproduce (or Replicate) Fault
RA Repeat Arising
NPR No Problems Reported
CND Can Not Duplicate
CNRF Can Not Reproduce Fault
NFI No Fault Indications
CNF Cause Not Found
RTOK Re-Test OK
NFI No Fault Indicated
FCDI Fault Cleared During Investigation
NDF No Defect Found
NFA No Fault Apparent
NEOF No Evidence of Failure
NAD No Apparent Defect (or Damage)
FNF Fault Not Found
NPF No Problem Found
NTF No Trouble Found
TNI Trouble Not Identified

Aerospace engineers have regularly expressed their belief that the term No Fault Found
provides a hindrance in reducing the problem of identifying the correct diagnosis of an
operational fault. In one major aerospace organization, the term Fault Not Found has been
used as a positive alternative. The reason is that the term NFF is not clear enough and
suggests an attitude of resignation that there was no fault in the first instance. But the fact
is that something has caused that maintenance action, be it a real fault, or a troubleshooting
error that needs to be identified. The term Fault Not Found (FNF) therefore is being used
to drive a cultural shift as it implies that “there is still work to be done.” This indicates a
proactive approach to the problem. It can, however, be the case that NFF, FNF, or whichever
term is used will, in reality, need to distinguish between different cases and types.

The most popular terms in the United States are Cannot Duplicate (CND) and Re-Test OK
(RTOK). There is a similar level of ambiguity in the use of these terms where often at first
line maintenance the event is labeled CND, and within the depth maintenance the event is
often labeled as RTOK. The distinguishing characteristic between a RTOK and a CND is that
RTOKs can only be determined after a subsequent level of repair, whereas CNDs happen
within the same level of repair [2-6]:

24
Basics and Clarification of Terminology

“At any test level, a fault may be recognized and localized


to a unit. However, when the unit is tested at a subsequent
test level, the recognition or localization of the fault may
be unsuccessful. This situation can occur for a number of
reasons. One possibility is that having correctly recognized,
and appropriately localized, the fault at the preceding level,
attempts to replicate the test results at the subsequent level
is [sic] unsuccessful. Another possibility is the fault being
incorrectly recognized or localized at the preceding level.”

The above explanation recognizes that the level of test is an important factor. In relation to
this point, the following descriptions can be defined:

• Cannot Duplicate (CND) concerns a faulty unit, whose faulty behavior cannot be
repeated consistently. If the failure cannot be verified (i.e., it is not reproducible), a NFF
situation is recorded, and this event is called a CND. This is followed by a manual
troubleshooting process in which the maintainer will make the decision on how many
units should be replaced, based on their training and experience.
• Retest OK (RTOK) describes the unit now being able to pass the test and perform its
function, where the unit was determined to be faulty at the previous level. For example,
a board in an LRU is reported to be faulty. When the board is swapped, the LRU is
“fixed.” The removed board is then sent for further testing, where it passes all board
test(s) and hence is labeled RTOK.

Understanding what is actually meant by terms such as those in Table 2.5, in the same
way as the coherent descriptions for RTOK and CND, will provide insights into whether
a single common term should be used as a default—or whether multiple terms should be
used depending on the circumstances of occurrence. For example, at second maintenance
level the drivers that resulted in an RTOK occurred because of an inadequate outdated
troubleshooting guide, but at third level different drivers were caused by test equipment that
lacked the necessary sensitivity [2-7].

It is evident that no code of practice is in place to ensure correct identification, reporting,


and management of these problems. To date, no published literature has been found that
specifically raises or deals with the issue of how the disparity between terminology and
descriptions affects the ability to deal with the issue of diagnostic failures.

Possible descriptions of the NFF phenomenon, covering the range of acronyms above, are
listed in Table 2.6.

25
Chapter 2

Table 2.6 Various Descriptions That Have Appeared


to Describe the NFF Problem
NFF description Notes
1 A reported fault for which the root This implies that NFF is a diagnostic failure, and
cause cannot be found; in other certainly many incidents of NFF can be described in
words, a diagnostic failure [2-8] this way. But perhaps NFF needs to be considered
in a much wider context than this.
2 A reported fault for which there These incidents are described as failures, but
was never a root cause [2-9] nevertheless will result in nugatory (futile)
maintenance efforts.

The question of how to recognize a lack of skill,


training, or execution of procedures becomes
important when a mistake or bad decision has
consequences that lead to economic losses.
3 Removals of equipment from The problem is wider than this, as it should also
service for reasons that cannot cover cases when no fault is found at the aircraft
be verified by the maintenance or equipment, and as a result it is returned to
process (shop or elsewhere) [2-10] service with nothing found or replaced. Also, many
faults that are classified as NFF do not result in
equipment removal from the aircraft. Many varied
factors can cause this, including operator policy,
operational expediency, redundancy, and recording
a fault as acceptable to be deferred because it is
not in contravention of the Minimum Equipment
List (MEL) defined by the regulatory authority.

This also implies that NFF is actually a failure in the


maintenance effort.
4 Any reported fault that results A broader description focused on the wasted
in nugatory maintenance and maintenance efforts in troubleshooting the
logistical effort [2-9] reported fault.
5 No Fault Found is a reported An extension of description 1.
failure that cannot be confirmed,
recognized, or localized during
diagnosis and therefore cannot be
repaired [2-11]
6 The inability to replicate field failure Specific to environment and diagnostics.
during repair shop test/diagnosis
[2-12]
7 A failure that may have occurred An extension of description 6.
but cannot be verified, or
replicated at will, or attributed to a
specific root cause, failure site, or
failure mode [2-13]
8 A fault indication that triggers a This implies that there never was a fault in the
maintenance action where no fault entity (or system) and gives an indication to the
exists [2-14] human influence.

26
Basics and Clarification of Terminology

The range of descriptions in Table 2.6 highlights that manufacturers, suppliers, operators,
and alike are aware that there is a problem and have acknowledged its existence within
the system life cycle, but have not converged on a common description. This leads to the
following concerns:

• How can a true gauge of the problem be investigated if no standardized term is used in
the maintenance history?
• Are all of these terms (in Table 2.5) accurate (i.e., do they actually describe the same
event), or are there subtle differences that need to be recognized?

The key concepts in the descriptions of Table 2.6 are “no root cause,” “diagnostic failure,”
and “nugatory troubleshooting efforts.” Let’s explore these three in a bit more detail.

The first keyword—root cause—is a fundamental problem. It is the earliest point at which an
action will reduce the chances of the incident from repeating. Root causes can be classified
into three categories:

• Physical
• Human
• Latent

A physical root cause can include a variety of material reasons including broken
connections, inadequate test coverage, and damaged wiring. Consideration of these reveal
subtle differences in some of the claimed NFF causes that were identified as equipment
faults, as complete equipment failures, or as defects in the equipment design or maintenance.
It is important here to establish some consistent terminology that can be used:

• Equipment fault is the result of some physical degradation within a system that results
in a percentage or intermittent loss of functionality. If it is left unrepaired, then a fault
will progress to a system failure.
• Equipment failure is a complete loss of functionality below acceptable safety and
operating limits.
• System defect is an inherent design, manufacturing, or maintenance flaw that can
develop into a system fault.

A human root cause includes human factors, errors, or what might be called “slip-ups.”
This also includes application of poor logic, inadequate training, and lack of experience.
Associated terminology includes:

• Operator error is when the system operator, dealing with the system or entity under
test or repair, incorrectly used procedures, tools, or test equipment, and/or incorrectly
interpreted results, or both; in other words, the operator was at fault.
• Human error at first line is a human error at the first line that results in identifying a
good system or entity under test as faulty; subsequent maintenance levels verify that the
suspect system or entity is not faulty.
• Human error at depth is a human error at the depth or shop level, which results in the
identification of a faulty system or entity as being serviceable.

27
Chapter 2

A latent root cause is one that cannot be identified for a number of reasons. These might
be because of organizational problems, inadequate procedures and processes, poor culture
that does not encourage thoroughness, or perhaps too much pressure to return equipment
to the customer.

The second keyword—diagnostic failure—is the inability to carry out a successful fault
diagnosis process. The term has been used loosely by engineers and has grown to mean
different things to different disciplines, particularly within the aerospace industry. A fault
diagnostic process can be carried out through several maintenance echelons, each having
different test capabilities.

The problem begins during operational service when a fault is reported (e.g., by the operator).
Independent functionality tests will then be ordered on the suspect LRU at first line
maintenance to verify the fault. An LRU is an essential support item or component that is
removed and can be replaced at field level to restore an end item to an operationally ready
condition. If it is not reproducible, a diagnostic failure will be recorded. This description
assumes an ambiguity group of one within the diagnosis. Typically multiple LRUs are
pulled, and a diagnostic failure is reported for each LRU for which no failure can be found.
As each unit proceeds up the supply chain with policies such as 1st – 4th (if 4 units can be
removed), this problem is only exasperated. Figure 2.1 further details the process when more
off-line tests are carried out on the failed units within the maintenance shop/depth.

The procedure here is to isolate the fault to a group of shop replaceable units (SRUs) that
are suspected of being the source of the failure. An SRU is a modular component that is
designed to be replaced by a technician in a workshop. Unlike LRUs that can be removed
and replaced in the field, SRUs are typically removed in a designated maintenance or testing
facility. Depending upon the accuracy of the fault diagnosis at this stage, ideally only one
SRU will be called out; less precise diagnostics can call out two, three, or more SRUs. The
called out units are then sent to the second or third line maintenance for more functional
testing. If the units pass at this stage, another diagnostic failure will be recorded. Two
possible scenarios will then exist—either the SRU is healthy and has been falsely replaced,
or it is faulty and the diagnostic testing is inadequate. The impact of this process will
eventually classify the unit as beyond economic repair (i.e., the state of the unit where
its estimated repair cost significantly exceeds a certain percentage—typically 80%—of its
replacement value.

The final keyword—nugatory troubleshooting efforts—is futile fault diagnosis activity


carried out within the preventive and corrective maintenance process. It is one of the prime
results of false alarms that lead to wasting of resources such as inspection time, man-
hours, expenditure, equipment, spare-parts provisioning, etc. This is a laborious task that
substantially increases the time to recover from a failure.

28
Basics and Clarification of Terminology

Figure 2.1 The chain of events resulting in NFF from the aerospace perspective.

The three keywords—root cause, diagnostic failure, and nugatory troubleshooting efforts—
are instrumental in understanding NFF, as they encompass the multidisciplinary nature of
the problem. NFF can be caused by:

• A combination of the three root causes (physical, human, and latent)


• Diagnostic failures at various maintenance levels
• Wasted efforts during maintenance

Accepting or rejecting the descriptions in Table 2.6 is being driven by a lack of


understanding of the NFF problem from a systems point of view. Most of them have
associated NFF with a physical root cause (that is unknown) and ignore the human and
latent elements of the problem. Instead, it should be recognized that there are other
elements within the chain of events, which are influenced by organizational culture,
processes, procedures, human behaviors, and maintenance practice. These should be the
drivers that must be understood and established by standardizing NFF taxonomy, with a
view of creating a high level of coherence.

Standards help overcome technical barriers by promoting organizational success through


better workflow paradigms and maintenance strategies. Because a diagnostic failure
is a multidisciplinary issue, establishing a formal methodology, process, criteria, and
practice can help reduce its consequences. However, one interesting point indicated in the
discussions presented in this chapter, is where exactly does the diagnostic failure occur

29
Chapter 2

within the maintenance process? Most of the descriptions for the NFF phenomena do not
provide any consideration for this and therefore lead to the assumption that the level of test
is unimportant (see descriptions 4, 5, 6 in Table 2.5).

For all the reasons stated above alone, the definition of NFF adopted in this book is given by:

NFF is a reported fault for which the root cause cannot be found.

2.6.2 Case Study–The Impact of Inconsistent Terminology


The Harrier aircraft operated by the UK Royal Air Force (RAF) can be used to illustrate this
problem. To gauge the size of the problem based upon “wasted work hours,” maintenance
data from both forward and depth domain were independently filtered for all events tagged
with the default phrase “01–No fault found after check/test” within the centralized database
system [2-11]. Forward support is defined as those logistics processes and functions that are
focused on, or provide immediate support to, the operating environment and are optimized
with an emphasis on operational effectiveness. Depth support is defined as those logistic
processes and functions that underpin the support of platforms and associated equipment,
and place an emphasis on sustainability, efficiency, and cost.

For the forward domain, the results returned an average NFF occurrence of around 3.5%
over a three-year period. This is significantly different from what is claimed in much of the
available literature on the subject, which indicated an average NFF rate of around 42% across
multiple industries. What was later found was that despite the maintenance staff being able
to populate a “Work Carried Out (WCO)” field with a default phrase, the reporting was
laden with a multitude of additional maintenance actions. As a consequence, within the
forward domain some 13 different NFF phrases were identified as being routinely used
and not coded under the 01- NFF label and thus were not counted as NFF. Examples are no
problems reported, fault not found, no apparent damage, no failure codes on, and tested
satisfactorily. Strong evidence shows that many of these events were misreported or that
false feedback terms such as “fault cleared itself during investigation” were also used to
circumvent formalities and provide a positive result rather than a negative event.

Expanding the database query, therefore, to include all these “real” NFF events produced
an average figure of 10.5%, a significant increase on the original figure that was generated
on the presumption that reporting was accurate and consistent. A similar picture was seen
within the depth domain where the figure leaped to 27%. Three of the key findings of this
research relating to disparity between terminologies were as follows:

• The problem is being under reported by staff.


• The disparity between original and NFF labeled events is caused by the significant
miscoding of events brought on by the availability of many NFF related terms.
• The output of data from the available database within the forward domain is erroneous
and cannot be relied upon to provide an accurate picture of aircraft/fleet health and the
cost of NFF.

The likelihood is high that a similar situation will occur in other industries such as civil
aviation and within rail vehicles infrastructure, where best practice and past experiences

30
Basics and Clarification of Terminology

could be shared if an appropriate knowledge transfer platform were developed. But for now,
the more generalized recommendations would be to work toward:

• Ensuring that reporting is based upon a set of accepted and standardized phrases and
terms to avoid false reporting and thus ensure that all applicable events are captured.
• Simplifying the functionality of recording systems, which would restrict erroneous use
of terms.

2.6.3 Other Related Terms


The following Table 2.7 lists terms (and their descriptions) that are often associated with the
subject area.

Table 2.7 Related NFF Terms


Ambiguity Group A collection of failure mechanisms for which diagnostics can detect
a fault and can isolate the fault to that collection, yet cannot further
isolate the fault to any subset of the collection.
Automatic Test A system used to test a unit, where all the control and decision
Equipment processes are performed with little manual intervention. The ATE
responds to pre-programmed test procedures that apply stimuli
and makes decisions based on measurement results. An ATE is
used both for detection and diagnosis of failure.
BIT Hard Failure A failure occurs in a BIT subsystem, and the BIT reports
a malfunction in a system that is not host to the BIT, and
consequently maintenance personnel verify the unit to be
serviceable, and the subsystem failure remains hidden.
BIT Error The BIT (built-in-test) equipment at any level records and reports
an equipment malfunction that cannot be reproduced. This can be
due to design errors, transient errors, external faults, or temporal
environmental induced faults.
BIT Transient Error Component degradation in the BIT subsystem causes a failure of a
transient nature, resulting in an erroneous report of a malfunction
in the host system, and the transient behavior subsequently is not
exhibited during testing by maintenance personnel.

Drivers of NFF The reasons that resulted in a fault being reported as a NFF.
Diagnostic Coverage The ratio of failures detected (by a given test program, test
procedure, or set of tests) to the entire theoretically detectable
failure population, expressed as a probability-weighted percentage.
Also called Fault Coverage or Fault Detection Coverage.
Diagnostic Ambiguity The situation that arises when the diagnostic is able to detect
a fault, but the smallest fault group to which that fault can be
isolated contains more than one repair item. In this case some
good units will be delivered to the next level of assembly, and when
these good units pass the tests, they will be rendered as Re-Test
OK.
Diagnosability It is the inherent ease in diagnosing a circuit or system that leads
to a faster, more comprehensive, more reliable, and cost effective
diagnosis.

31
Chapter 2

Table 2.7 Related NFF Terms (continued)


Design Attribute A specification that defines a characteristic of the entity or system.
For example, maintainability is a system design attribute that has
a great impact in terms of ease of maintenance by considering
factors such as accessibility, simplicity, modularization, and
diagnosability.
False Alarms A call for maintenance action where none was needed.
False NFF These are caused by procedural shortfalls during maintenance
activities, where a fault is found, but is not associated with the
original reported fault. This leads to a fault being documented
with the NFF consequently not being recorded as NFF, and then
hidden from statistics. This can indicate that the original fault is still
present within the system, but an incomplete test procedure or the
inability to reproduce fault conditions means it remains undetected.
False Removal The removal of a good part suspected as being defective due to
inconclusive diagnostics (e.g., diagnostic ambiguity), inaccurate
diagnostic information, inefficient IETM information, or inadequate
maintainer training. False removals contribute to high spares
consumption, high turnaround times, low operational availability,
and high RTOK rates.
Hidden Faults These are faults which are not identified by any of the acceptance
tests, resulting in a test escape. If a hidden fault is suspected of
being present, then the entity under test would be categorized as
Fault Not Found.
Integrated Diagnostics The Integrated Diagnostics (ID) process is a structured process
that maximizes the effectiveness of diagnostics by integrating
pertinent elements, such as testability, automatic and manual
testing, training, maintenance aiding, and technical information,
as a means for providing a cost-effective capability to detect and
isolate unambiguously all faults known or expected to occur in
weapon systems and equipment to satisfy weapon system mission
requirements [2-115]. For complex systems, Integrated Systems
Diagnostics is used, with the process of creating these being
Integrated Systems Diagnostics Design.
Latent BIT Design Error As a product of coincidence, an appropriate sequence of events
Manifestation occurs that causes a latent design error in a system to manifest
itself; subsequently, maintenance personnel cannot reproduce the
sequence of events that precipitates the error manifestation.
NFF Confirmed NFF Confirmed External is the case of diagnostic failure referring
External to the wrong unit being suspected of being at fault—or, in other
words, having no Functional Failure.
NFF Occurrence The frequency of observed NFF events.
Non Functional Failure We consider this case to arise due to either an error in the built-in-
test (BIT) or through human/operator errors.

32
Basics and Clarification of Terminology

Table 2.7 Related NFF Terms (continued)


Rogue Unit A rogue unit is a generic term in aircraft maintenance for a
repairable component that has exhibited irregular service life
patterns after repairs.
Transient Failure Component degradation in the system causes a failure of a
transient nature, resulting in a report of a malfunction of the
system; the transient behavior subsequently is not exhibited during
testing by maintenance personnel.
True NFF These types of NFF issues are recorded when there was definitely
no fault present within the system, and yet a fault was reported
due to operator error or a wrong interpretation of symptoms.

2.7 Nomenclature
The following is a list of acronyms used through this book.

A-799 U.S. Navy fault code


AAIB Air Accident Investigation Bulletin
AIMS Airplane Information Management System
ALARP As low as reasonably practical
ALDT Administrative and logistic delay time
AME Aircraft maintenance engineers
AoR Area of responsibility
ATA Air Transport Association
ATE Automatic test equipment
BIT Built-in test
BITE Built-in test equipment
CAA Civil Aviation Authority
CBIT Continuous BIT
CBM Condition based monitoring
CMOS Complementary metal-oxide semiconductor
CND Cannot Duplicate
CNF Cause Not Found
CNRF Can Not Reproduce Fault
CSI Cycles since installation
CSLV Cycles since last shop visit
CSO Cycles since overhaul
DfM Design for maintainability
DfT Design for testability
DH Duty holders
DMC Direct maintenance cost

33
Chapter 2

DRACAS Data Reporting Analysis and Corrective Action System


EMRP Enterprise and Materials Resource Planning
EPGWS Enhanced Proximity Ground Warning System
FCDI Fault Cleared During Investigation
FDE Flight deck effect
FFA Fraction of false alarms
FFD Fraction of faults detected
FFI Fraction of faults isolated
FI Fault isolation
FMEA Failure modes and effects analysis
FMECA Failure modes effects and criticality analysis
FNF Fault Not Found
FPGA Field programmable gate array
FRACAS Failure Reporting, Analysis and Corrective Action System
FSR Field service representatives
FTA Fault Tree Analysis
HUMS Health and Usage Monitoring Systems
IBIT Interruptive BIT
IETM Interactive Electronic Technical Manual
IVHM Integrated vehicle health management
LC Labor cost
LRU Line replaceable unit
MAA Military Aviation Authority
MACMT Mean active corrective maintenance time
MAPMT Mean active preventive maintenance time
MART Mean active repair time
MC Material cost
MMEL Minimum Master Equipment List
MMH Maintenance man hours
MRO Maintenance repair overhaul
MSG Maintenance Steering Group
MTB NFF Mean time between no fault found
MTBCF Mean time before critical failure
MTBF Mean time between failures
MTBUR Mean time before unscheduled removal
MTTR Mean time to repair
NAD No Apparent Defect (or Damage)
NDF No Defect Found

34
Basics and Clarification of Terminology

NEOF No Evidence of Failure


NFA No Fault Apparent
NFF No Fault Found
NFI No Fault Indications
NFI No Fault Indicated
NPF No Problem Found
NPR No Problems Reported
NTF No Trouble Found
ODH Operational duty holder
OEM Original equipment manufacturers
OMS On-board maintenance system
OT Operating time
PCB Printed circuit board
QMS Quality Management System
RA Repeat Arising
RAF Royal Air Force
RAMS Reliability, Availability, Maintainability and Supportability
RCM Reliability Centered Maintenance
REMM Reliability Enhancement Methodology and Modeling
RFA Rate of false alarm
RFID Radio-frequency identification
RFR Reason for removal
RGT Reliability growth tests
RtL Risks to life
RTOK Retest OK
RTS Return-to-service
SB Service Bulletins
SIL Service Instruction Letters
SRU Shop replaceable unit
ST Standby time
TAP Test access points
TC Total cycles
TCM Total corrective maintenance time
TDI/O Test dependent input/output
TES Through-life engineering services
TME Test and measuring equipment
TNI Trouble Not Identified
TPM Total preventive maintenance time

35
Chapter 2

TSI Time since installation


TSLV Time since last shop visit
TSO Time since overhaul
TT Total time
UOR Urgent Operational Requirements
URD Unit Removal Datasheet
UTRF Unable To Reproduce (or Replicate) Fault
UUT Unit-under-test
WCO Work carried out
WLC Whole life costs

2.8 Conclusion
The objectives of this chapter were to present a number of definitions, discuss the
distinguishing characteristics of failure and fault, provide an understanding of maintenance
and its related terms, and, finally, to unequivocally label a coherent description of the NFF
problem. These descriptions can be considered as a basis for studying more advanced and
specific problems with identifiable symptoms so we can map their relationship to achieve
diagnostic success.

The lack of standardization in this area has resulted in different terms being used to describe
the same events in maintenance engineering. This hides the scale of the problem; a common
term would provide meaningful statistics for the problem and allow easy identification of its
true cost and effect. The term NFF appears to be favored within the aerospace sector, where
they have recognized the need to understand the distinctions between root causes, faults,
and the influencing factors covering the entire maintenance process.

This chapter provides a minimum set of definitions and concepts to be used in the NFF
field of study. It provides initial steps toward a better understanding of NFF drivers, and a
baseline for several continuing discussions on NFF terminology utilization.

2.9 References
2-1. ISO/IEC 15288. Systems
and
software
engineering–System
Life
Cycle
Processes.
2002.

2-2. Deming, W. E. The New Economics for Industry, Education and Government. MIT CAES,
Cambridge, MA, 1993.

2-3. Vesely, W. E., F. F. Goldberg, N. H. Roberts, and D. F. Haasl. NUREG-0492, Fault tree
handbook. Nuclear Regulatory Commission, Washington DC, 1981.

2-4. Blanchard and Fabrycky. Systems Engineering and Analysis, Fourth Edition, 19.
Prentice Hall, 2006.

36
Basics and Clarification of Terminology

2-5. Huby, G. “No fault found: aerospace survey results.” Copernicus Technology Ltd.
Technical Report. Copernicus Technology Ltd, UK, 2012.

2-6. Soderholm, P. “A system view of the No Fault Found (NFF) phenomenon.” Reliability
& System Safety 92, no. 1 (2007): 1–14.

2-7. Ungar, L. “Design for diagnosability guidelines.” IEEE Instrumentation &


Measurement Magazine (2007): 24–32.

2-8. Cockram, J., and G. Huby. “No fault found (NFF) occurrences and intermittent
faults: improving availability of aerospace platforms/systems by refining
maintenance practices, systems of work and testing regimes to effectively
identify their root causes.” Paper presented at the CEAS European Air and Space
Conference, October 26–29, Manchester.

2-9. Hockley, C., and P. Phillips. “The impact of no fault found on through-life
engineering services.” Journal of Quality in Maintenance Engineering 18, no. 2
(2012): 141–153.

2-10. ARINC Working Group 672, Guidelines for the reduction of no fault found (NFF):
ARINC,2008.

2-11. Roke, S. “Harrier no fault found reduction.” MSc dissertation in engineering


business management. Cranfield University, 2009.

2-12. Kirkland, L. V. “Why did we add LabVIEW applications to our ATLAS TPSs?”
AUTOTESTCON, 266–271. IEEE, September 12-15, 2011.

2-13. Qi, H., S. Geneson, and M. Pecht. “No-fault-found and intermittent failures in
electronic products.” Microelectronics Reliability 48 (2008): 663–674.

2-14. Ungar, L. Y. Causes and Costs of No Fault Found Events. Advanced Test Engineering
(A.T.E.) Solutions, Inc. El Segundo, CA, 2015.

2-15. MIL-STD-1814, Department of Defense Handbook: integrated diagnostics


(Feb. 14, 1997).

37
Chapter 3
The Human Influence

3.1 Introduction
During troubleshooting activities, maintainers primarily rely on their experience,
fault isolation manuals, training, and organizational culture. These practices
support recognizing and understanding the chain of events involved for effective
maintenance, while promoting efforts at the system level to ensure that failures are
correctly evidenced and used to develop knowledge of the operations. However, these
practices are heavily dependent on human perception, contractual agreements, and
the cost for the maintenance of systems. Individuals play an important role within
this context, and hence it is essential to note their influence. Factors such as misread
reports, operator error, poor system understanding, or job pressures can lead to what
is sometimes known as “true NFF events.” It is also of great significance that there
appears to be a degree of mistrust within organizations about the occurrence of such
incidents, not just within the supply chain. This is based on the view that operators
are not always prepared to admit to making errors during their operations, and they
often register a fault rather than admit they are baffled. Of course, not all erroneous
reports are deliberate; some may stem from a number of causes such as high workload
and complexity.

Within the maintenance arena, NFF occurs during the interaction between human
and key resources such as aircraft test equipment, integrated onboard maintenance
systems, and technical maintenance manuals that are aimed at supporting engineers
with diagnostic tasks. The accuracy and usability of such support systems will directly
impact the quality of work performed by humans in the system. Linking this with
additional performance-affecting influences such as operational pressures, stresses,
motivation, and psychological factors leads to humans being a prime element in the
study of the causes of NFF.

39
Chapter 3

In this chapter, this human element in NFF events is analyzed. Effective procedures
and techniques can contribute to good performance; studying the human influence and
interaction with them can help us to understand how the problem develops into a NFF event.

3.2 The Human Element


Technological advances within the high-end transport, defense, and energy industries
(among others) have led to a drive for ever-improving, technologically driven, efficient
maintenance operations. However, as tempting as it is to attribute such technical innovations
as the reason for any progress and improvement, or even for the reduction in maintenance
failings, it must be acknowledged that such developments cannot be exploited without the
application of competent and qualified human beings. These essential human components
operate as part of the same system as any integrated technology, and the processes or
procedures to operate, maintain, and interpret the technology. The field of study that is
concerned with quantifying and measuring such performance characteristics of human
beings is known as human factors. Human factors, often considered synonymous with
ergonomics, is “the science that facilitates maximum human productivity, consistent quality,
and long-term worker health and safety” [3-1]. It measures the work demands imposed at
the workplace and compares these with the workforce’s capabilities. When task demands
exceed human capabilities, performance will decline, allowing human errors to occur. These
errors have the potential to manifest as safety-compromising incidents (see chapter 5), which
deteriorate productivity and can damage an organization’s reputation [3-2].

The notation that NFF is an output of a process (chapter 2), and that the human element plays
a major factor in ensuring that the process is followed correctly only reinforces the influence
this has on NFF events.

In chapter 1, the topic of human factors was classified as one of the four most influential
contributors to NFF events, amounting to much frustration among maintainers. The
motivation behind this argument arose due to a number of reasons:

• The equipment requiring maintenance is diverse in circuitry, configuration, and


function.
• The environment often seriously attenuates human performance.
• The knowledge, training, and skill of technicians have limits.
• NFF training packages are lacking.
• Management support factors typically are less than ideal.
• Personal relationships between teammates can pose problems.

All of these factors contribute to the real climate surrounding the NFF issue.

3.2.1 Organizational Context


The notion of NFF incorporates several key players who range from individual staff
members and regulatory bodies to OEMs, MRO providers, and a wide variety of other
elements of the maintenance supply chain. It also includes an organization’s maintenance
setup, staff skills and experience, the working environment, technological capabilities,
and the commercial contractual agreements that cover an organization’s maintenance

40
The Human Influence

obligations. In the organizational context, it is important to understand how these elements


relate, function, and work together to reveal how NFF manifests itself during maintenance
activities. For example, at the top-most level, a regulatory body would impose the necessary
rules and requirements that determine the local activities for inspection and repair, which
personnel will have to abide by [3-3]. At this point, organizations are responsible for quality
control functions, such as carrying out inspections and performing audits in compliance
with regulatory bodies (e.g., the Civil Aviation Authority). They are also responsible for
quality assurance of the maintenance system, checking engineering change orders, auditing
and investigating maintenance activities and components for errors, examining records, and
any troubleshooting process. Any failure or disruption in carrying out these two functions
may cause maintenance errors and inefficiencies, with resulting financial repercussions. An
organization’s maintenance plan would typically support actions at three levels:

1. Strategic level: Priorities and critical targets are established in accordance with business
goals. The strategic level is represented by senior management.
2. Tactical level: Resources necessary to achieve the maintenance plan, which include
requirements, planning, and scheduling, are determined. The tactical level is
represented by mid-level management.
3. Operational level: Maintenance tasks are performed in the scheduled time. The
operational level is represented by the maintenance staff.

Generally, the staff members at the operational level, which encompasses the “work on the
ground,” have a good understanding of the NFF phenomena. Testing and repair work takes
place here, and the operational level personnel are the ones who will identify a NFF. This
is largely due to the nature of the problem, because it appears primarily during system
operation, and hence the on-field personnel are the first ones to experience its symptom.
At this stage, NFF has the potential to economically affect the system operation due to
incorrect fault diagnoses, waste of resources, and unproductive use of time, which add
to maintenance costs, downtime, and unavailability of the system. It can further damage
the reputation and relationships within the supply chain, which is where the tactical
level personnel will become involved as they experience the shortage of spares, and the
maintainers waste time looking for faults that cannot being isolated. Due to time pressures,
the tactical level personnel will need to make decisions on whether to allow their staff to
keep searching for the symptom-to-cause relationship of the reported fault so they can
remove the NFF label. Alternatively, they must accept the NFF and send the equipment back
through the certification loop, or order further investigations by sending the equipment to
a deeper level of maintenance. At the strategic level, NFF events do not pose an immediate
financial burden, due to a lack of metrics, and hence the strategic level personnel struggle
to understand the long-term consequences that NFF events inflict on engineering practices.
However, decisions made at the organization’s strategic level directly influence the tactical
and operational performances. It is suggested, therefore, that if the cost of NFF at the
strategic level were clear, it would enable NFF resolution to become an integrated part of the
continuous improvement strategy of the organization.

41
Chapter 3

This situation has been the subject of many discussions, and it clarifies why the NFF
phenomenon has not been able to attract much attention for resolution, despite being a
known issue for many decades. Other NFF facts include:

• NFF is not evidenced to have caused an injury.


• Commercial contracts do not acknowledge it as a problem.
• No one knows the true costs involved.

Mentioned at the outset of this book was the comment that, given the variety of NFF sources,
each industry approaches NFF differently. This is due to their individual interests and
differing viewpoints, and an organizational philosophy (low cost vs. a premium business
carrier) is at the root of maintenance policies. For example, should a component be taken
offline for more investigation, should an aircraft be grounded for a period of time, or can
problem resolution be deferred? As a result, the situation arises in which internal pressure
is placed upon the maintenance personnel to reduce their maintenance turnaround times
to match the organization’s policies. This leads to a culture where units are replaced rather
than the root cause of a failure being identified and fixed. Here, it is also important to
highlight that although cultural factors are similar to human factors, they tend to focus
on the corrective aspect rather than the individual. Often, organizations can be overly
bureaucratic and cumbersome in their response to change and might not even recognize
that they have a problem [3-4]. It seems that culture is becoming more widely acknowledged
as one of the most significant contributory factors of NFF events. This is attributed to the
behavior, skill sets, and communication between an organization’s technicians, engineers,
and management personnel.

Many engineers agree that their organizations are cautious about operator satisfaction and
cost reduction of NFF related events. They are becoming more assertive about the cost and
resolution of these problems. They replace an LRU and observe that the fault goes away, but
the maintenance organization cannot find the fault and has to (directly or indirectly) charge
the customer for it. Bureaucracy increases the friction between the two sides.

All these factors contribute to, or are prime causes of, the elements we are considering in the
following sections.

3.2.2 Communication
Today, communication in the maintenance function has an increasingly important mission.
That mission is to convey information of the many facets associated with the business [3-5].
Communication can significantly influence the NFF phenomenon in many ways, including
the following:

• Poorly communicated procedures


• Poor test procedure descriptions that are misunderstood
• Incomplete or missing reports
• Lack of training

42
The Human Influence

Figure 3.1 illustrates the typical flow of information from the origin of a fault or reported
symptom until it reaches its target, and throughout this route the probability for NFF events
can grow at any phase. At the origin is “THE” problem that has instigated an event. By the
time the details reach the intended target, its interpretation will vary dramatically due to
misunderstandings, miscommunication, human subjectivity, and even self-interest. As a
consequence, “A” problem will be solved rather than “THE” problem, and without thorough
investigations of whether the root cause has been addressed or not.

Origin   Interpreta&on   Wrihen/verbal   Transit   Interpreta&on   Target  

(“THE”  problem)   (Competence,   (language,   (Telephone,   (Competence,   (“A”  Problem)  


fault  codes,   aitude,  style   email,  on  desk,   aitude)  
service  bulle&ns,   of  wri&ng,  etc.)   priori&sa&on,  
manuals,  etc.)   mee&ngs,  etc.)  

Figure 3.1 The route of a problem resulting in a NFF.

To differentiate between “THE” and “A,” companies need to plan in-depth discussions on
such topics to learn about them and propose solutions. This also reinforces an organization’s
attitude toward understanding the NFF problem. Some key areas where emphasis must be
placed are discussed in the following paragraphs.

3.2.2.1 Preparing Accurate Reports


It has been acknowledged that if a fault diagnostic report is made at the same time as
carrying out the troubleshooting process (also known as concurrent reporting), it is carried
out within the short-term memory (STM). Conversely, when a report is made after having
carried out the task (also known as retrospective verbalization), it is retrieved from the
long-term memory (LTM). Because retrieval from LTM is more fallible than from STM,
retrospective verbal reports are less valid than concurrent reporting. Thus, if maintainers
wish to achieve a better understanding of the nature of the fault diagnostic process, they
should rather concentrate their efforts on collecting concurrent verbal reports [3-6]. This
means that it is best to carry out the troubleshooting as well as the fault diagnostic report at
the same time to establish the most accurate report. Organizations can help by automating
this process, thus reducing the burden on the maintainer.

3.2.2.2 Consistency
Another concern is the lack of consistency in maintenance practice, which slows down the
overall process of rectifying failures. Lack of consistency can result from the following:

• Lack of correspondence between designers, airlines, and repair organizations


• Varied interests in businesses
• Not using standards for information exchange (e.g., S-Series ILS)

It is important to understand that an airline can solve some issues, but it cannot address
other problems that can only be dealt with by either the designer or the repair organization
(e.g., the airline can do nothing if the repair organization has a limited test bench). The rising

43
Chapter 3

complexity of systems, the interdependent mixture of hardware and software technologies,


as well as the imperfect transfer of knowledge between the test designers and the engineers
increase the difficulties of detecting all possible failure modes. A solution to this would be
to share information and knowledge between organizations, while providing feedback at all
levels of the relevant chain.

3.2.2.3 Feedback
Two distinct groups are identified clearly:

• The in-field staff or the operational team, who receives limited feedback about the NFF
phenomenon
• The tactical team, which obtains some feedback

From an aviation maintainer’s viewpoint, receiving limited feedback can perhaps be attributed
to two reasons. First, if a part was to be installed on an aircraft, but it had earned a NFF
labeled tag from the repair shop, there is still the possibility that it carries a defect, even
though a defect was not highlighted on the test bench. So, in theory, the part is at risk of failing
again soon, this time perhaps in flight. The responsibility can boil down to the pilots and
maintainers who will need to observe the behavior of the NFF labeled part during installation
and subsequent testing. In some airlines, pilots do receive a technical report listing all NFF
parts that were installed on the aircraft, seven days prior to the actual flight. This allows them
to recognize that certain units have a higher probability of failure. Such reports also should be
shared with maintainers, who will be able to benefit from the awareness. Therefore, a similar
aircraft technical report system would be beneficial to put in place.

Second, some maintainers have voiced their opinions that it would be desirable to receive
feedback about NFF parts that come back labeled NFF to rectify troubleshooting. As they
obtain limited feedback, they often have little (or no) idea of:

• The extent of the problem


• The percentage
• Which parts were most affected

These opinions are consistent with the observation made by other experts on the difficulties
arising from lack of feedback. At the moment, it seems that maintainers do things as they
have always been doing them (driven by the organizational culture), simply because they
have never been informed of the possibility that what they were doing was inadequate.

3.2.3 Human Factors Impacting NFF


Maintenance is one aspect of a complex system, where (human) entities perform varied tasks
in an environment with time pressures, sparse feedback, and sometimes difficult ambient
conditions. These situational characteristics, in combination with generic human erring
tendencies, can influence various NFF event “triggering factors,” which include:

3.2.3.1 Discrepancy in Terminology


Terminology was seen as a critical factor in chapter 2. It was shown that ambiguity,
misuse, and misunderstanding of the numerous terms have caused inconsistencies within
maintenance working groups and committees.

44
The Human Influence

3.2.3.2 Operational Pressure


Operational requirements for businesses with short turnaround times are driven by
commercial pressures and are an unavoidable fact of business, particularly in the airline
industry. Customer satisfaction and business reputation, not to mention penalties for delays
and cancellations, all contribute to operational pressures in many organizations such as
airline and rail industries. Pressures, whether self-imposed or organizationally imposed,
can result in mistakes being made and details being missed by the maintenance personnel.
Internal factors include poorly designed organizational workload policies, inadequate
staffing, last-minute changes in schedules, long working hours, and the lack of an effective
safety culture. The external operating pressures include market competition, contingency
events, and the demanding legislative framework. Nighttime is often devoted to scheduled
inspection and repair tasks, especially in the rail and aerospace industries, as well as fixing
failures reported from the previous day. Operational pressures at night can be exasperated
because of human fatigue [3-7].

3.2.3.3 Training
One of the more commonly reported contributing factors to NFF is inadequate training. This
is an area that is affected by different aptitudes and skills for recognizing and interpreting
faults. At present, there is little training available on NFF, but because modern educational
standards usually specify the “learning outcomes” for maintenance tasks, teaching NFF
should be easy to associate with the core engineering competencies now required.

3.2.3.4 Lack of Experience


Troubleshooting processes are affected not only by training and tools, but also are heavily
dependent upon experience [3-8]. This is important, as increased levels of system complexity
are a major cause of NFF events, and the experience of maintenance personnel is critical
to provide system familiarity. When the system is complex, unless the maintainer is
knowledgeable or experienced, they will simply send the whole unit for repair rather than
carry out further troubleshooting to identify the component at fault. In operational conditions,
expert knowledge for fault diagnosis can be lacking because of the unavailability of experts,
perhaps due to different shifts, sickness, or a holiday, which then compounds the problem.

Because all of these are fundamentally different issues, an organization may be challenged
to recognize what might be a systemic human factors problem. Another dimension of the
problem is when senior management does not understand, nor recognize, that NFF events
are an issue. This can be caused by a number of reasons, including the following:

• The cost of NFF is not seen by the senior management, and even if it is, it is small when
compared to an aircraft delay or cancellation.
• No clear end-to-end FRACAS (Failure Reporting, Analysis and Corrective Action
System) is available. Organizations do not always use the standards, making it difficult
to investigate units sent to suppliers with sufficient reasons for removals.
• Financial pressures are present to have the system (such as an aircraft) serviceable.
• The majority of senior managers are not involved with actual troubleshooting, testing or
data capture.

45
Chapter 3

Undoubtedly, NFF events are influenced by other factors, such as mission requirements, new
technology use, and many outside influences over which management has no control. Of
course, these cannot be ignored, although their effect can be minimized in an otherwise
good work environment. The answer may lie in studying how various systems interact
with each other, in the midst of NFF issues. Let us now take a look at how these various
interactions can relate to the problem.

3.3 The Maintenance Engineer and System Interactions


3.3.1 Typical Maintenance Processes in Civil Aircraft
Investigating the human influence on NFF requires an understanding of the maintenance
process and the human interaction between hardware and software elements, along with
their interaction within the maintenance environment. The NFF phenomenon will differ
depending on which part of the maintenance organization is experiencing it. With front-line
aircraft maintenance, NFF events originate from a fault condition that triggers a warning
to alert the pilot of possible system degradation. A chain of events, which is initiated by the
pilot experiencing a fault situation, is then set in motion. However, subsequent diagnosis
and maintenance intervention by the repair organization may be ineffective, as the same
symptom may occur on the next flight. Figure 3.1 illustrates a typical chain of events that
may lead to several cases of NFF within civil aerospace.

As can be identified within the maintenance process shown in Figure 3.2, a resulting NFF
can occur at several points, as listed in Table 3.1.

Reported  by  
operator   Fault  indica&on   Send  off  for  repair  

Fill  in  technical   Built-­‐in  self  test  


YES  
logbook   indicator  

Repair  order  to  send  


Under  
Call  for  maintenance   ‘defec&ve’  part  for  
warranty?  
repair  

NO  

Fault  diagnosis  by  technician  


in  accordance  with   Repair  shop  
Tagged  
manufacturer  documents   ahempts  to  fix  
unserviceable  
problem  

No  fault  found  
Removed  part   during  test  

Decision  to   YES   Fill  in  test  details  in  


remove   Remove  part  
logbook  
part?  

Install  new  part   Part  now  tagged  


NO   serviceable  
Maintainer  fills  in  
Part  put  back  in  
Tagged  serviceable   technical  logbook,  
stock  
system  released  

Figure 3.2 The simplified repair process during a maintenance action.

46
The Human Influence

Table 3.1 Example NFF Causes During the Repair Process


Stage Example NFF cause
Fault indication The fault is incorrectly interpreted by the operator.
Fill in technical log book The operator concisely writes down what he/she believes is
the problem in the log book.
Fault diagnosis by technician The manufacturer’s manuals will contain all expected fault
in accordance with modes that can occur during system operation. However,
manufacturer documents the new fault mode may not have been updated in the
documentation.
Decision to remove part The technician is inexperienced with the fault/failure and
accidently removes the wrong component.
Repair shop attempts to fix the Bench tests on the removed component are carried out
problem under normal conditions; therefore, the fault does not
manifest itself this time.

If the item is still under warranty, the attitude would be to


send the component off without confirming if there was a
fault in the first place.

The main reasons for unsuccessful diagnosis and repair of reported faults can be attributed
to a number of possibilities including the inability to reproduce the conditions under which
the fault symptom first materialized, inadequate test procedures, or simply human error.
The vital link in the flight safety chain, which was discussed in chapter 1, is the maintenance
engineer who has the responsibility for certifying an aircraft as fit for flight.

The next section explores the interactions of engineers with hardware, software, and
environment, in turn. It includes some technical aspects of NFF problems associated
with the complexity of aircraft systems, functionality of TME (or aircraft BIT), equipment
reliability, and procedural issues such as discrepancies in manuals and incorrect reporting.

3.3.2 Hardware Interactions


A maintenance engineer interacts with various hardware components (such as LRUs,
avionics systems, or ground support equipment) when conducting aircraft maintenance
tasks. These tasks often include checking system interfaces and interconnections,
reproduction of the flight environment, equipment test functionality, and system
interactions. During these checks, many obscure or intermittent fault conditions can be
found within the interconnections between systems. A common solution is to reset the
electrical power from its subsequent application for the purposes of maintenance, which can
be sufficient to clear the original fault. Indeed, this is a routine way of removing BIT codes.

A major aspect that must be noted is the situation in which the fault symptoms were first
reported (by the operator), which can be exceedingly difficult for the maintainer. For many
types of equipment, and aircraft in particular, it is impossible to replicate the vibration,
temperatures, and humidity that were being experienced when the fault first occurred.
Modern aircraft electronic systems share data and functionality among many subsystems
and components rather than operating autonomously. Instead, engineers are expected to
rely upon automated BIT to help with fault diagnosis, and therefore they must have the

47
Chapter 3

confidence in the BIT system’s ability to detect faults. This assumes that complex aircraft
systems have been designed with a BIT function capable of detecting all events that lead
to fault indications. The ability of a maintainer to accurately deploy the use of BIT as a
fault-finding tool and effectively interpret its results is the key to robust fault diagnosis and
subsequent repair action. Such systems require that maintenance tasks be carried out by
maintenance engineers who have the training, knowledge, and confidence in its use.

Another aspect within these hardware interactions is that maintenance personnel are often
required to use stand-alone TME that augments their diagnostic toolset. At face value, this
may give an impression of enhancing the individual’s ability to diagnose and reproduce
faults accurately. However, this is not necessarily the case, and limitations attributed to this
must be addressed and understood. For example, specialized equipment generally does
not have the capability to simulate the actual loading conditions, such as vibrations and
temperatures that are being experienced by the system.

The onset of new technologies, and the need to maximize equipment availability with
reduced maintenance, has resulted in an increase in system interactions. This is opposed to
the federated systems found on aging aircraft. Integrated operations within systems lead
to fault symptoms that may not have been considered possible during system design (e.g., a
failure within system X has a knock-on effect to produce a fault symptom in system Y). Such
a scenario presents additional challenges to maintenance engineers when attempting to
undertake fault diagnosis against the reported symptom.

3.3.3 Software Interactions


Software interactions, which encompass non-physical aspects of the maintenance
environment, include maintenance manuals and procedures, check-lists, and computer
software interfaces. The past few decades have seen increasing sophistication in this
area, in which engineers make use of software tools (for communicating, training, data
trending, etc.) in accomplishing their maintenance work. In fact, communication tools can be
considered the most important interface system within maintenance. These include e-mails,
bulletin boards, e-logs, and databases, which are used to share information within the team,
allowing simultaneous access, flexibility in presentation, and important updates. However,
for these interactions to be effective, the correct application of the software tools is required.
Some issues that can be associated with the human-software link can include poorly written
manuals, misinterpretation of procedures, non-compliance with procedures and manuals,
untested or difficult to use computer software, and poorly designed BIT/BITE [3-9].

In the NFF context, particular attention is focused on the maintainer’s use of aircraft
maintenance manuals when undertaking fault diagnosis; typically, these are now accessed
electronically. Aircraft maintenance manuals aim to provide the maintenance engineer with
the most effective sequence of diagnostic activities, using the most applicable methodologies
and tools. Since electronic manuals are easily updated, they offer the maintainer more “up
to the minute” information than paper systems would ever achieve. Although electronic
manuals can also have weaknesses, including the sequencing of tasks, inadequate accuracy,
and ease of use for the maintainer, such weaknesses may go unnoticed and generate NFF
events that will continue for some time until finally rectified.

48
The Human Influence

The maintainer’s access to maintenance procedures has dramatically evolved over recent
years due to the onset of high-speed computer networks and portable computing power.
Maintainers no longer rely solely upon cumbersome paper versions of fault diagnosis
procedures but rather have access to electronic versions of manuals viewed on computer
base stations, portable tablets, or onboard aircraft integrated maintenance systems. This
presents an added requirement for such interactions; not only must procedures be useable
and accurate, but also the hardware used to view them must be capable of accurately
presenting the information, and ergonomically suited to the user.

3.3.4 Environment Interactions


Environment interactions can be broad based and examine human factor implications in
different contexts:

• Physical environment: includes the physical environment as the workplace, such as the
maintenance hangar or workshop
• Working conditions: includes working patterns, management structures, training, and
company organizational structure

Environment-related implications cannot be ignored, as they potentially have the most


significant impact on the behavior of maintenance personnel, and influence their ability to
undertake effective fault diagnosis. Aviation maintenance is generally undertaken in a fast-
moving environment where engineers are regularly challenged by time pressures, limited
supervision, and difficult working conditions, which can result in human error. Lack of
time, and the associated pressure, is a major issue within aviation maintenance due to the
penalties, such as financial and reputational, if the aircraft is not available for its role of
carrying fare-paying passengers, which is the primary source of income for operators.

3.4 Human Factors Survey


This section presents international survey results on the contribution of maintenance-related
human factors to No Fault Found events on aircraft systems [3-10]. This work involved
research into maintainer human interactions with off-aircraft test equipment and integrated
maintenance systems. These systems include hierarchical listing of all parts, their manuals,
schedule management systems, and various activities that can minimize lost production
time. They can also provide engine maintenance information, depending on fault code data
received from on-board engine performance monitoring systems. Such systems can play
an instrumental role in recording maintenance actions for the purpose of validating and
generating warranty claim applications. The systems are not intended to disclose any actual
organization, product, or individuals.

3.4.1 Introduction
The project, undertaken in 2014, developed a set of recommended best practice guidelines
focusing on mitigating human factors implications that arise from engineers interacting
with complex systems when conducting maintenance tasks. The study revealed that key
resources such as aircraft test equipment, integrated on-board maintenance systems, and
technical maintenance manuals failed to support engineers when undertaking diagnostic
tasks. Reasons for this included poor fitness for purpose and lack of training in their

49
Chapter 3

use. The combined effect of the research findings is that aircraft maintenance personnel
are unable to consistently undertake accurate and timely fault diagnosis tasks due to
shortcomings in maintenance resources and organizational support. This can result in
unwanted NFF occurrences and prevent organizations from achieving maintenance and
operational objectives.

The study introduced the possibility of human factors in aircraft maintenance contributing
to the NFF problem. The objective of the research was to collect and evaluate human
factors-related data to recognize maintenance processes improvements to mitigate
NFF events. The work was completed at Heriot-Watt University, in cooperation with
the Engineering and Physical Sciences Research Council Centres for Innovative
Manufacturing in Through-life Engineering Services. It allows industry and operators to
have a greater understanding and recognition of the root causes of NFF occurrences and
the associated business costs they present.

To achieve its objective, the work extracted quantitative and qualitative data from surveys
that drew a significant response, with a large number of returns received from a number
of organizations.

The survey was answered by 188 participants. The majority of responses came from certified
Category B (Cat B) licensed engineers. Cat B License holders come in two categories: B1
engineers are responsible for aircraft mechanical and propulsion systems, and B2 engineers
are responsible for electronic and avionic systems. The survey identified four key areas in
which improvements should be made:

• Aircraft testing resources


• Aircraft maintenance manuals
• Organizational pressure
• Maintenance engineer: competence and training

These areas are addressed in the following sections.

3.4.2 Aircraft Testing Resources


The purpose of the aircraft testing resources section of the survey was to establish if aircraft
testing resources, which include off-aircraft TME and aircraft installed monitoring systems,
allow maintainers to diagnose reported faults accurately and efficiently.

The survey requested participants to state what percentage of TME used in support of fault
diagnosis they are competent with, regarding its operation and functionality. These data are
presented in Figure 3.3 and are broken down between aircraft trade and industry experience.

The data show that approximately 80% of AMEs in both trades have more than 10 years’
experience—demonstrating a positive correlation between experience and TME competency.

50
The Human Influence

Experience Breakdown by Trade 90% 90%

% of TME Competent With


80%

60% 70%

60%

30% 50%

40%

0% 30%
> 10 Yrs 5-10 Yrs < 5 yrs > 10 Yrs 5-10 Yrs < 5 yrs Experience by Trade
Mechanical Avionics
% of TME Competency
Industry Experience by Trade

Figure 3.3 TME competency by aircraft trade and industry experience.

The sharp jump in competency levels for those with less than five years’ experience can
possibly be attributed to the low numbers of respondents who fall into this category. The main
reasons for lack of competency were cited as the following:Infrequent use of equipment

• Lack of training
• Test equipment being too complicated to use

A large, and challenging, factor in aircraft test equipment can be attributed to its design.
A drive toward a more electric aircraft, presents a need for functional integration of BIT.
Unfortunately, many of the technical skills required to manage these well-designed and
user-friendly maintenance systems can be limited in isolating all diagnosed faults—at least
to a level where only the offending equipment requires removal. In such situations, the
broad range of operational and management applications that cope with fault investigations
can be limited, and the commonly used measures, such as NFF (or CND, RTOK, and FNF),
do not provide the required visibility. Many engineers voiced their need for the additional
fault detection capabilities within their standard test equipment, including for intermittent
faults, transient faults, and false alarms. However, no universal agreement has been reached
on how the capability should be classified and processed, or even the level of granularity to
be included.

Figure 3.4 shows a breakdown of AME perceptions on their ability to use BIT. Mitigating
problems associated with the usage and interactions with BIT will have practical implications
on this aspect of the aircraft maintenance chain. AME training philosophies, therefore, need
to be reviewed to ensure that individuals are competent with the use of testing resources,
particularly when using complex aircraft systems that help with fault diagnosis.

51
Chapter 3

Limited knowledge /
Unconfident in Use
(1) 1% (6) 3% (17) 9%

Confident in Use but Not


Aware of Full Functionality

Very Confident in Use &


Functionality

Too Complex / Prefer not to


use

(57) 32%
Other
(100) 55%

Figure 3.4 AME ability/confidence in the use of OMS.

It was also identified that NFFs were more apparent on modern computerized flight
decks when compared to traditional analog-based systems, leading to a requirement for
better diagnostics.

3.4.3 Aircraft Maintenance Manuals


This part of the survey was focused on the provision and content of maintenance support
manuals. The use of maintenance manuals is an essential and mandatory requirement,
and engineers must rely heavily on troubleshooting guides located within the manuals to
diagnose reported faults quickly. A case can be made that the availability of manuals and
the quality and lack of comprehensive technical content, to some extent, restricts the ability
of AMEs to do this. Worryingly, AMEs also reveal that they are compelled to use personal
experience to supplement the maintenance manual when undertaking fault diagnosis. Many
state as the reason for this that manuals fail to provide sufficient information to diagnose
faults. The evidence also indicates that manuals are not always available when required.

Given that the regulatory requirements state that all aircraft maintenance work be conducted
in accordance with specific manuals, the evidence presented poses significant issues. Data
surrounding the availability of manuals and their ability to help diagnose faults is presented
in Figure 3.5.

Figure 3.5 is split into four sections corresponding to rates of NFF ranging from less than
10% up to rates greater than 50%. In each of the four sections, the availability of manuals is
categorized as full (high availability) through too few (low availability). For each of the four
NFF rates, a measure on how well the manuals support fault diagnostic tasks is included.
For all rates of NFF, the majority of respondents state that manuals “mostly diagnose” faults,
with minimal responses stating that they “always” diagnose reported faults. Following
this, it was investigated as to whether the provision and technical content of maintenance
support manuals provide the user with sufficient information for accurate diagnostics. An
overwhelming majority of respondents indicate that the use of manuals alone could not be
relied upon. Therefore, individual expertise would have to be called upon when undertaking

52
The Human Influence

90 70%

% Use (manuals / expertise) against % NFF tasks


Mostly not
75 60%
diagnose
Number of Responses

60 50% Mostly diagnose

45 40%
Always diagnose

30 30%
Range of
Manuals Available
15 20%

% Use of
0 10% Manuals
Full

most

Some

Few

diagnosis?

Full

most

Some

Few

diagnosis?

Full

most

Some

Few

diagnosis?

Full

most

Some

Few

diagnosis?
% Use of
Expertise
Availability Availability Availability Availability
< 10% NFF 10% - 30% NFF 30% - 50% NFF > 50% NFF

Availability of manuals & ability of manuals to diagnose faults against % tasks resulting in NFF

Figure 3.5 The use of maintenance manuals for fault diagnosis.

diagnostic tasks. Failure to ensure accurate diagnosis, being difficult to follow, and too time
consuming to use, are the significant reasons for AMEs preferring to use expertise instead of
manuals when undertaking diagnostic tasks.

Serious implications arise when inadequate fault isolation manuals lead to their lack of
use, in favor of expertise being used instead. It is possible that unauthorized maintenance
practices will be adopted, leading to a range of adverse implications, including flight safety
issues. Less-experienced AMEs, who lack expert system knowledge, rely more heavily on
manuals to guide them when undertaking corrective tasks. In this case, the deficiencies in
manuals may lead to prolonged maintenance times and an increase in NFF occurrences.
The practical implications of solving this issue and mitigating the problems exposed in
these findings are potentially significant. Aircraft manufacturers may need to review their
approaches to the design and production of maintenance manuals. They must ensure that
documents always provide sufficient information for the maintainer to diagnose all fault
symptoms that are generated by systems. At a local level, operators must ensure that a full
range of manuals is available when required for use.

3.4.4 Organizational Pressures


The survey was conducted by asking respondents, all of whom worked in line maintenance
jobs, a series of questions to ascertain the time and organizational pressure they were under
in their regular job. It focused on corrective maintenance actions involving fault diagnostic
tasks. Analysis of the findings, which are presented in Figure 3.6, shows the responses for a
number of different issues.

53
Chapter 3

45% 140

40%
<Have sufficient time

Have suficient time Responses


120 to diagnose faults?>
35%
100
30%
[% Responses]

[Deviate from
80 procedures to fix
25%
faults on time]
20% 60

15% [Pressure influences


40 ability to diagnose
faults]
10%

20
5%
[Use of Manuals
restricts ability to
0% 0
[Never] [Rarely] [Occasionally] [Mostly] [Always] diagnose]

<Never> <Mostly not> <Most <Always>


Occasions>
Response to Question

Figure 3.6 Effects of time and pressure.

The engineering perception of time available to undertake fault diagnosis tasks is plotted
as the solid columns. Approximately 65% (124 responses) of engineers believe that, on most
occasions, they have sufficient time to do their job, but 16% (31 responses) believe they mostly
do not. The issue of available time was raised during the case study interview, and drew the
following response: “time is always limited because the aircraft, with or without the fault,
has got to fly the next day to make money, so you’ve only got a short period.”

Other pressure-related characteristics are plotted as the three curves in Figure 3.6. The
data show:

• 40% of engineers occasionally deviate from procedures to fix faults on time


• 30% of engineers occasionally feel pressure that influences their ability to diagnose
faults
• 20% of engineers need to use manuals, which occasionally restricts their ability to
undertake accurate diagnosis (e.g., when the manual is out of date)

Respondents were also asked to select the most significant factors that lead to a lack of time
to accurately diagnose reported faults. These results are presented in Figure 3.7.

Each available option has drawn a significant response, but most engineers quote manpower
shortages to meet the workload and lack of availability of equipment as the primary reasons.
Content analysis of the qualitative survey responses revealed other recurring themes that
lead to lack of time to diagnose faults, and includes the following:

• Shortage of aircraft spare parts


• Not fully trained on all systems/equipment

54
The Human Influence

Aircraft / System complexity

(24) 8% (53) 17%


Inaccurate maintenance
(46) 15%
procedures

Unavailability of tools / test


equipment

Manpower shortages / high


workload

(121) 39% Other


(63) 21%

Figure 3.7 Factors leading to lack of time for accurate diagnosis.

The following response from a case study interview highlighted issues that can lead to a lack
of time:

“You may not have the spares that you’d like to substitute in
to see if there’s a problem, or the aircraft may be in a position
that’s preventing some other air work being done on other
aircraft and have to be moved.”

Even though organizational pressures are an accepted industry problem, their impact on
the NFF phenomenon should not be ignored. When faced with lack of time, AMEs may
be inclined to go for a “quick fix” that involves replacing a component that is most likely
to clear the reported fault, as opposed to following a course of action that involves the
use of all available resources to undertake a thorough fault diagnosis. In many cases, this
shortcut approach may cure the reported problem, but when it fails to do so, it results in an
additional NFF arising (the documented consequences associated with this were outlined
in chapter 2). It is the responsibility of line management and other support functions
within an organization to manage this and implement strategies to ensure that negative
impacts are minimized.

3.4.5 Maintenance Engineer: Competence and Training


3.4.5.1 Competence
Legislative requirements often determine the need for different qualifications in each sector.
In the UK, all engineering institutions follow the UK-SPEC (UK Standard for Professional
Engineering Competence) standards as guidance to the level of competence necessary to
meet business demands. These advise on the “the breadth and depth of underpinning
scientific and mathematical knowledge to permit understanding and skills appropriate to
applying engineering principles to existing or future engineering problems and processes”
[3-11]. However, the focus of current training courses on “use a method” type approaches
needs to change toward a greater understanding of the methods themselves. Current NFF
investigative approaches do not reflect the depth or academic appreciation required to solve

55
Chapter 3

the problem. The survey results, shown in Figure 3.8, demonstrate these arguments, which
are evident across the UK aviation industry.

Lack of Training
(13) 5% (1) 0%
(21) 8%
(89) 34%
(15) 6% Infrequent Use

Too Complex

Equipment
Unserviceable

Little Confidence in
Functionality
(124) 47%
Other

Figure 3.8 Reasons for lack of competence.

The majority of respondents stated infrequent use of TME as the main reason for lack of
competence, while a significant proportion (34%) believed lack of training was to blame. A
training-related theme was also prevalent in free text responses to this question:

“…lack of education in the industry”

“...need recurrent training”

“…not trained in electrical test equipment”

3.4.5.2 Training
The purpose of the training section of the survey was to understand the level of training and
professional development that engineers have undertaken with respect to aircraft systems
maintenance, and the use of testing resources to support such maintenance. Engineers
were also asked if they would benefit from additional training and the type of training they
would require.

The range of training received and categorized for each of the military, fixed wing
(commercial aircraft), and rotary sectors is shown in Figure 3.9. Approximately 95% of AMEs
across all three sectors indicated they would benefit from additional training. The range of
training received on maintenance of aircraft systems and on test equipment follows a similar
pattern across all sectors. A significant proportion of AMEs indicated they had received
“very limited” training, particularly on test equipment.

56
The Human Influence

75% 100%

Would Benefit from


more Training
60%
95%
% Trained on

45%

90%
30%

85%
15%
Trained on a/c
systems
maintenance
0% 80%
Trained on test
None

None

None
On all
On all

On all
on most

Very limit

on most

Very limit

on most

Very limit
equipment

Benefit from more


Military Fixed Wing Rotary Wing training
Training by Aircraft Type

Figure 3.9 Training data overview.

The level of training received on aircraft systems maintenance and testing resources, shown
by the blue and red curves against the left hand vertical axis, follows a similar pattern across
all sectors. The range of training received peaks at “on most” for both aircraft systems and
test equipment in each sector, with responses in the region of 45% to 60%. A significant
proportion of engineers indicated they had received “limited” training, particularly on test
equipment. This peaked at 37% in the rotary wing sector.

In the case study interview, an engineer was asked to explain his experiences of training
courses in relation to the aircraft systems and test equipment. His response included the
following: “you don’t get any specific training on test equipment, you’ve got to pick it up
yourself and work it out for yourself, so normally you don’t do it until you need to do it and
so it’s done and you have a few hours with the aircraft, how do we do this? It’s not the best.”

An overwhelming majority revealed that they would benefit from additional training to
support them in the execution of their maintenance duties. Many AMEs feel as though they
receive very limited training on off-aircraft TME and integrated aircraft OMS, and this
inhibits their ability to undertake diagnostic tasks. It is, however, generally accepted that the
quality of training received was of a high standard.

Around 95% of respondents, across all three sectors, indicated that they would benefit from
additional training. The type of training is broken down in Figure 3.10.

57
Chapter 3

(11) 4%
Aircraft system operation
(112) 36% & functionality
(105) 34%
Onboard Maintenance
Systems / BITE

Use of manuals

Use of TME

Other
(11) 4% (69) 22%

Figure 3.10 Additional training needs.

A large number of engineers revealed they would benefit from training on testing
aspects—69 opting for on-board maintenance systems (or BIT) and 105 on the use of
equipment—while more than 100 individuals believed they would benefit from system
operation and functionality training. When probed about the benefit of additional training
during a case study interview, one engineer did not believe he would benefit from this, but
he did raise an interesting point regarding learning from experience: “I suppose if you
had to, you could experience some sort of way you can learn what some other operator
experienced: what happened here and this is how we got round it.”

3.5 Best Practice Guidelines


The research work presented in this chapter so far was focused on discussing the complex
nature of maintenance systems and how human influences can contribute to the NFF
problem. Identification of the shortcomings can lead to a series of best practice guidelines
that can help with NFF management at the AME level. Issues that may involve technical
redesign or hardware improvement and modification strategies have not been considered
and will be dealt with in later chapters.

The conclusions from consideration of maintenance engineers and their interactions with
complex systems are as follows:

• AMEs must be provided with a locally managed database that specifically tracks NFF
occurrences and impacts. The purpose of this is to record specific NFF events that will
contain metrics including maintenance man-hours spent and components replaced to
allow AMEs to learn from experience, and historical data that will enable management
to interpret the cost of NFF with greater accuracy.
• Tuition on the use and functionality of specialized test equipment must be incorporated
into existing training provision, such as aircraft type training courses, or EASA part
66 training. Additionally, operators must ensure that key individuals attend dedicated
equipment training courses.
• Aircraft type training courses must be reviewed continuously to ensure that the use
and functionality of on-board maintenance and BITE systems is addressed in sufficient
detail to equip AMEs with the skills to diagnose system faults accurately.

58
The Human Influence

• Although not discussed here, implementing quarantine procedures can reduce LRU
repair costs by ensuring that only LRUs requiring repair enter the repair cycle.
• AMEs need to have access to as wide an experience and knowledge resource as possible
to learn from others who operate the same aircraft type. This suggests the need for
a cross-industry supported best-practice sharing portal, which is dedicated to fault
diagnosis and corrective maintenance. This portal would be a useful step forward in
NFF management. The portal would be specific to a particular aircraft type and be
freely shared with all operators of that aircraft type. Its purpose would be to detail
specific aircraft system faults and the maintenance activity undertaken that repaired
the reported fault. AMEs faced with ongoing NFF issues could access information and
possibly be able to find a solution to their problem based upon events that occurred
elsewhere.
• NFF awareness training centered on the human factors domains is needed and should
be introduced into the existing EASA part 66 syllabus or other continuing professional
development programs.
• Airline operators and maintenance organizations must undertake periodic training that
is based on a needs analysis process, centered around AME interaction with aircraft
systems and their associated maintenance. The purpose of this is to identify skill gaps
within the AME population and provide the necessary training.
• Introduction of a maintenance manual feedback system, particularly focusing on
fault isolation manuals, is also needed. The purpose of this is to allow AMEs to report
on shortcomings in such manuals that are resulting in NFF, and to use the system
to instigate a fast-track amendment process. This system may be linked to existing
feedback mechanisms used by the industry to identify and resolve deficiencies in
manuals.

3.6 Conclusion
This chapter captures an understanding of the human aspect that can cause the NFF
phenomenon. NFF events are often subjected to miscellaneous interpretations whenever
organizations and individuals describe the problem. Therefore, before NFF management can
be considered, there is some merit in understanding the importance of humans surrounding
the chain of events.

To summarize the output of this chapter, if we seek to minimize the human influence within
the NFF phenomenon, the main areas of focus are:

• Staff communication skills should be honed through training on processes/procedures.


• Continuous feedback should be provided to maintainers and system operators.
• A formal training for developing an understanding on NFF should be provided.
• Improvement in fault reporting procedures for hidden faults (e.g., there should be
further tracking of fault information, reporting the tests carried out by maintainers on
a unit, or operational conditions of the fault, and sending this information to the repair
workshop to give it a better chance to identify and fix the fault).
• Maintainers would benefit from further support on complex fault isolation tasks,
correction of the shortcomings of fault manuals, and assistance with issues concerning
night work.

59
Chapter 3

3.7 References
3-1. Burke, M. J. Applied ergonomics handbook. Lewis Publishers, Ann Arbor; CRC Press,
Boca Raton, FL, 1991.

3-2. Dhillon, B. S. “Human factors in maintenance and maintainability.” Human


reliability: with human factors, 139–152. Elsevier, 2013.

3-3. Knotts, R. M. “Civil aircraft maintenance and support Fault diagnosis from a
business perspective.” Journal of Quality in Maintenance Engineering 5, no. 4 (1999):
335–348.

3-4. Khan, S., P. Phillips, I. Jennions, and C. Hockley. “No Fault Found events in
maintenance engineering Part 1: Current trends, implications and organizational
practices.” Reliability Engineering & System Safety 123 183–195.

3-5. Wise, J. A., V. D. Hopkin, and D. J. Garland, eds. Handbook of aviation human factors.
CRC Press, 2009.

3-6. Sauer, J., G. R. J. Hockey, G. R. J., and D. G. Wastell. “Effects of training on short-and
long-term skill retention in a complex multiple-task environment.” Ergonomics 43 no.
12 (2000): 2043–2064.

3-7. Brombacher, A. C., P. C. Sander, P. J. Sonnemans, and J. L. Rouvroye. “Managing


product reliability in business processes ‘under pressure.’” Reliability Engineering &
System Safety 88 no. 2 (2005): 137–146.

3-8. Hawkins, F. Human Factors in Flight (2nd ed). Aldershot, UK: Ashgate, 1987.

3-9. Kim, C., and H. Christiaans. “‘Soft’usability problems with consumer electronics:
the interaction between user characteristics and usability.” Journal of Design Research
10 no. 3 (2012): 223–238.

3-10. Pickthall, N. “Contribution of Maintenance Human Factors to No Fault Founds


on Aircraft Systems Engineering.” Proceedings of the 3rd International Conference in
Through-life Engineering Services, November 4–5, 2013, Cranfield, UK.

3-11. Engineering Council. UK standards for professional engineering competence:


Third Edition. Available at http://www.engc.org.uk/professional-qualifications/
standards/UK-SPEC. 2014.

60
Chapter 4
Availability in Context

4.1 Introduction
The operational performance and efficiency of a system is affected by the amount
of maintenance it requires, a consequence of its design and the faults and failures
encountered in operation. Maintenance, therefore, has an impact on both safety and
operational availability. Depending on the nature of the system’s task, more emphasis
is placed on either of these. For example, with military aircraft in a wartime situation,
safety is of course important, but availability is vital and will at times have to supersede
safety in the interests of winning the battle. For commercial aircraft though, safety
is paramount, and availability is always subservient to the requirements of safety.
However, a commercial operator loses revenue and reputation from delays and
cancellations; this can encourage a culture of replacing anything and everything to
solve a fault that might risk a delay or cancellation. Safety is preserved, but such activity
will almost certainly generate NFFs in the components removed at the next level of
repair, as some will have been removed speculatively in a blanket act of maintenance to
ensure the fault is remedied. While these examples are both taken from aerospace, the
principle applies to any system in operation, whether it be another form of transport or
systems such as manufacturing machines.

The ability for a system to achieve its required operational capability, from either
a safety or availability viewpoint, is known in some areas as system effectiveness.
This concept embraces both safety and the constituent parts of availability, namely
reliability, maintainability, and logistics support. One must also recognize that by
using operational capability, we include the performance attributes of the system. The
availability part of system effectiveness will be covered in this chapter, while the safety
aspect will be addressed in the next chapter.

Looking back at the definition and taxonomy of No Fault Found provided in chapter
2, we see that NFF is concerned with difficulties in identifying the symptom-to-cause

61
Chapter 4

relationship and identifying the specific location of the fault within a system. NFF therefore
has an impact on the time taken to return the system to a serviceable condition, and thus
can be a significant maintenance issue. If the system is returned as serviceable without
the fault that caused the original problem being identified, there is a potential safety issue
and every likelihood of the system failing again. If a component in the system is removed
unnecessarily or is removed several times for the same issue, then there is a significant
impact on availability.

In this chapter we explore the relationship between NFF and availability, with a focus on its
prime components of system’s reliability and maintainability. We begin by setting the scene
and describe the maintenance environment for civil aerospace, and how these practices
are changing as the industry moves further toward service-orientated business models.
Again, for simplicity, we use aerospace to provide context for the discussion and to provide
examples, but hopefully the reader can interpret the principles and apply them to other
operating environments. The core concepts of reliability, availability and maintainability
(RAM) are discussed and linked by investigating the equation for operational availability.
Taking this equation, the impact of NFF on operational availability is demonstrated
and shown to be a problem rooted in the reliability and maintainability domain. This
emphasizes that many NFFs are indeed symptoms of poor design and/or test and, likewise,
primary root causes of NFF. Finally, a process for monitoring in-service NFF rates for the
purpose of tackling NFF at the design stage, as part of a given improvement methodology, is
outlined together with some examples.

4.2 Aerospace Maintenance Practice


Using aerospace maintenance practice provides one of the most safety conscious and highly
regulated environments to investigate the NFF relationship with availability. Maintenance
programs for key systems such as avionics, engines, and landing gear are made up of several
types of maintenance policies such as preventive, corrective, and on-condition maintenance.

All these policies will provide for redesign, if necessary to solve fleet–wide issues, and
will result from a growing amount of evidence based upon the actual usage of the system.
The process should ensure that engineering modifications are made to address safety or
reliability issues that were not anticipated in the original design. Ideally, information from
maintenance activity, performed by all users of the system, should feed back into any
redesign activity.

Figure 4.1 shows the relationship over time between the operational tasks carried out by a
system, usually preceded by some waiting or standby time. Interspersed will be periods
of corrective maintenance for faults and failures and also periods for scheduled preventive
maintenance. These periods may well require very different levels of resources, varying
from single individuals to a team of several engineers.

62
Availability in Context

Uptime

Time
Downtime

Operational Tasks Corrective maintenance

Standby Preventative/scheduled maintenance

Figure 4.1 Uptime and downtime for various operational states.

Much of the major aircraft maintenance and repair work is out-sourced to service providers
who carry out MRO operations on behalf of the aircraft operators. Operators will not, and
legally cannot, compromise the safety of their aircraft and passengers. So they will look for
the optimal combination of affordability, expertise, flexibility, and the ability to offer the best
solutions, when faced with the choice of MRO provider.

MRO activity is expected to experience a dramatic worldwide growth over the next couple
of decades [4-1]. Currently, the European market holds a 26% share of the civil worldwide
MRO business compared to 39% held by North America. However, several hurdles must be
overcome by MRO providers for them to continue their lead in the global marketplace:

• Growing competition from the Middle East


• Greater competition from OEMs who seek to provide a whole-life supply and
maintenance service contract
• Continued pressure from airlines to reduce costs

These are forcing changes in the global aviation maintenance industry. MRO providers
are expanding their geographical reach and capabilities in a bid to become regional and
global full-service providers. Conversely, overall spending on MRO is expected to increase
universally despite the pressure from airlines to reduce costs as fleets age. Additional
spending will increase expectations and pressure for service value, including achieving first-
time-fix, which therefore highlights again the need to eradicate NFFs.

The demand for more predictive maintenance strategies is growing, with higher
reliability and practical solutions to complement out-sourced maintenance repair work
[4-2]. To further drive cost reductions, airlines are also seeking to incorporate ever more
sophisticated maintenance management solutions into their aircraft. At the same time,
they are reducing required investment in inventory to reduce committed capital and help
improve their business operations and operational availability—all of which are affected
by the NFF problem.

Reducing NFF, therefore, becomes a core part of any change in maintenance strategy
for operators and the service solutions that MRO suppliers can provide. This will help
in reducing the levels of corrective and unexpected maintenance, and hence optimize
maintenance effectiveness on aircraft fleets. The changing face of the aviation industry
requires that maintenance management become increasingly tailored toward individual

63
Chapter 4

customers’ needs with cost-effective solutions being found, offering compromises between
customer involvement and the level of commitment required from the providers. Figure
4.2 shows a matrix with different maintenance solutions and the level of commitment and
partnerships required by the operators and MRO providers for a variety of maintenance
contract types.

In Figure 4.2, the relationship between the level of MRO support that is needed versus the
involvement from aircraft operators, for various maintenance contract types, is shown.
Involvement is represented as low to high for MRO support and high to low for aircraft
operator involvement. Traditional maintenance evolution has been along the diagonal
from bottom right to top left. Time and materials, equating to “fix when broke,” involves
low (reactive) participation from both operator and MRO provider. Predictive maintenance
significantly improves this state by being more proactive, and gets both parties more
involved. CBM (top left) represents, for a number of companies, a highly desirable business
model for proactive maintenance, but demands high commitment from both sides. Other
possibilities are also shown in Figure 4.2, including through-life support. This is where an
MRO provider contracts to maintain an asset through its life, taking total commitment from
the MRO and low to medium involvement from the operator [4-3].

Condi&on  Based   Through-­‐life  


Maintenance  Policy  &   support
Condi&on  Monitoring All  inclusive  
High

overhauls
Predic&ve  
MRO  Support

maintenance
Medium

Customized  
payment  scheme
Preventa&ve  
ac&ons
Low

Time  and  
materials

High Medium Low


AircraD  Operator  Involvement

Figure 4.2 MRO support vs. operator involvement for maintenance contract types.

4.3 The Quality of Maintenance Systems


The concept of maintenance is all about inspecting and servicing equipment. Its purpose is to
ensure that the equipment can be put back into service, in a fit condition, to last until the next
maintenance intervention. The term “maintenance” is associated with finding failures and
repairing them. Jack Hessberg, Boeing’s Chief Mechanic, recognized that most maintenance
is reactive and was quoted as saying: “maintenance is nothing more than the management
of (faults) and failures” [4-4]. Maintenance can therefore be considered as the management of
deterioration. Here, it is important to note that this management of deterioration, faults, and
failures, is driven by the consequences of the occurrence of these conditions.

64
Availability in Context

As stated at the beginning of this chapter, system effectiveness is regarding both safety and
availability, with a large difference between civil and military applications. This produces
subtly different cultures and behaviors between these two groups. The need to achieve
dispatch reliability for an airline will be paramount, as the economic consequences of
delays, or worse, cancellations, can be damaging. Consequently, the maintenance staff will
do almost anything to achieve the minimum delay when faced with a failure at the gate.
The desire of many airlines is naturally to minimize delays at the gate, and if this means
changing three boxes rather than carrying out the diagnostics to find the root cause of
the failure and the exact box at fault, assuming this can be done, then three boxes will be
changed. In peace-time operations, this culture in the military would be unusual, as there
is rarely such a time pressure and particularly now that so many civilian companies are
providing the support. In actual operations, where battle-winning availability is vital, then
the same culture will pervade to maintain availability.

A second factor is also at play here; it underlines maintenance as part of the safety system.
Civilian airliners are built to fail-safe principles in which every system and part of the
design is meticulously analyzed for the consequences of it failing. Should that possibility
happen, there must be an alternative load-path or alternate system to provide redundancy.
Military aircraft, however, are built to safe-life principles, in which maintenance is a key
factor in providing the early warning of failure before it is catastrophic.

Various maintenance techniques are increasingly used and incorporated into the design of
both military and civilian aircraft to provide maintenance assistance. The grounds for this
begin by defining the maintainability goals. These must include quantifiable measures for
the repair process, as well as a quantitative description of the way in which repairs are to
be undertaken. For example, it is necessary to determine which components will need to
be repaired rather than discarded or replaced, the level of repair, test equipment and skills
required, and the maintenance scheduling. There will always be a financial trade-off while
carrying out these activities, reflecting the resources needed to support maintenance. In
recent years, globalization and technological innovation have changed the way a system
is maintained, with techniques such as condition monitoring, built-in tests, and integrity
monitoring systems, which offer maintenance staff early indications of both impeding
and actual failures. Communication and documentation are now being recognized as
vital elements for successful troubleshooting and cost justification. Achieving near zero
downtime while integrating more sophisticated health management systems into existing
designs introduces new challenges that affect cost, design periods, availability of experts, etc.
To the customer, this warrants that all operational equipment remains functional. It is the
key to avoiding unnecessary downtime and helps in making informed decisions to avoid (or
mitigate) the consequences of a failure.

However, no matter how well a maintenance system is designed, there is always the
possibility that it will contain deficiencies (due to decisions and trade-offs in design) that
can lead to difficulties in the quality of maintenance in service. Therefore, it is necessary
when any maintenance system is designed, and before it becomes operational, to thoroughly
test it to identify any potential problems. The ultimate responsibility for recognizing,
interpreting, and compensating for the deficiencies of the maintenance process rests with

65
Chapter 4

the human maintainers. These maintainers are fallible, and issues have arisen that have
proven statistically significant (i.e., they don’t get it right all of the time). Other causes can be
identified, including poor design of human tasks, poorly perceived maintenance operating
procedures, and inadequate training, as well as the pressures of the job. Even by trying to
understand the physical behavior and mechanisms of these errors, it is unrealistic to believe
that they could be eradicated entirely. The operating conditions, and the maintenance
environment itself, are always subjected to unpredictable fluctuations that will have
unforeseen consequences—often leading to NFFs.

4.4 Design for Maintenance and System Effectiveness


When complex equipment or systems are designed, engineers typically identify the potential
failure modes and their effects on the system using a failure modes and effects analysis
(FMEA). An FMEA is based on experience with the last design, and with similar equipment,
but it cannot truly replicate a new asset that has not been exposed to operational use. Indeed,
the FMEA will be updated throughout the asset’s lifetime, as field data become available.
When the equipment enters service—as the “real world” imposes itself—some faults that
were anticipated actually happen, and some will never happen. The FMEA will never be
truly accurate, as it is based on predictions, but it is essential in determining how best to:

• Employ on-board diagnostic technologies (e.g., built-in tests) to detect failures


• Implement prognostics and health management
• Implement condition monitoring strategies
• Implement trend monitoring to detect potential failures (impending functional failures)

These strategies assist in preparing troubleshooting procedures, in advance, for analyzing


the functionality of the system. They can help differentiate among the many possible root
causes of anticipated failures.

A fraction of the theoretically possible failure modes will make an appearance—and it is


those that are of most interest in NFF. The weaknesses in a piece of equipment will become
known during its operation. Things that fail on one aircraft are more likely to fail on another
aircraft of the same design, operated in similar conditions. But most importantly, some
real-world faults were not anticipated by the design engineers, and therefore the traditional
diagnostic systems may not be able to detect them [4-5].

Incomplete understanding of a system design can result in deficient interfaces between the
maintenances actors and the maintainable equipment. This problem, in essence, arises from
a disparity between those predictions in the FMEA and what is actually now happening
during operation in terms of faults and operating conditions. It is essential that maintenance
react to this actual experience by modifying the test procedures, fault isolation manuals, and
FMEAs. This relationship is illustrated in Figure 4.3. If an event was not foreseen during
the design stage, then the necessary fault isolation tools will not be available, thus making
diagnosis difficult unless it occurs primarily by chance, and increasing the potential of NFFs.

66
Availability in Context

Figure 4.3 Anticipated faults vs. actual faults.

The value of field experience in reducing NFFs relies on development of a practical means
both to deploy field experience, and capture it for feedback to design engineering. In its
entirety, completing the service feedback loop is the basis for through-life engineering [4-6],
in which real- world experiences are incorporated in the design knowledge base as efficiently
as possible. Technicians around the world are discovering novel causes of failure on a daily
basis—many of which are identified as a direct result of NFF investigations. Field experience
needs to be shared by inserting it directly into the troubleshooting workflow so that others
will identify the cause of the problem on their first attempt, whenever or wherever it next
occurs. Furthermore, field experience can assist design engineers in improving the reliability
of the next design of the equipment. At the core of the challenge to better troubleshoot and
reduce NFF is closing the gap between anticipated failures and the real failures that appear
in service.

4.5 Availability
4.5.1 The Multiple Facets of Availability
Chapter 2 presents commonly accepted definitions for reliability and maintainability.
Availability, however, poses a different challenge. Several (mathematical) definitions of
availability have been made. Understanding the definitions is critical to see the influence of
NFF and to determine the entity responsible for addressing NFF’s impact on availability.

Availability is the key requirement for any manager of an organization or resources. The
manager is responsible for the delivery of a capability or an output, and the relevant
equipment must be available when needed. No matter how capable an asset is, and how
much potential it has, if it does not work when needed, it is virtually useless.

67
Chapter 4

The definition of availability from chapter 2 is:

• Availability is the probability that the system or equipment used under stated conditions
will be in an operable and committable state at any given time.

It is often expressed as a percentage: the time the equipment is available for use as a proportion
of the total time. It is a function of operating time (the reliability of the equipment) and the
downtime, which is affected by the maintainability and also the supportability. The support
provided, such as tools, test equipment, spares, and resources can also have a large impact on
availability, which is covered later. These aspects are often referred to as the logistic support
or merely logistics, but they are more correctly included under supportability. It is important
to remember that the definition of availability includes various elements. The first is “under
stated conditions.” This recognizes the different environmental conditions that can affect
availability, such as geographic area of operation and the usage of the system. The next element
concerns probability of success, which reflects the operational consequence of failure, such
as what effect there will be on other systems or platforms [4-7]. Modern commercial aircraft
now have built-in redundancy in electronic systems, in particular, such as reversionary modes
which allow the system to revert to a basic level of operation if a fault occurs. This also applies
if the system needs to use spare capacity from another system, thus limiting the consequences
of faults and failures. The probability of success also considers the other available systems
and, most importantly, the supportability and logistic impact or consequences of failure.
The last element concerns time and reinforces the difficulties of measuring availability, as
availability will not be constant over time. The definition highlights at “any given time,” but
this has to be tied in with the phrase “committable,” indicating the period of standby before
the required mission starts. The equipment is assumed to be in full working order and ready
for the mission, and thus to be available for use at a constant level going forward. However,
once in use, the availability level will fluctuate. Consequently, availability metrics are usually
measured as functions of time, and in their simplest form, they are measured as a ratio of
available time to total time. We call the available time “uptime,” and the remaining time is
therefore “downtime,” or the time the equipment is unavailable (Figure 4.1). Uptime can be
viewed as reliability and downtime as time to repair. If we consider the downtime as the time
to repair and the uptime as the time it is working between failures, we can convert uptime and
downtime to MTBF and MTTR (at the vehicle level). Consequently, as shown in Equation 4.1:

Uptime Uptime MTBF (4.1)


=A = =
Total Time Uptime + Downtime MTBF + MTTR

We now need to be more precise because both downtime, and MTTR, can mean different
things to different people. Does MTTR just mean the active repair time, which is the time
the equipment is down and being worked upon? Or does it mean the total time including
waiting for manpower, tools, test equipment, or spares? To address this question, we need to
consider three forms of availability:

• Inherent Availability (Ai)


• Achieved Availability (Aa)
• Operational Availability (Ao)

68
Availability in Context

Inherent or intrinsic availability recognizes that only the active time for repair or
maintenance is to be considered. Instead of the imprecise term MTTR, we use a more
precise term, mean active repair time (MART) or mean active corrective maintenance time
(MACMT). These terms accept that it is only the active times that the designer can influence.
For example, the designer can design in maintainability features to make access and
maintenance easier, and improve diagnosis features. Ai can be calculated using Equation 4.2.

Uptime MTBF (4.2)


= Ai =
Uptime + Active Repair Time MTBF + Mean Active Repair Time

Achieved availability recognizes, under ideal support conditions, that scheduled


maintenance will be needed and will reduce availability. It also recognizes that the designer
can only influence the active time taken on both corrective and scheduled maintenance,
and thus includes the term mean active preventive maintenance time (MAPMT) as well as
MACMT. This still assumes that we have an ideal support environment in which everything,
including manpower, tools, test equipment, and spares are immediately available. What the
user is really interested in is actual operational conditions, not ideal conditions, and thus
this definition is included for completeness.

Operational availability addresses this situation. In an ideal situation, there is no


consideration of standby periods or of logistic and administrative delays. As a result, we
must consider all things that can cause delays, including time to complete paperwork, time
waiting for manpower, tools, test equipment and, of course, time waiting for spares. These
are all combined into a term called administrative and logistic delay time (ALDT). Uptime
must also recognize the time spent on standby, and so operational availability is determined
by Equation 4.3.

OT + ST
A0 = (4.3)
OT + ST + TPM + TCM + ALDT
where:

OT = operating time

ST = standby time

TPM = total preventive maintenance time

TCM = total corrective maintenance time

ALDT = administrative and logistic delay time

So, the uptime is made up of the sum of operational time (OT), when the equipment is
actually being used for its intended function and standby time (ST), when the equipment
is in a state where it can be used when necessary. The downtime is also made up of several
distinct elements shown. It includes the total corrective maintenance (TCM), which is
the time spent rectifying faults/failures that have occurred; the time taken for scheduled

69
Chapter 4

maintenance, which is the total preventive maintenance (TPM) time; and the total
administrative and logistic delay time (ALDT), as discussed above.

ALDT is the time taken to obtain a spare or is the delay in waiting for the appropriate
personnel, manuals, tools, or test equipment to become available. ALDT is also quite
dependent on the human influence factors discussed in chapter 3, and therefore does not
lend itself well to prediction. The other category, active repair time, which is sensitive to
environment, technician skill level, procedures, etc., can also be further divided into the
following categories:

• Dismantling and access


• Fault recognition and detection
• Fault location and diagnosis
• Fault correction/repair
• System/repair verification
• Recovery and completion

The total length of time that a piece of equipment is down for maintenance when a failure
occurs will statistically vary from one failure instance to another. The downtime for a piece
of equipment will be predicted during the equipment’s design as part of the availability
requirements. However, if unknown failures begin occurring, such as those that can result
in NFFs, then there will be an impact on the predicted downtime. A repair process, which
is subjected to NFF, will generate a large number of relatively short repair periods, as in
the case where components are replaced speculatively or where the wrong fix is applied to
avoid categorizing the equipment as NFF. This results in the fault repeatedly returning with
similar “quick fixes” being applied; this issue is common when aircraft are maintained and
operated at different bases. ALDT will be heavily affected by this repetitive short period
repair from the NFF perspective, as an increasing number of replacement parts are removed
from the spare parts pool and an increasing number of on-spec replacements are sent back
through the repair loop. This type of repair will usually occur at the operational site on
detection of a fault symptom. A second alternative, which would occur at the maintenance
base, can happen when a NFF occurs and there is an insistence on locating the symptom-
cause before the equipment can be placed back into service. This would result in long repair
times, usually longer than predicted, as a direct result of the diagnostic difficulties.

Given this discussion, Figure 4.1 can now be modified, as shown in Figure 4.4, to show
the periods of corrective maintenance and preventive maintenance all containing varying
periods of ALDT.

70
Availability in Context

Uptime

Time
Downtime

Operational Tasks Corrective maintenance

Standby Preventative/scheduled maintenance

Administrative & Logistic Delay Time

Figure 4.4 Uptime and downtime for various operational states, including ALDT.

It is perhaps useful to highlight the importance of inherent availability when considering


the effect of NFF. Figure 4.5 shows the limit of designed-in availability, which is always
supported by the need for some repair and maintenance work.

Importance of Inherent Reliability Margin lost to active


repair/maintenance work
100%

Ai R&M “designed limit”

Margin lost to
Ao
Availability

logistic delays

Cost of Logistic Support

Figure 4.5 The importance of inherent availability.

4.5.2 Design Requirements for RAM


While the user is really interested in operational availability, it must be remembered that
reliability, maintainability, and the other constituent parts of availability are crucial and
must be established for the economic and environmental success of systems. Some key
processes are recommended to achieve the best possible outcomes:

• Requirements definition is about defining and managing the requirements with clear
definitions for reliability, maintainability, and availability. They are key activities during
the design stage along with their specification and delivery by suppliers.

71
Chapter 4

• Reliability improvement includes data collection, reliability analysis, and improvement


during design using planned Reliability Growth Tests (RGTs) and a specific Data
Reporting Analysis and Corrective Action System (DRACAS).
• Performance monitoring is about collecting and correlating performance with
operational usage under representative environmental circumstances. It will be
important to establish its parameters before delivery and during use.
• Design for ease and affordable cost of maintenance includes design of maintainability
features and effective test features both in the system itself and for maintenance.
Attention to design that makes maintenance both cost-effective and affordable is also
essential to reducing the cost of through-life support.

The design process for RAM is achieved using formal tools, such as FMECA and FTA. The
use of such tools is essential to negate poor reliability and poor maintenance issues, and of
course, supportability problems in service. It is not appropriate to go into these tools in any
more detail here, as there are many books on RAM [4-8], [4-9]. Nevertheless, data are required
for the management and reduction of NFF, and the feedback of data into these RAM tools is a
significant factor in achieving the improvements needed during design [4-10], [4-11].

A framework for RAM and supportability (RAMS) requirements in the context of the total
system is illustrated in Figure 4.6.

System  
Requirements

Opera&onal   Constraints:  
Performance  
Availability   Money,  Time,  
Requirements
Requirements Weight,  Space

Supply
Reliability   Maintainability   Effectiveness
Requirements Requirements Requirements

Mean-­‐Time-­‐ Mean-­‐Time-­‐To   Tolerable  Risk  of  


Between  Failures Repair Shortages

Figure 4.6 System operational requirements.

Once a set of preferred system requirements has been specified, it can be broken down
into operational requirements and those features of the operational availability that will
be managed. It will also be necessary to consider specific requirements and any given
constraints, so that trade-off studies can be made to optimize the operational requirements.
At this stage, it is often very difficult to establish many of the requirements that will be
included in the ALDT metric, and hence difficult to specify the operational availability
requirement. Nevertheless, it is the customer’s responsibility to specify the requirement for

72
Availability in Context

operational availability, and for the OEM then to design-in effective RAM requirements to
deliver the inherent availability. Only then can the support solution be specified to deliver
the best ALDT solution.

In corrective maintenance, much of the time is spent on locating a defect that may require a
sequence of disassembly and reassembly. Being able to predict the time required to establish
fault location can be extremely difficult, and can increase downtime affecting the overall
operational availability of the system. The ability to automate this fault diagnosis, with
advanced technologies and techniques, can help to predict the downtime more accurately.
This helps to determine some of the operational availability parameters. However, the
prediction of the set of events that can occur as a result of a certain availability requirement,
or determining the system attributes that are required to meet a desired availability in
a “real-world” situation, is difficult. It is all the more difficult because once the system has
entered the real world, the operating environment and usage may be somewhat different
than that envisaged or specified at the design stage. If this is the case, the change of
operating conditions further complicates the ability to design effective methods to predict
the downtime more accurately.

4.6 The Impact of NFF on Availability


To discuss the impact that NFF has on availability, we first need to consider Equation 4.1. The
denominator, the sum of the uptime and downtime, represents a fixed time interval. This
can be considered and assessed over a logical period of say, one year. Consequently, the less
time the system spends under repair (downtime), the more time the system is available for
use (uptime) as described by Equation 4.4 and Equation 4.5.

Uptime + Downtime =
1 (4.4)

Uptime = 1 − Downtime (4.5)

There is, at this time, no clear formulation on how NFF impacts availability as a quantitative
measure. Yet, it is felt intuitively by many that NFF will have some negative impact on
availability. For example, if we consider the fact that a fault requires an LRU to be removed
from the aircraft for test, but the test cannot confirm that there is a fault, then a NFF has
occurred at the shop level. The original fault either still exists on the aircraft, or perhaps on
refitting the LRU the whole system proves to be serviceable. In this case, a NFF situation has
occurred at the complete system level. Availability has suffered with the overall downtime
increasing, and the uptime has been reduced as described by Equation 4.5. However,
because we are dealing with both the overall system—the aircraft—and the removed system,
this additional downtime attributed to NFF for the removed system should perhaps be
considered separately from the other downtime factors, and will manifest itself in Equation
4.6 as the term .

Availability
= Uptime + 1 − ( Downtime + dTNFF ) (4.6)

73
Chapter 4

In this equation, represents any additional corrective maintenance and logistics time
attributed for NFF on the removed system. Note that the downtime still includes an element
of NFF time spent on the initial diagnosis and LRU removal.

In many cases, the removed LRU is replaced by a new LRU, given that the LRU is in stock.
However, in a significant number of cases, and certainly when time or operational pressure
are not factors, a suspect LRU will be taken off the aircraft for intensive bench testing. Often,
this might be done to prove new complex fault analysis methods or to obtain more rigorous
levels of testing available at the shop level. This illustrates that there is a potential negative
impact upon operational availability for NFF on removed systems and components where it
might be thought that operational availability is not affected. In addition, associated costs of
man-hours, test equipment, facilities, and transport involved in this off-aircraft testing often
present hidden costs to the business.

In many businesses, the availability of a system is typically only measured as a factor of its
uptime and downtime, without really assessing or understanding its actual reliability. It is
clear though that, once the design is final, improving either reliability or maintainability
is very difficult. But often RAM problems are not identified until the entire system is
built, and users who are involved in trials and tests under representative conditions have
provided their feedback. At this stage, however, it is too late to make design changes purely
for RAM improvements. Hence, the problem of NFF represented in downtime and by does
not manifest itself until the equipment enters service. Consequently, the influence of NFF
on downtime does not become clear until it happens, perhaps well into the life cycle of the
system with all the attendant frustration and loss of availability. It is important to note that
ALDT is largely ignored at the design stage because so many factors are still unclear. Yet, it is
mainly ALDT that drives the level of operational availability and that usually has a greater
effect than corrective or preventive maintenance.

Having made clear that the designer can only influence the inherent or achieved availability,
it is instructive to look at the balance that the designer will consider. Figure 4.7 illustrates the
relationship between the requirements for inherent availability (Ai), reliability (denoted as
R), and maintainability (denoted as M). For a given inherent availability requirement, several
alternate combinations of R and M can be proposed. For example, in order to achieve a given
90% availability level, the requirement can be of 4 failures per-cycle with a corresponding
downtime of 30 minutes per failure, or a failure rate of 10 failures per-cycle and a downtime
of 12 minutes per failure instance. Achieving this availability requirement will be
determined by investment within reliability (reducing the failure rate) or maintainability
(improvements in diagnostic and repair times).

74
Availability in Context

Figure 4.7 Maintainability requirements against reliability.

It is from these considerations at the design stage that designers can decide on the trade-off
between reliability and maintenance times. They will also evaluate how rates of NFF will
lengthen the maintenance times and require fewer failures to achieve the user specification.
Failure to consider the effects of NFF will hamper any possible increase in availability.
From the definitions of NFF provided in chapter 2, it is clear that the phenomenon is not
related directly to the probability of failure, but rather it is attributed to the probability of
detecting, locating, and repairing a failure. However, it is clear that support factors are also
key features in driving operational availability. By considering availability as a function of
reliability, maintainability, and support effectiveness, Equation 4.7 can be obtained.

A0 = f ( R, M , S ) (4.7)

As reliability is concerned with faults and failures, it stands to reason that the higher the
probability of failure, the more failures (particularly unexpected failures) will occur, leading
to more opportunities for NFF during maintenance. Similarly, maintainability is concerned
with how easily faults are recovered from, and if faults are difficult to diagnose with
increasing NFF rates, then times for return-to-service will increase, and availability will be
reduced. In summary:

“The effect on availability as a result of NFF will be a


reduction in uptime quantified by increased MTTR and/or a
reduction in MTBUR.”

Of course, the NFF problem is complex and also heavily influenced by the entire operational
and maintenance environment. Such an environment is full of unpredictable factors that
will increase NFF rates and contribute to a loss of availability. So that NFF can really be

75
Chapter 4

understood and quantified, a way of measuring its impact is required. A function describing
NFF will be complex and contain a variety of other factors described by Equation 4.8.

NFF f ( R, M , | S , P, C , X 1, ..., X n ) (4.8)


=

where:

P is the system engineering performance requirements.

C is the general system constraints such as cost, size, time, and weight.

f ( R, M , | S , P , C , X 1, ..., X n )represent unspecified factors such as vulnerability, environment, technical


competence, resource constraints, and random events.

All these factors will also drive up NFF rates. It should be noted that within this function,
only the terms to the left of the | (i.e., R and M) are controllable; S is difficult to quantify as
noted previously; C and P are fixed design requirements; and are unpredictable and cannot
be controlled. A selection of these factors is discussed at length in chapters 3, 8, and 9.

A high-quality design leads to high reliability and enables good maintenance via
maintainability requirements. The reliability can then be measured in the field, along
with the required maintenance and testing, reflecting the quality of the designed-in
maintainability. Looking at the NFF function in Equation 4.8, reliability of a system can be
concluded as representing the quality of the design, while maintainability represents the
quality of maintenance and/or testing. It can therefore be stated:

“NFF can be a symptom of a poor design and/or a symptom


of a poor maintenance processes or tests. Therefore, the root
cause of some NFF events are embedded within reliability
and maintainability design requirements.”

We have also seen that reliability and maintainability influence the availability. The
interpretation of Equation 4.8 is that a controlled and accepted level of NFF would be
reached at the stage that the following is true:

“The availability of a system is satisfactory given that the


support effectiveness (logistics), engineering performance
requirements, the constraints, the cost, and the versatility to
manage unpredicted events is reduced to the minimum level
possible.”

76
Availability in Context

Further research is required in this area to identify a useful mathematical form to describe
the NFF problem. The usefulness of this would be that once the rate of NFF on a particular
system has been identified as a problem, a process of identifying its impact and where best
to target the NFF mitigation processes can then be decided. This would require a root cause
analysis to be performed to target potential reasons or root causes for the NFF problem.
This quantifiable NFF function would be of great help in identifying the business case for
potential solutions. Identifying measures to monitor the rate of NFF is significantly easier,
even though no accepted industry standard measures exist at present. Several options are
discussed in the next section.

4.7 A Process for Improvement


4.7.1 Overview
The measure of a system’s ability to meet its requirements using life-cycle costing is known
as the system’s cost effectiveness. The use of cost effectiveness allows relationships of all of
the interrelated concepts introduced and discussed in this chapter to be brought together.
For complex systems, the trend is that higher reliability requirements lead to higher
acquisition costs and lower operating costs. Costs for the whole life of the system would be
calculated at the design stage and can include an element in the maintenance calculations
that accounts for a certain percentage of maintenance actions resulting from NFF.
Traditionally, NFF is never a major consideration at the design stage, even though it impacts
heavily on operations, as discussed earlier. However, many designers remain unaware that
it is of concern, and it is rarely included in a design engineer’s requirements. To address
NFF effectively, this must change, with NFF being considered a priority at the design stage.
As discussed earlier, NFF is a symptom of poor design, and tackling it at the design stage
is therefore dealing with NFF at its root cause. Some research has been done in this area,
including work being undertaken in the UK to map design characteristics with NFF rates,
development of NFF prediction models, cost calculations and assessment, and information
standardization and mechanisms for in-service monitoring.

The recommended procedure for evaluating the effect of NFF on a design, to assess the cost
impact and to make redesign recommendations, is illustrated in Figure 4.8.

Beginning with the specification of a system (unit) design, evaluation of similar designs should
be made, including a review of previous systems with respect to their NFF rates, so that key
characteristics that can influence the occurrence of NFF are avoided. These characteristics
will be actual features of the hardware, such as the number of components and connections,
the area covered by BIT, the unit topology, or the overall complexity and installation factors.
Other characteristics can include the knowledge of the current operating environment and
predictability of the future operating environments, the knowledge of how the equipment
is currently used outside its design specification, and the level of maintenance required. The
usefulness of this information will manifest itself in the development of a model that can
predict, or assess, the burden of NFF in respect to overall maintenance actions. Having some
metrics during the design stage that define the burden of NFF will both identify the metric
and focus attention on its reduction. The metrics, though, must identify the NFF issue on

77
Chapter 4

the system itself and the NFF problem that will be associated with the items or components
removed for external testing or repair. Those items removed and confirmed as NFF are
designated NFF–Confirmed External. Useful metrics can thus be split into two general
categories: mean time between NFF and mean time between NFF–Confirmed External.

Figure 4.8 Process for improving NFF at the design stage.

4.7.1.1 Mean Time Between No Fault Found (MTB NFF)


The category of MTB NFF is used to identify when a removed LRU is known to be the
source of the fault symptom that resulted in its removal and subsequent maintenance action.
However, the actual fault has yet to be determined within the LRU, and a symptom-to-
cause relationship, which relates the observed symptoms and the LRU removal, remains
undetermined. The MTB NFF is described by the relationship shown in Equation 4.9.

Sum of operational hours in the subject period (4.9)


MTB NFF (h) =
Count of NFF unit removals in the subject period

This category of a NFF, however, does not capture the reality of the classification problem
within a complex maintenance program, and does not consider the instances when there
can have been a wrong LRU removal—and the fault symptoms have remained onboard the
aircraft. If this is the case and it can be confirmed, then the category MTB NFF–Confirmed
External can be applied to the removed unit, identifying the removed unit as not being the
source of the fault leading to its removal.

78
Availability in Context

4.7.1.2 Mean Time Between No Fault Found–Confirmed External


This category of MTB NFF–Confirmed External is used when the removed LRU is deduced
to be “not faulty” with respect to its reason for removal. The assertion is made following a
logical analysis of data pertaining to its removal. In essence, a unit removal can be placed
into the “NFF–Confirmed External” category typically for one of two reasons:

• The fault symptom stayed on the aircraft after the LRU removal
• The subject LRU was one of multiple LRUs that were removed in the same
troubleshooting instance for the same symptom, and one of the other units (i.e., not the
subject LRU), when tested by the maintenance facility, was found to exhibit a fault that
correlates with the symptom.

The mean time between for this type of NFF event is given by Equation 4.10.

Sum of operational hours in the subject period


MTBN NFF − Confirmed External = (4.10)
Count of NFF − Confirmed External unit removals

4.7.1.3 Estimating Maintenance Costs


The direct maintenance cost (DMC) of a commercial aircraft makes a significant
contribution to an aircraft’s cost of ownership and is calculated using Equation 4.11 where
( MMH off + MMH on ) is the total maintenance-man-hours for maintenance performed on-
wing and off-wing.

DMC =
( MMH off + MMH on ) LC + MC (4.11)

where:

LC is labor cost

MC is material cost

A DMC model would be used to calculate the cost impact, which can be monetary or
measured in terms of availability, of the predicted NFF. If the cost, which is evaluated
against the whole life costs (WLC) and which needs to be calculated at the design stage,
is unacceptable, this information, along with the MTB NFF measures and NFF design
guidelines, can be used to redesign or modify the equipment under study. Details of
potential redesign activities and identified guidelines are provided in chapter 6. If, however,
the cost is acceptable, the equipment is then placed into service, where appropriate
measures to monitor the in-service NFF performance would be implemented. Such in-
service monitoring must be stored in an accessible database along with all other relevant
information relating to NFF occurrences, repeat offenders, flight deck messages, etc. and
must be easily retrievable. Such a database, where it does exist currently, almost always
relies on dedicated human management, and it is not often easily automated. The data
are usually rich but un-mined, and the various data and information types are owned by
different parties and often incompatible. A process of standardizing the information is
vital to make the system effective. Information standardization to mitigate NFF is further
discussed in chapter 9.

79
Chapter 4

4.7.2 A Methodology for Monitoring NFF In-Service


A successful methodology for monitoring NFF in-service will always require accurate
symptom data. This will usually require agreement and cooperation between the OEM,
MRO, and operators to gain access to this data, which will be stored in a range of executive
asset management information technology (IT) systems, technical records, and aircraft
health monitoring databases. Such systems are data rich but information poor and require
sophisticated analysis and manipulation in order to extract useful information. This is
because much of the data is in free text, and is usually brief and highly subjective. A manual
system is labor intensive and often perceived as too expensive versus the expected benefits. A
bespoke organization or system specific process would first need to be established to enable
the symptom and removal data to be input to the maintenance management system. Regular
face-to-face meetings between technical services, engineers, and maintainers will also help to
extract the maximum information and to highlight the high NFF rates on particular systems.
Consequently, the clear recommendation is to establish a formal process, including generating
what we can call “unit removal datasheets” for each LRU under a NFF investigation.

The purpose of unit removal data sheets is two-fold. First, they allow a multitude of data,
which includes both the data normally provided and the data specifically requested for a
NFF investigation, all to be collected into one place. The collated information can be used in
symptom-cause analysis to identify the cause of the relevant category: either NFF or NFF–
Confirmed External as described in section 3.10.1. The application for this is illustrated in
chapter 9, which discusses tools and techniques for NFF resolution. The unit removal data
sheet (URD), therefore, must be populated with a wide range of raw data at regular intervals,
the process for which is described next.

4.7.3 Unit Removal Datasheets


As an example of a URD, consider an LRU that has data available from a variety of sources
including the capability for BIT—meaning the system conducts functional self-tests and
continuous monitoring—and will annunciate when a fault condition is detected. For BIT,
the computer that detects the fault will generate what is known as a fault code, or in some
spheres it is known as a maintenance message, which represents a low-level message
that can be associated to a higher-level cockpit message (warning). Messages received at
the cockpit level are referred to as flight deck effect (FDE). A post-flight report would be
generated on the presence of any FDE message received and is a document that usually
prints directly from the aircraft’s central maintenance system at the end of each flight.
The post-flight report includes flight deck messages and maintenance messages as well
as any other pertinent information. In some cases, these messages are captured within
the aircraft health monitoring systems, and are also downloaded during flight to the
maintenance organization.

The data types on the URD, which can be of interest for an LRU that has been removed as
a result of on-wing troubleshooting, includes basic reasons for removal data, technical log
entries relating to the same LRU, similar fault history on the same aircraft, and post-flight
reports following removal of the LRU. Basic removal data will include the removed LRU’s
part number, serial number and the date of removal, the registration number of the aircraft

80
Availability in Context

that it was removed from, the technical log page number and line relating to the removal
action, time since installation (TSI), and, very important for troubleshooting, the airline’s
stated reason for the removal of the unit. Technical log entries for the unit would be required
for a stated period prior to the unit’s removal, and entries relating to the removed unit’s
symptoms would be highlighted. Aircraft fault history would be extracted directly from
aircraft health monitoring systems, providing details of flight deck messages and underlying
maintenance messages relating to the flight when the fault appeared. Again, these data
should also cover a defined period pre- and post-unit removal. It is necessary to have
access to the part numbers and serial numbers of any additional units that could have been
removed at the same time as a result of on-wing troubleshooting.

Once these data sets and pieces of information have been collated into the URD, it will then
need to be analyzed to provide any meaningful information with regard to the NFF rate
using the process as shown in Figure 4.9.

Aircrao  fleet  
health  
monitoring  data

Airline  maintenance  
data

Airline  asset  
configura&on  data  
and  part  life  
tracking

LRU  MRO  data LRU  OEM  design

Plaqorm  (e.g.  
engine)  OEM  
system  data

Figure 4.9 Schematic showing the key aircraft-related data sources.

Piecing all this information together allows the instances that affect MTB NFF to be tracked
over time to show an increasing or decreasing trend and the identification of repeat arisings.
It will also help in identifying symptom-to-cause relationships. But the most important fact
is that by tracking the rates of NFF, this information can be used to evaluate increases in
system downtime and hence a quantitative effect of NFF on availability.

81
Chapter 4

4.8 Conclusion
The purpose of this chapter was to explore how availability is interlinked with the issue of
NFF. To do this, the impacts that reliability and maintainability have on the availability of
an aircraft or individual system were discussed. How reliability and maintainability also
contribute to, and exacerbate, the NFF problem and how high levels of NFF contribute to a
reduced rate of availability were also shown.

Making this link to maintainability and reliability, both of which are intrinsic parts of a
design, illustrates and reinforces the definition of NFF as being a symptom of poor design
or test. The ability to affect NFF from a design point of view by improving reliability
and maintainability requirements is clear. However, to do that requires active feedback
between operational cases of NFF and equipment designers to determine the design
changes that are necessary.

A methodology for doing this was presented along with an illustration of how multiple sources
of data and information concerning NFF-related parameters can be captured and analyzed.
This process helps to identify the level of NFF as well as any emerging trends between rates,
types of equipment that sufferer NFF, and the costs in terms of loss of availability.

These are discussed in further detail, with new opportunities for identifying and/or making
system design improvements, in chapter 8. New potential technology mitigation strategies
are addressed in chapter 9.

4.9 References
4-1. Jenson, D. “Europe’s Challenges In a Dynamic MRO Market.” Cited April 4, 2009.
Available from http://www.aviationtoday.com/. 2008.

4-2. Global Industry Analysts Inc. “Machine condition monitoring equipment – a global
strategic business report.” ID: 338488. 2014.

4-3. Ward, Y., and A. Graves. “Through-life management: the provision of total customer
solutions in the aerospace industry.” International Journal of Services Technology and
Management 8 no. 6 (2007): 455–477.

4-4. Hessberg, J. “Functionability Management – A Tribute to Jack Hessberg.” Presented


at the 23rd MIRCE International Symposium, Dec. 3–5, 2013.

4-5. Khan, S., P. Phillips, C. Hockley, and I. Jennions. “No Fault Found events in
maintenance engineering Part 2: Root causes, technical developments and future
research.” Reliability Engineering & System Safety 123 (2014): 196–208.

4-6. Roy, R., A. Shaw, J. Erkoyuncu, and L. Redding. “Through-Life Engineering


Services.” DOI: 10.1177/0020294013492283. Measurement and Control 46 no. 6 (2013):
172–175.

82
Availability in Context

4-7. Stapelberg, R. F. “Availability and Maintainability in Engineering Design.”


Handbook of reliability, availability, maintainability and safety in engineering design, 290–
527. Springer Science & Business Media, 2009.

4-8. Kinnison, H. A., and T. Siddiqui. “Part 3: Aircraft Management, Maintenance, and
Material Support.” Aviation Maintenance Management, 143–194. McGraw-Hill, 2012.

4-9. Kinnison, H. A., and T. Siddiqui. “Part 4: Oversight Functions.” Aviation maintenance
management, 195–216. McGraw-Hill, 2012.

4-10. AIA/ASD S5000F. “International specification for operational and maintenance data
feedback.” 2014.

4-11. S4000P. “International specification for developing and continuously improving


preventive maintenance.” 2013.

83
Chapter 5
Safety Perceptions

5.1 Introduction
It can be argued that unless a NFF occurrence (that can have numerous potential
causes) is a repetitive fault and is influencing the system performance, then just an
isolated incident cannot be considered a safety issue. Nevertheless, if NFF increases the
probability of a hazardous or catastrophic event, then there must surely be a concern
that the root cause is a safety issue. For example, the danger of not dealing adequately
with NFF events relating to intermittent faults is demonstrated through an incident on-
board a BMI A321 [5-1] at 36,000 ft that was carrying 43 passengers on route between
Khartoum and Beirut. An intermittent failure in the electrical power-generator system
presented numerous symptoms, which included an uncontrollable rudder trim, causing
the left wing to dip by 10° and the aircraft to deviate from its intended course by 37 km.
In addition to this, both the pilot and co-pilot’s instruments were affected, with the
primary and navigational flight displays among other instruments flickering or going
entirely blank. In this particular case, the aircraft landed safely, but it does highlight,
from a safety perspective, the need for intermittent faults to be successfully detected
and localized during maintenance testing.

The above scenario is part of an ongoing discussion as to whether or not NFF is linked
with safety, and, although this discussion has not reached a concrete conclusion, it
would portray an incomplete picture of NFF in the current book if this subject was not
addressed. The objective of this chapter is, therefore, to inform the reader of the pros
and cons of the possible connections between NFF and safety. In doing so, it reaches its
own conclusion regarding the connection and suggests possible ways forward. Safety in
the aircraft environment is paramount, and we can use the air environment to illustrate
the potential problem for other environments, such as rail and medical, where safety is
also critical.

85
Chapter 5

5.2 Faults and Safety—Some Perceptions


The root cause of a fault should generate symptoms that are obvious during the
equipment’s operation, and which then demand some maintenance and diagnosis. If this
is successful, the fault is rectified, but if not, the fault may remain on the system as it is
tested satisfactorily and/or cannot be reproduced. The fault then may reoccur during the
equipment’s next operation. Of course, the nature of the fault will determine whether it
is a safety issue or not, and it is perhaps too simple to say that some will be a safety issue
and some will not. In the commercial aircraft environment, we have aircraft designed
to fail-safe principles. This means that there is built-in redundancy, and any unforeseen
failure will have an alternate load path and the ability to continue in operation until the
next maintenance inspection. But, in the military environment, aircraft are built to safe-
life principles, without the redundancy and built-in safety afforded by fail-safe design.
This means that the equipment has an overall designed life for its safe operation, but there
is not necessarily redundancy designed to cope with unforeseen failures. While on the
surface this might suggest that any fault that occurs on a fail-safe aircraft is not going to
be a safety issue, but a fault that remains on the aircraft, without successful diagnosis, can
certainly contribute to distraction of the pilots and a near accident, as illustrated in the
previous example and in a case study later in this chapter.

The system operator and its maintenance organization will be interested in the continued
reliability of the system, as well as the link between NFF and reliability; any effect on safety
should, therefore, be considered. If the root cause of a fault has not been addressed (either
in the system or the LRU), the consequence of that root cause continuing will be that the
reliability of the component will be diminished. For instance, how long will it run without
a further fault? In terms of the safety case for the system, the original figure used for the
safety case is now incorrect. If the component or system now at risk is part of a defined
safety-critical system, such as the MMEL in aerospace, this should perhaps be a major
concern because the probability of a catastrophic or hazardous event must have increased.
The reason is that the component will have had an assessment in the original design of its
criticality, which will have been used in the fault tree analysis that provided the original
design safety case. Taking an aerospace example again, the design has to demonstrate that
the likelihood of a defined hazardous event is less than ten to the minus seven (10-7 or 1 in
10 million), and that a catastrophic event is less than ten to the minus nine (10-9 or 1 in 1000
million). Clearly, the fact that an unknown component or system has been subject to a failure
that cannot be identified, but yet now remains in service, means that the defined hazardous
event now has a much increased likelihood of occurrence and failing to meet the original
design requirement and safety case.

Anecdotal evidence from an unscientific survey conducted with those who are interested in
the NFF subject showed that there was nearly an equal split between those who thought “yes”
it was a safety issue, those who said “no,” and those who said “it depends.” Reasons included:

• Many said “yes” because they equated it to the impact of the symptom on the crew and
on system redundancy.
• Those who said “no” equated NFF with the maintenance process, saying it was merely
an economic impact or a process issue.

86
Safety Perceptions

• Those who said “it depends” said it depended on the system affected and on whether
the NFF becomes a repeat arising.

One perception of NFF is that it is a chain of events, not a single entity [5-2]. It starts with a
cause, which manifests itself as symptoms that may well be transitory. Establishing the exact
root cause may take time and considerable diagnostic effort, and demand a series of actions,
events, or situations, such as environmental conditions, all to be connected or reproduced to
achieve success. If breaks occur in this chain, then a NFF will result.

Terminology is another contributing factor and means that we inextricably link the NFF
with Fault Not Found (FNF) and repeat arisings, yet these are not the same. NFF implies
there may not have been a fault in the first place and a certain acceptance that the equipment
is ready to be put back into operation. FNF, on the other hand, indicates an acceptance
that there is a fault which has just not been identified. Repeat arisings again signify an
acceptance of the fact that the fault keeps reoccurring, but has not been traced successfully.
The psychology of acceptance is perhaps an indication that safety is perhaps being
compromised or not given the attention that it should.

Some engineers, managers, and operators are perceived to have a lack of awareness of the
safety implications due to:

• The original fault root cause


• An inability to reproduce the symptom
• The number and frequency of repeat arisings

5.3 A Conceptual Discussion


Let us first consider safety in relation to aircraft, as the principles will certainly be
transferrable to other industries. Aircraft are designed and are subject to an Aircraft Type
Certification Process which provides assurance that the aircraft has been designed, tested,
and is now certified as safe to operate within a permitted flight envelope. This flight
envelope specifies things such as all-up weights, landing and take-off speeds, rates of
climb and descent, and a myriad of other limits that have been established and fully tested.
Together with this certification and approval there will be a definition of the potential
catastrophic and hazardous events that have been identified. The design has to ensure that
the likelihood of a defined hazardous event is less than ten to the minus seven (10-7 or 1 in
10 million) and that a catastrophic event is less than ten to the minus nine (10-9 or 1 in 1000
million). The certification process thus gives approval to the system design and its ability to
meet the safety requirements laid down by the regulatory authorities for the maintenance
of airworthiness. Within the design process will be features that contribute to safety or
provide pass/fail and go/no-go limits. These are such things as BITE systems or test facilities
and processes that provide a positive or negative output. As part of the design and the
certification process, aircraft also have a MMEL, which provides a definitive list of systems,
instruments, and equipment that must be operative before the aircraft can be dispatched.
By establishing such things as the MMEL and the Aircraft Type Certification, the aircraft’s
airworthiness standard is established by the CAA. Any maintenance that must be completed
has to return the aircraft to an airworthy state.

87
Chapter 5

If NFF is then considered within this overall model of airworthiness, the multiple types
of NFF must be examined. Clearly, NFF consists of two main scenarios; the first is at the
platform level where the symptom cannot be reproduced, and all tests show that the system
and platform is serviceable. In this case, the fault is most likely an intermittent one that
may occur again. The second is where the most likely component(s) or LRUs are removed
for testing elsewhere. Figure 5.1 shows the types of NFF that can occur following such a
removal for testing.

Arising  

Product  removal  

Product  test  

YES   Fault   YES  


found?   Review  LRU  history  

NO  
Fault  
NO   confirmed  
Related  
previous  
elsewhere
FNF  event  
?  

YES   YES  
Fault  Not  Found  
Confirmed  LRU   Confirmed  Not  LRU   Repeated  
(FNF)  

Figure 5.1 Multiple types of NFF (Source: Adapted from [5-3]).

Now let’s examine each case along the bottom row in Figure 5.1, from left to right. In the first
case, the removed LRU is tested and the fault is confirmed in the LRU; so this is not a NFF
case but is a successful fault diagnosis. In the next case, the removed LRU is tested but the
fault cannot be found; here the fault is either confirmed elsewhere or it is still not found, and
thus could be declared FNF. In this FNF case, the system does not have a fault, and so the
LRU was incorrectly removed. However, the fault probably still exists somewhere else. In
this case, we really do not know if an original fault existed.

With a removed LRU that has a confirmed fault on test, we still have one case that is NFF (the
right side of Figure 5.1). This is the case where testing has confirmed a fault, but on reviewing
the previous history, the fault can be related to a previous FNF event. This is, therefore, a
repeat arising, and while it is now a fault that has been found, an associated NFF or FNF has to

88
Safety Perceptions

be attributed. The mere fact that the aircraft has continued to fly with the fault must surely be
counted as a potential safety issue, depending, of course, on the system affected.

So it is suggested that we have three categories or types of NFF or FNF that have potential
safety implications; these are:

• The repeat arising

• The fault not found or able to be confirmed elsewhere or at all

• The fault confirmed elsewhere

To see how they are possibly linked, we should look at another model, shown in Figure 5.2.

Fault Isolation
Tools

A/C Type
No
Certification Fault
Maintenance ‘Error’
Found
Maintenance
System Fault Maintenance
Observed Practice

Fault Operator
Detection Policy

Safety
Issue

Increases the probability of Hazardous / Catastrophic Events

Figure 5.2 NFF and safety link (Source: [5-3]).

Here, we see that fault detection systems that have been designed from the outset as a
constituent part of the aircraft certification process, together with the maintenance system
itself, will deal with an observed fault. Fault isolation tools and available diagnostic tools
will attempt to isolate the fault. However, both the operator’s policies and processes as
well as maintenance practices that are employed in the organization will govern whether
a NFF is declared. Within this environment, consideration also must be given to whether
maintenance errors might contribute to the outcome. This will be discussed in more detail
later in this chapter.

5.4 The Regulatory Issues in the Air Environment


As previously noted, the Aircraft Type Certification Process is designed to ensure that an
aircraft enters service with its airworthiness specified and met. The regulatory authority in
the UK, the CAA, and a similar body in the United States, the FAA, ensure that regulations
are in place for continuing airworthiness. These regulations involve everything from air
traffic procedures to maintenance schedules and processes. Central to the regulations is

89
Chapter 5

the need for a regulation to protect safety and to maintain the continued airworthiness of
aircraft. At present, while NFF is recognized as an issue, no regulations implicitly link NFF
with safety. Indeed, motivation is needed to do something to produce a new regulation, and
as of yet, the case has not been made for a link between NFF and safety that would require a
new regulation. One such link is outlined below.

Continuing airworthiness means all of the processes involved in ensuring that, at any time
in its operating life, the aircraft complies with the airworthiness requirements in force and is
in a condition for safe operation. As far as the UK is concerned, the industry is governed by
the European Law dealing with continuing airworthiness requirements [5-4]. It states:

The continuing airworthiness of aircraft and components shall be ensured in accordance


with the provisions of Annex I [5-5]. Annex I is also known as Part M, and here we find
more detailed guidance and responsibilities. For instance, M.A.201 in Annex I gives the
responsibilities as follows:

The [aircraft] owner is responsible for the continuing


airworthiness of an aircraft and shall ensure that no flight
takes place unless:

• The aircraft is maintained in an airworthy condition, and,

• Any operational and emergency equipment fitted is


correctly installed and serviceable or clearly identified as
unserviceable, and,

• The airworthiness certificate remains valid, and,

• The maintenance of the aircraft is performed in accordance


with the approved maintenance programme as specified in
M.A.302. [5-5]

In the UK military, following the catastrophic crash of the Nimrod aircraft XV230 in Iraq,
with the loss of 14 crew, the Ministry of Defence (MOD) has had to adopt a root and branch
change to its airworthiness management. This has resulted in the creation of the MAA
and alignment with the CAA regulations. The MAA has created Duty Holders (DH), and
they are legally accountable for the safe operation of systems in their area of responsibility
(AoR) and for ensuring that risks to life (RtL) are reduced to at least tolerable and ALARP
(as low as reasonably practical) [5-6]. The ALARP principle is that the residual risk shall be
as low as possible. For the risk to be ALARP, it must also be possible to demonstrate that
the cost involved in reducing the risk any further would be grossly disproportionate to any
benefit achieved. Each operational aircraft fleet has an Operational Duty Holder (ODH), a
very senior “operator” who will be personally legally responsible and accountable for the
airworthiness, maintenance, and safe use of the air systems in its defined AoR.

90
Safety Perceptions

With the regulations in place in the civil sector, and even more onerous perhaps in the
military sector, one might assume that a prima facia case should be able to be made for
linking NFF with safety and the duty holder’s responsibilities for continued airworthiness
and safety. Nevertheless, so far the case has not been made, and perhaps this is merely
because the issue of NFF and potential implications for safety have not yet received
sufficient publicity.

5.5 Faults and the Link with Maintenance Errors


5.5.1 The Maintenance Contribution
We are all familiar with the concept of maintenance: it is about inspecting and servicing
equipment to ensure it is able to be put back into service in a fit condition, to last until the
next maintenance intervention. Most people also associate maintenance with finding failures
and repairing them. Indeed, definitions in the air environment indicate a very necessary
and vital flight safety link with maintenance and the need for it to be undertaken. These
definitions will cite the need for maintenance to ensure or restore aircraft integrity. Aircraft
integrity is a phrase used to describe the absence of faults and failures in the entire aircraft
structure and systems that might jeopardize safety or successful completion of the mission.
Consequently, the view expressed by Jack Hessberg1, that maintenance is “nothing more
than the management of failures” [5-7], should be accepted.

It is important to note of course, however, that the management of failures is driven by the
consequences of the occurrence of failures [5-8]. These are:

• The impact on safety


• The impact on operational availability

Both are vital, and the impact on safety receives huge attention—and rightly so. The impact
on availability, however, receives more attention in commercial aviation, where delays
and cancellations cost money and reputation, and ultimately affect shareholders and
profits. However, in military aviation, availability is beginning to receive more attention
as resources and numbers of aircraft are reduced and more commercial ways are found to
provide support and availability. Yet, the whole process of the management of failures and
the need for maintenance is different between military aircraft and civilian aircraft. While
both will consider the impact on safety in the same way, the impact on availability will be
largely economic for the airline industry, but in the military will be driven by the need for
a battle-winning edge. This also produces subtly different cultures and behaviors between
these two groups. The need to achieve dispatch reliability for an airline will be paramount
as the economic consequences of delays, or worse, cancellations, can be very damaging.
Consequently, the maintenance staff will do almost anything to achieve the minimum delay
when faced with a failure “at the gate.” The culture in many airlines is one that minimizes
delays at the gate, and if this means changing three LRUs rather than carrying out thorough
diagnostics to find the root cause of the failure and the exact LRU at fault, then three LRUs
will be changed. In peacetime operations, this culture in the military would be unusual, and

1. Jack Hessberg (1934 -2013) was Boeing Chief Mechanic with responsibility for all maintenance design
during development of the Boeing 777.

91
Chapter 5

particularly now when so many civilian companies are providing the maintenance support
to the military. However, in actual operations, where battle-winning availability is vital, then
the same culture may well pervade.

A second factor—aviation safety—is also at play here. Civilian airliners are built to fail-safe
principles, in which every system and part of the design is meticulously analyzed for the
consequences of it failing. Should that be a possibility, an alternative load-path or alternate
system must exist to provide redundancy. Military aircraft, however, are built to safe-life
principles, in which maintenance is a key factor in providing the early warning of failure
before it is catastrophic [5-9].

Regardless, though, of whether the system is fail-safe or safe-life, various maintenance


techniques are increasingly used and incorporated into the design of both military and
civilian aircraft to provide maintenance assistance. Commercial aircraft such as the Boeing
777 use a huge amount of condition monitoring of all forms to monitor the deterioration of
systems and components. The information is collected by Boeing’s AIMS. BIT and BITE are
part of the whole AIMS system and contribute to the management of failures or impending
failures. By identifying faults through condition and health monitoring, the AIMS will
reconfigure systems or divert usage to other systems using spare capacity or redundancy
so that the need for urgent maintenance is avoided or postponed. AIMS also continually
monitors the systems and sends data to the maintenance management staff to enable them
to analyze the system and its information to determine both impending and actual faults
and failures. The necessary maintenance can then be planned at a convenient time for the
maintenance team, who can be alerted and positioned with the right skills, the right test
equipment, the right spares, and within the most appropriate servicing window [5-10].

5.5.2 Operational Pressure


The pressure in commercial operations on maintenance staff is often overwhelming. Aircraft
delays, cancellations, and lack of aircraft availability not only mean lost revenue but have
a knock-on effect in customer perception. Reputation is hard won but all too easily lost if
delays or cancellations occur. Delays and cancellations from 2003–2014 for U.S. domestic
carriers averaged more than 22% [5-11]. While some of these are due to uncontrollable
issues such as weather or air traffic control, a number are because of aircraft faults or
maintenance delays. The U.S. Department of Transportation’s Bureau of Transportation
Statistics tracks the on-time performance of domestic flights operated by large air carriers.
Summary information is provided on the number of on-time, delayed, and canceled flights:
on time 77.86%, delay (including late arrival) 20.24%, and cancellation 1.77% (between June
3 and October 14). For purposes of this report, a flight is considered delayed if it arrived
at (or departed) the gate 15 minutes or more after the scheduled arrival (departure) time
as reflected in the Computerized Reservation System. The information is based on data
submitted by reporting carriers.

The pressure on maintenance staff then becomes extreme, yet safety should still be
paramount. In that case, the easiest solution to a fault or failure will be taken, perhaps
without time for proper diagnosis. If the system can be reset and tests satisfactorily, the
fault is no longer present, or more correctly is no longer evident! Yet it may need certain

92
Safety Perceptions

conditions of vibration, temperature, or humidity while airborne to provide the situation


when it will fail again. Operational pressure might also suggest that changing three boxes
will solve the failure, and so it does, but now two of the boxes will prove to be NFF when
tested further down the support chain. In some cases, speed and operational imperatives
will have masked the failure, which may then reoccur at an inappropriate moment during
the next flight. The integrity of maintenance staff is all that stands in the way of whether
a fault or failure is solved in the most efficient and cost-effective way. On some occasions,
surely, speed and operational pressure will win, and a dormant fault will remain on the
aircraft, or in the removed component. The operational pressure is created, of course, by the
organization and the humans who manage the organization. There are also human factors
at work within the maintenance organization, which relies on the maintenance personnel to
undertake work. These human factors must also be acknowledged and understood.

5.5.3 The Human Factors Contribution


When humans are involved, errors can occur for any number of reasons. The CAA
goes further, stating:

It is an unequivocal fact that whenever men and women


are involved in an activity, human error will occur at some
point [5-12].

In a paper on the taxonomy [5-13] that delivers dependability, albeit associated with
computing, it is shown how dependability is made up of three elements: attributes, the
means (of delivering dependability), and threats. The attributes are availability, reliability,
safety, confidentiality, integrity, and maintainability. The means by which dependability
can be delivered are fault tolerance, fault prevention, fault forecasting, and fault
removal, and the threats are faults, errors, and failures [5-13]. While the paper concerns
dependability of computers and computer systems, the concept of dependability is equally
valid for any engineering system or service delivery. In the context of understanding
the contribution of NFF to air safety, it is important to distinguish between faults, errors,
and failures, and the definitions and taxonomy presented are as good as any. In sum,
they propose that failure means that a system “deviates from the correct service state
and fails to deliver the required service. The deviation is called an error.” The adjudged
or hypothesized cause of an error is called a fault, but a fault is either active, and thus
causes an error, or it is dormant. On the other hand in his acclaimed book “Human Error,”
Professor James Reason defines error as follows:

Error will be taken as a generic term to encompass all those


occasions in which a planned sequence of mental or physical
activities fails to achieve its intended outcome, and when
these failures cannot be attributed to the intervention of some
chance agency [5-14].

93
Chapter 5

So when considering maintenance and the resolution of faults and failures, errors must be
considered as well. However, in the context of this chapter, it is not the error in the physical
system that is of concern, but rather the maintenance errors associated with the humans
performing the maintenance.

Costs are associated with maintenance errors. First, maintenance errors cost lives. Second,
maintenance errors cost money. Maintenance errors also can cost a company its reputation.
Maintenance errors can be thought of as resulting from what Reason describes as “The Error
Chain.” Simple errors often combine to create a catastrophe; by themselves, they would not
be a problem, but the combination becomes serious. The Error Chain can cost a company
millions in rework and lost revenue, and invites unwelcome attention from regulators.
Examples of maintenance errors are:

• Incorrect installation of components


• Fitting the wrong part
• Electrical wiring discrepancies
• Loose objects left in areas
• Inadequate lubrication
• Access panels, fairings, or cowlings not secured
• Fuel or oil caps not secured
• Safety or gear pins not removed before aircraft departure

Maintenance technicians work in a variety of environments, often extremely challenging


ones, to deliver the outputs that are required. The performance of those maintenance tasks
is affected and obstructed by many things, yet the technician will be coping by using
both conscious and subconscious approaches to deliver the desired performance. The
subconscious will be delivered as automatic or emotional actions, whereas the conscious
approach will be delivered with logical and rational activities. The conscious actions include
activities delivered according to rules and procedures, or based on experience, knowledge,
and training. The maintenance activities delivered with a subconscious approach will
include those activities done automatically without thinking, and could involve fast reaction
and perhaps repetitive activities. Maintenance errors are usually obvious and can be traced
to one or more causes. These were identified and christened the “dirty dozen” by Gordon
Dupont in 1997 [5-15] as a concept developed while he was working for Transport Canada.
The “dirty dozen” now forms part of an elementary training program for understanding
human performance in maintenance. The “dirty dozen” have since been expanded, but still
form the basis for the ways that people’s ability to perform effectively and safely degrades,
which could thus lead to maintenance errors. They are well known in the commercial airline
industry and feature prominently in maintenance training courses. They are:

• Stress
• Fatigue
• Lack of communication
• Lack of assertiveness

94
Safety Perceptions

• Complacency
• Distraction
• Pressure
• Lack of resources
• Lack of knowledge
• Lack of awareness
• Norms (where incorrect procedures or quick fixes become the normal way of working)
• Lack of teamwork

These are all human factors, as they impact the ability of maintenance personnel to perform
effectively and safely. Any one of these factors, or a combination of them, can result in a
maintenance error or the failure to detect a fault. It is this latter point that is often dismissed
or not considered and where the connection with NFF can be critical in its impact on air
safety. The inability to locate or find a fault does not usually have such an obvious cause, and
is usually not considered a maintenance error. Yet, if the “dirty dozen” is considered in the
context of fault finding and achieving diagnostic success, many of those factors will actually
cause a NFF to be registered. Table 5.1 considers each of the “dirty dozen,” and assesses
whether each can contribute or even directly cause a NFF.

Table 5.1 The Dirty Dozen and an Assessment of Their Contribution or Cause of NFF
Dirty Dozen Contribute to, Comment
Factors or Cause NFF?
Stress Yes Stress affects concentration, motivation, and clear
thinking, which are essential for successful diagnosis of
complex faults.
Fatigue Yes Fatigue hampers the ability to think clearly and to
successfully diagnose the cause of a fault, and will
extend the time needed to solve complex faults.
Lack of Yes With rushed or poor communication, poor briefing and
Communication the description of fault symptoms can often lead to NFF
being declared.
Lack of Yes When directed by the supervisor to a specific course of
Assertiveness action, a technician who lacks assertiveness will fail to
question the course of action he/she has been directed
to follow, which he/she knows to be incorrect. This may
lead to a NFF.
Complacency Yes The maintenance action carried out is to perform the
usual fix and course of action and which may result in
temporary fix of intermittent faults or connector faults,
but which does not get to the real root cause.
Distraction Possibly The technician may miss crucial elements of diagnostic
procedures due to distraction and thus not find the fault.
Pressure Yes Pressure may involve changing three items just to make
sure the cause of the fault is covered. This subsequently
creates a NFF further down the support chain.

95
Chapter 5

Table 5.1 The Dirty Dozen and an Assessment of Their Contribution or Cause of NFF
(continued)
Dirty Dozen Contribute to, Comment
Factors or Cause NFF?
Lack of Yes Inadequate resources will hamper diagnosis (e.g.,
Resources unsuitable test equipment may be used or lesser skilled
technicians tasked who do not have the capability to
successfully diagnose the root cause).
Lack of Yes Inadequate training will cause poor diagnosis, and the
Knowledge need for checks and supervision will be missed.
Lack of Possibly Similar to lack or poor training and lack of experience to
Awareness use the best diagnostic process.
Norms (e.g., Yes Some short cuts will have become the normal solution
short cuts and for some faults, and will have become the preferred,
unauthorized yet unauthorized, first solution to be tried, as it usually
procedures) clears the fault.
Lack of Possibly The inability of a team to work successfully together may
Teamwork result in a NFF as a way of shortening the maintenance
time so that the team has the least time working
together.

It is clear, then, that the human factors described by the “dirty dozen” that cause
maintenance errors, and have been shown to cause safety issues or accidents, are also the
same factors that will contribute to NFF. It is, therefore, logical to conclude that there is a
strong link, albeit circumstantial at present, that causes and effects of NFF are also a safety
issue with the potential to cause accidents.

5.5.4 Diagnostic Maintenance Success


Having made the link between maintenance errors and NFF, it is worth looking at
maintenance support and guidance. Where does the technician get help? In modern aircraft
it is increasingly from the OMS. The OMS on the Boeing 777, as an example, is part of the
AIMS and provides direct computer access to many of the systems on the aircraft so that
any maintenance action can start with direct access to as much data and information as
possible. It consists of a central maintenance computer as the central AIMS core that takes
inputs from condition monitoring systems and BITE. There are direct access points for a
maintenance engineer to plug in a terminal around the aircraft. However, BIT and BITE have
their inherent problems. BIT and BITE have become central to the diagnosis of faults, yet
they have their inherent and specific level of reliability built upon the ever-increasing level
of complexity of the systems they are monitoring. Subtle relationships between systems
must be understood by the designer of the BIT and BITE. More and more parameters can
be monitored, from vibrations and pressures to avionic performance, and even structural
health. So, the complexity and difficulty of producing reliable test routines continually
increases [5-16]. Unfortunately, what is needed for success here is a logical method for
effective fault consolidation. If BITE falsely identifies component failures that do not exist,
components may be designated as faulty when they are not. Perhaps the fault has, in fact,
been caused by another component that feeds data into the first one—an example of what is

96
Safety Perceptions

known as cascading faults. Complex digital circuits are extremely sensitive to power surges
and transient voltages, which cause the monitoring circuits to register a fault. When a reset
or a test fails to reproduce the fault, a NFF is generated, and the BIT/BITE starts to get a poor
reputation for identifying spurious faults that cannot be reproduced. Referring to Figure 5.1,
this could well contribute to the NFF and FNF cases shown. As aircraft design and systems
such as AIMS and OMS have been developed, the danger for the maintenance organization
is an overload of data. In excess of 100 BITE messages can describe the condition of one
system such as the landing gear. Does this help the engineer with his diagnostics? Now he
has too many options and may take the path of least resistance, especially if operational
pressure demands result in insufficient time to diagnose the fault more carefully. If the
human factors contribution is added in to the ever more complex problem of achieving
maintenance diagnostic success, potential is huge for NFF to be recorded and an error chain
to be created.

5.6 NFF and Air Safety—A Case Study


Within the UK, military examples exist that would appear to have serious flight safety
implications. One such example concerns faults with the Merlin helicopter and its radio in
Afghanistan, in which transmit/receive faults were often not obvious to the pilots but also
could not be replicated on the ground. In the civilian environment, a recent AAIB reported
an incident on the September 11, 2010, which very nearly led to a crash by a Dash-8 Q400
aircraft, G-JECF [5-17], and is a particular case study that demonstrates the link all too well.

During approach to Exeter airport, the aircraft experienced a failure of the number 1 Input
Output Processor (IOP 1). The flight crew became distracted with this failure and was
unaware that the altitude select mode of the flight director had become disengaged. As a
consequence, the aircraft had descended below its cleared altitude. Descent continued until,
alerted by an EPGWS alarm, “Terrain, Terrain, Pull Up,” the pilots took manual control,
climbed the aircraft, and reestablished the glide path. The maintenance action following the
incident recorded NFF, with the relevant circuit breaker being reset and the system testing
satisfactorily. The aircraft was released for service with a request for further reports from
the aircrew. The subsequent detailed AAIB investigation found that the IOP 1 failure was
caused by an intermittent electrical contact arising from a cracked solder on two pins of a
transformer on the IOP 1 power supply module. This IOP fault happened on this aircraft
no less than eight times between August 22 and October 8, 2010. In each case, the fault had
been recorded as NFF, with various maintenance actions completed, such as swapping with
the IOP 2. Indeed, after the first swap on September 20, it was then IOP 2 being recorded
as faulty. Yet, it was not until October 8 that the faulty serial numbered item was removed
and replaced and sent to the OEM for investigation. It was established that extensive
tests were needed by the OEM to finally reproduce the fault on this IOP. Subsequently, it
was proven that the part had an intermittent fault caused by cracks in the solder of some
surface-mounted components on one of the electronic boards. IOP failures were a common
occurrence but were often tested satisfactorily on the ground or tested serviceable by
resetting a circuit breaker or reinstalling the processor.

97
Chapter 5

Removals were not uncommon due to the high rate of NFF, with only 20% of IOP failures
being confirmed across the fleet. Even those returned to the OEM produced a number that
were NFF. Consequently, the company has instituted a procedure in which serial numbers
must be tracked more carefully with linkage to reported faults. To reduce the risk of further
IOP units with intermittent faults being declared serviceable and subsequently fitted to
aircraft, the following safety recommendation was made:

Safety Recommendation 2012-019

It is recommended that Thales Aerospace review the Input


Output Processor test procedures to improve the detection
of intermittent failures of the ERACLE power supply module
in order to reduce the number of faulty units being returned
to service.

In this incident, we also see several of the “dirty dozen” maintenance errors occurring.
There was complacency with the general acceptance of the repeat occurrence of IOP failures
for which only 20% were confirmed faults. There was a lack of communication within the
organization where fault history and repeat arising information were often not available,
as the aircraft would be operating and staging overnight at different bases with different
maintenance teams.

5.7 Conclusion
It is clear that NFF is a serious problem to the airline industry, in particular as it affects
aircraft availability and causes delays and cancellations, all of which have a damaging effect
to airline revenue and reputation. Airlines thus treat NFF in a number of ways. Many will
accept high NFF rates if their delays and cancellations are minimized; ensuring reputation
and revenue is paramount. Others may hide the issue or are having NFF issues without
actually realizing them and the cost to their business. However, NFF has many causes, and
studies show that the human factors that cause maintenance errors and those that cause or
contribute to NFF are very similar. Yet maintenance errors, described as the “dirty dozen,”
have received a great deal of publicity, as they have been accepted as being the factors
that contribute to maintenance errors that cause aircraft incidents and accidents. The link
between NFF and aircraft safety is, however, yet to be fully understood and accepted. If
there is such similarity between NFF causes and the causes and impact of maintenance
errors, it is only a matter of time before an accident and loss of life can be directly linked to
the occurrence of NFF. How many times must the same intermittent fault be classified NFF
for instance, before a full and thorough diagnosis will take place? In the highly regulated
world of aviation, it is as yet not fully understood by those who are responsible and
accountable, that there is a link between NFF and safety. Unless the case can be made, there
is little chance that the regulatory authorities will seek to change current practice. Despite
this, a recent Air Accident Investigation Board report makes a clear link between NFF and a
potential near accident [5-17].

98
Safety Perceptions

5.8 References
5-1. Kaminski-Morrow, D. “BMI A321 strayed off course as pilots battled electrical
failure.” In: Proceedings of Flight Global, 2010.

5-2. Khan, S., P. Phillips, I. Jennions, and C. Hockley. “No Fault Found events in
maintenance engineering Part 1: Current trends, implications and organizational
practices.” Reliability Engineering & System Safety 123 (2014): 183–195.

5-3. James, I. “The link between NFF and Safety.” Presented at 3rd Conference on
Through-life Engineering Services, 2014.

5-4. European Legislation Commission Regulation (EC) No 2042/2003. European


Aviation Safety Agency, Publication date 20/10/2003.

5-5. Official Journal of the European Union, L315/4. Available at: http://eur-lex.europa.eu/
legal-content/en/ALL/?uri=OJ:L:2003:236:TOC. 2003.

5-6. MAA Manual of Air Safety, RA 1020. Available at: http://www.raf.mod.uk/rafcms/


mediafiles/974f1f4f_5056_a318_a8e6b0715a5e79d7.pdf.

5-7. Hessburg, J. “Functionability Management–A tribute to Jack Hessburg.” Presented


at the 23rd MIRCE international Symposium, Dec. 3–5, 2013.

5-8. ARINC Working Group 672, Guidelines for the reduction of no fault found (NFF):
ARINC, 2008.

5-9. Alderliesten, R. “Lecture: Introduction to Aerospace Engineering II.” Available at:


http://ocw.tudelft.nl/courses/aerospace-engineering/introduction-to-aerospace-
engineering-ii/lectures/fail-safe-safe-life/. 2014.

5-10. Moir, I., A. Seabridge, and M. Jukes. Civil Avionics Systems. John Wiley & Sons, 2013.

5-11. DOT, U. BTS. U.S. Department of Transportation, Bureau of Transportation Statistics,


“National Transportation Statistics.” BTS TranStats, 2010.

5-12. CAA. Available at: http://www.caa.co.uk/docs/1743/COR001.pdf. 2002.

5-13. Avižienis, A., J. C. Laprie, and B. Randell. “Dependability and its threats: a
taxonomy.” In Building the Information Society, 91–120. Springer, U.S., 2004.

5-14. Reason, J. Human Error. Cambridge University Press, Cambridge, 1990.

5-15. Dupont, G. “The dirty dozen errors in maintenance.” In: Proceedings of the Eleventh
FAA Meeting on Human Factors Issues in Aircraft Maintenance and Inspection. 1997.

5-16. Qi, H., S. Ganesan, and M. Pecht. “No-fault-found and intermittent failures in
electronic products.” Microelectronics Reliability 48 no. 5 (2008): 663–674.

5-17. AAIB Bulletin, 6/2012 G-JECF EW/C2010/09/04. Available at: http://www.aaib.gov.


uk/publications/bulletins/june_2012/dhc_8_402_dash_8__g_jecf.cfm. 2012.

99
Chapter 6
Operating Policies for
Management Guidance

6.1 Introduction
This book discusses various facets of the NFF problem over the entire system life cycle.
So far, these have included issues such as availability, human interaction, management,
system design, and technology. The range of these topics gives an idea of the complexity
within which the phenomenon occurs. Although the complete elimination of this
complexity is not a realistic expectation, a structured approach or a defined policy will
help minimize its effects [6-1].

Policies are guidelines that assist in achieving the defined objectives (what must be
achieved) of a function [6-2]. These indicate a course of action to be followed to cope
with situations as they arise. To verify the applicability of a policy, organizations will
assess if it can help in effectively addressing a single existing problem, or help with a
recurring situation.

The research behind this book has revealed that organizations seldom define
NFF mitigation policies, and hence they rarely exist in any written form. Therefore,
it is useful to talk about some policy requirements related to NFF events, occurring
during maintenance, which can help management recognize the interrelationships
that exist between the various functions (or departments) of the organization. This
will promote an organization-wide understanding of the principal causes that
result in the NFF phenomena.

To put everything in perspective, the next section highlights the through-life


engineering services context. This will help the reader to understand the fundamental
principles, relationships, mechanisms, and interactions married to the NFF issue. This
topic is followed by a discussion about the policies that should be in place to facilitate

101
Chapter 6

the NFF reduction process. Fortunately, some guidelines, in the form of the ARINC 672 [6-3],
already exist. These guidelines can help make the task of establishing operating policies a
little less troublesome, and they are presented next. The chapter concludes with a case study.

By the end of this chapter, the reader will be able to plan out a structured approach to
address the NFF problem, in terms of how to proceed to achieve their maintenance
objectives, and provide criteria for:

• Decision making regarding root causes


• Taking maintenance actions at an early stage of the component repair cycle
• Ultimately, reducing costs by avoiding unnecessary unit removals

6.2 Through-Life Engineering Services Context


Figure 6.1 presents a matrix of stakeholder interaction, in one organization or many,
against a systems engineering breakdown of the asset. More information on this overall
topic can be found in Redding and Roy [6-4]. The illustration highlights the various phases
involved within the complete life of an engineering system at various phases and levels.
These include:

• System Engineering Design phase (system realization)


• Development
• Integration and Validation
• System Operation phase (system use)
• Operation
• System Maintenance phase (Maintenance and Support)

The phases can be focused on the following four levels: Component, System, Aircraft, and
Fleet. This kind of abstraction helps to identify which entity is affected in the corresponding
phase.

The System Design phase has a significant impact on the System Operation and Maintenance
phases, because it determines all of the requirements, constraints, and restrictions to
guarantee a service.

Within the System Design phase, various activities, such as defining the architecture,
components, modules, interfaces, and data for a system to satisfy specified requirements, are
involved at the different levels. It should be understood that many operational characteristics
will have their origins within this phase, including system behavior, BITE, troubleshooting,
training requirements, interface control, etc.

102
Operating Policies for Management Guidance

Stakeholder  interac&on  

System  Design   System  Opera&on   System  Maintenance  

Fleet  level  
Product  or  Solu&on  
Planning/scheduling  

Benchmarking  

AircraD  level  
Knowledge  Management  
System  Engineering  and  Tools  
Standards/Regula&ons  

Airworthiness  

System  level  
Knowledge  Management  

System  Tes&ng  

Changes  and  Upgrades  

Component  level  
Component  use   Component  Tes&ng  

Changes  and  Upgrades  

Figure 6.1 Through-life engineering services in aerospace.

The Systems Operation and Maintenance phases will span the longest time, when the system
is being used and put into service. During this time, the maintenance and its corresponding
support also form the central activities that will impact system availability.

To investigate any fault/failure, all four levels highlighted in Figure 6.1 must be considered.
This is because specific documentation and competencies are developed for their
implementation (e.g., the maintenance of components may require operational information
from the System level for repair).

When diagnosing a fault, it is possible that the fault behavior cannot be replicated within a
particular level. The reason for this could be that a fault that is detected in one level might be
triggered from another level. The worst-case situation is when the symptom and root-cause
of the fault all exist at different levels. To resolve such instances, studying the interactions
between these different levels becomes crucial for fault isolation and resolution.

103
Chapter 6

The complexity of the worst-case problem lies in the fact that failures can, for the most part,
be resolved on the basis of the recorded data, user-filled reports, existing maintenance
documentation, and on-field engineering experience. When NFF events are reported, this
approach will prove insufficient. Figure 2.1 (chapter 2) illustrated this idea in which the
existing diagnostic process is insufficient during maintenance, and the primary aim is to
isolate the fault to a “group of SRUs.”

Figure 6.2 expands upon this argument by showing the stakeholders and processes involved
during the phases of a component’s life cycle (at the component level). It aims to depict the
complexity of the interactions that exist, and thus the inherent potential NFF issues. Further
details of these interactions are listed in Table 6.1.

Figure 6.2 The stakeholders and their interaction at component level.

104
Operating Policies for Management Guidance

Table 6.1 Organizational Interactions


Between System The Manufacturer, in addition to initial design/production and
Design and delivery of the aircraft to the Operator, is the design authority and
Operation provides maintenance support. Maintenance Documentation, Service
Bulletins (SB), and Service Instruction Letters (SIL) all come from the
Manufacturer.

The Manufacturer often receives Maintenance Support Requests from


the Operator and can also be asked for Maintenance Data from the
Operator.

The Aircraft is the subject of consideration, which is, among others,


usually equipped with some form of On-Board Maintenance System
(OMS), Logbook, and technical documentation, etc.

The Operator will receive the Maintenance Records from the aircraft’s
OMS/Logbook. In response, the Maintenance Organization (or the
operator’s Engineering Department) will carry out Maintenance and
deliver this information in some suitable format to the OMS and the
Logbook (on board the aircraft) or to the maintenance information
system. This department will receive Performance Data from the aircraft
and will then provide the required Engineering Support.
To / From Line Maintenance will receive Maintenance Status information from the
Maintenance Aircraft via the Logbook and will perform the required Maintenance
Actions. They will make corresponding Logbook Entries, to document
the Return-to-Service (RTS) status.

Shop Maintenance will receive Unserviceable LRUs from Line


Maintenance for testing, troubleshooting, calibration, repair, etc. It will
provide Serviceable LRUs for replacement.

Shop Maintenance will deliver Unserviceable SRUs to the Supplier or


OEM of that equipment for testing, troubleshooting, calibration, repair,
etc. It can receive Serviceable SRUs from the Supplier/OEM in return.
To / From The Supplier/OEM will support the Shop Maintenance by replacing
Unserviceable SRUs with Serviceable SRUs. Unserviceable SRUs may
Supplier/OEM undergo bench testing, troubleshooting, calibration, repair, etc. They can
even reach their End-of-Life and Beyond-Economical-Repair status.

The Supplier/OEM will deliver Systems/Components to the aircraft


Manufacturer for initial production, and can also sometimes receive
unserviceable/rogue Systems/Components back.

105
Chapter 6

Within the interactions highlighted in Table 6.1, a number of issues arise:

• The OEM may not understand the circumstances of a failure [6-5]: NFF is inherently
a byproduct of a lack of detail given by the environment in which the failure
occurred, and the testing inability to replicate that environment and fault. In other
words, a component is tagged as a NFF by the supplier, or repair station, due to a
lack of incoming information about the part and/or bench test procedures that are
too restrictive. This circumstance means that a test bench with which the actual
environmental condition is reproduced may be necessary to find the cause of NFF.
• Reliance on the Acceptance Test Procedure to identify faults: During the troubleshooting
procedure, the manufacturer will have issued a set of procedures (for particular fault
codes/failure modes) that were developed during the system design phase. However,
when these fail to identify the problem, other resources must be brought to bear
during system operation: help escalation channels, technician training, supporting
documentation, etc. Due to this, it is often difficult to define a fixed set of test procedures
that can verify the full functionality of a component. Consequently, this situation will
lead to a log report that contains spurious fault detection (e.g., operator/pilot reports on
faults may not correspond to the test logs, resulting in overlooked maintenance issues).
• An over-sensitive BIT system intolerant of intermittency [6-6]: The design of a BIT
system is a non-trivial task and relies deeply on the knowledge of all the system
interactions. As electronic equipment evolves into ever-more complex systems,
BIT is increasingly depended upon to provide in situ fault detection and isolation
capabilities. Failures reported by over-sensitive BIT tests can be costly, and are likely
to result in component replacement, recertification, or inevitable loss of availability
of the equipment. The nature of BITs will be, in some way, dependent upon a set of
predefined statistical limits for the various parameters that are being monitored. It is
important to recognize that BITs will report failures when either they have exceeded
a specified threshold, or when the intermittency of the BIT measurements throws
the test results outside of the testing limits. The former of these is a direct result of
component failure, such as a burned-out resistor, for instance. The latter occurs when
a measured parameter, which has intermittent errors, is measured by an instrument
having its own noise.
• Intermittent faults not detected by test equipment [6-7]: Intermittency is arguably the
most problematic of the NFF events due to their elusive nature, making detection
by standard test equipment difficult. The faulty state will often lay dormant until a
component is back in operational use, where it eventually causes further unit removals
unless a genuine cause is found. It should be emphasized that these failures are not
always present during testing, which makes them troublesome to isolate. This situation
can result in repeated removals of the same equipment for the same symptom, with
each rejection resulting in the equipment being tagged as NFF. At this stage, probability
is high that there will be loss of system functionality, integrity, and, perhaps, even an
unacceptable compromise in safety requirements.
• The nature of repairs does not reflect the original failure: This argument goes back to
Figure 3.1 (chapter 3 on Human Influence), where we start with “THE” problem, and
over time are left with “A” problem, which is not related to the root cause. The original
defect is likely to reappear, and as a result of unsuccessful troubleshooting attempts, it
will directly result in unscheduled maintenance jobs.

106
Operating Policies for Management Guidance

• Multiple rejections for apparently the same failure [6-8]: The ability to recognize a
failure is of paramount importance in mitigating the effects of NFF events. The key to
distinguishing failures is to implement the necessary procedures to track the underlying
conditions in which they occur. These underlying conditions include the environment,
the platform on which the components were installed, number of operating hours/
cycles, number of hours since the component’s last overhaul, and a genuine reason for
the generated removal codes. In addition to this, the history of the operating platform
(be that a wind turbine, aircraft, or train) should be recorded to determine the exact
effects the failure has on the overall system.

This list is non-exhaustive, but the above discussion does help to show that the NFF
phenomenon creates time-consuming problems and costly bottlenecks within the
maintenance program that must be controlled. This could be achieved by introducing a
policy on regulating NFF reported events with organizations (discussed in the next section).
To close the loop from the previous discussion, Figure 6.3 is a system-level view of how these
issues manifest themselves and what strategies can be developed to mitigate their effects.

Issues

Intermittent
BIT System
equipment fault Actions
design changes
together with Minimal
reduce sensitivity
intolerant BIT design OEM
to intermittent
leads to frequent engineering
failures
aircraft failure investigation

Minimal
reporting of Equipment ATP doesn’t
capture High OEM
Aircraft failure failure rejection to
circumstanc
intermittent NFF Rate
es
OEM fault

Additional testing
Provide On- Improved utilising
Aircraft reporting of failure representative
diagnostic advice circumstances environment
software
Detailed
engineering
Detailed analysis by OEM
engineering Detailed in addition to ATP
Software design
analysis of large engineering
issues identified
sample of failure advice to OEM
data
Identified failure
Corrective Actions correlated with
reducing Aircraft defect report
Failures

Hardware pattern
defects identified OEM Fault
and corrective Found
actions embodied

Figure 6.3 Developing a NFF reduction process.

107
Chapter 6

A reduction in NFF can be achieved through the following:

• Software updates: A software design change can be initiated, but roll out will be
dependent on the various organizations involved, especially if third parties/contractors
are also involved.
• Improved testing: A hardware refurbishment program can be established to address
soldering problems.
• Warranty considerations: Contractual changes could be negotiated to facilitate a
refurbishment program.
• Procedural changes: An on-aircraft failure diagnosis procedure can be provided to enable
detailed data analysis before removing equipment. This information, which can be shared
with the OEM, can help in duplicating the conditions in which the fault took place.
• Effective communication: A close liaison between the OEM, operator, and maintainer
can assist to maximize effectiveness of repairs, including detailed analysis of the data by
the OEM.

It should be emphasized that all corrective action must be related to the original fault to
prevent it from repeating itself.

6.3 Policy Requirements


Now that we have discussed the context and some of the NFF issues that occur during
system operation and maintenance, we will move our attention toward the requirements a
NFF mitigation policy should consider. These should include:

• The scope and limits of NFF events: This book creates a distinction between two types
of NFF events. A True NFF event includes factors such as misread reports, operator error,
job pressures, and poor system understanding. The other type is a False NFF event
that is caused by procedural shortfalls during maintenance activities. It indicates that a
fault is still present within the system, but an incomplete test procedure or the inability
to reproduce fault conditions means it remains undetected. For an organization, this
will determine what is within its terms of reference (jurisdiction) for NFF investigation,
and what should be excluded. This is not an argument against including (or excluding)
particular NFF events, but stresses that such a decision should be made on the basis of
explicit directives, definitions, and mutual understanding with the organization
(or sector).
• Type and level of troubleshooting expected: This refers to the amount and intensity
of the troubleshooting that is expected to determine whether a component must be
removed or not (e.g., what levels of checks are required on suspected units, how many
times can a rogue unit be put back into service?) Some aerospace organizations have a
policy on rogue units, which says that they will remove a unit from service if it has been
tagged as an NFF three times. Of course, such factors will rely on establishing a balance
between the costs incurred and the time taken to carry out the troubleshooting process.
If an organization is inclined toward cost savings, the level of troubleshooting might
be limited. This, however, embodies two other factors, which include the functional
performance of the component and the organization’s reputation. The functional
performance is concerned with the seamless operation of a component (e.g., we know
there are intermittent faults, but this does not affect the component’s ability to achieve a

108
Operating Policies for Management Guidance

level of performance at the moment). There is no guarantee that these intermittent faults
will increase over time and result in a functional performance failure. The cost savings
aspect will influence this decision.
The other factor, the organization’s reputation, is determined mainly by an
organization’s culture (i.e., if they are aiming to improve customer relations, or win
more contracts, they will aim at resolving intermittent fault issues as a demonstration
of their attention to detail and hence overspend on the total cost within the
troubleshooting process).
Therefore, the standard for the level of troubleshooting needed, for managing both
functional performance and reputation, must be dictated by means of the policy.
• Role and responsibilities of management: Investigating NFF is not just a peripheral
activity within an organization, and this must be reflected within the senior
management to establish a common understanding with regard to the consequence of
NFF on an organization. Principally, managers participate in (or facilitate) the decision-
making process for the allocation of resources, the development and implementation
of strategic plans, and the establishment of intervention and control strategies. Due to
their role, managers implement strategies and practices that aim to improve standards
and other related tasks. Aviation managers, in particular, may be required to establish
and identify approaches for absolute control, reduce incidences of aviation accidents,
increase levels of aviation operations, and mitigate all variables that are likely to result
in undesired and damaging outcomes that compromise safety. It should also be noted
that achieving such goals remains a daunting task, especially when the operational
context of the industry is open, dynamic, and complex.
It should be clear by now that the influence of NFF, on maintenance plans and system
availability, is far more evident to maintenance managers. This statement is supported
by several cogent arguments that indicate the complexities within commercial contracts,
organizational bureaucracies, and the lack of adequate metrics for costing the impact of
NFF units. Therefore, the policy must recognize the role of senior management as a vital
function in the need to improve supporting actions and budgeting for NFF reduction.
The responsibilities can be divided into three distinct levels, each providing various
vantage points on how decisions are made from each level:
• Top-level management: This level should aim to provide an overall view of
knowledge management challenges for the global business, including:
• Developing (or recognizing) a maintenance strategy that conforms to the
business’ vision.
• Agreeing on contracts including warranty and indemnities in the case of hidden
failures.
• Consenting to provide adequate training, material resources, and authorization for
investigating NFF occurrences.
• Defining a specific allocation of funds for NFF within the yearly maintenance
budget.
• Defining responsibilities to foster a culture that promotes interdisciplinary
integration of maintenance lines on all levels—identify key performance indicators.
• Reviewing yearly statistics on hidden system failures and evaluate them with
formal customer requirements and safety standards.

109
Chapter 6

• Middle-level management: This level should concentrate on developing procedures


and the implementation of NFF strategies, based on the priorities set by the top-
level management:
• Developing procedures (or instructions) that adhere to the NFF strategy set out by
the top-level management.
• Monitoring the implementation of reactive maintenance objectives and
performance.
• Communicating and providing feedback on NFF information related to the incident,
including resources used, statistics, and performance to senior management.
• Ensuring troubleshooting competence levels are preserved.
• Ensuring solutions and expertise is fed back to design to suggest improvements.
• Making comprehensive decisions under operational pressure.
• Liaising and negotiating with OEM, or third-parties and outsourcing partners.
• First-line management: This level should be provided with fault isolation manuals
(or troubleshooting guides) that contain all possible fault codes and step-by-step
procedures. Personnel should be encouraged to follow best practice guides and be
involved with:
• Implementing and coordinating appropriate actions.
• Making comprehensive decisions under operational pressure.
• Personnel practices: Other essential factors that need to be covered within a NFF
mitigation policy include reporting and training. Adequate reporting can ensure that
correct and sufficient data are collected and recorded to allow maintainers at all levels
in Figure 6.1 to have the complete fault history of a suspected component. This may
include reports from manufacturers (or subcontractors) with a component on its return,
detailing the original fault and any work that was carried out. Such documentary
evidence will help achieve two goals: it helps to record on-field knowledge and it will
grab the attention of management.
The former goal addresses our discussion that most of the knowledge for reducing NFF
events still exists with only a few experienced experts, or in personalized organizational
databases. These experiences need to be preserved and disseminated through training
courses and interactions with other experts, to ensure that levels of expertise are
retained. The later goal is a byproduct of reporting—which creates awareness. It will
help to answer questions such as: “Why were certain actions taken?” or “What were the
root causes that may have caused the failure?” or “What training can be provided to
improve productivity?” and “How do we modify diagnostic systems to further enhance
the troubleshooting process?” This indicates a much more proactive approach on both
individual and organizational levels. We might notice that with structured reporting,
most of the problems associated with the management’s attitude toward NFF problems
may just solve themselves.
The other factor just mentioned is training. Modern technology brings many
new pressures and complications to existing businesses to remain in competition,
emphasizing the importance of providing some sort of training to the maintenance
personnel who will build or repair the company’s assets. Even though some intelligence

110
Operating Policies for Management Guidance

may be designed into the system for anticipated failures, investigating NFF events is
rather challenging. In such cases, policies must deal with the following points:
• Raising awareness of the operators and maintainers on how and why NFF
events occur
• Regulating fault manual updates
• Recording and disseminating on-field experience
• Maintaining the level of on-field competence
• Allowing access to applied knowledge (i.e., what worked and what did not).
The effectiveness of a maintenance system can only be as good as the people who
control it, and no effort should be spared when it comes to training.
This section provides the building blocks for a policy that can assist design, management,
and maintenance personnel on their mission to control NFF events. Its establishment will
need sufficient knowledge regarding the system and the ability to accommodate existing
requirements within available means. Working out the details and incorporating them into a
“How-to” guideline will certainly mark a major milestone in the maintenance community’s
mission to control the NFF phenomenon.

The following section presents an approach that can help with establishment of a policy as
part of the NFF reduction process.

6.4 The NFF Control Process


“We don’t have any specific NFF mitigation policies yet, but we want to reduce its impact right
now. What should we do?” Fortunately, there is a process that can be tailored and customized
for specific operational environments for the reduction of NFF. This section presents an outline
of this process, which is called “Guidelines for the reduction of No Fault Found” [6-3].

Published in 2008, this report proposed a set of guidelines specifically for the aviation
industry to deal with the NFF problem. The purpose was to:

Help the industry understand the nature of NFF and to


develop conclusions that each individual organization could
use to help it address and solve the problem.

The report was the first to discuss the idea of focusing on the complete life cycle of electronic
equipment (i.e., from the initial design stages through to their deployment and maintenance).
This not only allowed organizations to investigate current NFF events, but also helped to close
the loop and feed the acquired knowledge back—influencing equipment design of the future.

111
Chapter 6

As illustrated in Figure 6.4, this setup provides criteria for decision making regarding root
causes, and describes the importance of taking maintenance actions at an early stage of
the component repair cycle. It further highlights the necessary means of reducing costs
by avoiding unwanted component removals from the aircraft. It should be noted that the
guidelines are intended to be customized for specific operations environments.

Establish  Candidates  
 
-­‐Capture  and  assess  data  
-­‐Establish  criteria  
-­‐Iden&fy  candidate  

Establish  Source  
 
-­‐Establish  criteria  
Air  Transport  
-­‐Establish  causes  
Environment  
-­‐Determine  sources  
 
-­‐Design/Produc&on  
-­‐Flight  Opera&ons  
-­‐Line  Opera&ons   Select  Solu&on  
-­‐Shop  Opera&ons    
-­‐Establish  criteria  
-­‐Generate  possibili&es  
-­‐Select  solu&ons  

Implement  Solu&on  
 
-­‐Record  data  
-­‐Feedback  results  

Figure 6.4 The ARINC 672 NFF reduction process provides an interactive framework for
various domains when customizing interdisciplinary processes.

112
Operating Policies for Management Guidance

The guideline suggested a framework, created to narrow down the complex nature of the
problem to something more manageable, by dividing the problem into “domains,” which
include:

• Design/ Production
• Flight, Operations
• Line Maintenance Operations
• Shop Maintenance Operations

Each domain, in turn, contains the following “categories”:

• Documentation
• Communication
• Training
• Testing
• Systems and Components

The “Establish Candidates” phase emphasizes various data collection aspects related to
the maintenance event. This is required to establish a set of selection criteria to be used by
the organization to implement the NFF reduction process. This may include information
such as the LRU part/serial numbers, financial impacts, unscheduled removal rates, period
over which occurrences have been observed, history, etc. These data are used to short list
(identify) candidates. ARINC 672 leaves the detail of this stage to each user/application,
because it is highly individual for different airlines/operators. It assumes that the NFF
selection stage has already revealed a potential NFF candidate for a particular NFF problem
that is being experienced.

The “Establish Source” phase helps to determine the most likely NFF source, many of
which have been listed in Table 6.2 [6-3]. Comparing the observed NFF behavior with the
domain and category helps to identify the major NFF driver in the event. At this stage, it
is assumed that a NFF problem has been observed and a possible candidate selected. Then
Table 6.2 is scanned in a left-to-right and top-down sequence to determine if the observed
situation is covered by the entries in one or more cells of the matrix.

113
Chapter 6

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


Domain
Design/production Flight operations

NFF Sources: NFF Sources:


• Inadequate design/production • Misleading/incomplete documentation,
documentation processes, and procedures

Recommendation: • Unclear reporting in Logbook

• Obtain feedback from field/operations • Unaware of available documentation from


aircraft manufacturer
• Produce unambiguous accurate
documents (e.g., FCOM, AMM, CMM, Recommendation:
TSM, FIM, SIL)
• Improve reporting processes and
• Produce timely temporary revisions procedures (e.g., Logbook entries)
Documentation Category

of pertinent documents (e.g., FCOM,


• Improve reporting means/tools
AMM, CMM)
• Improve documentation
• Obtain/produce clear and
unambiguous requirements in • Improve documentation/ information
documentation (note, both actor and review and distribution process
recipient roles)
• Analyze in-service events to detect
requirements related deficiencies
• Follow-up with in-service findings in
general

114
Operating Policies for Management Guidance

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Line operations Shop operations

NFF Sources: NFF Sources:


Deficiencies in: Deficiencies in:
• Aircraft Maintenance Manual (AMM) • BITE decoding information in CMM
• Troubleshooting Manual (TSM)/Fault • Component Maintenance Manual (CMM)
Isolation Manual
• Multipurpose ATE TPS implementation
• Component BITE user’s manual
• Removal data
(considered to be part of TSM/FIM)
Documentation Category (continued)

• Logbook entries Recommendation:


• Repair history keeping • Provide information feedback to Design/
Production and Line
• Configuration management
• Request removal data
• Deficiencies in awareness of sources and
available documentation (e.g., Aircraft
Manufacturer, OEM, Supplier, Operators)

Recommendation:
• Improve TSM/FIM descriptions
• Improve AMM descriptions
• Provide information feedback to design/
production
• Improve reporting discipline, e.g.
Removal Tag, Logbook
• Improve Logbook recordkeeping
• Improve history keeping
• Improve documentation/ information
review and distribution process

115
Chapter 6

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Design/production Flight operations
NFF Sources: NFF Sources:
• Insufficient level of communication • Imprecise/lack of reporting by the previous
between and/or among design flight crew
engineering and test engineering
• Lack of/insufficient communication
• Insufficient level of communication between flight crew and aircraft
between design organization and end maintenance
user
Recommendation:
Recommendation:
• Improve communication between rotating
Communication Category

• Obtain feedback from field/operations flight crews


• Inform field of changes/updates/ • Improve communication between flight
status of corrective actions crew and maintenance
• Institute sufficient level of internal • Introduce an information acknowledgement
communication between design process
engineering and test engineering
• Institute sufficient level of external
communication between design
engineering and end user (e.g., Flight,
Line, Shop)

116
Operating Policies for Management Guidance

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Line operations Shop operations

NFF Sources: NFF Sources:


• Insufficient communication with flight • Lack of/insufficient communication with
operations line maintenance
• Insufficient communication with shop • Lack of/insufficient communication with
operations (maintenance Organization, external repair service providers
OEM, etc.)
• Lack of/insufficient communication with
Communication Category (continued)

• Insufficient communication among line OEM


operations
Recommendation:
• Insufficient feedback to manufacturer
• Disseminate (good) tribal knowledge/
Recommendation: behavior
• Disseminate (good) tribal knowledge/ • Feedback of workshop experience
behavior related to faults found to Design/
• Improve communication between Production
maintenance and flight crew • Improve internal communication among
• Improve internal communication among shop
line maintenance • Improve internal communication
• Improve internal communication between shop and line maintenance
between line maintenance and shop • Improve external communication
(maintenance Organization, OEM, etc.) between shop and repair service
• Improve external communication providers (e.g., maintenance
between line maintenance external organization, OEM)
maintenance organizations • Introduce an information
• Introduce an information acknowledgement process
acknowledgement process

117
Chapter 6

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Design/production Flight operations
NFF Sources: NFF Sources:
• Insufficient design experience and • Lack of understanding of normal (specified)
knowledge of new technology system behavior
implementation consequences
• Insufficient training in observed fault
• Lack of operational and maintenance interpretation
environment knowledge
• Unexpected system behavior
Recommendation:
Recommendation:
Training Category

• Provide opportunities for appropriate


• Improve training to understand the usage/
experience and training in/exposure to
operation of aircraft systems
new technology
• Improve specific/appropriate training to
• Obtain/gain in-depth operational and
report observed flight deck/cabin effects
maintenance environment knowledge
• Provide sufficient knowledge of test
requirements

NFF Sources: NFF Sources:


• Insufficient design for testability No explicit test activity
expertise
Recommendation:
• Insufficient knowledge of test
requirements (e.g., RTS, ATP, V No explicit test activity
and V Test)
• Poor test design
• Test implementation
Testing Category

Recommendation:
• Ensure design for testability both on
system and component level
• On-aircraft performance criteria/
tolerances to be consistent with shop
RTS criteria

118
Operating Policies for Management Guidance

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Line operations Shop operations

NFF Sources: NFF Sources:


Insufficient training related to: Insufficient training related to:
• System operation • System operation
• Fault reporting • Component operation
• OMS usage • Test equipment operation
Training Category (continued)

• GSE usage • Repair techniques


• Troubleshooting and repair techniques
Recommendation:
Recommendation: • Propagate tribal knowledge to the
• Propagate tribal knowledge to the advantage of all actors involved
advantage of all actors involved • Provide awareness training to consult all
• Provide training related to: available maintenance history

• System operation • Provide training related to:

• Fault reporting • System operation

• OMS usage • Component operation

• GSE usage • Test equipment operation

• Fault isolation/trouble shooting and • Shop repair techniques


repair techniques
• Removal documentation
NFF Sources: NFF Sources:
Insufficient: Insufficient:
• Fault isolation/trouble shooting manual • Design for testability
test procedures
• RTS test implementation
• Aircraft maintenance manual test
• Test coverage
Training Category (continued)

procedures
• Awareness of unit repair history
• Time to perform necessary testing (TAT
constraints) • Time to perform necessary fault
isolation/ trouble shooting (TAT
Recommendation: constraints)
• Improve trouble shooting and test
Recommendation:
documentation
• Review previous removal history before
• Allow sufficient time for line maintenance
starting any testing
to perform necessary testing
• Report test specification implementation
issues
• Allow sufficient time for shop
maintenance to perform necessary fault
isolation/trouble shooting
• Develop additional test procedures
based on experience

119
Chapter 6

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Design/production Flight operations
Sources: Sources:

• Unclear design requirements • Unexpected system behavior


• Poor BITE specifications • Incorrect system operation by User
• Limitations in hardware/software Recommendation:
• Poor system specification
• Provide proper feedback of unexpected
• Insufficient integration effort behavior
Recommendation: • Improve flight crew awareness on system
behavior changes
• Produce comprehensive maintenance
documentation (e.g., AMM, CMM)
System/Components Category

• Provide effective/efficient means for


analysis
• Ensure design that allows sufficient
test coverage
• Provide test specification based on
feedback obtained from operations
• Provide OMS capability that will
support a shorter/simplified fault
isolation time requirement
• Develop adequate BITE capability
(coverage and interpretation/decoding
– line and shop)
• Select hardware/software capabilities
to match the design requirements
• Use design methodology that ensures
visibility of functional dependencies
between systems
• Provide sufficient integration means/
resources
• Allocate adequate resources

120
Operating Policies for Management Guidance

Table 6.2  Sources/Causes of NFF and Their Recommended Remedial Actions


(continued)
Domain
Line operations Shop operations

Sources: Sources:

• Misleading/insufficient BITE information • Insufficient BITE decoding information


• No flight fault history recorded • Insufficient design for testability
• Limitations in fault isolation • Poor overhaul and repair means/
Recommendation: workmanship (e.g., soldering)
• BITE data memory is erased in shop
• Report to Design/Production for design maintenance automatically
and manufacturing improvements as
Recommendation:
System/Components Category (continued)

applicable, e.g.:
• Observed/BITE reported malfunctions • Report to Design/Production for design
• Possible faults and manufacturing improvements as
applicable, e.g.:
• Reliability issues
• Observed faults/BITE reported
• “Work-around” to cope with known malfunctions
issues
• Reliability issues
• Provide shop operator with a selectable
option to erase fault history - no
automatic erase

121
Chapter 6

The “Select Solution” phase refers to the various recommended solutions listed in Table 6.2.
These correspond to the particular category and domain identified in the previous step. In
conclusion, an action is determined and implemented. Once the source(s)/cause(s) have been
established, progress is made toward selecting an appropriate solution. This can lead to one
or more possible solutions, and the process requires a narrowing down to select the most
appropriate/preferred one.

Ultimately, ARINC 672 can help to improve an organization’s position by developing:

• A methodological approach that helps with understanding the essential underlying


principles, mechanisms, relationships, and interactions of a NFF situation
• A comprehensive understanding of NFF
• A NFF reduction framework that can be tailored for application to a wide variety of
tasks
• An integrated approach covering both the dimensions of the system life cycle—from
design/production through operation/support, as well as hierarchy—and from fleet
level down to component level. It also focuses on closing the loop from operation/
support back to design/production
• Allowing scope for improvements based on future experience gained from actual NFF
reduction implementations

If such processes are adhered to, a clearer picture will emerge about an organization’s
culture and attitude toward its maintenance activities. Unless management is fully
behind such endeavors, any attempts to implement reduction procedures may struggle to
demonstrate its applicability. In the case of the NFF phenomenon, time, money and effort
must be made available, and not just reluctantly conceded.

6.5 Application Example


6.5.1 Introduction
The following is part of a case study presented in the ARINC 672 publication to demonstrate
how to implement some of its ideas and strategies. Here, this example will not cover the
complete process chain, but will rather demonstrate how a shop maintenance organization
will be able to improve the analysis toward reduction, better control, and avoidance of NFF.

6.5.2 Implementation Prerequisites


To establish an effective reduction process, a number of basic prerequisites are essential.
The following is applicable to any medium to large size component workshop. Its
implementation is based on the acquisition of basic component and operational data:

• The basic component data: This can include component part numbers, serial numbers,
supply classification/code for manufacturers, etc.
• The operational time control data related to individual component and customers: The
proper data must be collected using prescribed procedures. This includes the total time
(TT), total cycles (TC), time since overhaul (TSO), cycles since overhaul (CSO), time since
last shop visit (TSLV), cycles since last shop visit (CSLV), time since installation (TSI),
cycles since installation (CSI), etc. The idea should be to ensure that the data remain
consistent between periods, to reflect similar activities in a uniform way.

122
Operating Policies for Management Guidance

• A shop record database with structured documentation of all workshop events that
took place. These will include the reason for removal, shop findings, shop actions, all
available counters of operational time control data.
• Status reports of all components on repair and their condition data.
• Data attributed to availability (e.g., MTBR and MTBF, which can provide information on
the operating reliability for the component.
• Meaningful combinations of component parameters that can be adapted to different
technologies to improve selection.
Contrary to the generally practiced generation of component reliability data (MTBUR, MTBF,
etc.) this implemented ARINC 672 solution aims to provide the shop with component status
and condition data in a proactive way. The process provides shop technicians with valuable
inputs on the most probable technical state of the unit with regard to NFF events. The data
are generated by analysis of all data gathered during previous shop visits (i.e., by using the
existing shop record data as a knowledge base).

6.5.3 Application
As illustrated in Figure 6.5, the first step is to create a model based on relevant component
removal data.

Figure 6.5 The basic process of component removal.

The idea is to build and maintain a database of expected “gold” values that will be
used for making future decisions on classifying suspected components. This can help
define additional maintenance troubleshooting efforts. For these gold values, if a normal
distribution is assumed for the available component data parameters, mean values can
be created for the individual selected parameters (or a combination of parameters for TSI,

123
Chapter 6

CSI, calendar months, etc.). Over the entire population of one part number, the results will
represent the expected values for that part number’s population.

The expected values can either be provided by the manufacturer or by recording the life and
maintenance history of the component.

The next step is to use these expected values as a comparison when dealing with a
removed component. A decision toward the component status (green, yellow, or red) can
be performed by applying predetermined thresholds for each parameter. This will help to
classify the component status. A green status indicates a good unit (i.e., there is no NFF). A
yellow status indicates some concerns on the unit (e.g., this may be a second or third shop
visit for this component within a specified time period). A red case clearly indicates a poor
history of the component.

This model will not deliver any deterministic results, but will give a strong indication
toward problematic components.

Once the problematic component(s) are short listed, a criterion must be defined on what
classifies it as a “Chronic” component. The term Chronic was introduced in the ARINC
672 report for units that must have suffered at least three removals (true/false NFFs or not)
within a defined time period. This was used, rather than calling them “Rogue units,” which
are defined in the ATA spec 109 as: “A specified LRU that has three premature removals
accumulating less than 500 hours of aircraft operation, or a total of five removals within a
12-month period.”

Depending on which time data can be shared between the involved parties (i.e., the
airline, manufacturer, and OEM); the following parameters can be considered for a chronic
acceptance criterion:

• Calendar time
• Time on Wing (TSI, TSLV)

The chosen time period in this example is 18 calendar months. “Calendar time” has been
chosen, as “Time on Wing” is not always systematically provided to the OEM along with
the removed LRU. Once the component is classified as “chronic,” it enters an extensive
investigation process that goes beyond the normal bench testing procedures.

For the different types of LRUs, a range of troubleshooting tools and techniques can be used:

• BITE logging and analysis


• Automatic test equipment tests
• Visual inspections of the LRU
• Vibration test
• Temperature test
• Humidity test
• EMC test
• Specific SRU test

124
Operating Policies for Management Guidance

The sequence in which this range of tests has to be performed to achieve the most efficient
investigation will need to be established through experience, and circumstances in which
a component was used. Such investigation can then be performed sequentially, per the
predetermined scheme. This sequence may be stopped any time upon joint decision of
the repair technician, the engineering specialist, and the maintenance manager if some
significant findings arise.

Every step of this investigation process must be recorded in what is called an


“investigation form.” The findings will help technicians to assess what solutions must be
proposed to the operator:

• If a failure has been found, a repair will be performed and the unit will be returned
to service.
• If no failure has been found, the operator is contacted to determine if the operator wants
this unit back in service, or if it has to be removed from service.

A replacement solution (such as standard exchange, loan, etc.) can also be proposed.
However, this will depend on how the warranty/contracts have been put in place between
the two businesses. Either way, the chronic unit investigation form will serve as the
dedicated communication means between the OEM and the operator. It will provide the
stage-by-stage details that will describe how the component has failed over its life. Some of
its contents can include the following:

• Component history (previous maintenance operations performed)


• Investigations and tests performed in the frame of the routine unit process and the
chronic unit process, and the associated results
• Technical conclusions of the system specialist
• The relation between the repair findings and the original airline complaint

Furthermore, it is useful to assign a dedicated manager, with the task to monitor component
backlogs and maintain communication. This will make sure that all components falling into
the scope of the definition of a chronic unit, as described here, are taken into account. Once
senior management is in possession of the investigation process reports and manager inputs
in a quantitative form, a decision can be made incorporating the aspects of available NFF
finances and its troubleshooting efficiency.

6.6 Conclusion
No matter how irregular activities may appear to be, concerted policies can be instituted to
get them under control. Requirements must be analyzed, simplified, and reduced to bare
essentials; only then can unnecessary complicated procedures be avoided.

This chapter discussed the through-life service context, and its interactions, inside which
a NFF policy has to operate. The requirements for putting a NFF policy in place and how
NFF mitigation processes can help control such events during troubleshooting was then
addressed. There is a need for strategic proactive thinking, as well as collective, industry-
wide cooperation, to deal with the NFF issue. Having a specific policy can help dictate

125
Chapter 6

programs for managing and controlling some of the attitudes and practices toward NFF
events. All the information collected must be relevant to this policy (and its objectives)
of the organizations. Its main functions are to monitor the performance of maintenance
activities and highlight any need for corrective actions, to monitor the impact of those
actions, and to provide data to justify adjusting of maintenance intervals or procedures. It
is not the intention of the authors to increase the workload of managers, who will have to
incorporate these ideas within their existing plans. This discussion rather aims to ensure
that improvements within the troubleshooting process can be performed more easily and
that a more efficient resolution of NFF evolves. There is a need for solutions that can easily
be standardized and adopted by industry as a whole. These concepts need joined efforts
from manufacturers, maintenance organizations, and regulators to mutually control the
root cause precursors. Having a NFF policy in place also tends to increase awareness of the
subject area, which will reflect as improvements in maintenance manuals, management
attitudes, and financial justifications.

6.7 References
6-1. James, I., D. Lumbard, I. Willis, and J. Goble. “Investigating no fault found in the
aerospace industry.” In: Reliability and Maintainability Symposium, 441–446. IEEE,
2003.

6-2. Priel, Victor. “Objectives, benefits and policies.” In: Systematic Maintenance
Organisation, 10–20. Macd. & E., 1975.

6-3. ARINC 672: Guidelines for the reduction of No Fault Found.

6-4. Redding, L. E., and R. Roy. Through-life Engineering Services: Motivation, Theory &
Practice, 55–70. Springer, UK, ISBN 978-3-319-12110-9, 2014.

6-5. Gary Teng, S., S. M. Ho, D. Shumar, and P. C. Liu “Implementing FMEA in a
collaborative supply chain environment.” International Journal of Quality & Reliability
Management 23 no. 2 (2006): 179–196.

6-6. Carson, R. S. “BITE is not the answer (but what is the question?).” In: Digital Avionics
Systems Conference, 1998. Proceedings, 17th DASC. The AIAA/IEEE/SAE (Vol. 1, B41–1).
IEEE.

6-7. Cockram, J., and G. Huby. “No fault found (NFF) occurrences and intermittent
faults: improving availability of aerospace platforms/systems by refining
maintenance practices, systems of work and testing regimes to effectively
identify their root causes,” paper presented at the CEAS European Air and Space
Conference, 26-29 October, Manchester.

6-8. Khan, S., P. Phillips, C. Hockley, and I. Jennions. “No Fault Found events in
maintenance engineering Part 2: Root causes, technical developments and future
research.” Reliability Engineering & System Safety 123, 196–208. 2014.

126
Chapter 7
A Benchmark Tool for NFF

7.1 Introduction
The purpose of this chapter is to provide an awareness of the generic benefits and
needs associated with managing NFF within an organization and to recommend a
systematic approach to managing NFF. Within the NFF related field, a plethora of
different technologies, processes, and practices are championed as the best solution to
NFF; however, the reality is that the dynamic nature of the maintenance environment,
the complexity of the NFF problem, and organizational capabilities and needs, means
that there is no one single or complete fix-all solution. Organizations must evaluate
themselves and select a tailored option that suits them. Currently, no accepted
method has been established for such self-evaluation in the context of NFF, but this
chapter does present a suggested tool that can be deployed to help in determining an
organization’s current status and ability to manage NFF and implement corrective
solutions [7-1]. Starting with an introduction to the needs and the benefits that are
associated with managing the NFF problem, the chapter demonstrates the challenges
of implementing a NFF management system in terms of technical and commercial
points of view, and then illustrates a methodology to evaluate the NFF management
needs in an organization. This methodology is based upon a tool developed as a
means of benchmarking the effectiveness of an organization’s ability to identify,
quantify, and accept the NFF problem and its capability to implement a management
strategy. The tool draws on theories and practice that have been successful in a range
of continuous improvement ideologies.

7.2 Benefits of NFF Management


There will always be a measurable impact on business output if a reported fault is not
correctly fixed the first time that it is reported. It therefore becomes crucial to ensure
the existence of a robust maintenance policy that recognizes the existence of NFF and

127
Chapter 7

one that is reactive to the negative aspects of NFF. A well-organized policy provides the
following benefits [7-2]:

• An increase in overall system availability, and a decrease in downtime


• A reduction of “wasted” resources that would have been used to investigate the incident,
such as man-hours, equipment, spares, etc.
• A growing end-user confidence
• Extended useful life of components
• Elimination of repetitive tasks for the same fault

Perhaps the one major advantage that emanates from the decrease in repetitive tasks is
the reduction of “wasted” resources, which would otherwise have been used on more
productive activity. The value of adopting systems to combat NFF is most likely to
experience savings in maintenance costs, as discussed in chapter 3. The reduction in
time spent dealing with NFF will offer a competitive advantage in maintenance decision-
making, which is crucial for any organization. This will help manufacturers retain
customers and attract new business through increased confidence that is generated in their
products; it will also mean that NFF solutions become a key part of formulating future
maintenance strategies.

The airline industry has seen a rapid increase in operators over the past decade, particularly
in low-cost short haul operations. The nature of the budget airline business success is its
ability to operate a large aircraft fleet, coupled with high aircraft availability and short
turnaround times [7-3], while keeping ticket costs low. For such factors to remain and for
airlines to create a business winning advantage, then strategic maintenance management
(incorporating NFF reduction) has to become one of the significant factors in their operations
management to keep high availability and short turnaround times. A proactive approach to
NFF and more efficient maintenance can help push the business forward, as illustrated in
Figure 7.1, which could be drawn generically for MRO management but is here used in the
context of NFF.

The purpose of Figure 7.1 is to highlight the progressive nature of the maintenance business
as NFF is tackled with an increasingly proactive approach. Chapter 3 discussed many of the
impacts of not dealing with NFF in terms of increased downtime and cost. The inability or
even lack of desire to combat the NFF issue that contributes to increased downtime will hold
an organization back in terms of their business objectives and development. Among other
factors and considerations, any loss of availability and loss of service as a result, leads to
falling profits and negative reputation. In this case, there will be a struggle to meet the needs
of a market containing competitors committed to reducing NFF through best practice. The
situation can, however, be improved by recognizing the NFF problem and implementing
the necessary mechanisms and strategies to combat the problem. The ability to tackle the
NFF issue will rely on the application of best practice, and any strategy must be sustainable
and robust against any changes in organizational structure, personnel, new equipment, or
process changes. Being able to support a sustainable NFF reduction strategy results in the
ability to identify emerging NFF issues at the earliest possible stage and react before they
begin consuming large amounts of maintenance resources and reduce availability.

128
A Benchmark Tool for NFF

Redefine  
expecta&ons   Give  an  Opera&ons  and  
business  winning  
Proac&ve  NFF  Management  

Increasing  proac&ve  
approach  to  dealing   Advantage    
with  NFF  
Be  clearly  the  
Link  maintenance  with  
best  in  the  
opera&ons  strategy  
industry  

   
Be  as  good  as  
compe&tors     Adopt  best  prac&ce  

Be  worse  than    
compe&tors   Correct  the  worst  problems  

STAGE  1   STAGE  2   STAGE  3   STAGE  4  

The  ability  to   The  ability  to  


The  ability  to  
support   drive  
implement  
strategy   strategy  
strategy  

Figure7.1 Potential effects of proactively dealing with NFF on an aircraft operator’s business.

Efficiency here would mean clearly operating at the forefront of the industry when it
comes to NFF related maintenance. That philosophy includes quick and timely fault
diagnostics and isolation, reduced corrective maintenance turnaround times, reduced
unscheduled removals, and the ability to perform more effective in situ maintenance. These
abilities are perhaps regarded as somewhat novel, but to truly achieve a proactive NFF
mitigation strategy with the ability to ensure that the root cause of NFF is eradicated, then
a redefinition of expectations is required. As we have already seen, NFF is a symptom of
bad design and/or an inefficient maintenance process, which makes these the root cause of
any NFF. The information gained through following a NFF reduction strategy must also be
linked to the design of new equipment, thus ensuring that NFF is increasingly unlikely to
occur as older equipment is replaced. If this is achieved, then the reduction of NFF begins to
act as a key driver for improving the delivery of reliability, availability, and maintainability
through improved design. The ability to do this would redefine expectations and certainly
provide a business winning advantage.

In Figure 7.1, four stages of continuous progression move toward successfully implementing,
supporting, and driving a strategy toward reducing the NFF issue. An organization that
only has the capability or will to react to the issues when they are already present, and
embedded within the maintenance arena, risks being held back when compared to an
organization that can act proactively as in stage 3 and 4.

129
Chapter 7

7.3 Challenges of Investigating NFF


A number of challenges, both from a technical and nontechnical viewpoint, which
organizations will face when striving to tackle inefficient maintenance, may include:

• Identifying target improvements


• Training maintenance personnel
• Recording and understanding relevant data
• Establishing requirements of test equipment
• Designing out root causes of inefficiency
• Reducing the time, effort, and cost of troubleshooting
• Improving the decision making process

This is clearly not an exhaustive list, but it does capture some of the key concerns. Managing
an effective maintenance policy under such situations can be problematic, and attempting
the judicious resolution of challenges may require more:

• Time
• Finance and investment
• Skills and experience
• Test capability
• Condition data—environmental, trending, records, etc.
• Spare units
• Patience

One important factor is the cost implication of the NFF incident investigation, as each of
the above mentioned factors will come at a price. It could be quantified by measuring the
proportion of the repair budget that is spent, or rather wasted, on the maintenance activities
involved in locating the root cause of the failure. However, due to the complexity of the
issue, several industries do not know exactly how much NFF is costing them. In addition,
establishing a standard way of measuring the costs is difficult because of the complexity of
external influences and contributors such as costs in the supply chain, man-hour costs, and
the cost of system down-time, as well as indirect effects such as customer perception and
the maintenance organization’s capacity and efficiency. Because no complete, robust, and
reliable cost model is currently available, nor even an established universal NFF metric
for assessment of the impact of NFF, many business departments are afraid to admit that
shortcomings exist. Therefore, they do not provide any budget for NFF [7-4].

Identifying the faults that are NFF will certainly require a reassessment of the current test
coverage, the development of new maintenance troubleshooting tools and techniques, and
changes to the management and information capture processes. Many organizations defend
their established practices, making do with whatever resources they currently have available,
and thus create barriers for any change in the future. The attitude, “if it isn’t broke, don’t
fix it” is often the adversary of innovative ideas and concepts, and this blinkered approach
within the organization will hamper attempts to tackle the situation. This culture of belief—

130
A Benchmark Tool for NFF

“the way we do things” as the only correct and successful way—embodies the beliefs and
attitudes that have been indoctrinated within many organizations and are reflected in their
structure and policies. Combating these challenges will require a mix of changes from both
a technology and commercial perspective.

7.3.1 Technical Challenges


Making the decision to embrace a new method or process to combat NFF has a strong
potential to disrupt an organization, in that large-scale innovation will cause disruptive
changes to long-standing and established working practices. Technologies, such as
improved sensors, test equipment, or data analysis tools are aimed at improving the
performance within the diagnostics area and seek to improve the operational performance
of the system. However, it will be best achieved along the lines of “evolutionary” change
while demonstrating diagnostic reliability, validating cost-benefit models, and reducing
operational risks. The integration of new technologies inevitably faces difficulties and a
number of challenges for the community of engineers and technical specialists as they seek
to eradicate the NFF problem. Some examples of these difficulties are:

• The technology and frameworks are available but often not adopted or are under-
utilized due to excessive bureaucracy and over complicated processes.
• Performance characteristics for any adopted system of tackling NFF are usually
untested and not validated, leading to a lack of confidence.
• A wealth of data is often available from users, but access to the data is complicated or
impossible due to incompatible systems or sensitivity/confidentiality issues.
• Data are available, but much of it is not converted into meaningful information.

7.3.2 Commercial Challenges


Many and varied commercial challenges exist with any business that needs its assets
to be operationally available for the maximum amount of time. If they are undergoing
maintenance, they are not earning revenue and are consuming resources such as spares and
man-hours. If the assets are to generate revenue, they must be reliable and be available. The
pressure this generates provides challenges for managers and staff to be as efficient and
effective as possible in delivering the best possible availability and operational performance.
Customers, who are disadvantaged or disappointed, can provide a great deal of negative
publicity, which will then adversely affect the reputation of the business. An example would
be airlines having to delay or cancel a flight due to a fault that cannot be found during the
turnaround period before the next flight. The pressure to change something will often be so
great that several possible suspect units might be replaced to ensure that the fault has been
removed. Only one of the units may be faulty, but several units will now need to be bench
tested or repaired with the obvious associated costs. Consequently, senior management will,
in some cases, generate a culture that puts asset availability as the priority to protect revenue
and reputation. In such cases, additional maintenance and costs further down the supply
chain are deemed as secondary considerations and are often not obvious, yet they will be a
commercial challenge and will be paid by the parent organization. This presupposes that the
organization has enough data to understand where those costs are actually falling.

131
Chapter 7

A further commercial challenge, therefore, is to provide a system that tracks the relevant
fault data such that costs of fault rectification can be identified easily and items can be
tracked with the appropriate and relevant history associated with each one. Such databases
can be costly, however, and must be justified with a cost-benefit analysis. Linkage of the
maintenance system with Enterprise and Materials Resource Planning (ERP/MRP) could
optimize the information required to track parts and the repairs carried out throughout
the support chain. Indeed, any system to monitor and reduce NFF will have a cost but also
may generate a commercial benefit. The challenge is first to identify the costs that are being
consumed wherever they lie in the organization and its support chain, but then to show the
benefit of the solutions to mitigate the NFF problem. As with many commercial challenges,
this may well demand a spend-to-save philosophy and an open minded attitude in the
face of commercial and fiscal pressure. What is required, therefore, is an assessment of the
organization’s performance. A benchmarking tool is a great first step.

7.4 A Proposed Tool for Managing NFF


7.4.1 The Benchmark Tool
Industrial working groups comprised of representatives from airlines, manufacturers,
maintenance providers, and suppliers of diagnostic solutions dedicated to understanding
NFF and overcoming its impact in both the UK and U.S. have expressed the need to have a
robust method to benchmark NFF improvements against. The purpose of such a tool would
not be to directly reduce NFF, but rather to help the organization improve its appreciation of
the problem. The tool would identify areas that are most likely to need attention and those
most likely to respond quickly. It would also allow an organization to identify the most
costly and critical areas and to prepare a strategy to deal with those issues [7-5]. Such a tool
should also establish the maturity level, on a sliding scale, at which the organization accepts
NFF as an impact on cost effectiveness as well as equipment safety and airworthiness.
Maturity, in the context of the current tool that is presented in this chapter, is defined as the
level at which the organization understands NFF and commits resource to reduce its effect.

The tool illustrated here is not designed for external parties to benchmark an organization,
but rather it would allow organizations to provide an honest appraisal of themselves. The
organization may then seek external consultation if they wish to make improvements
to increase maturity. Benchmarking can provide an internal appraisal of the potential
possibilities and benefits in reducing NFF and will provide a comparison across industries
that can highlight key areas of underperformance. Benchmarking can be designed to allow
improvement targets to be set by identifying the mechanisms or tools required to achieve
the set targets. The benefits of benchmarking can be summarized as:

• Provide an internal appraisal of NFF reduction capability


• Provide cross industry comparisons
• Highlight key areas of underperformance
• Allow improvement targets to be set

132
A Benchmark Tool for NFF

7.4.2 A NFF Maturity Model


The Maturity Model defines the maturity of the organization’s approach and capability to
deal with, and proactively reduce, NFF. It is designed as a three-part tool consisting of:

• Maturity Model Scoring Matrix (the major part of the model)


• Capability Spider Plot
• Mitigation Planning Sheet

The maturity model scoring matrix is the core part of the tool, used for visualization and
interpretation of results. It is gathered from a series of questionnaires, allowing the current
maturity state of the organization to be identified. This is supported by a capability spider
plot, which allows clear visualization of the current maturity level against either past
assessments or future targets. It also allows benchmark comparison against any other
comparable department of the organization (for example, between individual aircraft
fleets). The final part of the toolset is a mitigation planning sheet that helps to identify the
appropriate tools and techniques that are required for continuous improvement.

7.4.2.1 Maturity Model Scoring Matrix


So that any assessment can be made on the organization’s capability to implement a NFF
management strategy, its current capabilities have to be assessed and quantified. Without
quantification, any improvement in the state of the organization’s ability to drive down NFF
will be limited. This is required to ensure that improvements have been achieved. For this
purpose, the use of a scoring matrix is proposed that assesses the current state against 5
maturity levels (1 through 5), with levels 2 and 4 acting as intermediate levels. The scoring
matrix should contain a list of organizational elements that require evaluation in terms
of their maturity level. In the context of the NFF reduction planning tool, these are set as
executive, operational, and tactical capabilities, each of which were discussed in chapter 4.
Descriptions of these three organizational levels and the meaning of the maturity scoring
are provided in due course, but first it is important to understand how it is possible to obtain
a score that requires information and analysis of the organization’s practices. In order to do
this the use of a set of questionnaires, designed for the specific business unit, should be used.

Such questionnaires are used in many continuous improvement programs and are the ideal
mechanism for gaining insight into an organization’s existing routines and practices. Three
sets of questionnaires are required; they are aimed at maturity levels 1, 3, and 5 and are
directed at the appropriate personnel within the necessary business units, maintenance
lines, and test departments. It should be noted that not all of the organization’s maintenance
elements require interrogation. Those elements of the organization that do require
assessment should be identified at the start of the decision to deploy the tool—deployment
mechanism to support this is discussed in section 7.4.3. To capture responses, the maturity
questionnaires must be designed with a series of closed questions requiring Yes/No answers,

133
Chapter 7

which will be used to determine the maturity level in the maturity matrix against the
various defined categories that will be discussed later. The appropriate scoring would be
calculated as:

• Fully Conforming–All answers “Yes”


• Partially Conforming–Over 60% (but not 100%) of answers “Yes”
• Not Conforming–Less than 60% answers “Yes”

Partially conforming represents the state between assessment levels. For example, if the
organization is fully conforming to level 1 but, upon further evaluation, they are only
partially conforming to level 3—then it can be established that the current maturity would
be level 2. Likewise, fully conforming to level 3 and partially conforming to level 5 would
indicate a maturity level 4.

The questionnaire responses should be supported by implicit evidence, so within the


maturity questionnaires, suggestions on the type of evidence must be provided. The scoring
matrix will be evaluated against 5 maturity levels (levels 1, 3, and 5 are given in Table 7.1,
Table 7.2, and Table 7.3, respectively) that are defined as:

• Level 1 (Initial)—At this maturity level, the organization has an awareness that NFF
exists and that it has a measurable impact upon maintenance and repair, but there is
very little drive from the executive level to measure these costs and to develop or to
implement the necessary processes, standards, or technologies to reduce this cost. The
organization at this maturity level delivers the necessary maintenance and repair
service for the customer, but with an acceptance of NFF, which results in the service not
being efficient and cost competitive.
• Level 2—Interim between maturity levels 1 and 3, but at this level the organization can
demonstrate increased proactive responses, compared to level 1, in areas such as data
collection and analysis. The impact of the problem can be gauged along with root causes,
even though these may not be accurately substantiated.
• Level 3 (Managing)—The organization has standardized processes in place for the
delivery of reduced NFF rates. The organization has a strong understanding and
drive toward the need to gain effective control over NFF occurrences to deliver a cost-
effective and optimized service. The NFF problem is managed and controlled against
identified performance measures, with the need for continuous improvement in NFF
reduction being incorporated into the business philosophy. While the processes and
culture are in place to reduce NFF, such reductions are not always being delivered and
followed up successfully.
• Level 4—Interim maturity between levels 3 and 5. At this level, the organization will
conduct continuous quality control activity to ensure that appropriate data are being
collected with usable NFF statistics for performance evaluation. Improvements are being
monitored and delivered within a reasonably robust continuous improvement culture.
• Level 5 (Optimizing)—At this level, the organization is fully equipped to deal with NFF.
The organization has well-managed processes within an effective Quality Maintenance
Systems (QMS) and pursues successful and measurable continuous improvement (CI).
CI works across the organization to drive efficiency, reducing NFF duplication, and it is
willing and able to invest in NFF solutions to optimize the quality of the maintenance
and repair service.

134
A Benchmark Tool for NFF

It might be noted that levels 1, 3, and 5 described here equate to the 3 points between the 4
stages in Fig 7.1 that were described earlier.

Managing NFF needs to encompass the entire organization from top to bottom. It is
appropriate, therefore, to include a distinction between the organizational elements of
Executive, Operational, and Tactical as parts of an integrated solution. In the context of this
NFF toolset, the elements of the organization are defined as:

• Executive—The organization has commitment from the top executive level of the
business to drive maintenance efficiency in the management, control, and reduction
of NFF.
• Operational—The organization provides effective day-to-day operational management
and reporting of the NFF phenomena, which encompasses a system-wide view of the
problem including people, resources, and business impacts.
• Tactical—The tactical level provides actual solutions at the workplace for processes,
procedures, and work practices that will reduce the NFF count.

However, even if the necessary systems are in place within one of the organizational levels,
it is still necessary to assess the ability of these to identify, deliver, and evaluate solutions to
the NFF problem. Therefore, the maturity model also includes two categories to represent
this as follows:

• Architect Solutions—The organization has appropriate management mechanisms in


place to define decision making and control it to deliver solutions capable of embodying
integrated solutions, which address the full value of investing in NFF mitigation
solutions.
• Evaluate Solutions—The organization has a set of performance measures and systems in
place to ensure that the solutions, including individual elements, are verified against the
requirements, including a cost-benefit analysis.

Once an assessment of current capabilities has been carried out, recommendations on how
to make improvements are required. This requires the selection of the appropriate tools
and Techniques (T&T) that can be used to make the necessary improvements. With the
addition of a T&T assessment, the tool can provide a suggestion for a selection of T&T that
can be employed, to aid the organization in achieving its own targets in dealing with the
NFF problem. The T&T will be broken down into two sections, one aimed at diagnosing the
organizational problems associated with high levels of NFF, and one that will help improve
the current NFF levels. These will help the organization reach better levels of maturity in
dealing with NFF.

Because each organization will be different, only a generic list of T&T possibilities is
presented here in Table 7.1, Table 7.2, and Table 7.3, where the overall maturity model, scoring
matrix, and its relation to specific tools and techniques are provided.

135
Chapter 7

Table 7.1 Tool and Techniques for Level 1 Maturity


Level 1–Initial
At this level of maturity, the organization has awareness that NFF exists and that it has an
impact upon maintenance operations. While it recognizes that NFF phenomena will have a
direct impact upon cost, there is little drive from the top level of management to measure
these costs and implement the necessary processes, procedures, and technologies to mitigate
the problem. The organization at this level delivers the necessary maintenance and repair
service for the customer, but with the acceptance that NFF is a factor, meaning that the
service may not necessarily be efficient and cost-effective. The organization needs to be more
proactive in data gathering to gauge the level of the problem, identifying the root causes and
implementing formalized policies driven from the top management level.
Category Definition Tools and techniques to
assist in achievement
Executive The problem of NFF is recognized and its impact Process Flow Charts –
acknowledged. But management lacks a clear these help to describe
drive to reduce the impact of NFF. the process, visualize the
Appropriate standards and policies are link between activities,
identifiable, but their use is limited. provide a clear picture of
what is happening, and
The organization has a communication plan in help in identifying lack of
place to ensure the problem is understood, but it standard practice.
lacks coherency.
Risk management of the impact of NFF is only Cause and Effect
adopted at a local or individual level. Diagrams – these are
used to identify all the
Operational There is a reactive response to NFF. causes that contribute
The problem of NFF is dealt with by a culture of to a particular effect
acceptance and inevitability. and are useful in helping
with brainstorming
Tactical NFF data are recorded using standard and locating areas for
maintenance recording, but not necessarily improvement.
followed up.
Statistics on NFF rates/occurrences can be Unit Removal Sheets –
extracted, but only through significant manual used to collect data about
handling of data. an activity in a way that is
easy to use and analyze.
Test capability is limited to standardized
functional testing only. Pareto Charts – used
Troubleshooting guides and technical manuals to assess the relative
are in place but require updating or improvement importance of different
for ease of use and effectiveness against NFF identifiable causes of the
occurrences. problem and identify the
cause of the problem that
Diagnostic failure is considered to be high. occurs most frequently.

136
A Benchmark Tool for NFF

Table 7.1 Tool and Techniques for Level 1 Maturity (continued)


Category Definition Tools and techniques to
assist in achievement
Define and The problem of NFF is ill-defined and not often Histograms – allow
deliver understood by all the relevant staff, and there the collated data to be
solutions is very little in the way of proactive solution arranged into groups and
mapping. patterns to be identified.
The NFF problem is dealt with as a collection of
Scatter Diagrams – useful
incohesive elements, making deployable solutions
in seeing if there is a direct
difficult to implement.
relationship or correlation
Solutions are generally missing or they overlap between two different
duplicating capability. elements.
Evaluate Improvements are implemented, but the Run Charts – used to see
solutions evaluation against realistic scenarios is limited. how something varies
Requirements capture for improvements exist but with time.
do not form part of a final evaluation.

Table 7.2 Tool and Techniques for Level 3 Maturity


Level 3–Managing
The organization has standard processes in place for the delivery of reduced NFF rates.
The organization understands the need to gain effective control over NFF to deliver a cost
effective and optimized service. The problem is managed and controlled against identified
performance measures, with the need for continuous improvement in NFF reduction being
recognized and incorporated into the business philosophy. The organization at this level
will conduct quality control activity to ensure that appropriate data are being collected to
generate NFF statistics for performance evaluation. However, the evaluation and delivery of
reductions are not always delivered successfully.
Category Definition Tools and techniques to
assist in achievement

(In all cases start with


Maturity Assessment to
diagnose areas/issues for
attention)
Executive Senior Management ensures that requirements for Control Charts – used
NFF monitoring and reduction are specified. to identify acceptable
Policies and standards are in place and well variations in a process.
maintained. Stakeholder analysis –
Roles, responsibilities, impacts, and allows assessment of who
accountabilities are clearly defined and is affected by the problem
understood. and how.

The organization’s communication plan Benefits assessment –


encompasses all relevant stakeholders, allowing necessary to determine
them to understand how their actions impact the the appropriate level of
management of NFF. spend in achieving pre-
defined goals.
Risk management of NFF cost is used to help
prioritize improvement activities.

137
Chapter 7

Table 7.2 Tool and Techniques for Level 3 Maturity (continued)


Category Definition Tools and techniques to
assist in achievement

(In all cases start with


Maturity Assessment to
diagnose areas/issues for
attention)
Operational The need to control NFF is widely understood at Development/adoption
all operational organizational levels. of policy – articulate
Appropriate data are collected and analyzed to management policy
understand the behavior of the NFF process. regarding NFF with the
clear intent to improve
Appropriate data are collected and analyzed to on existing policies and
evaluate where continuous improvements can be policy conformity.
made.
Key performance
A high level of understanding of NFF is built into indicators – organization
the organization’s resource management. KPIs set the targets for
A proactive approach to management tries to improvements both
identify and control avoidable errors that cause in terms of area of
NFF in the process. improvement and level.
Skills matching – ensuring
that the correct skills are
available and deployed in
the correct way.
Training – training
needs are identified
and provided, they are
reviewed periodically,
and staff training is kept
current.
Tactical Appropriate metrics are defined for which NFF
reduction capability can be measured, and
appropriate data recording systems are in place.

Recording, analyzing, and retaining of


NFF specific data is routine, but automatic
identification of repetitive defects, repeat
offenders, or rogue units is not utilized.

The diagnostic capability is strong but relies


primarily on human expert interpretation of test
results.

138
A Benchmark Tool for NFF

Table 7.2 Tool and Techniques for Level 3 Maturity (continued)


Category Definition Tools and techniques to
assist in achievement

(In all cases start with


Maturity Assessment to
diagnose areas/issues for
attention)
Define and The organization identifies and delivers solutions
deliver in the form of a collection of isolated and
solutions individual elements.
A robust solution plan is in place.
The requirement for an overall solution linking
individual elements together to form a cohesive
whole is recognized.
The problem being addressed is well defined, and
the path to the solution is defined and understood
by relevant staff.
Individual solution elements are mapped out to
avoid conflicts and overlaps.
A communication and review plan is in place
between the relevant teams responsible for
individual elements.
Evaluate There exists the capability and know-how for
solutions evaluating process and operational improvements.
Improvement evaluation is focused on localized
improvements and does not consider the impact
within the wider business context.
Evaluations are primarily concerned with the
internal stakeholder organization and only
consider external stakeholders in an artificial
context.
Lessons learned are captured but remain localized.

139
Chapter 7

Table 7.3 Tool and Techniques for Level 5 Maturity


Level 5–Optimizing
The organization has well-managed processes within an effective QMS and pursues successful
and continuous improvement in removing NFF incidents. It works across the organization to
drive efficiency, reducing NFF duplication, and is willing and able to invest in NFF solutions to
optimize the quality of the maintenance and repair process.
Category Definition Tools and techniques to
assist in achievement

(In all cases start with


Maturity Assessment to
diagnose areas/issues
Executive The standards and policies that are in place Standardized
are coherent and consistent with external information – a practice
stakeholder’s standards and policies. of ensuring that all data
Appropriate compliance mechanisms are in place and information are
to ensure coherence to policies, standards, roles, collected and recorded
and responsibilities. in a standardized way
to ensure continuity and
Clearly defined polices for the management transferability.
of NFF are understood, implemented, and
systematically adhered to. Automated data collection
and analyses – data
The organization has communication plans in that are required are
place to enable coherent sharing of information, automatically collected
data, and best practice across the organization, and collated, data that
maintenance lines, customers, and all are not required for the
stakeholders. particular problem are not
Risks of NFF impact and costs are managed allowed to cloud issues.
across the organization and value chain. Predictive analytics – the
Operational The organization continuously improves processes adoption of predictive
to make diagnostic success more efficient. analytics allows large
amounts of data to
There is a strong understanding of how NFF be easily mined for
diagnosis within the organization impacts external information, automatically
stakeholders and customers, and they work to identifying problem cases
ensure this impact is minimized. and aiding in mitigating
The organization acts to identify problems and against any impacts.
errors and engages stakeholders to ensure the Enhanced diagnostics –
optimal solutions are applied. advanced tools are used
Continually analyzes the situation to ensure that for diagnostics, which
NFF is under control and proactively adapts to can include bespoke
changing situations and needs. test equipment or more
intelligent onboard
systems.

140
A Benchmark Tool for NFF

Table 7.3 Tool and Techniques for Level 5 Maturity (continues)


Category Definition Tools and techniques to
assist in achievement

(In all cases start with


Maturity Assessment to
diagnose areas/issues
Tactical Data are modeled to provide NFF trending across Information feedback –
the organization. information pertaining
Systems are in place to automatically identify to a problem, such as
repetitive defects, repeat offenders, rogue units, its occurrences, impacts,
etc. causes, and solutions are
freely accessible to other
The data provide quantifiable measures of parts of the organization,
organizational impacts. regardless of location.
Data and information are easily retrievable from This information also has
multiple maintenance lines and maintenance sites. a direct feedback loop
into equipment design.
Mechanisms are in place to act as a “safety net”
to capture suspect items and to prioritize them
based on the cost of requiring further in-depth
and in-house testing; this may include specialized
capability.
The maintenance process has the test capability
to adopt integrity testing alongside standard ATE
functional testing for suspected NFF units.
The diagnostic capability is strongly supported
by up-to-date technical manuals, troubleshooting
guides, and more automatic diagnostic
capabilities to remove human error.
Define and All teams have a very clear vision of the problem
deliver being addressed and can articulate what a
solutions successful future state will look like.
The end point vision actively guides all decisions
and the routes to take to achieve them.
All solution elements join together to form an
organizational level solution with the necessary
interfaces being well defined and coherent.
Solution interface specifications exist and are
being worked by interfacing activity leads.
Evaluate There is an active plan of action to evaluate the
solutions effectiveness of improvements both as individual
elements and the effects as a whole system.
Improvement evaluation is realistic and tests
the practicality for all stakeholder organizations
against the original theoretical requirements.
Lessons learned are captured and actively shared
across the organization.

141
Chapter 7

It should be noted that these tables do not contain an exhaustive list, nor does their
inclusion mean that they are the correct tools to adopt for any particular situation;
rather, they provide an example with the ability to tailor the management solution to the
user’s needs. T&T in Table 7.1 are representative of mechanisms aimed at identifying the
problem and its causes. Adopting such approaches will allow the organization to make
the transition to a higher level of maturity. Likewise, in Table 7.2, the T&T represented are
indicative of an effective continuous Quality Management System (QMS), and those in
Table 7.3 are indicative of a continuous improvement approach with significant investment
in new technology and improvements.

7.4.2.2 Mitigation Plan


Once an assessment has been carried out, the next stage is to identify and agree on a plan of
action that will result in an agreed organizational level of improvement. This requires the
use of a mitigation planning sheet, illustrated in Table 7.4. The mitigation planning sheet
allows the current status, which has been assessed across the five evaluation categories, to be
set against a target level that the organization will strive to reach (this is the information that
is captured graphically in the spider plot). Once a target level has been set, the mechanism
for achieving this also needs to be identified and agreed upon. This mechanism is split into
three columns: what will be done, who will be responsible for driving these actions, and
when it will be achieved. The date when it will be achieved can also be used to set the dates
when the next review must take place.

Table 7.4 Example of a Mitigation Planning Sheet


Assessment Agreed Actions (Priority Items)
Capability Current Status Target Level What Who When
Drivers Assessment
Executive L1 L3
Operational
Tactical
Define &
deliver
solutions
Evaluate
solutions

7.4.2.3 Visual Capability


Once a benchmarking assessment has been carried out, to identify an organization’s current
capability and future improvement targets, which will enable the firm to deal with NFF
with respect to the five benchmarking categories, a Capability Spider Plot (Figure 7.2) is
used to visualize the current state against the target improvements outlined in the previous
section—mitigation planning. Other potential uses for the capability spider plot include the
combining of individual organizational scores within a particular industry and comparing
them against those in a separate industry (for example, automotive vs. aerospace). The
benefit of this would be that if capability in one area is stronger in one industry, then it may
highlight the potential for transfer of best practice between industries.

142
A Benchmark Tool for NFF

Figure 7.2 Visualization of the benchmarking tool scores against target improvement scores.

7.5 Deployment of the Tool


The NFF benchmarking tool could be deployed in a similar fashion to many other
continuous improvement systems and would consist of a four-stage process as outlined next.

7.5.1 Stage 1
In this case, a representative of the organization, perhaps known as the “NFF Champion,”
is allocated the task of overseeing the improvement process. The NFF Champion would
be selected based on having all of the necessary organizational authority to undergo the
necessary assessment tasks. In a cross-industry process, multiple personnel may be selected
for this task. The NFF Champion’s primary role is to identify the process owners, who will
be the direct line managers of the maintenance staff, and operators who are the people that
perform the various tasks and activities that make up the process being assessed. From this
group of process owners, suitably experienced assessors will be selected who are responsible
for developing and defining the questionnaires and collating the responses and other
evidence. Together the NFF Champion and the process owners will agree on the scope of the
assessment and identify the appropriate assessors. The NFF Champion and process owners
will be responsible for overall management of the assessment and for communicating the
aims and intentions across the organization.

7.5.2 Stage 2
The NFF Champion will decide on the level of maturity to be assessed—if no previous
assessments have been carried out, then an initial (Level 1) assessment must be carried
out. Level 3 and 5 assessments will be carried out once the maturity for the majority
of categories have begun and are entering the interim maturity levels, Level 2 and 4,
respectively, where appropriate.

143
Chapter 7

7.5.3 Stage 3
The full assessment is carried out across the organization using the appropriate maturity
questionnaires for the selected maturity level being assessed. Questionnaires should be
deployed down to the level of the person(s) who are able to produce the necessary evidence
to support the responses. Following collection of the questionnaire responses, the responses
can then be collated and mapped onto the scoring matrix.

7.5.4 Stage 4
Based upon the results of the assessment, it may be necessary to develop an improvement
action plan to be implemented and followed up with a reassessment at a later stage.

7.6 Summary of the Tool


Based upon a clear need to benchmark industrial capability to deal with NFF, this chapter
has presented a suggested methodology and tool. The freely available tool, which consists
of closed question capability maturity questionnaires, will provide insight into the
organizational process effectiveness, the organization’s culture, and its ability to architect,
deliver, and evaluate NFF mitigating solutions. A maturity scoring matrix, where the results
of the questionnaires are mapped onto a matrix to assess the maturity or capability level
against predefined categories, is used to assess current capabilities and response levels. The
maturity matrix categories (divided into subcategories) provide an easy visualization tool
that uses capability spider plots to review maturity against previous assessments, or even
between industries. In addition to this, an example of an action planning sheet has been
added for use in identifying the actions that are required to improve capability before the
next assessment. This includes a generic list of Tools & Techniques that are available to help
in diagnosing organizational problems with regard to NFF and what is required to solve
them. Overall, the tool would be expected to act as an organization’s self-assessment process.

7.7 References
7-1. Hockley, C., and P. Phillips. “The impact of no fault found on through-life engineering
services.” Journal of Quality in Maintenance Engineering 18 no. 2 (2012): 141–153.

7-2. Jambekar, A. B. “A systems thinking perspective of maintenance, operations, and process


quality.” Journal of Quality in Maintenance Engineering 6 no. 2 (2000): 123–132.

7-3. Mortada, M. A., T. Carroll, S. Yacout, and A. Lakis. “Rogue components: their effect and
control using logical analysis of data.” Journal of Intelligent Manufacturing 23 no. 2 (2012):
289–302.

7-4. Söderholm, P. “A system view of the No Fault Found (NFF) phenomenon.” Reliability
Engineering & System Safety 92 no. 1 (2007): 1–14.

7-5. Turner, J. R., ed. “Maturity Models for the Project-Oriented Company”. Gower handbook
of project management. Gower Publishing, Ltd. 183–208. 2014.

144
Chapter 8
Improving System and
Diagnostic Design

8.1 Introduction
This book has consistently emphasized that the true root cause of any NFF event is
embedded within the design of the suffering system, and as such is a predictable
outcome of any designed system that enters service. All systems are susceptible to faults
and failures during service, but those faults that result in NFF are often unanticipated
during the design, sometimes because they occur within acceptable operating
tolerances. This makes them exceptionally difficult to control, as there are subsequent
inherent diagnostic difficulties when dealing with failures occurring within expected
operating tolerances. The most important factor, therefore, is to understand that NFF
is an expected outcome of any system design, and that NFF can therefore never truly
be eradicated. It can, however, be brought down to levels that are bounded within
acceptable parameters for the system.

Many NFF occurrences are a direct consequence of the diagnostic process and, hence,
the design of the diagnostics. A number of mitigation strategies to combat NFF have
been elaborated upon in this book, but most are purely that, mitigation for when a
NFF related failure occurs—such as incorporating specialized test equipment. In
truth, though, the objective in reducing NFF to levels that are considered acceptable
according to the system’s specifications, from both a technical and economic viewpoint,
would be to design and manufacture systems that are increasingly immune to those
unanticipated failures that result in NFF. This is not just through focusing on improved
reliability and integrity, but it is also with a strong emphasis on the design of the
diagnostics that will be used to support the system throughout its operational life. This
requires enhanced systems understanding, improvements to the actual design integrity,
and most importantly a robust mechanism for translating in-service failure (and NFF

145
Chapter 8

data) knowledge directly back into the design process to both validate a system’s diagnostics
and identify potential improvements. In truth, system design and diagnostic design are
separate activities but are interrelated, and the diagnostic design would be dependent upon
the system design. In fact, they should ideally be part of an integrated design process.

The purpose of this chapter is not to provide an in-depth review of systems or diagnostic
design, nor to focus on any individual (or specific) design, but rather to give an overview
of issues that relate NFF to the design of both the system and the design of the supporting
diagnostics. The chapter begins by considering the relationship between diagnostic
design and NFF. It continues by looking at how the integrity of a system is influenced
and determined by its design, and how degradation of integrity directly relates to NFF—
focusing on the lack of understanding regarding the propagation of component level faults
through a systems hierarchy. The following section will then briefly discuss the possibility
of understanding a system’s attributes, such as the number of interconnects, topological
complexity, and subsystem interactions, to predict a system’s NFF burden throughout its
service life. To achieve this, the need for enhanced testability, such as design for test and
design for diagnosis, is discussed, along with a brief overview of current standards to aid in
the implementation of testability at the design stage. Reduction of NFF through improved
design always relies upon the capturing of in-service failure knowledge. The challenge of
capturing and feeding back information to design is highlighted in the later section. Finally,
attention is given to training considerations from a military domain perspective and its
impact on NFF occurrences, based on user interactions highlighted through the consumer
electronics industry.

8.2 Diagnostics Design and NFF


Diagnostic design provides the necessary requirements, processes, techniques, and tools
that will be used to implement diagnostics on a system throughout its operational life. The
diagnostics design will be generated based on an understanding of a systems function,
operating requirements, previous failure experience, and acceptable levels of fault isolation
ambiguity, as well as being heavily driven by safety and/or economic factors. Experts within
the realm of NFF have indicated a relationship between the number of NFF events at various
maintenance levels, system type, and test procedures/process/equipment. All of these are
issues that could be considered in isolation, but when they come together as an integrated
diagnostics process, if NFF exists at an unacceptable level, they highlight inadequate
diagnostic design. Although no academic literature has been uncovered that provides a
direct measure correlating these factors, practitioners in the field are adamant that the link
is certainly there. Therefore, it is proposed that by understanding equipment design and
tracking NFF occurrences, improvements in electronic systems diagnostics can be made that
would result in systems that are increasingly immune to unacceptable rates of NFF.

A systems design and its support mechanisms, including its diagnostic specifications, have
inherent factors that will lead to NFF occurrences. It is here that a strategy to bring these
NFF occurrences to an acceptable level should be emphasized. Areas identified as being of
significant interest in this task would include:

146
Improving System and Diagnostic Design

• The effect that NFF has on specific types of system


• Identifying the types of system and system attributes which have frequent NFF issues
• The rate at which NFF reoccurs and the main influencing root causes

Also of importance would be the need to understand the dependency that NFF events
have on repairable items, and how they may change throughout their operational life cycle.
Questions arising from this line of thought include:

• Do NFF problems become more common after initial repair than after the original
delivery?
• Does the number of repairs have any influence?
• Is there any impact of component modification?

It would also be beneficial to be able to gauge what percentage of NFF is attributed to say
intermittent faults, working practice, or inadequate troubleshooting. There has already been
some work in this area (for example, the concise industrial NFF survey carried out in 2012 by
Copernicus Technology Ltd [8-1].

Many NFF occurrences are a direct consequence of diagnostic process and design. For
example, reducing fault isolation ambiguities to the smallest possible level (further described
in chapter 9) will ultimately reduce NFF. However, achieving this may be overly costly
and therefore not an acceptable specification for the diagnostics requirement. Therefore,
economics has an impact upon the diagnostic design, which again provides a link between
cost and NFF. Achieving a desired diagnostic design requires an economic case to be made,
which can be enhanced through the input of in-service failure knowledge. This in-service
failure information can aid in reducing NFF by identifying that the diagnostics have
been correctly characterized and indicating when and where remedial action and design
improvements are required. However, it should be emphasized that in-service knowledge
feedback can support diagnostics design, but design does not rely on it. It is a secondary
support mechanism aimed at verifying the success of a systems diagnostics performance.
The reduction of NFF will always primarily be achieved through improving the diagnostic
design, itself, where NFF inherently resides.

8.2.1 In-Service Feedback Activities


NFF events encompass a whole range of products in service, many of which are made up
of legacy systems with well-defined operational support practices. Disregarding the fact
that the root cause of a NFF will begin with the failure of a component (or unpredictable
intermittent faults, which may be part of an inherent design flaw), the end result is a failure
within the diagnostic process. The maintenance services procedures, equipment, testing
capability, and guidelines for that equipment were inadequate to isolate the problem. To
reduce the NFF event rate for in-service equipment, the conditions under which NFF
problems occur need to be considered in depth, and investigations should focus on the
following areas:

• Failure Knowledge Bases, novel FMECA tools, and troubleshooting guides specific for
NFF to improve diagnostic success rates.

147
Chapter 8

• Research to pinpoint where in the maintenance process NFF is occurring (for example,
at a particular maintenance line, testing station, or under specific testing equipment).
• Development of assessment tools to assess maintenance capability/effectiveness, which
may include:
• Recording and cross referencing test station configuration and performance
statistics with NFF occurrences. This includes statistics on equipment calibrations.
• Ensuring that the testing environment is correct, and investigating whether testing
procedures need modification to consider multiple environmental factors (humidity,
temperature, vibration, etc.) simultaneously.
• Introduction of integrity testing as complimentary to standard ATE (functional) testing
procedures.
• New testing techniques.
• Integration of on-board health and usage monitoring.
• Standardization for intermittent testing and procedures for dealing with
intermittent fault occurrences.

These areas will be further discussed in chapter 9 (Technologies for Reducing NFF).

8.2.2 Diagnostic Design Activities


When talking about influencing the design of a complex system’s diagnostics for the specific
purpose of reducing the impact of NFF, designers need to be clear on what actually needs
to be modified or redesigned. For example, does the system itself need to modified or
redesigned with the aim of increased robustness to make it more fault tolerant? Or is the
equipment itself suitable for the task at hand, and what is required a better diagnostics
design? Focusing on the latter of these two questions, if the diagnostic design has been
conducted as part of an integrated system diagnostic design, in which the design of the
diagnostics is an integrated part of the actual systems design, then the diagnostic capability
should be robust and would only require validating by capturing service data.

Often, huge amounts of service data are captured, including NFF occurrences. However,
what is not generally widespread is the act of quantifying the burden of NFF, identifying the
root causes, and understanding the impact of NFF related faults on coupled systems, which
would help in driving the diagnostics design. The key challenges to address here are:

• Development of design guidelines and standards to improve integrated systems


diagnostic designs that incorporate the reduction of NFF as a design goal
• Research into the relationships between system design characteristics and NFF related
attributes such as rate of false alarms, and fraction of faults isolated, to improve
integrated systems diagnostic design
• Modeling of complex interactions between system/subsystem/components and their
physics of failure
• Development of a NFF burden/rate predictor for new designs or NFF trending process
for legacy systems
• NFF specific maintenance cost models for design justification
• In-service monitoring and feedback to validate and verify the integrated systems
diagnostics
• Many of these areas are discussed in the following sections.
148
Improving System and Diagnostic Design

8.3 System Design and System Integrity


System integrity is a widely used metric that describes the robustness and reliability of a
system, in terms of fault-free operating efficiency. As a system ages and interacts with the
outside environment, this integrity inevitably begins to reduce at a rate proportional to
the rate of degradation within the systems elements. The link between integrity reliability
and NFF is important to be understood and recognized, and is paramount in ensuring that
reliability and availability metrics are being met; especially when these are being monitored
and assessed using in-service data. Three main aspects of a system combine when talking
about system integrity: these are the design of the system (which includes materials and
build quality), the operating environment, and the systems usage. Therefore, to understand
system integrity, the region where the design functions with the operational environment,
and system usage, must be accurately identified and understood. This region is illustrated in
Figure 8.1.

Figure 8.1 The system integrity zone.

Estimation and prediction of the future operating environment and system usage will,
therefore, directly influence the system design to ensure that a specified level of reliability
is achieved while minimizing over-engineering. If these predictions are inaccurate or the
systems usage requirements change, there will be an inherent impact on the vulnerable
points in the system. The system will degrade particularly rapidly in these vulnerable points:
interconnects in wiring, components, and PCBs. These problems also occur during assembly

149
Chapter 8

of new-build equipment undergoing maintenance/upgrades and can cause such issues as


intermittency. The impact of such low-level component degradation on the whole system is
not well understood. Consequently, the ability to back propagate observed failure symptoms
from the system level back to the component level is notoriously difficult, and as such the
required test approaches do not necessarily exist, leading to test escapes [8-2]. The result will
then be speculative system/subsystem replacements and, hence, higher-than-anticipated
rates of NFF. One approach would be, again, to focus on investments in higher-level test
equipment, possibly even having to invest in high-cost bespoke test equipment that may not
be possible due to economic restraints attached to the systems design. A more cost-effective
and future-proof approach, therefore, would be to look at the actual design and manufacture
of these vulnerable, degradation-prone points in the system.

As an example, consider the critical yet vulnerable system components, such as electrical
interconnect and wiring systems. As these components begin to degrade, perhaps through
natural wear, corrosion, environmental damage, etc., then faults will begin manifesting,
propagating through the system, and eventually leading to a system-level fault indicator.
The difficult task is then being able to trace this system-level symptom back to a specific
wiring harness or connector. Specific test regimes for these will not always be available, or
will be inadequate to avoid a NFF event. Avoidance of faults through improved integrity
is required and can be achieved through adopting new methods of connecting systems or
installing wiring.

For example, the main issue with a wiring harness under vibration is that wire-to-wire
interactions (and their interactions with other structures) can easily lead to chaffing, which
will eventually damage the internal wiring. Poor installation then makes the wiring
harnesses difficult to access and test. A design solution does exist for this, which has begun
to take hold in the motorsports industry, and makes use of composite wiring sleeves. These
sleeves offer the inherent strength of composites to avoid chaffing and internal wiring
damage. The sleeves can be molded so that they follow the contour of the structure, allowing
a permanently neat arrangement. The paths of individual wiring harnesses can easily be
visualized, as there is no risk of tangling. Changing the design of aircraft wiring systems
would significantly reduce NFF through reducing the rate of degradation that leads to a loss
of system integrity.

8.4 Testability
One design characteristic, which facilitates the testing and diagnostics, is testability. MIL-
STD 2165 defines testability as: “a design characteristic which allows the status (operable,
inoperable and degraded) to be determined and isolation of faults to be performed in a
timely and efficient manner.” The standard also states that good testability is when existing
faults can be confidently and efficiently identified [8-3]. Testability [8-4] can be implemented
at the design phase of equipment, as a measure to reduce NFF at early stages of development.

150
Improving System and Diagnostic Design

There are two types of testability at the system level:

1. Inherent Testability—the way a system is designed and the ability to observe


system behavior using a variety of stimuli. It is defined by location accessibility and
sophistication of tests and test points applicable to the system.
2. Achieved Testability—the way maintenance of a system is implemented. It is defined
by results of the maintenance process (for example, false alarms, ambiguities, incorrect
isolations, and NFF).

For both types of testability, it is recommended that the testability analysis begin directly
at the design phase. This maximizes a system’s inherent testability, because beyond the
design phase, only achieved testability can be implemented. Considering testability as a
design characteristic, it could be split into two distinct but related types: design for test and
design for diagnosis. Design for test relates to the implementation of good design practices
that allows the facilitation of testing and is performed by the system designers. Design for
diagnosis is also performed by designers, but often in conjunction with system analysts. It
relates to the optimization of a design to facilitate diagnostics by incorporating optimized
test point placements and optimized diagnostic strategies.

8.4.1 Testability Standards


A few existing standards have been developed for testability. MIL standard MIL-M-24100
[8-5], now superseded by MIL-STD-H-2165, was one of the first to be developed in the 1960s
for military applications. MIL-STD-Hdbk-2165 [8-6] is widely used by the Department
of Defense (DoD). This standard did not have precise and unambiguous definitions of
measurable testability “figures-of-merit” and relied mostly on a weighting scheme for
testability assessment. The continued integration of diagnostic environments, where
elements of automatic/manual testing, maintenance, training, and technical information
were required to work together with the testability element, pushed the need to maximize
the reuse of data, knowledge, and software through BIT and ATE. The IEEE later developed
documents that provide a formal basis, for the analytical component, of the Design for
Testability process [8-7], [8-8], [8-9]. These also standardized the interfaces for diagnostic
elements of “an intelligent test environment” and “representations of knowledge and data”
used in diagnostics. In 2008, the UK MoD also issued the Defense Standard 00-42 part 4
Reliability and Maintainability (R&M) Assurance Guide Part 4: Testability [8-10] document.
It provides testability guidance to industry and can be used as a contract reference. All these
standards provide guidance on how to design for testability and ways of validating how the
standards are met. The other advantage of such standards is that they allow the industry
to work to a common and known practice (standard), leading to more competition and
interoperability of systems. This also contributes to reducing the cost to replace equipment
when it becomes obsolete.

151
Chapter 8

8.5 Design for Diagnosis


Designing for diagnostics (also known as diagnosability), in the first instance, relies
heavily on an understanding of the system; more specifically, understanding of the system
requirements. The process for diagnostic design has three key stages, which include
diagnostic development, diagnostic assessment, and diagnostic improvements. The first
stage, diagnostic development, should be integrated with the systems design and developed
simultaneously and updated as an iterative process. The diagnostic assessment stage again
should be performed in tandem with the systems design and is used to evaluate diagnostic
tools, providing the necessary feedback to both the diagnostic and system designers.
Diagnostic assessment is also used to determine requirement allocations, and the assessment
becomes more frequent as the system and diagnostics design becomes more mature. At this
stage, there are opportunities to evaluate the potential success of the diagnostics in keeping
NFF at the required/acceptable level. Finally, the last stage is the design improvements to
the diagnostics. Here, the diagnostics are assessed at the earliest phase of development and
updated as required. Again, at this stage, opportunities exist for improving diagnostics
based on any new fault and/or NFF related information.

Determining the fault detection and isolation capability of a system, with known confidence
level at every level of the systems hierarchy, is a key objective in ensuring that the rate of
NFF is kept at the acceptable level. It can be realized by ensuring that the ambiguity in
fault isolation is as low as possible. Ideally, fault isolation ambiguity, which describes the
maximum number of system elements to which a fault should be isolated, should be one.
That is, the fault is isolated to the correct system element, the first time, every time. A high
level of confidence in the diagnostics would help ensure this.

Identifying the occurrence of failures within the system makes use of measurements of the
system behavior, results of autonomous tests, and result-driven empirical analysis. Making
sense of this information and data should be facilitated by enhanced diagnostic modeling,
which integrates all available information and data into an efficient diagnostic model of the
system, providing high-level and reliable diagnostic decisions. The benefits of this include:

• Confident and accurate fault detection


• Optimal fault isolation
• Reduced false alarm and false removal rates
• Reduced diagnostics time
• Improved operational availability
• Improved safety
• Reduced maintenance cost

The relationship between NFF and improved operational availability, time for diagnostics,
and reduced maintenance cost has been covered in chapter 4, while the impact on safety is
considered in chapter 5. Chapter 9 will consider fault detection confidence, fault isolation,
and false alarms from a technology viewpoint, but the purpose of the current discussion
is to emphasize that diagnostics are fundamental to reducing NFF, and the NFF related
benefits stemming from diagnostics is rooted in the diagnostics design.

152
Improving System and Diagnostic Design

8.6 Information Feedback to Diagnostic Design


Technicians around the world are discovering novel causes of failures on a daily basis.
These are called novel because they have never before been experienced or observed for
that particular system. Even though they would have been anticipated during the systems
design, they may have had a predicted low probability of occurrence, or a low system
impact, meaning that they may not have been incorporated into the design of the systems
diagnostic features. The reason many failures that are not expected to occur, do occur, is
that during the design stage the system is expected to be operating within a specific set
of operational and environmental envelopes; a breach of the envelopes would signal a
predictable failure. However, these novel failures occur in-service and within the designed
operational tolerances, making them unpredictable and difficult to diagnose, and usually
resulting in NFF events. Ensuring that all possible failures are identifiable and correctly
prioritized, and that predictions of occurrence are reliable at the design stage, requires the
following to be implemented:

• Create a practical means in which to access field experience


• Capture that experience
• Feed it back into design engineering
This field experience needs to be shared by inserting it directly into the troubleshooting
workflow so that others will be able to identify the cause of the problem the next time that
it occurs on the first attempt, whenever or wherever it may occur. Furthermore, that field
experience, if used correctly, would be invaluable in assisting design engineers to improve
the reliability of the system. At the core of this design improvement would be the challenge
of better troubleshooting the difference between “anticipated failures” and “real failures”
that appear in service before the system is released into service.

When complex equipment or systems are designed, engineers typically identify the potential
failure modes and effects on the system using FMEA. When the system enters service, and
the real world imposes itself, some faults that were anticipated will occur, and many will
never happen (Figure 3.3). Utilizing FMEA in the design process can, therefore, never
be truly accurate, or aid in reducing NFF directly, but it can help to determine tools and
techniques that can help reduce NFF. These tools and techniques would include:

• Employing on-board diagnostic technologies to detect failures.


• Implementing prognostics and health management strategies, including trend
monitoring to detect potential failures.
• Preparing troubleshooting procedures in advance for analyzing the functionality of the
system. This can help differentiate among the many possible root causes of anticipated
failures.
Eventually, a fraction of the theoretically possible failure modes will make an appearance.
The weaknesses in a system’s design will become evident during service, and as the
same system is often installed on multiple platforms, what occurs on one aircraft can be
expected on another aircraft being operated in similar conditions. In the case of faults that
were unanticipated during design, any designed diagnostic systems/processes will not
resolve them, resulting in them being reported as NFF. When experts eventually resolve

153
Chapter 8

such unanticipated failures, where does the problem-solving knowledge reside after its
creation? This experience must be blended with existing diagnostic and prognostic tools
and techniques to inform design so that the number of unanticipated failures is reduced.
The following are key challenges to feeding in-service experience and knowledge back
into design:

• Storing experience-based knowledge and being able to deliver it at the time and place
when a specific symptom is observed so that the experience can be used to resolve the
problem on the first attempt
• Delivering knowledge in a common form that is useful to designers and other experts,
including less experienced staff
• Ensuring that knowledge shared is of benefit to everyone at which it is aimed and that
those beneficiaries can and do make use of it
• Integrating the knowledge with existing troubleshooting and design tools so that it
becomes part of the usual design workflow

8.7 Level of Training


One of the difficulties with NFF has been identified as a lack of training. Without
appropriate education or training, NFF events will be unlikely to be traced back to design-
related issues at a repair station. This can be emphasized by considering the military domain.
Field engineers are usually provided with up-to-diploma level of training in an engineering
field. Before they are deployed, they are trained to the level at which they are expected
to repair or maintain equipment in theater. However, reduced budgets and manpower
numbers influence the amount or level of training provided to a technician before they are
deployed to operations. Specific training is mainly provided by industry, often by the OEM
as part of the support package. As systems are becoming more complex and sophisticated,
technicians are only expected to replace a faulty LRU/item with a serviceable one, and then
send the faulty one back to the OEM for repair. This has limited the amount of skill required
and has increased dependency on the OEM. Furthermore, this increases the logistic burden
and MTTR if the system is returned to the fourth line for repair. This predominantly affects
Urgent Operational Requirements (UORs), which are currently first to fourth line repair,
and the average NFF rate is estimated to be up to 25% per unit or item, according to the
individuals that were interviewed. As a result of this complication, mainly driven by the
advanced technology required to carry out missions at operations, the Army is increasingly
relying on Field Service Representatives for mentoring and assistance when they cannot
resolve a fault in operation. It is within these FSRs where the blend of systems design
knowledge and service experience exists—and provides the perfect basis for informing
design engineers of the need for improvements and/or system modifications.

The inadequacies in training level further add a burden to design engineers and could be
used as a case for driving forward intelligent and automated diagnostics, which would
remove the human factor from the NFF issue.

154
Improving System and Diagnostic Design

8.8 User-Interaction and System Design


One area that is related to system design and NFF, and that is often overlooked, is how that
system is utilized by the user. Incorrect interaction with a system by the user immediately
poses difficulties for a diagnostic process, which has been designed based on information,
metrics, and parameters that depend on the system to be used in a specific manner. For
example, is the user operating the system incorrectly because of inadequate training, or
has the system been designed without the end user in mind? This design issue is a major
contributor to the NFF issue across multiple industries. To highlight the case, we can
consider the consumer electronics industry, in which the annual bill for NFF returns is in the
billions of dollars.

The major impact area for operators due to no fault found device returns is that when
customers buy a smartphone, it’s increasingly likely that it is a replacement for an existing
device [8-11]. Their expectation is that the new device will perform at least as fast as their
old device, and also other devices owned by friends and family. If those expectations are not
met because that particular combination of device, application, and Internet provider suffers
from wasted data, the customer will return the device as “faulty.” The operator then spends
time and money testing the device without discovering any faults, and has to resell the
device as “refurbished” at a lower margin. This increases their support costs and makes the
device less profitable for them.

If a device is not functioning as expected, even with a user error fault, the reason for not
obtaining the expected functionality must be identified. One of the underlying reasons for
devices being returned as NFF is that the user often has the device configured incorrectly,
has misunderstood the device’s capabilities or functionality, or there is an underlying
hardware/software design problem that is having a secondary “fault” affect—but that
design root cause is not obvious. The user often will be required to contact a service
representative, who will talk them through a troubleshooting process; however, this can
create a frustrated customer when the help and advice provided is not solving the issue. The
service representative is not aware of how the device is currently being used and in what
conditions, or if the customer is providing the correct information—although this could be
rectified through real-time analysis of available data. The cost of service representatives
could also be significantly reduced if this real-time analysis for use in device troubleshooting
could be delivered directly to the device, allowing the user to identify the nature of the
problem.

8.9 Conclusion
Throughout this chapter, a variety of issues relating to NFF and design have been discussed,
and possibilities for some improvements and modifications have been summarized. There
is no doubt that the very nature of the NFF problem has its roots firmly embedded within
inadequacies present in the diagnostic design process. This is in no way laying the NFF
blame on designers; they will design to the required specifications, which almost certainly
will not include any reference to NFF—in fact, most may not even be aware of any in-
service issues that could enhance their designs. To minimize the variance between the

155
Chapter 8

expected (acceptable) levels of NFF and the levels of NFF observed in the field, designers
need to be provided with information and knowledge captured in the field. This will
enable them to validate and verify that the diagnostics design is performing well and is
the right approach for the system, for its current operating requirements. More emphasis
must be placed on improved predictability of system usage and operating environments
to reduce the probability of unanticipated faults occurring. This is no easy task, and many
of the challenges highlighted in this chapter still remain far from resolved. In addition to
this, designers need to be turning their attentions to enhancing the testability of systems.
This will ensure that access to test points is easy, appropriate test equipment for the task
is identified, and the overall test coverage of the system is enhanced. If you cannot test
a system, then you cannot diagnose it, and NFF will always prevail. Finally, the often
overlooked contributor to NFF is the way in which the interaction between the user and
system is managed. Systems should always be designed with the user in mind, with full
training and support to avoid confusion and incorrect operation, which results in perceived
faults that again lead to a category of NFF.

8.10 References
8-1. Huby, G. “No Fault Found: Aerospace Survey Results,” Copernicus Technology Ltd.,
2012.

8-2. Gatej, J., L. Song, C. Pyron, and R. Raina. “Evaluating ATE features in terms of
test escape rates and other cost of test culprits.” In: Proceedings International Test
Conference, 2002. (1040–1049). IEEE, 2002.

8-3. MIL-STD-2165. Testability program for electronic systems and equipment.


Department of Defense. Available at http://www.testability.com/Reference/
Documents/mil_std_2165.pdf. 1985.

8-4. Ungar, l. Y. “The Economics of Harm Prevention through Design for Testability.”
IEEE Autotestcon 2008. A.T.E. Solutions Inc. LA, USA.

8-5. MIL-M-24100. Military specification manuals, technical: functionally oriented


maintenance manuals (fomm) for equipment and systems. Departments and
Agencies of the Department of Defense, 1966.

8-6. MIL-STD-Hdbk-2165. Testability handbook for systems and equipment. Department of


Defense, 1995.

8-7. IEEE Std 1232-1995. Trial Use Standard for Artificial Intelligence and Expert
System Tie to Automatic Test Equipment (AI-ESTATE): Overview and Architecture.
Piscataway, New Jersey: IEEE Standards Press, 1995.

8-8. IEEE Std 1232.1-1997. Trial Use Standard for Artificial Intelligence and Exchange
and Service Tie to All Test Environments (AI-ESTATE): Data and Knowledge
Specification. Piscataway, New Jersey: IEEE Standards Press, 1997.

156
Improving System and Diagnostic Design

8-9. IEEE Std 1232.2-1998. IEEE Trial-Use Standard for Artificial Intelligence Exchange
and Service Tie to All Test Environments (AI-ESTATE): Service Specification.
Piscataway, NJ: IEEE Standards Press, 1998.

8-10. Def Stan 00-42. “Reliability and Maintainability (R&M) Assurance Guide Part 4:
Testability,” Defence Standard 00-42. 2008.

8-11. Overton, D. “‘No Fault Found’ returns cost the mobile industry $4.5 billion per
year.” Available at: http://www.wds.co/no-fault-found-returns-cost-the-mobile-
industry-4-5-billion-per-year/. Accessed May 2015.

157
Chapter 9
Technologies for
Reducing No Fault Found

9.1 Introduction
So far, throughout this book a variety of ideas and methodologies have been presented
that have been aimed at understanding the scale and the root causes of the NFF problem.
The next logical step is to identify a way that is best to combat it. A variety of solutions
have, in the past, been recommended and, in some cases, successfully implemented. The
most adopted approach has been to modify processes and procedures, adapting them
as required for specific NFF trending and statistical monitoring. A complete overhaul
might even be required with entirely new procedures/processes adopted as part of
continuous improvement processes. This will, of course, go some way to drilling down
into NFF scales, root causes, and even identifying those troublesome rogue units, but
they do not go as far as eradicating diagnostic ambiguities in complex equipment, which
could be overly expensive. Achievement of this reduction requires the introduction of
new technologies to support a continuous improvement program.

In this chapter, the authors propose some technology-based solutions, dividing them into
two categories. The first considers the use of enhanced diagnostics to improve on the
shortcomings of traditional techniques, which are reliant upon BIT, and an overly linear
diagnostics process arising from a lack of understanding of the system topology. Topology
in system design describes the configuration and interlinking of the components of the
system. To this end, we discuss the use of incorporating health and usage monitoring
systems (HUMS) into electrical systems. We also outline potential improvements to
BIT based upon an enhanced understanding of system topology to understand fault
propagation within a system, and BIT code diagnostics to identify spurious or corrupted BIT
codes. In addition, we also advocate the use of monitoring for potential failure precursors
and life-cycle loads—both of which can be invaluable for understanding system failure and
for designing enhanced testing within the maintenance arena.

159
Chapter 9

The second category covered in this chapter is based upon the need for enhanced testing at
the maintenance station. We look at the need for ensuring that newly designed equipment
is testable, by incorporating testability as a specific design variable. Focusing on testing
improvement, the limitations of available information regarding the onset of system
degradation through common functional testing is discussed. Emphasis is made on the
need for more system integrity testing to be adopted. Our call for the monitoring of life-cycle
loads is further supported through focusing on the importance of integrating the data with
environmental testing. Finally, we consider the test station itself, and provide suggestions on
how to improve a station’s performance through the correlation of test equipment serial codes
with those of units experiencing NFF. This is done to identify any rogue test equipment and
show its use with the radio frequency identification (RFID) tracking of test units.

9.2 Advanced Diagnostics


9.2.1 Health and Usage Monitoring of Electrical Systems
The difficulties that are associated with achieving successful diagnostics arise through a
lack of knowledge on the extent of degradation of a system’s components while in operation.
In many systems, components are not inspected until a failure has occurred, and with fault
isolation (FI) ambiguities being larger than one, a NFF will occur. To explain this, consider
a system that has an FI ambiguity of two. If a fault occurs in the system, then the fault will
be isolated down to a maximum of two subsystems. One of these will contain the fault,
while the other will not—if both subsystems are removed for testing, the one without the
fault will be categorized NFF after the necessary testing has been completed. Monitoring of
component/subsystem integrity, which we term health monitoring, is aimed at reducing the
level of FI ambiguity to as low a number as possible.

It is not just the health of the component that would be of interest, though. Component
usage monitoring provides key data and information on conditions that the component
experienced, such as stresses or environmental changes, which would allow reproduction of
the fault to be achieved during maintenance testing. The ability to do all of this is known as
health and usage monitoring; three current approaches are applicable to electronic products:

• Built-in test
• Monitoring and reasoning of failure precursors
• Monitoring cumulative damage based on measured life-cycle loads

9.2.2 Built-In Test


The evolution of electronic equipment over the years has seen equipment with a dependency
upon BIT to provide a default fault detection and isolation capability [9-1], [9-2]. BIT is
an assortment of on-board hardware-software elements designed as a mechanism for
system error checking. Historically, BIT has been designed and used primarily for in-
field maintenance by the end user, but BIT is now finding its way into evermore diverse
applications that include oceanographic systems, multichip modules, large-scale integrated
circuits, power supply systems, avionics, and also passenger entertainment systems. The
term “system error checking” in the context of this discussion indicates system status
updates that provide valuable information to locate any system error. These updates may

160
Technologies for Reducing No Fault Found

then be translated into a particular component that is at fault and requires replacement.
To aid in the further understanding of BIT without delving into specific technical details,
Figure 9.1 presents an overview of how BIT is used in a processor system.

Purpose:  
• Provide  non-­‐intrusive  monitoring  of  equipment   Processor  Board  
interfaces,  etc,  to  ensure  that  they  are  opera&ng  
within  measurable  limits.  
• Provide  informa&on  to  the  sooware  for  evalua&on   BITE   Sooware  
of  the  system’s  capability  to  perform.  
Analog  
Example:   A/D  
• Voltage  and  current  meters  to  evaluate  interfaces,  
power  inputs,  etc.   Power  
• Frequency  monitoring  of  analog  interfaces.   Equipment  
• Monitoring  of  peripheral  equipment  ac&vi&es.  

Figure 9.1 Example of BITE functions.

Essentially, BIT operates in one of two ways. The first, known as interruptive BIT (I-BIT),
is when the equipment operates normally but is suspended during BIT operation. The
second, known as continuous BIT (C-BIT), is continuous monitoring of the equipment
without affecting normal operation. Even though both of these BIT concepts will have been
designed as a means to detect and locate equipment faults, a variety of shortcomings directly
contribute to the NFF phenomena.

Over the last three generations of aircraft, systems sophistication has increased significantly,
yet the increase in BIT signals has not increased nearly as rapidly as one may have expected. A
variety of reasons likely have contributed to the slow increase of BIT being introduced. In the
context of the NFF argument, we choose to focus on the well-known issue of BIT providing
unreliable or often spurious fault-related signals. An issue such as this breaks down the
confidence and trust that a maintenance engineer has with the information they are receiving
from BIT, as illustrated by the human influence case study presented in chapter 3.

BIT design is a not a trivial task. Its foundation is built upon deep knowledge of all the
interactions within the system. It is easy to see how this results in difficulties defining a
fixed set of test procedures that can verify full systems functionality. Within aerospace,
this is what leads to BIT log reports containing spurious fault detections. By spurious, we
mean those codes that cannot easily be correlated to any other information. For example, an
operator/pilot report often reports a fault that has no correlation to the BIT log, resulting in
an overlooked need for maintenance. Also, even with the sophistication of modern BIT, the
issue still remains of units being removed from the aircraft after being reported as faulty
by BIT, only to be found upon subsequent testing to be perfectly healthy. The flipside of
this is that secondary faults (a fault that exists but was not the root cause of the observed
symptoms resulting in the equipment removal) may become apparent during testing. These
may not correlate to any the BIT reports—in other words, they have gone undetected by the
BIT. As well as the false alarm issue, other factors such as the level of system coverage and

161
Chapter 9

inappropriate parameter threshold limits set within the BIT contribute to NFF events. To
solve this, two potential improvements could be investigated and developed:

• Enhanced understanding of system/fault topology


• BIT code diagnostics

9.2.2.1 Enhanced Understanding of System/Fault Topology


If a fault occurs at some location in the system, abstractly indicated in Figure 9.2 as a “real
fault event,” then the effects of this fault will not remain in that location. Systems are
composed of hundreds (if not thousands) of interconnected components, and the effect of
any fault event will propagate throughout the system. This propagation could be thought
of as a ripple or wave that, at some point, will come into contact with a monitoring point
(sensor), which triggers an alarm. With multiple monitoring points throughout the system,
the fault will, therefore, trigger multiple alarms at other locations within that systems
topology. In such a scenario, detection of the propagated fault signal in differing locations
within the system immediately poses fault isolation difficulties in the traditional linear
fault isolation processes (i.e., an alarm from a sensor must have originated from a fault in a
subsystem/component directly connected to that sensor). For example, in Figure 9.2, sensors
1, 2, and 3 represent specific monitoring points in three different systems locations, each
being a perfectly healthy subsystem. As the impact of the fault event comes into contact with
the three sensors, their individual fault thresholds will be breached and alarms triggered.
Either of these subsystems could now be justifiably regarded as containing a fault and pulled
from the aircraft to undergo maintenance, and result in a NFF if the wrong one is removed.

Sensor 2

Sensor 1

Real fault
event

Sensor 3

Figure 9.2 Fault occurrence triggering alarms.

162
Technologies for Reducing No Fault Found

Taking this abstract example, it is fairly easy to see some possible improvements making
use of a much more nonlinear diagnostic process. The approach taken is used to determine
the epicenter of an earthquake by triangulation, based on the relevant signal strengths at
detection sites. This idea is also suitable for advanced system fault isolation, if the relative
signal strength at multiple detection sites is known, and the system topology is also known.
By this, we mean not just the interconnection of system elements, but also how events in one
system location can cause an event in another seemingly unrelated system location. Then
this information can be used to triangulate where the epicenter of the fault is within the
system, improving the fault isolation procedure.

9.2.2.2 BIT Code Diagnostics


The success or failure of BIT will always be dependent upon a set of predefined statistical
limits for the various parameters that are monitored. It is important to recognize at this
point that BIT will report failure for one of the following two reasons:

• A specified parameter has exceeded a set threshold value.


• The noise of the BIT measurements throws the test results outside of the testing limits
when the unit under test (UUT) meets required specifications.

The first of these is a direct result of component failure (for example, a burned-out
resistor). The second occurs when a measured parameter is corrupted by noise and has
been measured by an instrument that also has its own noise; this is common in integrated
manufacturing processes, digital system timings, and radar systems. The accuracy of
both of these cases is, of course, directly linked to the accuracy of the statistical thresholds.
The areas of concern with this accuracy involve a poor understanding of hardware/
software interactions, or the nature of the equipment’s operating environment may not
have been available when the thresholds were defined. Thresholds that are too low will
inevitably lead to BIT false alarms, and thresholds that are too high will result in missed
alarms. BIT, which provides a low-level fault indication, could also be inaccurate due to
difficulties understanding the BIT code. The code, itself, can become corrupted and result
in a meaningless interpretation, confounding the diagnostic process and increasing NFF
rates. One way to improve the use of BIT in diagnostics to reduce NFF would be to perform
diagnostics on the actual BIT code to identify spurious or corrupted codes.

9.2.3 Monitoring and Reasoning of Failure Precursors


The basis of health monitoring has been built upon the premise that indicators, as precursors
of failure, should always exist. These precursors are expected to take the form of some
change in a measurable parameter/signal of the system, which can then be correlated with
a subsequent failure mode. Using this causal relationship, the assumption is that failure
prediction can be made with an appropriate reasoning approach. The first step toward
achieving this would be to select the life-cycle parameters to be monitored. As an example,
Table 9.1 provides a summary of potential failure precursors for electronics. Selecting the
appropriate precursors to monitor will usually be done systematically through a FMECA, as
discussed in chapter 8.

163
Chapter 9

Table 9.1 Potential Failure Precursors for Electronics


Electronic Subsystem Failure Precursor Parameter
Switching power supply • DC output (voltage and current levels)
• Ripple
• Pulse with duty cycle
• Efficiency
• Feedback (voltage and current levels)
• Leakage current
• RF noise
Cables and connectors • Impedance changes
• Physical damage
• High-energy dielectric breakdown
CMOS IC • Supply leakage current
• Supply current variation
• Operational signature
• Current noise
• Logic level variations
Voltage controlled oscillators • Output frequency
• Power loss
• Efficiency
• Phase distortion
• Noise
Ceramic chip capacitors • Leakage current/resistance
• Dissipation factor
• RF noise
General purpose diodes • Reverse leakage current
• Forward voltage drop
• Thermal resistance
• Power dissipation
Electrolytic capacitors • Leakage current/resistance
• Dissipation factor
• RF noise

Reflectometry is now a commonly used technology that aids in assessing the integrity of
cables and wiring with effective fault localization, particularly with intermittent faults such
as open or short circuits. Reflectometry methods send a high-frequency signal down the line,
which reflects back at impedance discontinuities. The location of the fault is then determined
by the phase shift between the incident and reflected signals. However, caution is urged
when using these methods, as little is known on the impedance profile of intermittent faults
(with the exception of open and short circuits).

Complementary metal-oxide semiconductor (CMOS) integrated circuits (ICs) are routinely


tested using supply current monitoring based upon the knowledge that a defective circuit
will produce a significantly different amount of current than is found in fault-free circuits.
Damaged solder points, a prime culprit of NFF, affect supply current and are notoriously
difficult to detect without extensive visual inspections. They do, however, produce large
variations in thermal resistance, which can be used as a potential method for monitoring
solder joint fatigue inside of the packaging of power modules. For example, the development

164
Technologies for Reducing No Fault Found

of a new solder-joint fault sensor would provide the ability to monitor selected I/O pins of
powered-off field programmable gate arrays (FPGAs). RF impedance can also be used as a
failure precursor, offering linear increases in impedance as damage increases, whereas the
dc resistance becomes constant.

9.2.4 Monitoring Life-Cycle Loads


The life-cycle environment of a product consists of manufacture, storage, handling,
operating, and non-operating conditions. Life-cycle loads, such as the examples given in
Table 9.2, are the mechanisms by which physical/performance degradation is accelerated,
reducing service life. Suppliers and operators, particularly within the airline industry,
spend significant resources attempting to determine root causes of NFF events. Yet without
measured field conditions, any root cause analysis can be problematic, and capturing this
information poses even more significant challenges, requiring additional specific sensing
equipment and data loggers.

Measuring parameters that may include vibration, temperature, power supply, functional
overload, and air pressure can help the understanding of how the role of those
environmental factors impact upon a particular failure. It is argued that NFF events would
be reduced by the ability to prioritize the order of components replaced during a fault
event based on probabilities established through life-cycle load monitoring. Life-cycle load
information also can aid in enhancing fault testing at the maintenance stations, by providing
valuable information relating to the systems operation that may be replicable during test.

Table 9.2 Examples of Life-Cycle Loads


Load Load Conditions
Thermal • Steady-state temperature
• Temperature ranges, cycles, gradients
• Ramp rates
• Heat dissipation
Mechanical • Pressure magnitude / altitude
• Pressure gradient
• Vibration
• Shock load
• Acoustic level
• Strain
• Stress
Chemical • Aggressive versus inert environment
• Humidity level
• Contamination (fuel, oil, water)
• Ozone
• Pollution
Physical • Radiation
• Electromagnetic interference
Electrical • Current
• Voltage
• Power

165
Chapter 9

9.3 Improvements to Testing Abilities


9.3.1 Testability as a Design Variable
Standard electronics bench top testing makes use of what is commonly referred to as
automatic test equipment (ATE). ATE usually includes features like timing, signal strength,
duplicating the operating environment, loading, fan out, and correctly interconnecting the
UUT. The idea of ATE is to demonstrate the functionality of the UUT with varying input
conditions; the ability to do this is directly related to the systems testability [9-3]. Testability
is a design-related characteristic which, if designed well, offers the potential capability to
confidently and efficiently identify existing faults.

Testability as a design characteristic is usually approached from the bottom up, with
component and board-level testability built in, but with very little attention given to the
isolation of individual units within the full system. The number of tests and the information
content of test results, along with the location and accessibility of test points, define the
testability potential of the equipment. The two attributes that must be met for testability
success are:

• Confidence: this is achieved by frequently and unambiguously identifying only the


failed components or parts, with no removals of good items.
• Efficiency: this is achieved by minimizing the resources required to carry out the tests
and overall maintenance action. This includes minimal yet optimized man-hours, test
equipment, and training.

Conventional ATE methods used within a maintenance line, as required from the testability
design, are not always successful—if they were, then NFF would not be such a problem.
They do not always carry the necessary levels of confidence and efficiency, and may even
be inappropriate, leading again to many industries suffering NFF difficulties. This is
particularly evident in the case of attempting to detect and isolate intermittent faults at the
test station. The ability to test for short-duration intermittency at the very moment that it
reoccurs, using conventional methods, is so remote that it will almost certainly result in a
NFF [9-4]. The one major issue with designing component testability is that the focus is on
functionality and integrity of the system, and not on the ATE being tested.

For testability to be consistent within the design process, and achieve the necessary levels
of confidence and efficiency, these standard definitions, procedures, and tools must be
developed. A testability evaluation should not only provide any necessary predictions, but
also supply redesign information when testability attributes are predicted to be below the
acceptable levels [9-5]. Four testability attributes can be identified:

• Fraction of faults detected (FFD): Ideally this should be 100%. Any fault not detected by
either the BIT or ATE can result in total loss of system integrity and hence loss of full
functionality. In reality, some faults that are not safety/mission critical can be tolerated,
and so a FFD less than 100% may be acceptable when designing for testability.

166
Technologies for Reducing No Fault Found

• Fraction of faults isolated (FFI): If a detected failure is not isolated quickly and
efficiently with high confidence levels, then the system may end up being kept out of
operation for significant periods of time. The result of this is pressure on maintenance
personnel, who are then likely to adopt the “shotgun” approach of speculative LRU
replacements, adding pressure and complications to the sparing and logistics processes,
and increasing life-cycle costs. Appropriate measures of FFI include mean time to fault
isolation (MTFI), mean time to repair (MTTR), and rates of NFF.
• Fraction of false alarms (FFA): This is complementary to FFD and should ideally be
as close to zero as possible. High FFA will also lead to maintenance pressures and the
“shotgun” effect.
• Rate of false alarm (RFA): This is a measure of the rate at which detected faults result
in a false alarm upon investigation. It is computed as a time-normalized sum of false
alarms, where the normalization is either calendar time or operating hours.

If it is suspected that a NFF has occurred due to a lack of fault coverage by the ATE or BIT, then
there would be a requirement to use additional tools that are capable of identifying the root
cause of the problem. To achieve this, an understanding of the physics of the actual failures
of an asset. As the integrity of the system decreases from a fault-free system to one with a hard fault,
within the operating environment is needed. Once this is known, appropriate test equipment
can be selected to support the ATE through interpretation of the physics of the failure.
impact on cost and availability of that system is significant, as illustrated in Table 9.3. Functionality
9.3.2 Functional and Integrity Testing
testing stems from a one-flight-at-a-time mentality (i.e., is it airworthy and serviceable for the next
The idea behind most standard ATE tests is to ascertain whether or not the UUT is able to
perform its required task—this is what we have referred to as functional testing. Functional
flight?) An availability-focused maintenance strategy, however, results in large investments in fatigue
testing checks that a unit is in an acceptable working state. It can also provide much
information on the health and integrity of an asset. As the integrity of the system decreases
life and component life extension programs. It is argued that if appropriate systems integrity testing
from a fault-free system to one with a hard fault, impact on cost and availability of that
system is significant, as illustrated in Table 9.3. Functionality testing stems from a one-flight-
was integrated with the process of confirming serviceability (system functionality), then this would
at-a-time mentality (i.e., is it airworthy and serviceable for the next flight?) An availability-
focused maintenance strategy, however, results in large investments in fatigue life and
lead to enhanced levels of sustained systems availability. To underline this point, Cockram and Huby
component life extension programs. It is argued that if appropriate systems integrity testing
was integrated with the process of confirming serviceability (system functionality), then
[9-6] define the following:
this would lead to enhanced levels of sustained systems availability. To underline this point,
Cockram and Huby [9-6] define the following:

𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹𝐹  𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 + 𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼  𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇 =  𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼  𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴  

Table9.3 
Table 9.3 System
System Integrity
Integrity vs.
vs. Cost
Cost Impact
Impact [9-7]
[9-7]
System Integrity Cost Impact
100% Robust design
System Integrity Low Cost Impact
Minor scheduled maintenance
(fault free)
Repeat arisings
100%
Intermittent High Phantom supply chain
Robust design (fault
Low Minor scheduled maintenance
free) Phantom maintenance policy
Low Stranger
Repeat arisings
0% Hard Fault Medium Repeater
Intermittent High Phantom supply chain
High Runner
Phantom maintenance policy 167
Low Stranger

0% Hard Fault Medium Repeater


Chapter 9

On several occasions throughout this book, the relationship between NFF and intermittent
faults has been mentioned. When circuits are tested one at a time, or just a few circuits
at a given time, and unless the intermittent fault occurs within the time window of the
test, as illustrated in Figure 9.3, the fault will go undetected. This testing blind spot, which
is compounded further by digital averaging of results, means that conventional testing
equipment does not provide effective test coverage for intermittency, one of the major
drivers for NFF.

Figure 9.3 An example of a missed intermittent event during testing.

As a way of overcoming the issue of missing intermittent events due to the size of the test
window, a sensitive analyzer was introduced by the company Universal Synaptics, located
in Ogden, Utah, to simultaneously monitor test lines for voltage variation. Conducting a test
for intermittency across many simultaneous connections, using what has been termed an
analog neural network process, provides increased probability of detecting an intermittent
fault. Combining this with the reduction in the time taken to complete the test (because the
testing is performed for multiple points simultaneously, rather than testing one connecting
line at a time) means that exploiting analog neural-network equipment to detect and
eradicate intermittent faults in electrical and electronic aerospace components is potentially
one of the most effective test methodologies, as the overall test coverage is of several orders
of magnitude higher than other methods. The approach is illustrated in Figure 9.4.

To address the NFF/intermittency problems, other alternatives, which try to use traditional
measurements, include methods such as tracking and comparing circuits down to fractions
of a milliohm, one-circuit at a time, against long-running records of similar measurements.
However, this approach has some major limitations. When an intermittent circuit is in
a temporary “working” state, it will generally pass such tests, and only those circuits
approaching hard-failure status will be detected this way. Also, measuring “fractions
of a milliohm” and attempting to take meaningful action based on these values is
extremely difficult, time-consuming, and requires precise control in the test set-up and test
environment. Appropriate test equipment is required to address the massive intermittent/
NFF issue, and to resolve all of the variables causing this unpredictability, to provide the
maintainer with a quick and comprehensive route to a successful outcome. Overcoming the
testing challenge posed by NFF/intermittency problems requires a different approach to that
of using conventional digital equipment predicated on accuracy of measurements and time-
consuming results analysis.

168
Technologies for Reducing No Fault Found

Figure 9.4 Automatic test equipment compared to analog neural network for intermittent
fault detection. (used with permission [9-7])

A variety of high-profile integrity testing methods are currently being championed. Most
notable of these are the use of X-ray and thermal imaging. X-ray inspections can highlight
shorts or coupling faults buried within the layers of multiplayer printed circuit boards non-
invasively. Automated inline systems based on X-ray transmission have several advantages
over optical inspection. Optical inspections are restricted to surface inspection of visible
solder joints. Consequently, J-leads and ball grid arrays cannot be inspected by optical
means. More sophisticated features concerning the solder volume, fillet, voids, and solder
thickness can be determined reliably only by X-ray transmission. Therefore, the use of X-ray
inspection generally results in better test performance in terms of false alarm rate and
escape rate, and it is to be favored for closed-loop process control.

The use of infrared imaging for nondestructive evaluation of electrical component integrity
is also a well-known practice. The basic principle of using infrared imaging as an integrity
test is that faulty connections and components in an energized circuit will begin to heat up
before they fail. For many electrical components, such as resistors and capacitors, the build-
up of heat will be entirely normal, but for many components the build-up of heat—or even
lack of heat—will indicate a problem.

9.3.3 Testing Under Environmental Conditions


Testing under laboratory conditions is not always the best way to test for failure. You can never
guarantee that a fault will manifest itself outside of the specific environmental conditions
experienced in service. Examples of this could include when the temperature widely fluctuates,
or a stress is applied in the form of vibration—conditions which will not normally be present
during laboratory testing. Most products will undergo environmental testing to prove their

169
Chapter 9

reliability and robustness under the most extreme operating conditions as part of their
certification process, but a more subtle set of environmental tests can also be used as part of
the maintenance process, which tries to simulate a more normal mode of operation.

In effect, three main environmental conditions should be controlled for a good diagnostics
test: humidity, vibration, and temperature. However, testing standards do not require these
environmental factors to be undertaken together. Each of these will depend on many factors,
such as temperature and humidity, and will fluctuate with variables such as altitude, time of
year, and current weather patterns. Vibration is dependent upon such things as smoothness
of roads/runways, location in the vehicle, and the vehicle activity (e.g., a fighter aircraft
cruising or in a battle scenario). These three conditions can be simulated with relative ease
through the use of market-available environmental chambers. Often an overlooked area
when considering an environmental test is the orientation of the UUT, when embedded
within its operating platform. The orientation can mean that differing components are more
affected by vibration than if the UUT was in a different position. So the orientation of the
UUT should be a consideration when undergoing environmental testing.

9.3.4 Management of the Test Station


Test station capability is measured by the successful ability to measure the functionality
of UUTs. This will rely on a high level of confidence in the test station instruments and
procedures. In many cases, these, along with test results for an individual UUT, are not stored
and analyzed for signal and measurement trends. Such trends could be particularly useful
in identifying how different test results are produced, depending on which test station is
used. Within an organization’s maintenance operation, certain stations may perform better
than others in terms of reducing NFF events, even though the configuration of the individual
stations may appear identical. Factors that contribute to this include technician/engineer
perceptions, as discussed under human influence in chapter 3, instrument compatibility,
reliability, calibration, and health. In many cases, test engineers will intuitively know that one
station operates better than another, but will not have any idea just how much this compounds
the NFF rate, as this information for each station will not be quantified.

Relatively easy methods can be implemented to monitor test station configuration for the
purpose of test station-based NFF trending. These include the monitoring of instruments
and parts by serial numbers, either automatically or manually. This enables information to
be stored so that the correct test instrument, used for a specific UUT, can be used and traced
to see if a higher number of NFF events are attributed to that test equipment’s serial number.
The benefits of this approach would be best realized in test facilities where instruments
are frequently swapped between stations. NFF can also be caused by inappropriate test
limit criteria, such as out of calibration factors. The data collected at test stations should be
correlated with test failures to that station’s testing parameters to check for near- or out-of-
limit calibration values.

Figure 9.5 shows the high-level maintenance process for aircraft when NFFs are encountered.
The diagram also highlights the areas where test station configuration data and test result
data could be logged. This helps to investigate and understand discrepancies between the
field, LRU, and SRU test stages, to significantly reduce NFF events.

170
Technologies for Reducing No Fault Found

Figure 9.5 The repair process.

9.3.5 Tracking Spare Part Units


The ability to recognize rogue units is of paramount importance in mitigating the effects
of NFF events, and ensuring operating safety, particularly in the case of aircraft. The key
to distinguishing a rogue unit is to implement the necessary procedures to track suspected
rogue units by serial number showing the date installed and removed, the platform on
which the unit was installed, the number of operating hours/cycles, the number of hours
since its last overhaul, and a solid reason for the generated removal codes. In addition to
this, the history of the operating platform, be that a wind turbine, aircraft, or train, must be
recorded with an easy-to-use retrieval system. The importance of this type of historical data
is to aid in determining the exact effects the failure has on the overall system and whether
the replacement of the unit offers a high level of confidence of rectifying the problem.

Some airlines operate within a spare parts pool in which the policy is that if a unit is
returned to the pool labeled NFF more than three times (for example), then that unit will
be scrapped. This has advantages and disadvantages. The advantage is that the spare parts
pool will become less polluted with units that are NFF rogues, but at the same time, it also
provides the disadvantage of encouraging the culture of accepting NFF, and not searching
out the root cause. This may be a fundamental manufacturing flaw present in equivalent
units, such as a batch of faulty capacitors that have been used in the unit’s production. Or,
likewise, it could be a system design flaw leading to integration faults, as discussed in
chapter 8. Either way, scrapping units in this way will inevitably lead to an increase in costs.

Other airlines operate differently, and routinely tag and track units that are returned with
similar reported fault symptoms multiple times. These tagged units are then subjected to
special testing that would not usually be required, such as thermal shock and environmental
tests. Units tagged as rogue are also tracked by the tail number of the aircraft from which
they came. Technicians then monitor and track repetitive serial numbers using specialized
tools to help determine if the unit is a repetitive problem, or if the problem is fundamentally
an issue with the aircraft. In the case of airlines that are contracted into a spare parts pool

171
Chapter 9

used by several airlines, the lack of “tracking by design” of units suspected of being rogue
means that an airline has no information regarding any unit that they take from the pool.

Advanced tracking methods have begun to gain popularity, particularly in the aircraft
industry, based upon RFID tracking for predictive maintenance and to support the tracking
of MRO work in progress [9-8]. In the repair process, multiple operations are conducted to
repair a complex engineered machine, such as an engine, which would include dismantling,
inspection, repairing, maintenance, and reassembling. Tracking and tracing of the status
of these processes and operations provides critical information for decision making. This
tracking and tracing is often performed manually, but the adoption of RFID as an automatic
identification technology has the potential to speed up processes, reduce recording errors,
and provide critical part history. The use of RFID technology to track units within a spare
parts pool, providing full service histories to the current user, also offers the ability to
reduce the number of NFF events by allowing visualization of rogue units in the spare parts
pool—reducing costs attributed to phantom supply chains.

9.4 Conclusion
Once the scale, impact, and causes of NFF have been identified, the problem of how to
address the NFF issue must be tackled. Two approaches have been defined: the first is
to implement new mitigation processes and procedures, and the second is to adopt and
implement new technology. This chapter has provided an overview of some of the more
advanced technological solutions in the battle against NFF. In general, these technologies
can be categorized as either advanced diagnostics or improvements to testing abilities.
Advanced diagnostics covers enhanced BIT, diagnostic reasoning, and health and usage
monitoring as well as the monitoring of life-cycle loads. The essence of advanced diagnostics
is to identify failure precursors, and to discover how faults propagate through a system, to
provide effective information on the probable failure type and root cause to the maintainer
for fault isolation.

The second category of improving testing abilities looks at ensuring that a system is
designed in a way that makes it accessible to testing, and that the test coverage across
the unit under test is maximized through design for testability methods. Extending
standard functional testing to include integrity testing would allow for weak points, such
as interconnects, to be checked for degradation, identifying potential failures before they
occur. Likewise, implementing environmental testing based on measured life-cycle loads
would allow a more realistic testing scenario to be devised. Improvements to testing abilities,
however, do not just include the introduction of new test equipment. They should also
extend to introducing technology to monitor and manage workstations, test equipment, and
spare parts to identify any erroneous elements in the maintenance chain.

All of these technologies currently exist and with dedicated effort can be developed to a
standard to be incorporated into aircraft systems and productive maintenance environments.
Without a step-change in technology, the problem of NFF, along with many other diagnostic
issues, will continue to plague equipment manufacturers, operators, and maintainers for a
long time to come.

172
Technologies for Reducing No Fault Found

9.5 References
9-1. Ungar, L. Y., and L. V. Kirkland. “Unravelling the cannot duplicate and retest ok
problem by utilising physics in testing and diagnoses.” In: AUTOTESTCON, 2008.
(550–555). IEEE, 2008.

9-2. Ungar, L. Y. “Design for diagnosability guidelines.” Instrumentation and Measurement


Magazine. IEEE 11 no. 4 (2008): 24–32.

9-3. Simpson, W. R., and J. W. Sheppard. “System complexity and integrated diagnostics.”
IEEE Design & Test of Computers 8 no. 3 (1991): 16–30.

9-4. Qi, H., S. Ganesan, and M. Pecht. “No-fault-found and intermittent failures in
electronic products.” Microelectronics Reliability 48 no. 5 (2008): 663–674.

9-5. Sheppard, J. W., and W. R. Simpson. “Applying testability analysis for integrated
diagnostics.” Design & Test of Computers, IEEE, 9 no. 3 (1992): 65–78.

9-6. Cockram, J., and G. Huby. “No fault found (NFF) occurrences and intermittent
faults: improving availability of aerospace platforms/systems by refining
maintenance practices, systems of work and testing regimes to effectively identify
their root causes.” In: Proceedings of CEAS European Air and Space Conference, 2009.

9-7. Huby, G., and J. Cockram. “The system integrity approach to reducing the cost
impact of no fault found and intermittent faults.” In: UKRAeS Airworthiness and
Maintenance Conference, 2010.

9-8. Khan, S., P. Phillips, C. Hockley, and I. Jennions. “No Fault Found events in
maintenance engineering Part 2: Root causes, technical developments and future
research.” Reliability Engineering & System Safety 123 (2014): 196–208.

173
Chapter 10
Summary and Ideas
for Future Work

This book has covered a lot of ground and has hopefully introduced a number of new
ideas. Before closing, some of the ideas and issues raised are summarized under each
chapter. The commentary links ideas together from different parts of the book, and
seeks to show how the various contributions reinforce each other.

Introduction. The book opens with a general introduction to the background to NFF.
It begins with the general operating environment, and highlights the evolution of
maintenance, the NFF phenomena itself, and its growth in aerospace. Documented
growth in journal publications shows the rise of interest in NFF. To complete the chapter,
a short overview of the subject of cost is included. From a number of symposia on NFF,
held at Cranfield University over the last few years, the cost of NFF has been found
to be excessively high, and hence provides the main driver for further research and
publications such as this book.

Basics and Clarification of Terminology. Taxonomy is a constant source of problems,


in which a number of TLAs (three-letter acronyms) can stand for any process, fault,
operation, and maintenance action—leading to confusion and ambiguity. This
is especially true in NFF, and this chapter tried to set out the consensus view on
terminology and concepts to be used in the rest of the book, suggesting its adoption by
the general community. A nomenclature is brought together for ease of reference.

The Human Influence. While technology can provide the theoretical solution to a
number of NFF problems, it rests with a number of individuals, especially maintainers,
to enable it in difficult and often ambiguous situations. The organizational context,
including communications, is examined, followed by an exploration of NFF events
caused by humans. These include interaction with software, hardware, and the
environment. In most cases, better procedures and diagnostic equipment can
significantly ameliorate these effects. The chapter closes with a research case study into

175
Chapter 10

the interaction between maintenance engineers, off-board test equipment, and integrated
maintenance procedures.

Availability in Context. This and the following chapter cover the two facets of system
effectiveness—availability and safety. Availability is a hugely important factor for an
operator, because the broad division between uptime and downtime can mean the
difference between profit and loss for a business. This chapter explores a number of different
definitions for availability, from various viewpoints, which enable designers, logisticians,
and maintainers to all see how they can contribute. NFF is then tied into these scenarios by
contributing to the asset’s downtime. To conclude, a process for improvement is suggested,
by following the idea of a Unit Removals Database.

Safety Perceptions. This is a contentious subject, as opinion is divided as to whether or not


NFF is connected to safety. The view put forward here is based on not being able to find a
root cause for a NFF, and hence having an airplane flying around with an (unknown) fault
on board. The argument is taken up at length with reference to the airworthiness regulations
and the “dirty dozen.” The latter are the drivers of human behavior that cause maintenance
errors, most of which can also cause NFFs. The chapter includes a case study.

Operating Policies for Management Guidance. This chapter looks at the NFF problem
from a through-life engineering services context (i.e., across the whole life—from design
through manufacture to disposal—of the asset). This leads to consideration of the many
organizational interactions that are necessary to produce complex assets, and hence the
issues that arise and need controlling through operating policies. The control process
adopted is that outlined in ARINC 672, which proposes a simple, four-step scheme for
eradicating the sources and causes of NFF through these different organizations. An
example is given to illustrate an effective reduction process.

A Benchmarking Tool for NFF. Coupled with the previous chapter, this chapter presents
an awareness of the generic benefits and needs associated with managing NFF. After the
benefits and challenges are enumerated, a methodology for managing NFF is proposed.
Originating from quality management, this tool enables companies to benchmark their NFF
capability and assess what is needed to advance their maturity, and achieve more benefit, in
dealing with NFF.

Improving System and Diagnostic Design. Throughout this book, it has been argued that if
the design was right, with test equipment designed to spot any deviation from design intent,
then we would not have the NFF problem. While this somewhat naïve viewpoint is true,
this chapter does not blame the designers, but rather the lack of information and knowledge
feedback from in-field experience. It looks at the relationship between a design and NFF—
how designs can be made more amenable to testing, root cause analysis, and user interaction
with the system.

176
Summary and Ideas for Future Work

Technologies for Reducing NFF. Previous chapters have dealt with modifying processes
and procedures as part of a continuous improvement program. In contrast, this chapter goes
to the heart of the NFF problem and proposes some technical solutions that can be used.
These are divided broadly into two parts. The first category considers the use of enhanced
diagnostics to overcome the shortcomings of traditional techniques, proposing incorporation
of HUMS concepts into electrical systems as well as enhanced understanding of BIT codes.
The second category considers how to improve testing at maintenance stations. This includes
testability as a design variable, as well as functional and integrity testing.

Naturally, not all of the available subject space has been explored in the book. With an
acceptance of what has been said in individual chapters, the following is a list of ideas of
areas for future work, either pragmatically within organizations, or addressing research
issues.

Management

• Cost / benefit analysis of reducing NFF. There is still much denial of NFF as a problem.
At the first-line maintenance level, there is anecdotal evidence and understanding, but
no real awareness of the underlying costs. Further up the management chain, there is
a certain denial of the costs, while at the top of the management chain, there is usually
blissful ignorance. Costs throughout the support chain are extremely difficult to capture,
and so there is a pressing need to establish evidence (an analysis) of the real costs of
NFF throughout the supply chain.
• The need for standards, regulations, and guidance. Standards can be useful, providing
they have something to contribute rather than providing an oppressive and restrictive
environment. Regulations are necessary if a link with safety can be proven. Guidance
is always useful, but it must be focused and useful. This book has tried to offer up
a common nomenclature, which in its own way would help to clarify some of these
matters.
• Culture. Too many organizations do not accept the waste of time and resources
involved with NFF. So much can be done to deliver improvements in through-life
support if there is an acceptance of the costs involved and a positive attitude to make
the necessary changes. However, this culture change requires the potential costs
involved to be established and the benefits articulated.
• What is a valid metric for NFF? No standardized metric is used for NFF. Current
practices may be focused on developing statistics for cost savings (less industry returns),
wasted man-hours, and spare part availability, among others. The use of one such
statistic will not capture the entire picture, whereas multiple statistics can cloud the
picture.
• What figure do you use as a safety metric?
• How do you evaluate a figure for spare part requirements in NFF?
• How do you evaluate the expenditure of time (i.e., man-hours, test preparation
times, transit times, etc.)?
• Can all these statistics be identified into a single “NFF Metric,” which identifies a
reliable figure for the impact of NFF?

177
Chapter 10

• Some additional areas, which have not been addressed in this book, could be of benefit
to organizations:

1. Parts quarantine process


2. Improved decision support information to the troubleshooter, such as:
• TSI for LRUs
• Average fleet-wide TSI information
• MTBUR and MTBF trend information for components
• S/N level shop records for possible causing LRUs
3. Reliability department improved analytics tools to support:
• S/N level tracking of shop findings for major part categories
• Correlation of S/N shop findings with component removal maintenance
actions and observed aircraft faults
Technical

• Formulating a NFF theory. NFF is quantifiable in terms of statistics, usually based


on statistical terms relating to wasted man-hours or the occurrence of maintenance
faults resulting in a NFF occurrence. But what has not been studied and formulated
into a physical and mathematical framework are the underlying scientific mechanisms
that describe NFF. Research over the past two years has indicated that NFF is, in
fact, a subset of reliability theory/engineering, but whereas this is well established
mathematically, the equations to describe No Fault Found have not yet been developed.
The target area for reduction of No Fault Found has been cited, both in academia
and industry, as being rooted in improving equipment design. A mathematical
formalization of the NFF theory would allow this to be realized in a similar way to
design for reliability, maintainability, or testability.
• Understanding intermittent faults. It is clear that intermittent fault occurrences are a
major technical cause of NFF events. It is also clear that there is a lack of fundamental
understanding of intermittency in electronics. This relies on the ability to describe the
various diagnostic interactions accurately—how mechanical, software, and electronic
elements work together. In some industries, adopting better prognostics has ensured
that important operational parameters are monitored at all times to identify adverse and
out-of-limits variations. Such technology has helped to introduce a change from a policy
of reactive maintenance to a predictive one, which provides information on the root
causes of failures.
• The connection with HUMS/IVHM and prognostics must be researched and
developed. It seems obvious that using HUMS information to aid in fault diagnosis will
help reduce NFF occurrences. However, this requires more research and evidence. The
natural link, then, is to relate prognostics to NFF, so that faults can be removed before
they manifest themselves as potential intermittent faults. Prognostics are designed
to warn of an impending failure before it actually occurs. This allows the item to be
removed and replaced at a convenient time rather than allowing the item to run to
failure, which may occur at an inconvenient time. Further research is necessary to
establish the reliability of such prognostic removals so that they do not contribute to the
numbers of NFF.

178
Summary and Ideas for Future Work

• The impact of increased use of condition monitoring systems on NFF. With a move
toward more automated fault detection and diagnostic systems such as condition
monitoring, the implication is that an ever-increasing number of dedicated sensors will
be integrated into systems for this task. If, for any reason, these sensors do not provide
the correct information, items will be removed unnecessarily, found to be serviceable,
and categorized NFF. Research is required, therefore, to answer these questions:
• Is there a link between condition monitoring and NFF, either positive or negative?
• How do we ensure V & V of sensors to provide correct data and the reliability of
information to prevent false removals?
In conclusion, this book set out to address a new hot topic in aerospace—reported faults for
which a root cause cannot be found—No Fault Found. It is the first of its kind and, as such,
reflects an emerging field that will overtake the ideas expressed here in due course. But it is
important to have documented where we stand today, make suggestions for progress into
the future, and enliven the debate around this most important subject. In this, we hope we
have succeeded.

179
Index

Acceptance test procedure, 106


Achieved testability, 151
Active fault, 20
Administrative and logistic delay time (ALDT), 69–70, 74
Aerospace, 24
growth of NFF within, 4–7
maintenance practices, 62–64
NFF perspective, 29f
through-life engineering services in, 103f
Air Accident Investigation Board (AAIB), 97
Air Transport Association (ATA), 13–14
Aircraft maintenance manuals, 52–53
use for fault diagnosis, 53f
Aircraft testing resources, 50–52
Aircraft Type Certification, 87, 89
Annex I, 90
Anomalous behavior, 1
ARINC 672, 6–7, 102, 111–122
case study, 122–125
As low as reasonably practical (ALARP), 90
Automatic test equipment (ATE), 166
compared to analog neural network, 169f
Availability, 9
and aerospace maintenance practice, 62–64
definition of, 22, 68
and design for maintenance and system effectiveness, 66–67
design requirements for RAM, 71–73
impact of NFF on, 73–77
introduction to, 61–62, 176
metrics for, 68–69, 73–76, 78–79
multiple facets of, 67–71
process for improvement, 77–81
and quality of maintenance systems, 64–66

181


Behavior, 18
Benchmarking
benefits of, 132
proposed tool, 132–144, 176
Best practice guidelines, 58–59
Boeing 747, 4
Boeing 777, 92
Boeing 787 Dreamliner, 14
Built-in test (BIT), 47–48, 96–97, 106, 160–163
code diagnostics, 163
enhanced understanding of system/fault topology, 162–163
Built-in-test equipment (BITE), 96–97
example of functions, 161f

Cannot duplicate (CND), 24–25


Capability spider plot, 142, 143f
Case studies
ARINC 672, 122–125
impact of inconsistent terminology, 30–31
NFF and air safety, 97–98
Categories, 113
Chronic component, 124
Civil aircraft, 4–5, 91–92
maintenance practices, 62–64
and safety, 61, 65, 92
typical maintenance processes in, 46–47
Classification, NFF, 11f, 12f, 23–30
Commercial challenges of investigating NFF, 131–132
Communication, 42–44
Competence, 55–56
reasons for lack of, 56f
Complexity, 18
Concurrent reporting, 43
Condition monitoring, 179
Consistency, 43–44
Continuous BIT (C-BIT), 161
Continuous improvement, 127, 133
Corrective maintenance, 21
Cost–benefit analysis of reducing NFF, 177

182


Cost effectiveness, 77
Costs, 130
maintenance, estimating, 79
of NFF, 9, 13–14
vs. system integrity, 167t

Dash-8 Q400, 97
Data, required, 122–123
Data management, 8
Datasheets, unit removal, 80–81
Defense Standard 00-42 part 4, 151
Depth support, 30
Design
design and development phases, 20
for diagnosis, 152
diagnostics, 145–148, 176
information feedback to, 153–154
for maintenance, 66–67
for maintainability (DfM), 22
system
and NFF, 145–146, 149–150
and user interaction, 155
for testability (DfT), 23, 151, 166–167
Diagnosis, designing for, 152
Diagnostic failure, 27, 28, 29–30
Diagnostics, 8
advanced, 160–165
built-in test (BIT), 47–48, 96–97, 106, 160–163
built-in-test equipment (BITE), 96–97, 161f
design of
and NFF, 146–148, 176
information feedback to, 153–154
health and usage monitoring of electrical systems, 160
monitoring and reasoning of failure precursors, 163–165
monitoring life-cycle loads, 165
Direct maintenance cost (DMC), 79
“Dirty dozen,” 94–96, 98
Domains, 113
Dupont, Gordon, 94

183


Electrical systems, health and usage monitoring of, 160


Electronics, failure precursors for, 164t
Entities, 18
Environment, maintenance engineer interactions with, 49
Environmental conditions, testing under, 169–170
Equipment failure, 27
Equipment fault, 27
Establish candidates, 113
Establish source, 113
Executive level, 135
Experience, lack of, 45–46

F-16, 14
Failure modes and effects analysis (FMEA), 66, 153
Failure precursors, monitoring, 163–165
Failures
in-service, 153
types of, 19
underlying conditions, 107
Fault avoidance, 21
Fault confirmed, 10–11
Fault coverage, 23
Fault detection, 21, 161
Fault diagnosis, 21
Fault investigation, traditional approach, 9–10
Fault isolation (FI), 21, 152, 160
Fault not found (FNF), 24, 87
Fault not indicated (FNI), 23
Fault propagation, 162
Fault recovery, 21
Fault reproducibility, 20
Fault tolerance, 21
Fault topology, 162–163
Faults
intermittent, 106, 168, 169f
understanding, 178
latent, 20
and maintenance errors
diagnostic maintenance success, 96–97

184


human factors contribution, 93–96


maintenance contribution, 91–92
operational pressure, 92–93
and safety, 86–87
secondary, 161
types of, 20–21
Feedback
to diagnostic design, 153–154
inadequate, 44
in-service, 147–148
Field service representatives (FSRs), 154
Forward support, 30
Fraction of false alarms (FFA), 167
Fraction of faults detected (FFD), 166
Fraction of faults isolated (FFI), 167
Function, 18
Functional specification, 18
Functionality testing, 167–169
Future work, ideas for, 177–179

General operating environment, 2f


Gold values, 123–124
Guidelines, best practices, 58–59
“Guidelines for the Reduction of No Fault Found” see ARINC 672

Hard fault, 20
Hardware, maintenance engineer interactions with, 47–48
Harrier case study, 30–31
Health and usage monitoring systems (HUMS), 160, 178
Hessberg, Jack, 64, 91
Hidden failures, 4–5
Human error at depth, 27
Human error at first line, 27
Human error fault, 20
Human factors, 7, 8, 39–46, 175–176, 106
best practice guidelines, 58–59
communication, 42–44

185


consistency, 43–44
contribution to faults, 93–96
definition of, 40
discrepancy in terminology, 44
feedback, 44
lack of experience, 45–46
maintenance engineer and system interactions, 46–49
operational pressure, 45
organizational context, 40–42
preparing accurate reports, 43
survey of, 49–58
aircraft maintenance manuals, 52–53
aircraft testing resources, 50–52
competence and training, 55–58
organizational pressures, 53–55
training, 45, 56–58, 110–111, 154
Human root cause, 27

IEEE, 151
Information, typical flow of, 43
Infrared imaging, 169
Inherent availability, 22
Inherent testability, 151
Initial maturity level, 134, 136–137t
Input output processor (IOP), 97–98
In-service feedback, 147–148
Integrated vehicle health management (IVHM), 2–3, 178
Integration phase, 20
Integrity, 18
system, 149–150
Integrity testing, 167–169
Intermittent faults, 20, 106, 168, 169f, 178
Interruptive BIT (I-BIT), 161
Intrinsic availability, 22
Investigation form, 125

186


Latent fault, 20
Latent root cause, 28
Life-cycle loads, monitoring, 165
Line replaceable unit (LRU), 9–12, 28, 73–74, 78, 79, 80, 88, 124

Maintainability, 22
vs. reliability, 75f
Maintenance
contribution to faults, 91–92
costs of, estimating, 79
definition of, 64
designing for, 66–67
diagnostic, contribution to faults, 96–97
evolution over time, 4f
historical perspective, 3
simplified repair process, 46f
terminology, 21–23
typical processes in civil aircraft, 46–47
Maintenance echelon, 21
Maintenance engineers
interactions with system, 46–49
competence and training, 55–58
perceptions on ability to use BIT, 52f
Maintenance errors, 94–95
Maintenance line, 21
Maintenance plan, three levels of, 41
Maintenance, repair, overhaul (MRO) providers, 63–64
Maintenance Steering Group (MSG), 4
Maintenance systems, quality of, 64–66
Management, 177–178
first-line, 110
middle-level, 110
operating policies, 101–126
role and responsibilities of, 109
top-level, 109–110
Manuals, aircraft maintenance, 52–53
Maturity model, 133–142
managing maturity level, 134, 137–139t
optimizing maturity level, 140–141t

187


Mean active corrective maintenance time (MACMT), 69


Mean active repair time (MART), 69
Mean time before critical failure (MTBCF), 22
Mean time before unscheduled removal (MTBUR), 22
Mean time between failures (MTBF), 22
Mean time between no fault found (MTB NFF), 78
Mean time between no fault found–confirmed external, 79
Mean time to repair (MTTR), 22
Merlin helicopter, 97
Military aircraft, 91–92
and safety, 61, 65
Military Aviation Authority (MAA), 90
MIL-M-24100, 151
MIL-STD 2165, 150
MIL-STD-H-2165, 151
MIL-STD-Hdbk-2165, 151
Mitigation plan, 142
Monitoring NFF in-service, 80

NFF–confirmed external, 79
NFF–confirmed not LRU, 10–11
NFF–faulty, 11–12
NFF–not faulty, 11, 12
No fault found (NFF)
acronyms associated with, 23–25
ARINC 672 control process, 6–7, 102, 111–122
background, 1–9, 175
benefits of managing, 127–129
case studies
ARINC 672, 122–125
impact of inconsistent terminology, 30–31
NFF and air safety, 97–98
challenges of investigating, 130–132
classification, 23–30
classification frameworks, 11f, 12f
cost of, 9, 13–14
definition of, 6–7, 30
descriptions of phenomena, 26t
and diagnostics design, 146–148, 176

188


example causes during repair process, 47t


formulating theory of, 178
growth within aerospace, 4–7
impact on availability, 73–77
and maintenance, historical perspective, 3
maturity model, 133–142
mean time between no fault found (MTB NFF), 78
methodology for monitoring in-service, 80
mitigation policy requirements, 108–111
nomenclature, 33–36
proactive approach to, 128–129
process for improving at design stage, 77–81
proposed benchmarking tool, 132–144, 176
deployment, 143–144
maturity model, 133–142
maturity model scoring matrix, 133–142
mitigation plan, 142
summary, 144
visual capability, 142
reducing, 108
advanced diagnostics for, 160–165
cost/benefit analysis of, 177
improvements to testing, 166–172
reduction process, 107f, 112f
relevant literature, 8–9
and safety, 89f
case study, 97–98
scope and limits of events, 108
sources/causes and recommended remedial actions, 114–121t
structured approach to, 101–126
example application, 122–125
and system design, 146–146, 149–150, 176
technologies for reducing, 159–172, 176
advanced diagnostics, 160–165
improvements to testing, 166–172
terminology, 5, 9–12
classification, 23–30
failure and types of failure, 19
fault and types of fault, 20–21
introduction, 17, 175
maintenance and related terms, 21–23

189


no fault found, 23–33


nomenclature, 33–36
related terms, 31–33t
system basics, 18
through-life engineering services, 102–108
traditional approach to, 9–10
troubleshooting expected, 108–109
types of, 88f
valid metric for, 177
No trouble found (NTF), 23
Noise corruption, 163
Nomenclature, 33–36
Nugatory troubleshooting efforts, 27, 28, 29

On-condition maintenance, 22
Operating policies, 101–126, 176
application example, 122–125
implementation prerequisites, 122–123
mitigation policy requirements, 108–111
Operation, maintenance, and support phases, 20
Operational pressure, 45
contribution to faults, 92–93
Operational level, 135
Operational requirements, 72f
Operator error, 27
Organizational context, 40–42
Organizational culture, 177
Organizational pressures, 53–55
effects of time and pressure, 54f
factors leading to lack of time, 55f
Original equipment manufacturers (OEM), 2–3, 106, 108, 154

Part M, 90
Personnel practices, 110–111
Physical root cause, 27
Post-failure event, 19

190


Pre-failure event, 19
Preparing accurate reports, 43
Preventive maintenance, 21
Product service systems (PSS), 2

Quality, 23

Rate of false alarm (RFA), 167


Reason, James, 93
Reflectometry, 164
Regulations, need for, 177
Regulatory issues, 89–91
Reliability, 22
Reliability, availability, and maintainability (RAM), design
requirements for, 71–73
Reliability centered maintenance (RCM), 4
Reliability Enhancement Methodology and Modeling (REMM)
project, 6
Repairability, 23
Reporting, 110
Re-test OK (RTOK), 23, 24–25
Retrospective verbalization, 43
RFID tracking, 172
Rogue units, 108, 124, 171
Root cause, 6–7, 27–29
human, 27
latent, 28
physical, 27

Safety, 23, 176


case study, 97–98
commercial vs. military aircraft, 61, 65
conceptual discussion, 87–89
and faults, 85–87
link with maintenance errors, 91–97
regulatory issues, 89–91
191


Secondary faults, 161


Select solution, 122
Service, 18
Shop replaceable units (SRUs), 28
Software, maintenance engineer interactions with, 48–49
Spare parts, tracking, 171–172
Stakeholder interactions, 103f, 104f, 105t
Stakeholders, 18
Standardization, 7
Standards
need for, 177
testability, 151
Survey of human factors, 49–58
System defect, 27
System design, 8, 102
and NFF, 146–146, 149–150, 176
and user interaction, 155
System effectiveness, 61, 65
System error checking, 160–161
System integrity, 149–150
System interactions, 46
environment, 49
hardware, 47–48
software, 48–49
System life cycle, 18
System maintenance phase, 102–103
System operation phase, 102–103
Systems, definitions of, 18

Tactical level, 135


Technical challenges of investigating NFF, 131
Technical systems, dependence on, 1
Terminology
discrepancy in, 44
failure and types of failure, 19
fault and types of fault, 20–21
inconsistent, impact of, 30–31
introduction, 17, 175
maintenance and related terms, 21–23

192


no fault found, 23–33


nomenclature, 33–36
related terms, 31–33t
system basics, 18
Test and measuring equipment (TME), maintainers’ competency, 51f
Test station, management of, 170
Testability, 23, 150–151, 166–167
Testing
functional and integrity, 167–169
improvements to, 160, 166–172
test station management, 170
testability as a design variable, 166–167
under environmental conditions, 169–170
Thresholds, alarm, 163
Through-life engineering services, 102–108
Tools and techniques (T&T), 135, 136–141t
Training, 56–58, 110–111
additional needs, 58f
inadequate, 45
level of, 154
survey data overview, 57f
Transient fault, 20
Triangulation, 163
Trouble-not-isolated (TNI), 23
Troubleshooting tools and techniques, 124–125

UK-SPEC (UK Standard for Professional Engineering Competence),


55–56
Unit removal datasheet (URD), 80–81
Unit under test (UUT), orientation of, 170
Unscheduled removals, 5, 6
User interaction, 155

Wiring harness, 150

X-ray inspection, 169


193
About the Editors
Samir Khan
Dr. Samir Khan has been a lecturer of aerospace engineering
at Coventry University since 2015. He completed his PhD
in Control Theory at Loughborough University in 2010. He
was the leading researcher working on the No Fault Found
research project between 2011-2015, at the Through-life
Engineering Services Centre within Cranfield University,
collaborating with Rolls-Royce, Jaguar Land Rover, BAE
Systems, and MoD. Prior to this role, he worked at Thales
Transportation as a systems engineer, where he was responsible for performing fault
analysis and condition monitoring from track-side feedback sensors. Dr. Khan’s current
research work is focused on intelligent monitoring of intermittent failures and false
alarms in electronic systems. He is a chartered engineer and a member of IEEE and IET.

Paul Phillips
Dr. Phillips has over 10 years of research and development
experience, beginning at The University of Manchester,
where he focused his research on electromechanical system
failures and the development of condition monitoring
solutions. After completing his EngD in mechanical
engineering from the University of Manchester, he worked
as a postdoctoral researcher before joining the EPSRC
Centre for Through-life Engineering Services at Cranfield
University as the Project Manager in 2011. His role was instrumental in establishing and
growing research activities dedicated to the study of No Fault Found. Since 2014, Dr.
Phillips has held the position of head of advanced projects at UTC Aerospace Systems,
Marston, UK, where he is responsible for the management and leadership of engine and
environmental control systems related research and development.

195


Chris Hockley
Chris Hockley joined Cranfield University in 2003 after 35 years
in the Royal Air Force (RAF), where he specialized in aircraft
maintenance, including commanding an engineering wing,
providing support for two aircraft squadrons. He served in the
MoD in several appointments before joining the department
responsible for improving the reliability and maintainability
(R&M) of defence equipment. He completed a Defence Fellowship
to study R&M and has commanded the RAF’s R&M policy
department. Mr. Hockley is a chartered engineer who is currently the prinicipal investigator
at the EPSRC Centre for Innovative Manufacturing in Through-Life Engineering Services
for the NFF project, seeking to reduce the occurrences of NFF in all industries. His
main research interests are in health and usage monitoring systems, prognostics health
monitoring, condition based monitoring, and delivering availability and support contracts.

Ian K. Jennions
Ian K. Jennions is a professor and director of the IVHM Centre,
Cranfield University, UK. He joined the Centre, which is funded
by a number of industrial partners, when it was founded in
2008 and has led its development and growth in research and
education since then. Previously, Jennions had worked for a
number of companies in the gas turbine industry over a 40-
year career. He worked for Rolls-Royce, General Electric, and
Alstom in a number of technical roles, gaining experience in
aerodynamics, heat transfer, fluid systems, mechanical design, combustion, and, more
recently, IVHM. He has a mechanical engineering degree and a PhD in CFD, both from
Imperial College, London. He is a Director of the PHM Society, vice-chair of the SAE IVHM
Steering Group and contributing member of the HM-1 IVHM committee, and a Fellow of
IMechE, RAeS, and ASME. He is also the editor of five SAE books on IVHM.

196

You might also like