You are on page 1of 123

Fundamentals

of
Reliability
and
Maintainability
Second Edition

Steven Davis
This book was written by a U.S. Government employee as part of his
normal duties; therefore, it is in the public domain.

ii
Dedication
The second edition of this book is dedicated to the memory of my good
friend and roommate at Georgia Tech, Joseph Lester “Joey” Wilson
(Textiles, 1978), Feb. 27, 1956 — Jan. 3, 2004. Rest in peace, Joey.

iii
iv
Dr. Robert Abernethy begins his New Weibull Handbook with an excerpt
from Oliver Wendall Holmes’ “The Deacon’s Masterpiece.” I thought it
appropriate to begin with the entire poem—a classic 19th century work on
reliability.

The Deacon's Masterpiece


or,
The Wonderful "One-Hoss Shay"
A Logical Story
by Oliver Wendell Holmes

HAVE you heard of the wonderful one-hoss shay,


That was built in such a logical way
It ran a hundred years to a day,
And then, of a sudden, it—ah, but stay,
I'll tell you what happened without delay,
Scaring the parson into fits,
Frightening people out of their wits,—
Have you ever heard of that, I say?

Seventeen hundred and fifty-five.


Georgius Secundus was then alive,—
Snuffy old drone from the German hive.
That was the year when Lisbon-town
Saw the earth open and gulp her down,
And Braddock's army was done so brown,
Left without a scalp to its crown.
It was on the terrible Earthquake-day
That the Deacon finished the one-hoss shay.

Now in building of chaises, I tell you what,


There is always somewhere a weakest spot,—
In hub, tire, felloe, in spring or thill,
In panel, or crossbar, or floor, or sill,
In screw, bolt, thoroughbrace,—lurking still,
Find it somewhere you must and will,—
Above or below, or within or without,—
And that's the reason, beyond a doubt,
That a chaise breaks down, but doesn't wear out.
v
But the Deacon swore (as Deacons do,
With an "I dew vum," or an "I tell yeou ")
He would build one shay to beat the taown
'n' the keounty 'n' all the kentry raoun';
It should be so built that it couldn' break daown
"Fur," said the Deacon, "'t's mighty plain
Thut the weakes' place mus' stan' the strain;
'n' the way t' fix it, uz I maintain,
Is only jest
T' make that place uz strong uz the rest."

So the Deacon inquired of the village folk


Where he could find the strongest oak,
That couldn't be split nor bent nor broke,—
That was for spokes and floor and sills;
He sent for lancewood to make the thills;
The crossbars were ash, from the straightest trees,
The panels of white-wood, that cuts like cheese,
But lasts like iron for things like these;
The hubs of logs from the "Settler's ellum,"—
Last of its timber,—they could n't sell 'em,
Never an axe had seen their chips,
And the wedges flew from between their lips,
Their blunt ends frizzled like celery-tips;
Step and prop-iron, bolt and screw,
Spring, tire, axle, and linchpin too,
Steel of the finest, bright and blue;
Thoroughbrace bison-skin, thick and wide;
Boot, top, dasher, from tough old hide
Found in the pit when the tanner died.
That was the way he "put her through."
"There!" said the Deacon, "naow she'll dew!"

vi
Do! I tell you, I rather guess
She was a wonder, and nothing less!
Colts grew horses, beards turned gray,
Deacon and deaconess dropped away,
Children and grandchildren—where were they?
But there stood the stout old one-hoss shay
As fresh as on Lisbon-earthquake-day!

EIGHTEEN HUNDRED;—it came and found


The Deacon's masterpiece strong and sound.
Eighteen hundred increased by ten;—
"Hahnsum kerridge" they called it then.
Eighteen hundred and twenty came;—
Running as usual; much the same.
Thirty and forty at last arrive,
And then come fifty, and FIFTY-FIVE.
First of November, 'Fifty-five!
This morning the parson takes a drive.
Now, small boys, get out of the way!
Here comes the wonderful one-hoss shay,

Little of all we value here


Wakes on the morn of its hundredth year
Without both feeling and looking queer.
In fact, there 's nothing that keeps its youth,
So far as I know, but a tree and truth.
(This is a moral that runs at large;
Take it.—You're welcome.—No extra charge.)

vii
FIRST OF NOVEMBER,—the Earthquake-day,—
There are traces of age in the one-hoss shay,
A general flavor of mild decay,
But nothing local, as one may say.
There couldn't be,—for the Deacon's art
Had made it so like in every part
That there wasn't a chance for one to start.
For the wheels were just as strong as the thills,
And the floor was just as strong as the sills,
And the panels just as strong as the floor,
And the whipple-tree neither less nor more,
And the back-crossbar as strong as the fore,
And spring and axle and hub encore.
And yet, as a whole, it is past a doubt
In another hour it will be worn out!

Drawn by a rat-tailed, ewe-necked bay.


"Huddup!" said the parson.—Off went they.
The parson was working his Sunday's text,—
Had got to fifthly, and stopped perplexed
At what the—Moses—was coming next.
All at once the horse stood still,
Close by the meet'n'-house on the hill.
First a shiver, and then a thrill,
Then something decidedly like a spill,—
And the parson was sitting upon a rock,
At half past nine by the meet'n'-house clock,—
Just the hour of the Earthquake shock!
What do you think the parson found,
When he got up and stared around?
The poor old chaise in a heap or mound,
As if it had been to the mill and ground!
You see, of course, if you 're not a dunce,
How it went to pieces all at once,—
All at once, and nothing first,—
Just as bubbles do when they burst.

End of the wonderful one-hoss shay.


Logic is logic. That's all I say.

viii
Acknowledgements

I would like to thank my R&M mentor, Seymour Morris, now at


Quanterion Solutions Inc. and the Reliability Information Analysis Center
(RIAC). Seymour served as an R&M consultant during my time in the
Loader Program Office; he provided advice on R&M requirements and
testing, Life Cycle Cost (LCC) analysis, and copies of numerous R&M
articles. Although I have never taken a formal class taught by Seymour, I
have learned more about R&M from him than from all of the formal R&M
classes that I have ever taken.

I would like to thank my supervisor, Cynthia Dallis, for allowing me the


time and providing the support and encouragement to write this book.

I would like to thank my co-worker in the Loader Program Office, Molly


Statham, who learned R&M along with me.

I would like to thank Tracy Jenkins, our co-op at the time; she performed
the research and wrote the original version of Case Study 3: Fire Truck
Depot Overhaul.

I would like to thank Glenn Easterly, Director, Georgia College & State
University at Robins AFB, for his kind review of the first edition of this
book.

Finally, I would like to thank my co-workers, especially Valerie Smith, for


their review and comments on the various drafts and the first edition of
this book.

ix
x
Fundamentals of Reliability & Maintainability (R&M)

Table of Contents
Introduction ............................................................................................... 1

1: Reliability.............................................................................................. 2

2: Maintainability..................................................................................... 6

3: Availability............................................................................................ 8

4: Introduction to Reliability Math: The Exponential Distribution. 18

5: Reliability Analyses............................................................................ 30

6: Reliability Growth Test (RGT)......................................................... 36

7: Reliability Qualification Test (RQT)................................................ 51

Case Study 1: Integrated Suitability Improvement Program (ISIP) 57

Case Study 2: BPU Reliability Feasibility Analysis ............................ 68

Case Study 3: Fire Truck Depot Overhaul .......................................... 77

Appendix 1: Developing a Textbook Reliability Program ................. 84

Appendix 2: Example R&M Requirements Paragraphs.................... 85

Appendix 3: Summary of χ2 Models ..................................................... 97

Appendix 4: Fractiles of the χ2 Distribution ........................................ 98

Appendix 5: Factors for Calculating Confidence Levels .................. 100

Appendix 6: Redundancy Equation Approximations Summary..... 102

Appendix 7: Summary of MIL-HDBK-781A PRST Test Plans ...... 104


xi
Appendix 8: Summary of MIL-HDBK-781A Fixed-Duration Test
Plans........................................................................................................ 105

Appendix 9: Glossary........................................................................... 106

xii
Fundamentals of Reliability & Maintainability (R&M)

Introduction

Why Study R&M?

The Foreward to Reliability Toolkit: Commercial Practices Edition,


published in 1995, states:

The reliability world is changing: no longer are the


commercial and military industrial approaches distinct. For
years the military has had its advocates for the use of
commercial off-the-shelf (COTS) equipment and
nondevelopmental items (NDI), but now military use of
commercial designs is required. The June 1994 Secretary of
Defense William Perry memorandum officially changes the
way the military develops and acquires systems. Military
standards and specifications are out (except with a waiver)
and commercial practices are in.1

With the acquisition reform of the mid-1990s, which culminated in the


referenced “Perry memo,” the military emphasis on such subjects as
reliability and maintainability was greatly reduced, especially for such
quasi-commercial items as vehicles and ground support equipment. After
all, as the reasoning went, we have been directed to buy commercial or
modified commercial if possible; if the commercial suppliers have no
quantitative reliability and maintainability programs, how can the Air
Force require that they establish R&M programs for us?

However, the need for R&M did not go away. Poor reliability and
maintainability cause mission aborts, increase maintenance costs, and
reduce end item availability, leading to both operator and maintainer
frustration.

1
Reliability Toolkit: Commercial Practices Edition. Rome NY:
Reliability Analysis Center (RAC) (now the Reliability Information
Analysis Center (RIAC)), 1995, p. i.
Chapter 1: Reliability
The classic definition of reliability is “the probability that an item can
perform its intended function for a specified interval under stated
conditions.”2 The reliability of repairable items is traditionally measured
by mean time between failure (MTBF): “A basic measure of reliability for
repairable items. The mean number of life units during which all parts of
the item perform within their specified limits, during a particular
measurement interval under stated conditions.”3 This is calculated by:

T
MTBF = ,
n

where:

T is the number of life units and


n is the number of failures.

T is most commonly measured in operating hours, but can be any measure


of “life units,” such as miles (or kilometers) or cycles.

A repairable item is “an item which, when failed, can be restored by


corrective maintenance to an operable state in which it can perform all
required functions.”4 Corrective maintenance is discussed in chapter 2.

The reliability of non-repairable systems is measured by mean time to


failure (MTTF): “A basic measure of reliability for non-repairable items.
The total number of life units of an item population divided by the number
of failures within that population, during a particular measurement interval
under stated conditions.”5 It follows that MTTF is calculated by:

2
MIL-HDBK-470A, “Designing and Developing Maintainable Products
and Systems, Volume I and Volume II,” 4 Aug 1997, p. G-15, definition
(2).
3
Ibid., p. G-11.
4
MIL-HDBK-338B, “Electronic Reliability Design Handbook,” 1 Oct.
1998, p. 3-17.
5
Ibid., p. 3-13.
2
T
MTTF = ,
n

where:

T is the number of life units and


n is the number of failures.

Note that before we can calculate either MTBF or MTTF, we must know
the number of failures; therefore, we must define “failure.”

Failure

The classic definition of failure is “the event, or inoperable state, in which


any item or part of an item does not, or would not, perform as previously
specified.”6 Note the “would not” part of the definition. It does not matter
whether or not the item did not actually perform its function; all that
matters is that it would not perform its function.

For the purposes of reliability calculations, only relevant, chargeable


failures are included. A relevant, chargeable failure is most easily defined
as a failure that is not non-relevant and not non-chargeable.

A non-relevant failure is a failure caused by installation damage; accident


or mishandling; failures of the test facility or test-peculiar instrumentation;
caused by an externally applied overstress condition, in excess of the
approved test requirements; normal operating adjustments (non-failures)
specified in the approved technical orders; dependent failures within the
test sample, which are directly caused by non-relevant or relevant primary
failures; or caused by human errors.7

6
MIL-HDBK-470A, op. cit., p. G-5.
7
MIL-STD-721C, “Definition of Terms for Reliability and
Maintainability”, 12 Jun 1981 (since cancelled), p. 4.
3
A dependent failure (also known as a secondary failure) is “a failure of
one item caused by the failure of an associated item(s). A failure that is
not independent.”8

A non-chargeable failure is a non-relevant failure; a failure induced by


Government furnished equipment (GFE); or a failure of parts having a
specified life expectancy and operated beyond the specified replacement
time of the parts (e.g., wear out of a tire when it has exceeded its specified
life expectancy).9

A relevant, chargeable failure, hereafter referred to as simply a failure, is,


therefore, any failure other than a non-chargeable failure, as defined
above.

Basic versus Mission Reliability

Note that no distinction has been made in regards to the severity of a


failure: a failure with minor consequences counts the same as one that
causes the destruction of the entire end item. This is basic reliability, also
known as logistics reliability, since any failure will place a demand on the
logistics system, whether limited to maintenance labor, maintenance labor
plus one or more replacement parts, or replacement of the end item. A
critical failure is defined as “a failure or combination of failures that
prevents an item from performing a specified mission.”10 Note that this, in
turn, requires a definition of a mission, which inherently cannot be defined
by a reliability textbook. It is incumbent upon the user (the customer, or
the warfighter) to define the mission.

Mission reliability, or critical reliability, is measured by mean time


between critical failures (MTBCF), which is calculated by:

T
MTBCF = ,
nc

8
MIL-HDBK-338B, op. cit., p. 3-6.
9
MIL-STD-721C, op. cit., p. 4.
10
MIL-HDBK-338B, op. cit.
4
where:

T is the number of life units and


nc is the number of critical failures.

One common misperception is that redundancy improves reliability.


Redundancy is defined as:

The existence of more than one means for accomplishing a


given function. Each means of accomplishing the function
need not necessarily be identical. The two basic types of
redundancy are active and standby.

Active Redundancy - Redundancy in which all


redundant items operate simultaneously.

Standby Redundancy - Redundancy in which some


or all of the redundant items are not operating
continuously but are activated only upon failure of
the primary item performing the function(s).11

Redundancy improves critical reliability, but it reduces basic reliability, as


it adds components that can fail.12 Redundancy also increases cost, weight,
package space, and complexity. Redundancy, therefore, is not a panacea; it
has its place; indeed, frequently it is necessary. However, due to the costs
involved, redundancy should be used judiciously.

11
Ibid., p. 3-16.
12
Reliability Toolkit: Commercial Practices Edition, op. cit., p. 41.
5
Chapter 2: Maintainability
The classic definition of maintainability is:

The relative ease and economy of time and resources with


which an item can be retained in, or restored to, a specified
condition when maintenance is performed by personnel
having specified skill levels, using prescribed procedures
and resources, at each prescribed level of maintenance and
repair.13

Note that this definition includes “retained in” as well as “restored to.”

The “retained in” portion of maintainability is addressed by preventive


maintenance (PM), which is defined as “all actions performed in an
attempt to retain an item in specified condition by providing systematic
inspection, detection, and prevention of incipient failures.”14 This
includes, but is not limited to, scheduled maintenance, which is defined as
“periodic prescribed inspection and/or servicing of products or items
accomplished on a calendar, mileage, or hours operation basis.”15

The “restored to” portion of maintainability is addressed by corrective


maintenance (CM), defined as:

All actions performed as a result of failure, to restore an


item to a specified condition. Corrective maintenance can
include any or all of the following steps: Localization,
Isolation, Disassembly, Interchange, Reassembly,
Alignment, and Checkout.16

A related term is unscheduled maintenance, defined as “corrective


maintenance performed in response to a suspected failure.”17

13
MIL-HDBK-470A, op. cit., p. G-8, definition (1).
14
Ibid., p. G-14.
15
Ibid., p G-15.
16
Ibid., p. G-3.
17
Ibid., p. G-17.
6
The corrective portion of maintainability is typically measured by mean
time to repair (MTTR) (also known as mean repair time (MRT)), defined
as: “the sum of corrective maintenance times at any specific level of
repair, divided by the total number of failures within an item repaired at
that level during a particular interval under stated conditions.”18 MTTR is
calculated by:

CMT
MTTR = ,
n

where:

CMT is the corrective maintenance time and


n is the number of failures.

Corrective maintenance time, also known as repair time, is “the time spent
replacing, repairing, or adjusting all items suspected to have been the
cause of the malfunction, except those subsequently shown by interim test
of the system not to have been the cause.”19

A maintenance event is defined as “one or more maintenance actions


required to effect corrective and preventative maintenance due to any type
of failure or malfunction, false alarm, or scheduled maintenance plan.”20 A
maintenance action is defined as “an element of a maintenance event. One
or more tasks (i.e., fault localization, fault isolation, servicing, and
inspection) necessary to retain an item in or restore it to a specified
condition.”21

18
Ibid., p. G-11.
19
Ibid., p. G-15.
20
Ibid., p. G-9.
21
Ibid.
7
Chapter 3: Availability
The classic definition of availability is “a measure of the degree to which
an item is in an operable and committable state at the start of a mission
when the mission is called for at an unknown (random) time.”22 It is
calculated by:

Uptime
Availability = ,
Uptime + Downtime

where:

Uptime is the total time that the product is in


customer’s possession and works and
Downtime is the total time that the product is
inoperable. 23

Availability is normally expressed as a percentage.

Note that the formula above is only a notional equation for availability:
depending on the exact definitions used for Uptime and Downtime, there
are four distinct measures of availability, and several additional variations
of the most frequently used variety, operational availability.

Inherent availability (Ai) is “a measure of availability that includes only


the effects of an item design and its application, and does not account for
effects of the operational and support environment.”24 It is calculated by:

22
Ibid., p. G-2.
23
Reliability Toolkit: Commercial Practices Edition., op. cit., p. 11.
24
MIL-HDBK-470A, op. cit., p. G-7.
8
MTBF
Ai = ,
MTBF + MTTR

where:

MTBF is mean time between failure and


MTTR is mean time to repair. 25

Achieved availability (Aa) is similar to Ai except that it includes both


corrective and preventive maintenance. It is calculated by:

MTBM
Aa = ,
MTBM + MTTRactive

where:

MTBM is mean time between maintenance


(see operational availability, below,
for further discussion) and
MTTRactive is mean time to repair (corrective
and preventive maintenance).26

Operational availability (Ao) “extends the definition of Ai to include


delays due to waiting for parts or processing paperwork in the Downtime
parameter (MDT).”27 A common equation for calculating operational
availability is:

25
Reliability Toolkit: Commercial Practices Edition., op. cit., p. 12.
26
Ibid.
27
Ibid.
9
MTBM
Ao = ,
MTBM + MDT

where:

MTBM is mean time between maintenance


and
MDT is mean downtime.28

MTBM is a term that is frequently misunderstood and, therefore,


frequently misused. It is defined as “the total number of life units
expended by a given time, divided by the total number of maintenance
events (scheduled and unscheduled) due to that item.”29 It follows that
MTBM is calculated by:

T
MTBM = ,
m

where:

T is the number of life units and


m is the number of corrective and preventive
maintenance events.

Note that MTBM includes both scheduled and unscheduled maintenance


events, that is, both corrective and preventive maintenance. However, it is
frequently used as if it were mean time between unscheduled maintenance,
MTBUM, and, essentially, a synonym for MTBF.

The MIL-HDBK-470A definition of mean downtime (MDT) is “the


average time a system is unavailable for use due to a failure. Time
includes the actual repair time plus all delay time associated with a repair
person arriving with the appropriate replacement parts.”30 Note that this

28
Ibid.
29
MIL-HDBK-470A, op. cit., p. G-11.
30
Ibid., p. G-10.
10
definition includes only corrective maintenance downtime, while
preventive maintenance downtime should also be included in the
operational availability equation above. The MIL-HDBK-338B definition
of downtime is “that element of time during which an item is in an
operational inventory but is not in condition to perform its required
function,”31 which includes preventive maintenance downtime, as the item
“is not in condition to perform its required function” while it is undergoing
preventive maintenance. Active time is “that time during which an item is
in an operational inventory.”32

Mean downtime is calculated by:

DT
MDT = ,
m

where:

DT is the total downtime and


m is the number of corrective and preventive
maintenance events.

If we expand the operational availability equation above with the


equations for MTBM and MDT, we have:

T T
MTBM m m T
Ao = = = = .
MTBM + MDT T DT T + DT T + DT
+
m m m

This gives an alternate but equivalent equation for operational availability


and leads to a common variation:

31
MIL-HDBK-338B, op. cit., p. 3-5.
32
MIL-HDBK-470A, op. cit., p. G-1.
11
T + ST
Ao = ,
T + ST + DT

where:

T is the number of life units,


ST is standby time, and
DT is the total downtime.

Recall our initial notational equation for availability:

Uptime
Availability = ,
Uptime + Downtime

where:

Uptime is the total time that the product is in


customer’s possession and works and
Downtime is the total time that the product is
inoperable. 33

In the first equation for operational availability, the item is credited with
Uptime only when it is being operated; in the second equation, the item is
credited with Uptime when it is operated and also when it is in standby—
ready to operate, but not in operation. Note that this increases both the
numerator and the denominator by the same amount; therefore, operational
availability including standby time is always greater than operational
availability without standby time. Also, note that for the case that ST=0
both equations for operational availability are equivalent.

The upper limit of standby time is reached if we define standby time as all
time other than operating time and downtime:

33
Reliability Toolkit: Commercial Practices Edition., op. cit., p. 11.
12
STmax = CT − (T + DT ) ,

where:

CT is calendar time,
T is the number of life units, and
DT is the total downtime.

In this variation, the item is credited with Uptime unless it is in Downtime.


This can be expressed:

CT − DT DT
Ao = = 1− ,
CT CT

where:

CT is calendar time and


DT is the total downtime.

This form of operational availability is known as mission capability (MC)


rate for Air Force aircraft and vehicle in-commission (VIC) rate for Air
Force vehicles. In Air Force On-Line Vehicle Interactive Management
System (OLVIMS) terminology, downtime is vehicle out-of-commission
(VOC) time, which is further broken into vehicle-down-for-maintenance
(VDM) time and vehicle-down-for-parts (VDP) time. Thus:

DT = VOC = VDM + VDP .

Let’s examine the affect of utilization rate (UR) on both variations of


operational availability. Utilization rate is “the planned or actual number
of life units expended, or missions attempted during a stated interval of
calendar time.”34 It is calculated by:

T
UR = ,
CT

34
MIL-HDBK-470A, op. cit., p. G-17.
13
where:

T is the number of life units and


CT is calendar time.

Utilization rate is also typically expressed as a percentage.

The first equation for operational availability without standby time is:

MTBM
Ao = .
MTBM + MDT

If we assume that, for any given item, neither MTBM nor MDT is a
function of the utilization rate (a common assumption, approximately true
in many cases), it follows that Ao without standby time is also not a
function of the utilization rate.

However, for operational availability with maximum standby time:

DT
Ao = 1 − .
CT

T T DT
Since UR = , CT = . Also, since MDT = , DT = m × MDT .
CT UR m
T T
Finally, since MTBM = , m = . Substituting in the previous
m MTBM
equation for DT:

T T × MDT
DT = m × MDT = × MDT = .
MTBM MTBM

Substituting the above expressions for CT and DT into the equation for Ao
with maximum standby time:

T × MDT
DT MDT
Ao = 1 − = 1 − MTBM = 1 − UR × .
CT T MTBM
UR

14
Thus, if neither MTBM nor MDT is a function of the utilization rate,
operational availability with maximum standby time (e.g., VIC rate or MC
rate) is a function of the utilization rate, decreasing in proportion with
increasing utilization rate.

If, however, you were to examine the Air Force vehicle reliability,
maintainability, and availability data as reported in OLVIMS, you would
find that that the above equation does not hold. While, in general,
increasing the utilization rate decreases the VIC rate, the reported VIC rate
ranges from 85% to 95% with rare exception. Why? In practice, when the
utilization rate of a particular type of vehicle is high, that type of vehicle
will receive priority at the maintenance shop; it will move to the front of
the line, so to speak. Therefore, when the utilization rate is high, the mean
downtime will be reduced, resulting in an almost constant VIC rate.
Therefore, our assumption that MDT is not a function of the utilization
rate is not valid in practice.

Dependability (Do) is “a measure of the degree to which an item is


operable and capable of performing its required function at any (random)
time during a specified mission profile, given item availability at the start
of the mission.”35 Dependability is calculated by:

MTBCF
Do = ,
MTBCF + MTTRS

where:

MTBCF is mean time between critical


failures and
MTTRS is the mean time to restore system.

Mean time to restore system (MTTRS) is:

A measure of the product maintainability parameter, related


to availability and readiness: The total corrective
maintenance time, associated with downing events, divided
by the total number of downing events, during a stated

35
Ibid., pp. G-3 – G-4.
15
period of time. (Excludes time for off-product maintenance
and repair of detached components.)36

11.3.1 Misuse of availability. The simplest misuse of availability is to


mistake availability for a measure of reliability. While availability is a
function of reliability, it is no more a measure of reliability than power is a
measure of voltage, or distance is a measure of velocity.

Another misuse of availability is demonstrated by the following scenario.

1. The contractor cannot achieve the contractually required MTBF.


2. “However,” the contractor argues, “what you really need is
availability, not reliability. What do you care if the MTBF is lower
than the contract originally required, if you can still have the
implied operational availability (Ao)? Sign a long-term contract
logistics support (CLS) agreement with us, and we’ll expedite the
parts to you to reduce the MDT to the point that you meet or
exceed the originally implied operational availability!”

Of course, even though the customer may meet the agreed-upon


operational availability:

1. Every failure must be repaired by the customer’s mechanics, so


there is still a real additional maintenance labor cost to the
customer for the less-than-originally required reliability.
2. The customer still ultimately pays for each part required, whether
on a per-part basis or a “power by the hour” basis. Since the item
has less-than-originally required reliability, it requires more-than-
originally required replacement parts, at an additional cost.
3. The expedited parts supply comes at a cost. In order to reduce the
possibility that a particular part will not be available, the contractor
may increase the inventory. In order to reduce shipping time, the
contractor may pre-position parts in several locations around the
world. Of course, increased inventory and increased number of
warehouses come at an increased cost—to the customer.

36
MIL-HDBK-338B, op. cit., p. 3-13.
16
4. Another approach the contractor can take to expediting parts is to
air freight everything, regardless of urgency. Of course, this also
comes at an additional cost.

By falling into the “availability trap,” the customer has lost, because:

1. The item does not meet the original reliability requirement.


2. Maintenance labor is increased.
3. Parts cost is increased.

The contractor, on the other hand, has won, because:

1. The original reliability requirement was eliminated.


2. The CLS agreement can be a real “cash cow.”

Thus, the contractor has ultimately been rewarded for poor performance.

17
Chapter 4: Introduction to Reliability Math: The
Exponential Distribution
Reliability as a Probability

In chapter 1, we saw that the definition for reliability is “the probability


that an item can perform its intended function for a specified interval
under stated conditions.”37 Therefore, the math of reliability is based upon
probability and statistics.38 Section 5 of MIL-HDBK-338B provides an
excellent discussion of reliability theory. This chapter is, essentially,
condensed from that discussion, and consists primarily of quotes and
excerpts from it.

The Cumulative Distribution Function

MIL-HDBK-338B states:

The cumulative distribution function F(t) is defined as the


probability in a random trial that the random variable is not
greater than t … , or

t
F (t ) = ∫ f (t )dt
−∞

where f(t) is the probability density function of the random


variable, time to failure. F(t) is termed the “unreliability
function” when speaking of failure. … Since F(t) is zero
until t=0, the integration in Equation 5.1 can be from zero
to t.39

37
Ibid., p. G-15, definition (2).
38
MIL-HDBK-338B, op. cit., p. 5-1.
39
Ibid., p. 5-2.
18
The Reliability Function

MIL-HDBK-338B continues:

The reliability function, R(t), or the probability of a device


not failing prior to some time t, is given by


R(t ) = 1 − F (t ) = ∫ f (t )dt
t

The probability of failure in a given time interval between


t1 and t2 can be expressed by the reliability function

∞ ∞

∫ f (t )dt − ∫ f (t )dt = R(t ) − R(t


t1 t2
1 2 )

Failure Rates and Hazard Rates

MIL-HDBK-338B continues:

The rate at which failures occur in the interval t1 to t2, the


failure rate, λ(t), is defined as the ratio of probability that
failure occurs in the interval, given that it has not occurred
prior to t1, the start of the interval, divided by the interval
length. Thus,

R(t1 ) − R(t 2 )
λ (t ) =
(t 2 − t1 ) R(t1 )

40
Ibid.
19
The hazard rate, h(t), or instantaneous failure rate, is
defined as the limit of the failure rate as the interval length
approaches zero, or

1 ⎡ − dR (t ) ⎤
h(t ) = K = (5
R(t ) ⎢⎣ dt ⎥⎦

The Bathtub Curve

Again from MIL-HDBK-338B:

Figure 4-1 shows a typical time versus failure rate curve for
equipment. This is the "bathtub curve," which, over the
years, has become widely accepted by the reliability
community. It has proven to be particularly appropriate for
electronic equipment and systems. The characteristic
pattern is a period of decreasing failure rate (DFR)
followed by a period of constant failure rate (CFR),
followed by a period of increasing failure rate (IFR).

FIGURE 5.4-1: HAZARD RATE AS A FUNCTION OF AGE42

41
Ibid.
42
Ibid., p. 5-28.
20
The Exponential Distribution

MIL-HDBK-338B states:

If h(t) can be considered to be a constant failure rate (λ),


which is true for many cases for electronic equipment, …

R(t ) = e − λt

Thus, we see that a constant failure rate results in an exponential reliability


function. The usefulness of the exponential distribution is not limited to
electronic equipment, as the above quote implies; it also extends, in
general, to repairable systems. MIL-HDBK-338B continues:

[The exponential distribution] is widely applicable for


complex equipments and systems. If complex equipment
consists of many components, each having a different mean
life and variance which are randomly distributed, then the
system malfunction rate becomes essentially constant as
failed parts are replaced.

Thus, even though the failures might be wearout failures,


the mixed population causes them to occur at random time
intervals with a constant failure rate and exponential
behavior.44

Reliability in Engineering Design also states, “For system level reliability


calculations, the exponential [distribution] is usually a good model.”45

There is one more factor that makes the exponential distribution the single
most important in reliability math.46 As MIL-HDBK-338B states, “The

43
Ibid., p. 5-5.
44
Ibid., p. 5-29.
45
Kailash C. Kapur and Leonard R. Lamberson, Reliability in Engineering
Design. John Wiley and Sons, Inc., 1977, p. 235.
46
MIL-HDBK-338B, op. cit., p. 5-17.
21
simplicity of the approach utilizing the exponential distribution, as
previously indicated, makes it extremely attractive.”47 Not only is the
exponential distribution a good model for electronic parts and complex
systems in general, it also involves the simplest math.

Mean Time Between Failure (MTBF)

MTBF is commonly represented by θ. For the exponential distribution, we


have:

1
MTBF = Θ = .48
λ

Reliability of Items in Series

For the exponential distribution, to calculate the reliability of items in


series, add their failure rates:

n
λtotal = ∑ λi .
i =1

Since (by definition) the failure rate is constant, there is no need to


calculate reliability as a function of time, greatly simplifying the math
associated with the exponential distribution. Since the failure rates are
additive, tasks such as reliability predictions are also mathematically easy
with the exponential distribution.

Confidence Levels

A further advantage of the exponential distribution is the relative ease with


which confidence levels can be calculated. Unlike such properties as
length, weight, voltage, and time, reliability cannot be directly measured.
Rather, it is measured by counting the number of failures over a period of
life units. As we saw in chapter 1, this is calculated by:

47
Ibid., p. 5-29.
48
Ibid., p. 5-18.
22
T
MTBF = ,
n

where:

T is the number of life units and


n is the number of failures.

This is known as the observed reliability. Note that this is a point estimate
of the true reliability; since, as we have discussed, reliability is a
probability, there is a possibility that the true reliability is somewhat
better, or somewhat worse, than the observed reliability. We would like to
be able to calculate confidence levels, so that we could state, for example,
that the widget has a reliability of X hours MTBF measured at the 90%
confidence level. This would mean that we are 90% confident that the
MTBF of the widget is at least X hours, or, equivalently, that there is only
a 10% chance that the MTBF of the widget is less than X hours.

We will discuss calculation of confidence levels in the remainder of this


chapter. We will more fully discuss the usefulness of confidence levels
and their practical implications in chapter 7 Reliability Qualification Test
(RQT).

As RADC Reliability Engineer’s Toolkit states:

There are two ways to end a reliability test, either on a


specified number of failures occurring (failure truncated),
or on a set period of time (time truncated). There are
usually two types of confidence calculations of interest,
either one sided (giving the confidence that an MTBF is
above a certain value) or two sided (giving the confidence
than an MTBF is between an upper and lower limit).49

In general, formal reliability tests are time truncated (terminated), that is,
successful completion of the test is defined as completing T hours of

49
RADC Reliability Engineer’s Toolkit. Griffiss Air Force Base, NY:
Systems Reliability and Engineering Division, Rome Air Development
Center, Air Force Systems Command (AFSC), 1988, p. A-47.
23
testing with no more than N failures. The results could be analyzed as a
failure truncated test if the test were terminated early due to excessive
failures or if one or more failures were discovered during the post-test
inspection.

24
Table 4-1: Summary of χ2 Models50

Two-Sided Confidence Single-Sided


Level Models Confidence Level
Models
) ) )
2CΘ 2CΘ 2CΘ
Failure Truncated ≤Θ≤ Θ≥ 2
Tests χ2 α χ α2 χ (1−α ), 2C
(1− ), 2 C , 2C
2 2
) ) )
2CΘ 2CΘ 2CΘ
Time Truncated ≤Θ≤ Θ≥
Tests χ2 α χ α2 χ (21−α ),( 2C + 2)
(1− ), ( 2 C + 2 ) , 2C
2 2

Table 4-2: Fractiles of the χ2 Distribution51

Degrees of Probability in Percent


Freedom (f)
10.0 20.0 80.0 90.0
2 0.21072 0.44629 3.2189 4.6052
4 1.0636 1.6488 5.9886 7.7794
6 2.2041 3.0701 8.5581 10.645
8 3.4895 4.5936 11.030 13.362
10 4.8652 6.1791 13.442 15.987
12 6.3038 7.8073 15.812 18.549
14 7.7895 9.4673 18.151 21.064
16 9.3122 11.152 20.465 23.542
18 10.865 12.857 22.760 25.989

50
Ibid.
51
Ibid., pp. A-48 – A-50. This table has been abridged to include only the
10% and 20% upper and lower confidence levels (those most commonly
used in reliability calculations) and to delete the odd-numbered degrees of
freedom, which are not used in confidence level calculations. It has been
expanded to include more degrees of freedom and more significant digits.
25
Degrees of Probability in Percent
Freedom (f)
10.0 20.0 80.0 90.0
20 12.443 14.578 25.038 28.412
22 14.041 16.314 27.301 30.813
24 15.659 18.062 29.553 33.196
26 17.292 19.820 31.795 35.563
28 18.939 21.588 34.027 37.916
30 20.599 23.364 36.250 40.256
32 22.271 25.148 38.466 42.585
34 23.952 26.938 40.676 44.903
36 25.643 28.735 42.879 47.212
38 27.343 30.537 45.076 49.513
40 29.051 32.345 47.269 51.805
42 30.765 34.157 49.456 54.090
44 32.487 35.974 51.639 56.369
46 34.215 37.795 53.818 58.641
48 35.949 39.621 55.993 60.907
50 37.689 41.449 58.164 63.167
52 39.433 43.281 60.332 65.422
54 41.183 45.117 62.496 67.673
56 42.937 46.955 64.658 69.919
58 44.696 48.797 66.816 72.160
60 46.459 50.641 68.972 74.397
62 48.226 52.487 71.125 76.630
64 49.996 54.337 73.276 78.860
66 51.770 56.188 75.425 81.085
68 53.548 58.042 77.571 83.308
70 55.329 59.898 79.715 85.527
72 57.113 61.756 81.857 87.743
26
Degrees of Probability in Percent
Freedom (f)
10.0 20.0 80.0 90.0
74 58.900 63.616 83.997 89.956
76 60.690 65.478 86.135 92.166
78 62.483 67.341 88.271 94.374
80 64.278 69.207 90.405 96.578
82 66.076 71.074 92.538 98.780
84 67.876 72.943 94.669 100.98
86 69.679 74.813 96.799 103.18
88 71.484 76.685 98.927 105.37
90 73.291 78.558 101.05 107.57
100 82.358 87.945 111.67 118.50
1000 943.13 962.18 1037.4 1057.7

27
Table 4-3: Factors for Calculating Confidence Levels52

Factor
80%
Failures Two- 80%
60% Two-Sided
Sided Two-
80% One-Sided
90% One- Sided
Sided
Time
All Other Lower Lower Upper Upper
Terminated
Cases Limit Limit Limit Limit
Lower Limit
0 1 0.43429 0.62133 4.4814 9.4912
1 2 0.25709 0.33397 1.2130 1.8804
2 3 0.18789 0.23370 0.65145 0.90739
3 4 0.14968 0.18132 0.43539 0.57314
4 5 0.12510 0.14879 0.32367 0.41108
5 6 0.10782 0.12649 0.25617 0.31727
6 7 0.09495 0.11019 0.21125 0.25675
7 8 0.08496 0.09773 0.17934 0.21477
8 9 0.07695 0.08788 0.15556 0.18408
9 10 0.07039 0.07988 0.13719 0.16074
10 11 0.06491 0.07326 0.12259 0.14243
11 12 0.06025 0.06767 0.11073 0.12772
12 13 0.05624 0.06290 0.10091 0.11566
13 14 0.05275 0.05878 0.09264 0.10560
14 15 0.04968 0.05517 0.08560 0.09709

52
The Rome Laboratory Reliability Engineer’s Toolkit. Griffiss Air Force
Base, NY: Systems Reliability Division, Rome Laboratory, Air Force
Materiel Command (AFMC), 1993, p. A-43. This table has been adapted
and abridged to include only the 10% and 20% upper and lower
confidence levels (those most commonly used in reliability calculations).
It has been expanded to include more failures and more significant digits.
Note that The Rome Laboratory Reliability Engineer’s Toolkit is in the
public domain; it can, therefore, be freely distributed.
28
Factor
80%
Failures Two- 80%
60% Two-Sided
Sided Two-
80% One-Sided
90% One- Sided
Sided
Time
All Other Lower Lower Upper Upper
Terminated
Cases Limit Limit Limit Limit
Lower Limit
15 16 0.04697 0.05199 0.07953 0.08980
16 17 0.04454 0.04917 0.07424 0.08350
17 18 0.04236 0.04664 0.06960 0.07799
18 19 0.04039 0.04437 0.06549 0.07314
19 20 0.03861 0.04231 0.06183 0.06885
20 21 0.03698 0.04044 0.05855 0.06501
21 22 0.03548 0.03873 0.05560 0.06156
22 23 0.03411 0.03716 0.05292 0.05845
23 24 0.03284 0.03572 0.05048 0.05563
24 25 0.03166 0.03439 0.04825 0.05307
25 26 0.03057 0.03315 0.04621 0.05072
26 27 0.02955 0.03200 0.04433 0.04856
27 28 0.02860 0.03093 0.04259 0.04658
28 29 0.02772 0.02993 0.04099 0.04475
29 30 0.02688 0.02900 0.03949 0.04305
30 31 0.02610 0.02812 0.03810 0.04147
31 32 0.02536 0.02729 0.03681 0.04000
32 33 0.02467 0.02652 0.03559 0.03863
33 34 0.02401 0.02578 0.03446 0.03735
34 35 0.02338 0.02509 0.03339 0.03615
39 40 0.02071 0.02212 0.02890 0.03111
49 50 0.01688 0.01791 0.02274 0.02428
499 500 0.00189 0.00193 0.00208 0.00212

29
Chapter 5: Reliability Analyses
Reliability Modeling and Prediction

A reliability model is a mathematical model of an item used for predicting


the reliability of that item—that is, the development of the model is the
first step in the development of the reliability prediction. MIL-HDBK-
338B states, “Reliability modeling and prediction is a methodology for
estimating an item’s ability to meet specified reliability requirements.”53
However, it also states:

Reliability models and predictions are not used as a basis


for determining the attainment of reliability requirements.
Attainment of these requirements is based on representative
test results such as those obtained by using tests plans from
MIL-HDBK-781 (see Section 8 and Ref. [1]). However,
predictions are used as the basis against which reliability
performance is measured.54

Reliability Modeling

For basic (logistics) reliability, all items, including those intended solely
for redundancy and alternate modes of operation, are modeled in series.
As we saw in chapter 4, if the exponential distribution (constant failure
rate) is assumed, to calculate the reliability of items in series, add their
failure rates:

n
λtotal = ∑ λi .
i =1

For critical (mission) reliability, redundant items are modeled in parallel.


Note that there are several potential cases. For example, there can be three
identical items, with only two required for success; or there can be a
primary system and a backup system, which is only used after failure of
the primary system. The following table summarizes the mathematical
models for several common cases.

53
MIL-HDBK-338B, op. cit., p. 6-20.
54
Ibid.
30
Table 5-1: Redundancy Equation Approximations Summary55
With Repair Without Repair

All units
are active
on-line Equation 4
with equal Equation 1
unit λ
failure n!(λ ) q +1 λ( n − q ) / n =
λ( n − q ) / n = n
1
∑i
rates. (n-
q) out of (n − q − 1)!( µ ) q
n required i=n−q
for
success.
Two
active on-
line units
with
different Equation 2 Equation 5
failure
and repair λ A λ B [( µ A + µ B ) + (λ A + λ B )] λ A 2 λB + λ A λ B 2
rates. λ1 / 2 = λ1 / 2 = 2
One of ( µ A )( µ B ) + ( µ A + µ B )(λ A + λ B ) λ A + λ B + λ A λB
2

two
required
for
success.
One
standby
off-line
unit with
n active
on-line
units
required
for Equation 3 Equation 6
success.
n[nλ + (1 − P) µ ]λ
Off-line

spare
λn / n+1 = λn / n+1 =
assumed
to have a
µ + n( P + 1)λ P +1
failure
rate of
zero. On-
line units
have
equal
failure
rates.

55
The Rome Laboratory Reliability Engineer’s Toolkit, op. cit., p. 90.
31
Key:
λx/y is the effective failure rate of the redundant configuration where x
of y units are required for success
n = number of active on-line units. n! is n factorial (e.g.,
5!=5x4x3x2x1=120, 1!=1, 0!=1)
λ = failure rate of an individual on-line unit (failures/hour) (note that
this is not the more common failures/106 hours)
q = number of on-line active units which are allowed to fail without
system failure
µ = repair rate (µ=1/Mct, where Mct is the mean corrective maintenance
time in hours)
P = probability switching mechanism will operate properly when
needed (P=1 with perfect switching)

Notes:
1. Assumes all units are functional at the start
2. The approximations represent time to first failure
3. CAUTION: Redundancy equations for repairable systems should
not be applied if delayed maintenance is used.

Reliability Prediction

MIL-HDBK-338B lists four reliability prediction techniques. They are:

(1) Similar Item Analysis. Each item under consideration is compared


with similar items of known reliability in estimating the probable
level of achievable reliability, then combined for higher level
analyses.
(2) Part Count Analysis. Item reliability is estimated as a function of
the number of parts and interconnections included. Items are
combined for higher level analysis.
(3) Stress Analyses. The item failure rate is determined as a function
of all the individual part failure rates as influenced by operational
stress levels and derating characteristics for each part.
(4) Physics-of-Failure Analysis. Using detailed fabrication and
materials data, each item or part reliability is determined using
failure mechanisms and probability density functions to find the
time to failure for each part. The physics-of-failure (PoF) approach
is most applicable to the wearout period of an electronic product’s
32
life cycle and is not suited to predicting the reliability during the
majority of its useful life.56

We will consider the parts count analysis technique below. For a


discussion of the other techniques, see 6.4.5 of MIL-HDBK-338B.

Parts Count Reliability Prediction

[The following article is reprinted from The Rome Laboratory Reliability


Engineer’s Toolkit.57]

A standard technique for predicting reliability when detailed design data


such as part stress levels is not yet available is the parts count reliability
prediction technique. The technique has a "built-in" assumption of average
stress levels which allows prediction in the conceptual stage or source
selection stage by estimation of the part types and quantities. This section
contains a summary of the MIL-HDBK-217F, Notice 1 technique for
eleven of the most common operational environments:

GB Ground Benign
GF Ground Fixed
GM Ground Mobile
NS Naval Sheltered
NU Naval Unsheltered
AIC Airborne Inhabited Cargo
AIF Airborne Inhabited Fighter
AUC Airborne Uninhabited Cargo
AUF Airborne Uninhabited Fighter
ARW Airborne Rotary Wing (i.e., Helicopter) (Both Internal
and External Equipment)
SF Space Flight

Assuming a series reliability model, the equipment failure rate can be


expressed as:

56
MIL-HDBK-338B, op. cit., p. 6-44.
57
The Rome Laboratory Reliability Engineer’s Toolkit, op. cit., p. 92.
33
n
λ EQUIP = ∑ ( N i )(λ gi )(Π Qi )
i =1

where

λEQUIP = total equipment failure rate (failures/106


hrs)
λgi = generic failure rate for the ith generic part type
(failures/106 hrs)
πQi = quality factor for the ith generic part type
Ni = quantity of the ith generic part type
n = number of different generic part types

[End of the Parts Count Reliability Prediction article reprinted from The
Rome Laboratory Reliability Engineer’s Toolkit.]

Failure rate data for use in reliability predictions can be difficult to obtain.
The single best source for failure rate data for electronic components is
MIL-HDBK-217F(2), “Reliability Prediction of Electronic Equipment;”
for non-electronic components, use NPRD-95, “Nonelectronic Parts
Reliability Data 1995,” available from the Reliability Information
Analysis Center, 201 Mill Street, Rome NY 13440.

A sample top-level basic reliability prediction, from the Basic


Expeditionary Airfield Resources (BEAR) Power Unit (BPU) Feasibility
Analysis is provided in Table 5-2.

34
Table 5-2: Top-Level Reliability Prediction (Feasibility Analysis) for
the BPU

Mean
Failure Rate Time
Description in NPRD-95
(Failures/10E6 Between Page
PD Description
Hours) Failure
(hours)
Internal Engine,
Combustion Diesel 14.2389 70,230 2-88
Engine (Summary)
Heat
Engine Cooling Exchangers, 2-
7.8829 126,857
System Radiator 112
(Summary)
Brushless AC Generator, 2-
0.7960 1,256,281
Generator AC 105
Voltage Regulator,
2-
Regulator/Exciter Voltage 5.5527 180,093
166
System (Summary)
Starter, 2-
Cranking Motor 0.0212 47,169,811
Motor 192
(assumed to
be included
Controls 0 N/A N/A
in other
items)
(included in
Governor Engine, 0 N/A N/A
Diesel)
(assumed to
be
Other Devices negligible 0 N/A N/A
for initial
analysis)

Total 28.4917 35,098 N/A

35
Chapter 6: Reliability Growth Test (RGT)
[The following article is reprinted from Appendix 6 of RADC Reliability
Engineer’s Toolkit58, which is in the public domain, and can, therefore, be
freely distributed.]

6.1 RGT definition. MIL-STD-785 distinguishes reliability growth testing


(RGT) from reliability qualification testing (RQT) as follows:

Reliability Growth Test (RGT): A series of tests conducted to


disclose deficiencies and to verify that corrective actions will
prevent recurrence in the operational inventory. (Also known as
"TAAF"59 testing).

Reliability Qualification Test (RQT): A test conducted under


specified conditions, by, or on behalf of, the government, using
items representative of the approved production configuration, to
determine compliance with specified reliability requirements as a
basis for production approval. (Also known as a "Reliability
Demonstration," or "Design Approval" test.)

6.2 RGT application effectiveness. An effective way to explain the


concept of RGT is by addressing the most frequently asked questions
relative to its use as summarized from "Reliability Growth Testing
Effectiveness" (RADC-TR-84-20). For more information consult this
reference and MIL-HDBK-189, "Reliability Growth Management."

Who pays for the RGT? Does the government end up paying more?

The usual case is that the government pays for the RGT as an additional
reliability program cost and in stretching out the schedule. The savings in
support costs (recurring logistics costs) exceed the additional initial
acquisition cost, resulting in a net savings in [Life Cycle Cost (LCC)]. The
amount of these savings is dependent on the quantity to be fielded, the
maintenance concept, the sensitivity of LCC to reliability, and the level of
development required. It is the old "pay me now or pay me later situation"

58
RADC Reliability Engineer’s Toolkit, op. cit., pp. A-63 – A-68.
59
“TAAF” stands for Test Analyze And Fix.
36
which in many cases makes a program manager's situation difficult
because his performance is mainly based on the "now" performance of
cost and schedule.

Does RGT allow contractors to "get away with" a sloppy initial design
because they can fix it later at the government's expense?

It has been shown that unforeseen problems account for 75% of the
failures due to the complexity of today's equipment. Too low an initial
reliability (resulting from an inadequate contractor design process) will
necessitate an unrealistic growth rate in order to attain an acceptable level
of reliability in the allocated amount of test time. The growth test should
be considered as an organized search and correction system for reliability
problems that allows problems to be fixed when it is least expensive. It is
oriented towards the efficient determination of corrective action. Solutions
are emphasized rather than excuses. It can give a nontechnical person an
appreciation of reliability and a way to measure its status.

Should all development programs have some sort of growth program?

The answer to this question is yes in that all programs should analyze and
correct failures when they occur in prequalification testing. A distinction
should be in the level of formality of the growth program. The less
challenge there is to the state-of-the-art, the less formal (or rigorous) a
reliability growth program should be. An extreme example would be the
case of procuring off-the-shelf equipment to be part of a military system.
In this situation, which really isn't a development, design flexibility to
correct reliability problems is mainly constrained to newly developed
interfaces between the "boxes" making up the system. A rigorous growth
program would be inappropriate but a [failure reporting and corrective
action system (FRACAS)] should still be implemented. The other extreme
is a developmental program applying technology that challenges the state-
of-the-art. In this situation a much greater amount of design flexibility to
correct unforeseen problems exists. Because the technology is so new and
challenging, it can be expected that a greater number of unforeseen
problems will be surfaced by growth testing. All programs can benefit
from testing to find reliability problems and correcting them prior to
deployment, but the number of problems likely to be corrected and the
cost effectiveness of fixing them is greater for designs which are more
complex and challenging to the state-of-the-art.
37
How does the applicability of reliability growth testing vary with the
following points of a development program?

(1) Complexity of equipment and challenge to state-of-the-


art?

The more complex or challenging the equipment design is,


the more likely there will be unforeseen reliability problems
which can be surfaced by a growth program. However,
depending on the operational scenario, the number of
equipments to be deployed and the maintenance concept,
there may be a high LCC payoff in using a reliability growth
program to fine tune a relatively simple design to maximize
its reliability. This would apply in situations where the
equipments have extremely high usage rates and LCC is
highly sensitive to MTBF.

(2) Operational environment?

All other factors being equal, the more severe the


environment, the higher the payoff from growth testing. This
is because severe environments are more likely to inflict
unforeseen stress associated with reliability problems that
need to be corrected.

(3) Quantity of equipment to be produced?

The greater the quantities of equipment, the more impact on


LCC by reliability improvement through a reliability growth
effort.

What reliability growth model(s) should be used?

The model to be used, as MIL-HDBK-189 says, is the simplest one that


does the job. Certainly, the Duane is most common, probably with the
AMSAA (Army Materiel Systems Analysis Activity) second. They both
have advantages; the Duane being simple with parameters having an easily
recognizable physical interpretation, and the AMSAA having rigorous
statistical procedures associated with it. MIL-HDBK-189 suggests the
38
Duane for planning and the AMSAA for assessment and tracking. When
an RQT is required, the RGT should be planned and tracked using the
Duane model; otherwise, the AMSAA model is recommended for tracking
because it allows for the calculation of confidence limits around the data.

Should there be an accept/reject criteria?

The purpose of reliability growth testing is to uncover failures and take


corrective actions to prevent their recurrence. Having an accept/reject
criteria is a negative contractor incentive towards this purpose. Monitoring
the contractor's progress and loosely defined thresholds are needed but
placing accept/reject criteria, or using a growth test as a demonstration,
defeat the purpose of running them. A degree of progress monitoring is
necessary even when the contractor knows that following the reliability
growth test he will be held accountable by a final RQT. Tight thresholds
make the test an RQT in disguise. Reliability growth can be incentivized
but shouldn't be. To reward a contractor for meeting a certain threshold in
a shorter time or by indicating "if the RGT results are good, the RQT will
be waived," the contractor's incentive to "find and fix" is diminished. The
growth test's primary purpose is to improve the design, not to evaluate the
design.

What is the relationship between an RQT and RGT?

The RQT is an "accounting task" used to measure the reliability of a fixed


design configuration. It has the benefit of holding the contractor
accountable some day down the road from his initial design process. As
such, the contractor is encouraged to seriously carry out the other design
related reliability tasks. The RGT is an "engineering task" designed to
improve the design reliability. It recognizes that the drawing board design
of a complex system cannot be perfect from a reliability point of view and
allocates the necessary time to fine tune the design by finding problems
and designing them out. Monitoring, tracking, and assessing the resulting
data gives insight into the efficiency of the process and provides
nonreliability persons with a tool for evaluating the development's
reliability status and for reallocating resources when necessary. The forms
of testing serve very different purposes and complement each other in
development of systems and equipments. An RGT is not a substitute for
an RQT, or any other reliability design tasks.

39
How much validity/confidence should be placed on the numerical
results of RGT?

Associating a hard reliability estimate from a growth process, while


mathematically practical, has the tone of an assessment process rather than
an improvement process, especially if an RQT assessment will not follow
the RGT. In an ideal situation, where contractors are not driven by profit
motives, a reliability growth test could serve as an improvement and
assessment vehicle. Since this is not the real world, the best that can be
done if meaningful quantitative results are needed without an RQT, is to
closely monitor the contractor RGT. Use of the AMSAA model provides
the necessary statistical procedures for associating confidence levels with
reliability results. In doing so, closer control over the operating conditions
and failure determinations of the RGT must be exercised than if the test is
for improvement purposes only. A better approach is to use a less closely
controlled growth test as an improvement technique (or a structured
extension of FRACAS, with greater emphasis on corrective action) to fine
tune the design as insurance of an accept decision in an RQT. With this
approach, monitoring an improvement trend is more appropriate than
development of hard reliability estimates. Then use a closely controlled
RQT to determine acceptance and predict operational results.

6.3 Duane model. Because the Duane model is the one most commonly
used, it will be further explained. The model assumes that the plot of
MTBF versus time is a straight line when plotted on log-log paper. The
main advantage of this model is that it is easy to use. The disadvantage of
the model is it assumes that a fix is incorporated immediately after a
failure occurs (before further test time is accumulated). Because fixes are
not developed and implemented that easily in real life, this is rarely the
case. Despite this problem, it is still considered a useful planning tool.
Below is a brief summary of the Duane model.

∆MTBF
a. Growth Rate α=
∆TIME

1 α
b. Cumulative MTBF MTBFC = T
K

40
MTBFC
c. Instantaneous MTBF MTBFI =
1−α

1
d. Test Time T = [(MTBFI )(K )(1 − α )]α

e. Preconditioning period at which system will realize an initial MTBF


of MTBFC

1
TPC = (MTBFPRED )
2

where

K = a constant which is a function of the


initial MTBF
α = the growth rate
T = the test time

The instantaneous MTBF is the model's mathematical representation of


the MTBF if all previous failure occurrences are corrected. Therefore,
there is no need to selectively purge corrected failures from the data.

The scope of the up-front reliability program, severity of the use


environment and system state-of-the-art can have a large effect on the
initial MTBF and, therefore, the test time required. The aggressiveness of
the test team and program office in ensuring that fixes are developed and
implemented can have a substantial effect on the growth rate and,
therefore, test time. Other important considerations for planning a growth
test are provided [below].

RGT Planning Considerations

• To account for down time, calendar time should be estimated


to be roughly twice the number of test hours.

• A minimum test length of 5 times the predicted MTBF should


always be used (if the Duane Model estimates less time).
Literature commonly quotes typical test lengths of from 5 to 25
times the predicted MTBF
41
• For large MTBF systems (e.g., greater than 1,000 hours), the
preconditioning period equation does not hold; 250 hours is
commonly used.

• The upper limit on the growth rate is .6 (growth rates above .5


are rare).

[End of the RGT article reprinted from Appendix 6 of RADC Reliability


Engineer’s Toolkit.]

Starting Point for the Duane Model

The above discussion of the Duane growth model is incomplete in one


detail: it does not provide sufficient information to calculate the constant
K.

According to MIL-STD-1635, the initial MTBF is 10 percent of the


predicted MTBF;60 therefore, the first failure is expected to occur at
MTBFpred
T= , and the cumulative MTBF at the first failure will be
10
MTBFpred
MTBFC = . By rearranging equation b above,
10


K= .
MTBFC

Substituting the initial values of T and MTBFcumulative, we have:

α
⎛ MTBF pred ⎞
⎜⎜ ⎟⎟ α −1

K= ⎝ 10 ⎠ = ⎛⎜ MTBF pred ⎞
⎟⎟ .
MTBF pred ⎜ 10
⎝ ⎠
10

60
MIL-STD-1635, “Reliability Growth Testing,” 3 Feb 1978 (since
cancelled), p. 27.
42
Expected Number of Failures

The Duane Model can be used to calculate the number of failures expected
during an RGT.

Note that, by definition, the cumulative reliability, MTBFC, is the test time,
T, divided by the number of failures, N:

T
MTBFC = .
N

However, MTBFC can also be calculated by equation b above, so we


have:

T Tα
MTBFC = = ,
N K

which can be solved for N:

T K
N= α = .
T T α −1
K

Expected Time to Each Failure

The above analysis can be extended to predict the expected time to each
failure. Let i be the i-th failure and ti be the time at which the i-th failure is
expected to occur. By substituting i for N and ti for T in the above
equation, we have:

K
i= α −1 .
ti

This can be rearranged to solve for ti:

1
K
ti = ( ) α −1 .
i

43
This can be used, for example, to predict how many failures will occur
during each week of an RGT so as to estimate the number of engineers
that will be required for the FRACAS needed to achieve the planned
growth rate. It can also be used for a Monte Carlo simulation of an RGT.

Moving Average Method

The moving average method of monitoring an RGT is discussed in MIL-


STD-1635, which states:

The moving average for a given number of failures is


computed as the arithmetic mean of the corresponding
times between selected [failures] sequentially and in
reverse order of occurrence. For example, the moving
average of two failures is obtained by adding the [times
between the] last two failure times and dividing by two; for
three failures, by summing the [times between the] last
three failure times and dividing by three; and so forth. The
number of failures used in the computation is arbitrary but
should be restricted to ten or less.61

The number of failures used is typically identified by referring to an N-


point moving average; for example, a 5-point moving average would use
the last five failures. Table 6-1 provides an example of the moving
average method.

There are three disadvantages to the moving average method.

1. It offers no method to project reliability growth or the required


RGT time; therefore, it cannot be used for RGT planning. As a
result, the Duane Model is often used in conjunction with the
moving average method for projection and planning purposes.
2. It is volatile, and becomes more volatile as the number of failures
used is reduced. This is exaggerated in a real-life RGT, where
failures often are not identified when they occur; rather, they tend
to be identified during inspections, resulting in several failures
grouped at one cumulative test time.

61
Ibid., p. 32.
44
3. It cannot be used prior to the N-th failure. However, there is a
work-around for this: use a 1-point moving average at the first
failure, a 2-point moving average at the second failure, etc., until
the N-th failure occurs.

With these disadvantages, why would anyone use the moving average
method?

1. It is simple; calculation of the current MTBF at any point can be


done with a subtraction and a division. In comparison, the Duane
Model requires determination of the current growth rate, either
graphically or using regression analysis, and the AMSSA Method
requires even more complicated calculations.
2. It offers an easy method of calculating confidence levels for the
current MTBF.62

It is this last point that makes the moving average method so attractive.
Recall the discussion of confidence levels in chapter 4. If a system has a
constant failure rate, we can easily calculate the confidence levels using
the χ2 distribution and the number of failures or the table of confidence
level factors. We know the number of failures during the period of interest
with the moving average method. The assumption of a constant failure rate
is not strictly true, as the point of an RGT is to introduce improvements,
and reduce the failure rate, as the RGT progresses. However, if the period
is relatively short in comparison to the entire RGT, the failure rate can be
assumed to be constant during the period. Further, if the failure rate is
decreasing, as it should in an RGT, the confidence levels would be
conservative. Thus, the moving average method provides an easy method
of calculating confidence levels for the current MTBF. We can calculate
the current MTBF at any point with a subtraction and a division; we can
calculate the two-sided confidence levels with two additional
multiplications.

62
Molly Statham and I developed this method while working in the
Loader Program Office in 1996. It is so simple, yet so powerful, I cannot
imagine that no one had previously developed it; however, I have never
seen it mentioned in the literature.
45
AMSAA Method

For a discussion of the AMSSA Method, see 5.5.2 of MIL-HDBK-781A.

Block Modification Method

The Block Modification (or Block Mod) Method differs from the methods
previously described in that, rather than implementing corrective actions
as soon as they are available, they are held and implemented in groups as
Block Mods. The result is that a graph of the instantaneous or current
reliability will appear as a stair-step rather than a continuous line or curve.

There are two basic advantages to the Block Mod Method:

1. It reduces the total number of test item configurations, making it


easier to determine which configuration was being tested at any
particular time.
2. Since the configuration is constant in any block, each block can be
analyzed as an independent reliability test utilizing the confidence
level calculations from chapter 4.

There are also two basic disadvantages to the Block Mod Method:

1. It offers no method to project reliability growth or the required


RGT time; therefore, it cannot be used for RGT planning. Further,
due to its discontinuous current reliability plot, the Duane Model is
only approximately valid; therefore, the Duane Model’s usefulness
for projection and planning purposes is limited.
2. It reduces the test time available for verifying the effectiveness of
corrective actions.

Recommended Approach

RADC Reliability Engineer’s Toolkit, in the discussion of RGT quoted


above, states: “MIL-HDBK-189 suggests the Duane for planning and the
AMSAA for assessment and tracking.”63 While the AMSAA Method is

63
RADC Reliability Engineer’s Toolkit, op. cit., pp. A-64.
46
technically more rigorous than the moving average method, unless the
RGT is being used in lieu of an RQT—which is never recommended—the
moving average method is adequate. Therefore, the recommended
approach is to use the Duane Model for planning and the moving average
method for assessment and tracking.

Failure Purging

Failure purging is a potentially contentious issue in an RGT program.


Failure purging is the removal of a failure from the RGT tracking process
after the corrective action for that failure has been implemented and its
effectiveness verified. However, RGT guidance and literature appears to
be universally, and emphatically, opposed to failure purging.

MIL-HDBK-189, “Reliability Growth Management,” states:

… failure purging as a result of design fixes is an


unnecessary and unacceptable procedure when applied to
determining the demonstrated reliability value. It is
unnecessary because of the recently developed statistical
procedures to analyze data whose failure rate is changing. It
is unacceptable for the following reasons:

a. The design fix must be assumed to have reduced


the probability of a particular failure to zero.
This is seldom, if ever, true. Usually a fix will
only reduce the probability of occurrence, and in
some cases, fixes have been known to actually
increase the probability of a failure occurring.

b. It must be assumed that the design fix will not


interact with other components and/or failure
modes. Fixes have frequently been known to
cause an increase in the failure rate of other
components and/or failure modes.

The hard fact is that fixes do not always fix; and, therefore,
the attitude of the government must be to defer judgment
until further testing is conducted. However, even after the
effectiveness of a design fix has been established, failures
47
associated with eliminated failure modes should not be
purged. The reason is—if there has been sufficient testing
to establish the effectiveness of a design fix, then an
appropriate reliability model will, by then, have sufficient
data to reflect the effect of the fix in the current reliability
estimate.

The above discussion, of course, applies to the


demonstrated reliability values. It may, however, be
necessary to weight the effectiveness of proposed fixes for
the purpose of projecting reliability. However, the
difference between assessments and projections must be
clearly delineated.64

MIL-STD-1635, “Reliability Growth Testing” (now cancelled), states:


“This [Duane Cumulative MTBF] plot shall not be adjusted by negating
past failures because of present or future design changes.”65

RADC Reliability Engineer’s Toolkit, in the discussion of the Duane


Model quoted above, states: “The instantaneous MTBF is the model’s
mathematical representation of the MTBF if all previous failure
occurrences are corrected. Therefore, there is no need to selectively purge
corrected failures from the data.”66

A variation on the failure purging theme is to purge all but the first
occurrence of a particular failure mode—essentially, once a particular
failure mode has been experienced, any recurrence of that failure mode is
ignored—even before a corrective action for that failure mode has been
identified. Obviously, this form of failure purging is also unacceptable.

Cumulative Test Time for Multiple Test Items

One practical point that is often overlooked is how to account for


cumulative test time when using multiple test items. While it is possible to

64
MIL-HDBK-189, “Reliability Growth Management,” 13 Feb. 1981, pp.
87-88.
65
MIL-STD-1635, op. cit., p. 19.
66
RADC Reliability Engineer’s Toolkit, op. cit., pp. A-66.
48
interpolate the test time on each test item at which any particular test item
failed, this is unnecessary extra effort and adds little, if any, accuracy to
the result. Cumulative test time should be calculated using the test time for
each unfailed item as recorded prior to each failure.

“Testing does not improve reliability”

As stated in the handout for “R&M Design in System Acquisition,”


“Testing does not improve reliability. Only incorporating corrective
actions that prevent the recurrence of failures actually improves the
reliability.”67 Unfortunately, it is easier to accumulate RGT test hours than
it is to develop and incorporate corrective actions; therefore, the contractor
has an incentive to continue testing even though the instantaneous
reliability is below the projected growth curve. The RGT should,
therefore, be structured so that, when the instantaneous reliability is
significantly below the projected growth curve, testing is to cease until the
contractor has incorporated sufficient corrective actions so that the
projected reliability is greater than the projected growth curve.

67
“R&M Design in System Acquisition,” Air Force Institute of
Technology (AFIT) QMT 335, 16 – 26 Oct 1989.
49
Table 6-1: Moving Average Method Example68

Time Moving Average MTBF


Failure Test Cumulative
Between
Count Time MTBF 3 Point 4 Point 5 Point
Failure
1 1 1 1.0
2 4 3 2.0
3 8 4 2.7 2.7
4 13 5 3.3 4.0 3.3
5 20 7 4.0 5.3 4.8 4.0
6 30 10 5.0 7.3 6.5 5.8
7 42 12 6.0 9.7 8.5 7.6
8 57 15 7.1 12.3 11.0 9.8
9 78 21 8.7 16.0 14.5 13.0
10 104 26 10.4 20.7 18.5 16.8
11 136 32 12.4 26.3 23.5 21.2
12 177 41 14.8 33.0 30.0 27.0
13 228 51 17.5 41.3 37.5 34.2
14 292 64 20.9 52.0 47.0 42.8
15 372 80 24.8 65.0 59.0 53.6
16 473 101 29.6 81.7 74.0 67.4
17 599 126 35.2 102.3 92.8 84.4
18 757 158 42.1 128.3 116.3 105.8
19 956 199 50.3 161.0 146.0 132.8
20 1205 249 60.3 202.0 183.0 166.6
21 1518 313 72.3 253.7 229.8 209.0
22 1879 361 85.4 307.7 280.5 256.0
23 2262 383 98.4 352.3 326.5 301.0
24 2668 406 111.2 383.3 365.8 342.4
25 3099 431 124.0 406.7 395.3 378.8

68
MIL-STD-1635, op. cit., p. 31.
50
Chapter 7: Reliability Qualification Test (RQT)
Reliability qualification test (RQT) is defined as “A test conducted under
specified conditions, by, or on behalf of, the government, using items
representative of the approved production configuration, to determine
compliance with specified reliability requirements as a basis for
production approval.”69

MIL-HDBK-781A Standard Test Plans

MIL-HDBK-781A provides a number of standard test plans which


“contain statistical criteria for determining compliance with specified
reliability requirements and are based on the assumption that the
underlying distribution of times-between-failures is exponential.”70 An
understanding of the “statistical criteria” involved is necessary for
selection of an appropriate standard test plan or for development of a
custom test plan. The key terms are defined below:

“Consumer’s risk (β) is the probability of accepting equipment with a true


mean-time-between-failures (MTBF) equal to the lower test MTBF (θ1).
The probability of accepting equipment with a true MTBF less than the
lower test MTBF (θ1) will be less than (β).”71

“Producer’s risk (α) is the probability of rejecting equipment which has a


true MTBF equal to the upper test MTBF (θ0). The probability of rejecting
equipment with a true MTBF greater than the upper test MTBF will be
less than (α).”72

69
MIL-STD-785B, “Reliability Programs for Systems and Equipment
Development and Production,” 15 Sep. 1980 (since cancelled), p. 3.
70
MIL-HDBK-781A, “Handbook for Reliability Test Methods, Plans, and
Environments for Engineering, Development Qualification, and
Production,” 1 Apr 1996, p. 17.
71
Ibid., p. 6.
72
Ibid.
51
“The discrimination ratio (d) is one of the standard test plan parameters; it
is the ratio of the upper test MTBF (θ0) to the lower test MTBF (θ1) that is,
Θ
d = 0 .”73
Θ1

“Lower test MTBF (θ1) is [the lowest value of MTBF which is


acceptable]. The standard test plans will reject, with high probability,
equipment with a true MTBF that approaches (θ1).”74 The lower test
MTBF is the required MTBF.

“Upper test MTBF (θ0) is an acceptable value of MTBF equal to the


discrimination ratio times the lower test MTBF (θ1). The standard test
plans will accept, with high probability, equipment with a true MTBF that
approaches (θ0). This value (θ0) should be realistically attainable, based on
experience and information.”75 The upper test MTBF is also known as the
“design to” MTBF.

“Predicted MTBF (θp) is that value of MTBF determined by reliability


prediction methods; it is a function of the equipment design and the use
environment. (θp) should be equal to or greater than (θ0) in value, to ensure
with high probability, that the equipment will be accepted during the
reliability qualification test.”76

There are two types of standard test plans that are of interest here:
Probability Ratio Sequential Test (PRST) plans, summarized in Table 7-1,
and fixed-duration test plans, summarized in Table 7-2. The PRST plans
have a variable length. MIL-HDBK-781A provides the following
guidance for choosing between a fixed-duration test plan and a PRST plan.

A fixed-duration test plan must be selected when it is


necessary to obtain an estimate of the true MTBF

73
Ibid.
74
Ibid., p. 7. Actually, the MIL-HDBK-781A definition reads, ““Lower
test MTBF (θ1) is that value which is unacceptable.” However, this is
confusing.
75
Ibid.
76
Ibid.
52
demonstrated by the test, as well as an accept-reject
decision, or when total test time must be known in advance.

A sequential test plan may be selected when it is desired to


accept or reject predetermined MTBF values (θ0, θ1) with
predetermined risks of error (α, β), and when uncertainty in
total test time is relatively unimportant. This test will save
test time, as compared to fixed-duration test plans having
similar risks and discrimination ratios, when the true
MTBF is much greater than (θ0) or much less then (θ1).77

A sequential test plan is generally inappropriate for use with a firm-fixed-


price type contract. Also, the wide range of test lengths possible creates
scheduling problems regardless of the contract type. While the guidance is
to plan for the maximum length,78 the program office must also plan for
reaching an “accept” decision early so as to avoid an unproductive gap in
the program. Therefore, use of a sequential test plan is not recommended
in most cases.

MIL-HDBK-781A provides the following guidance for choosing the


discrimination ratio.

The discrimination ratio (d) … is a measure of the power of


the test to reach a decision quickly….. In general, the
higher the discrimination ratio (d), the shorter the test. The
discrimination ratio (d) (and corresponding test plan) must
be chosen carefully to prevent the resulting (θ0) from
becoming unattainable due to design limitations.79

Review of Table 7-2 demonstrates the impact of the discrimination ratio


on test time. The test plans with 10 percent nominal decision risks (that is,
consumer’s and producer’s risks) are IX-D, XII-D, and XV-D. Test plan
IX-D has a discrimination ratio of 1.5 and a test duration of 45.0 times the
lower test MTBF (θ1). The duration for test plan XII-D, which has a
discrimination ratio of 2.0, is 18.8 times the lower test MTBF (θ1), 41.8

77
Ibid., p. 17.
78
Ibid., p. 37.
79
Ibid., p. 19.
53
percent of that for test plan IX-D. Increasing the discrimination ratio to 3.0
(test plan XV-D) reduces the test duration to 9.3 times the lower test
MTBF (θ1), only 21.7 percent of that for test plan IX-D. Thus, doubling
the discrimination ratio, from 1.5 to 3.0, reduces the test duration by a
factor of 4.84. Therefore, the highest discrimination ratio that results in a
feasible upper test MTBF (θ0) should be selected.

Review of Table 7-2 also demonstrates the effect of decision risks on test
time. The test plans with a discrimination ratio of 2.0 and balanced
decision risks (that is, the nominal consumer’s risk is equal to the nominal
producer’s risk) are XII-D, XIV-D, and XX-D. Test plan XII-D, with 10
percent nominal decision risks, has a duration of 18.8 times the lower test
MTBF (θ1). Increasing the nominal decision risks to 20 percent (test plan
XIV-D) reduces the test duration to 7.8 times the lower test MTBF (θ1),
only 41.5 percent of that for test plan XII-D. Increasing the nominal
decision risks to 30 percent (test plan XX-D) reduces the test duration to
3.7 times the lower test MTBF (θ1), only 19.7 percent of that for test plan
XII-D. Thus, tripling the decision risks, from 10 percent to 30 percent,
reduces the test duration by a factor of 5.1. This is a classic case of trading
cost and schedule for risk reduction.

Another point to note is that, in order to reach an “accept” decision for any
of the test plans, the observed reliability must be significantly greater than
the required reliability, the lower test MTBF (θ1). Dividing the test
duration by the maximum number of failures to accept (except for test
plan XXI-D, of course) reveals that the minimum acceptable observed
reliability is 1.196 times the required reliability (test plan X-D). At the
other extreme, test plan XVII-D requires an observed reliability 2.15 times
the required reliability. Although this may seem to be unfair, it is the
nature of dealing with probabilities.

54
Table 7-1: Summary of MIL-HDBK-781A PRST Test Plans 80

Producer’s Consumer’s Risk


Test Discrimination Ratio
Risk (α) (β)
Plan (d)
(%) (%)
I-D 11.5 12.5 1.5
II-D 22.7 23.2 1.5
III-D 12.8 12.8 2.0
IV-D 22.3 22.5 2.0
V-D 11.1 10.9 3.0
VI-D 18.2 19.2 3.0
VII-D 31.2 32.8 1.5
VIII-D 29.3 29.9 2.0

80
Ibid., p. 36.
55
Table 7-2: Summary of MIL-HDBK-781A Fixed-Duration Test Plans
81

Test
Producer’s Consumer’s Discrim- Maximum
Test Duration
Risk (α) Risk (β) ination Failures
Plan (multiples
(%) (%) Ratio (d) to Accept
of θ1)

IX-D 12.0 9.9 1.5 45.0 36


X-D 10.9 21.4 1.5 29.9 25
XI-D 19.7 19.6 1.5 21.5 17
XII-D 9.6 10.6 2.0 18.8 13
XIII-
9.8 20.9 2.0 12.4 9
D
XIV-
19.9 21.0 2.0 7.8 5
D
XV-D 9.4 9.9 3.0 9.3 5
XVI-
10.9 21.3 3.0 5.4 3
D
XVII-
17.5 19.7 3.0 4.3 2
D
XIX-
29.8 30.1 1.5 8.1 6
D
XX-D 28.3 28.5 2.0 3.7 2
XXI-
30.7 33.3 3.0 1.1 0
D

81
Ibid., p. 131.
56
Case Study 1: Integrated Suitability Improvement Program (ISIP)

In the following paper, written in April 1997, I proposed establishing an


Integrated Suitability Improvement Program (ISIP) for the Next
Generation Small Loader (NGSL). The proposal was never approved. A
year later, in April 1998, Mrs. Darlene Druyan, then the Air Force’s
principal deputy assistant secretary for acquisition and management,
decided to transfer the program from WR-ALC to WPAFB. The NGSL is
now known as the Halverson Loader.

In particular, note the step-by-step approach to developing contractual


reliability requirements, developing a traditional “textbook” reliability
program, demonstrating that it is unfeasible, and developing an alternative
two-step reliability program. Also, note the steps involved in selecting the
various reliability demonstration tests (RDTs) and parameters for the
various reliability growth tests (RGTs).

There is one apparent problem in the Operational to Contractual


Requirements Translation section: as discussed in chapter 3, MTBF is
always greater than or equal to MTBM. The underlying issue is that in the
draft AMC Operational Requirements Document (ORD), AMC did not
use the standard definition of MTBM; rather, they essentially defined
MTBM as “mean time between visits to the maintenance shop.” This
explains my statement, “MTBM is essentially mean time between
workorder,” since a workorder will be opened for each visit to the shop.
Therefore, the metric should be MTBWO.

There was a logical reason for AMC to be concerned with MTBWO. As


discussed in chapter 3, downtime in the Air Force vehicle community is
broken into vehicle-down-for-maintenance (VDM) and vehicle-down-for-
parts (VDP). Since experience shows that the awaiting maintenance time
is a large portion of VDM, and is accumulated on a per-shop-visit basis,
rather than a per-maintenance event basis, reducing the number of visits to
the shop in turn reduces the awaiting maintenance time.

57
Next Generation Small Loader (NGSL)
Integrated Suitability Improvement Program (ISIP)

Purpose

The Air Force Operational Testing and Evaluation Center (AFOTEC)


conducted an Operational Assessment (OA) of two candidate loaders from
13 Nov 96 to 16 Jan 97 at Travis AFB CA. The OA results reveal that
both candidate loaders fall far short of the required reliability. (Although
the OA was too short to verify that the loaders comply with the reliability
requirements, the MTBCF of both loaders was so low that AFOTEC
calculated that there is less than 0.01 percent chance that either loader
complies with the 400 hour MTBCF requirement.) In order to achieve the
required reliability, an aggressive reliability growth process is needed.
This paper addresses an iterative approach to developing a reliability
program to achieve the required levels of reliability within the cost and
schedule constraints of the program.

In the typical RGT, the focus is exclusively on improving reliability, with


no concern to measuring and improving maintainability or verifying the
technical orders (T.O.s). However, by broadening the scope of the
program to include maintainability and T.O.s, the benefits of the test can
be significantly increased, with no appreciable degradation to the
reliability growth process. In AFOTEC terminology, reliability,
maintainability, availability, and T.O.s are “suitability” parameters.
Hence, the term, “Integrated Suitability Improvement Program.”

Operational to Contractual Requirements Translation

The Draft AMC Operational Requirements Document (ORD) for the Next
Generation Small Loader (NGSL) (AMC 020-93 I, 7 Mar 1997) requires a
Mission Completion Success Probability (MCSP) of at least 86% for a 60
hour mission at Follow-On Operational Test and Evaluation (FOT&E); by
assuming a constant failure rate, this can be translated to a requirement for
400 hours mean time between critical failure (MTBCF). The ORD
Operational Availability (Ao) requirement is based on 100 hours mean
time between maintenance (MTBM) at FOT&E. MTBM is essentially
mean time between workorder. On-Line Vehicle Interactive Management
System (OLVIMS) data for the current Southwest Mobile Systems (SMS)
25K Loader shows an average of 1.36 to 1.64 failures per workorder
58
(FY93 through FY96); for this period, with the exception of FY94, the
average varied between 1.53 to 1.64. A ratio of 1.50 failures per
workorder is reasonable; this results in a 66.7 hour mean time between
failure (MTBF) requirement.

Textbook Reliability Program

The textbook approach is to conduct a single reliability growth test (RGT),


followed by a reliability demonstration test (RDT), during First Article
Testing (FAT) or Developmental Testing and Evaluation (DT&E).
Compliance with the ORD reliability requirements in an operational
environment would be verified by Qualification Operational Testing and
Evaluation (QOT&E) or Initial Operational Testing and Evaluation
(IOT&E). There would be no need to verify compliance with the
reliability requirements during FOT&E with this approach.

The first step is to select or develop appropriate RDT plans for both
MTBF and MTBCF. This involves selecting decision risks (consumer’s
risk, the probability of accepting loaders which do not meet the reliability
requirement, and producer’s risk, the probability of rejecting loaders
which do meet the reliability requirement), the discrimination ratio (a
measure of the power of the test to reach an accept/reject decision
quickly), and whether a fixed length or variable length (probability ratio
sequential test (PRST)) is to be used. For the first iteration, only fixed
length MIL-HDBK-781 test plans with 10% and 20% decision risks are
considered. These are summarized below.

59
MTBF MTBCF
Discrim- Test Test
Producer’s Consumer’s Duration Duration
Test ination
Risk Risk (hours) (hours)
Plan Ratio
(α) (β) θ1 = 66.7 θ1 = 400
(d)
hours hours

IXD 10% 10% 1.5 3,000 18,000


XD 10% 20% 1.5 1,993 11,960
XID 20% 20% 1.5 1,433 8,600
XIID 10% 10% 2.0 1,253 7,520
XIIID 10% 20% 2.0 827 4,960
XIVD 20% 20% 2.0 520 3,120
XVD 10% 10% 3.0 620 3,720
XVID 10% 20% 3.0 360 2,160
XVIID 20% 20% 3.0 287 1,720

Note that only one test plan, XVIID, can verify the 400 hour MTBCF
requirement in less than 2,000 hours. This test plan has a discrimination
ratio, d, of 3.0. The discrimination ratio is the ratio of the upper test limit
(the design-to reliability) to the lower test limit (the required reliability);
therefore, if this test plan is selected, the contractor should design to a
predicted 1,200 hour MTBCF, which may not be feasible. Selecting a test
plan with a discrimination ratio of 1.5 would reduce the upper test limit to
600 hours MTBCF. However, this would require a test length of 8,600
hours, which is too long to be practical. A reasonable compromise would
be test plan XIVD, with a discrimination ratio of 2.0, resulting in an upper
test limit of 800 hours. This test plan would require a test length of 3,120
hours.

The MTBF requirement is verified in parallel with the MTBCF


requirement; therefore, any test plan requiring no more than 3,120 hours
can be selected without increasing the test length. Test plan XIID appears
to be a reasonable choice: it reduces both the consumer’s and producer’s
risks to 10%, while keeping a discrimination ratio of 2.0, resulting in a
133.3 hour upper test MTBF.

60
The next step is to determine the RGT length needed. The minimum
recommended test length is 5.0 times the predicted MTBF or MTBCF, or
4,000 hours for MTBCF. Test time can be calculated by:

1
T = [ MTBFinstantaneous × K × (1 − α )]α

where:

T is test time, in hours,


MTBFinstantaneous is the instantaneous
MTBF or MTBCF,
K is a constant which is a function of the
initial MTBF or MTBCF, and
α is the reliability growth rate (not to be
confused with the producer’s risk).

To use this equation, a value for K is required. Cumulative MTBF can be


calculated by:

1
MTBFcumulative = ×T α ,
K

where:

MTBFcumulative is the cumulative MTBF


at time T.

This can be rearranged to solve for K:


K= .
MTBFcumulative

According to MIL-STD-1635, MTBFcumulative is one tenth of the


predicted MTBF or MTBCF at T of one half the predicted MTBF or
MTBCF, or 100 hours, whichever is greater.82

82
Note that this is incorrect; MIL-STD-1635 actually states that the
initial reliability is approximately 10% of the predicted reliability, which
61
If MTBFinstantaneous = MTBFpredicted = 133.3 hours MTBF or 800
hours MTBCF, K is as follows.

Reliability
Growth Rate K for MTBF K for MTBCF
(α)

0.20 0.18844 0.04143


0.25 0.23723 0.05590
0.30 0.29866 0.07543
0.35 0.37598 0.10177
0.40 0.47334 0.13732
0.45 0.59590 0.18528
0.50 0.75019 0.25000

Predicted test times are as follows.

Reliability
MTBF Test Time MTBCF Test Time
Growth Rate
(hours) (hours)
(α)

0.20 3,276,922 13,106,131


0.25 316,405 1,265,471
0.30 65,617 262,486
0.35 21,018 84,069
0.40 8,818 35,272
0.45 4,418 17,672
0.50 2,500 10,000

means that both T and MTBFcumulative for the initial failure would be
MTBFpred
. Of course, this affects all of the subsequent RGT calculations
10
in this case study. For the correct equation for calculating K, see chapter 6.
62
It is obvious that, even at the extremely aggressive 0.50 growth rate
(which may be unachievable in any case), a 10,000 hour RGT is too long
to be practical.

Two Step Reliability Program

The Draft AMC ORD addresses the reliability and availability required at
FOT&E. This allows for a System Maturity Matrix (SMM) which
provides for lesser values at measurement periods prior to FOT&E, such
as at QOT&E. In fact, this approach has been applied for the 60K Loader,
which requires 144 hours MTBCF and 22 hours MTBF at IOT&E. Since
the 60K MTBCF requirement at FOT&E is 400 hours, the same as the
NGSL, 144 hours would be a reasonable interim requirement for the
NGSL at QOT&E. However, the ratio of failures to critical failures has
been reduced from 6.67 for the 60K to 6.0 for the NGSL. Therefore, the
corresponding interim requirement would be 24 hours MTBF at IOT&E.

This approach would involve the following segments:

1. Initial RGT (which will be identified as RGT1)


2. Initial RDT (identified as RDT1)
3. QOT&E
4. Final RGT (identified at RGT2)
5. Final RDT (identified as RDT2)
6. FOT&E

It would be reasonable to select the same test plans previously chosen for
the “textbook approach” for RDT2. Therefore, the design-to/predicted
reliability levels would remain at 133.3 hours MTBF and 800 hours
MTBCF. RGT1 is essentially the same as the RGT in the “textbook
approach,” terminated early, and RGT2 is simply the continuation of
RGT1. The interim reliability requirements, 24 hours MTBF and 144
hours MTBCF, are verified by RDT1. For consistency, a discrimination
ratio of 2.0 should be selected. The fixed length MIL-HDBK-781 test
plans with 10% and 20% discrimination ratios are summarized below.

63
MTBCF Test
MTBF Test
Producer’s Consumer’s Duration
Test Duration
Risk Risk (hours)
Plan (hours)
(α) (β) θ 1 = 144
θ1 = 24 hours
hours
XIID 10% 10% 451 2,707
XIIID 10% 20% 298 1,786
XIVD 20% 20% 187 1,123

Test plan XIVD is a reasonable choice for verifying MTBCF, while test
plan XIID is a reasonable choice for MTBF.

The test time required for RGT1 is the time predicted to achieve an
instantaneous reliability of 48 hours (2.0 times 24 hours) and an
instantaneous critical reliability of 288 hours (2.0 times 144 hours). Test
time is calculated as follows.

Reliability
MTBF Test Time MTBCF Test Time
Growth Rate
(hours) (hours)
(α)

0.20 19,839 79,254


0.25 5,319 21,258
0.30 2,179 8,711
0.35 1,135 4,539
0.40 686 2,743
0.45 456 1,825
0.50 324 1,296

At a nominal 0.35 growth rate, the test time required for MTBCF is
excessive. However, if priority is given to developing corrective actions
for critical failures, and a 0.40 growth rate can be achieved, a 2,743 hour
RGT would be sufficient. To be prudent, this should be rounded up to a
3,000 hour fixed length test.

The test time required for RGT2 is the difference between the test time
required for the “textbook approach” (at this growth rate) and the test time
64
for RGT1. The RGT2 time for MTBCF is 32,273 hours. If a growth rate of
0.35 can be maintained for MTBF for RGT1 and RGT2, the RGT2 for
MTBF would be approximately 18,000 hours (since the MTBCF
requirement drove a 3,000 hour RGT1). Therefore, the MTBCF RGT2
time would be the longer. However, a dedicated RGT in excess of 30,000
hours is not feasible.

Reliability Growth Testing in an Operational Environment

The traditional reliability growth program is conducted during FAT or


DT&E, at the contractor’s facility or a dedicated testing facility, under
simulated or stimulated operational conditions. The advantages of this
approach include the ability to accumulate test time in minimal calendar
time and the contractor having ready access to the test items for corrective
action implementation.

There are also disadvantages to the traditional approach. Regardless of the


care taken during development of the test plan, the true correlation
between test hours and “average” operational hours is unknown. Secondly,
the ability to accumulate test time quickly can be detrimental: a
contractor, in a misguided attempt to reduce schedule (probably in an
attempt to recover from previous schedule slips), may choose to test 24
hours a day, seven days a week, with all available test items. Failure data
could soon overwhelm the failure reporting and corrective action system
(FRACAS). With few corrective actions developed and implemented, the
reliability growth rate will be very low, and little growth will actually take
place. The final disadvantage of the traditional approach is cost:
accumulating the required test time can be costly, especially with a labor
intensive system like an aircraft cargo loader (one operator hour per loader
hour, with additional labor required to lock and unlock pallets, operate the
loading dock conveyor system, record data, etc.).

An alternative approach is to conduct the RGT in an operational


environment. This would require the contractor’s participation; therefore,
it would not be appropriate during dedicated IOT&E, dedicated QOT&E,
or FOT&E, as, with clearly defined exceptions, the contractor is
specifically excluded from participating in those events. However, it
would be appropriate during an OA, a combined DT&E/IOT&E or a
combined FAT/QOT&E.

65
There are several benefits to conducting the RGT in an operational
environment. The most obvious is the exact correlation between the test
hours and operational hours. Secondly, a more reasonable pace can be
maintained. During the OA, the three loaders tested averaged
approximately 5.0 hours per loader per day. A reasonable FRACAS
should be able to keep pace with the test, even if it includes several
loaders operated seven days a week. Finally, the additional cost to the Air
Force associated with accumulating test hours should be significantly less.
The cargo moved is cargo that would have to be moved in any case;
therefore, the cost of the operators, personnel to lock and unlock pallets,
fuel, etc., would have been spent even if the RGT were not performed.

Adding Maintainability and T.O. Improvement to an RGT:


An Integrated Suitability Improvement Program

In the typical RGT, the focus is on improving reliability, with no concern


to measuring and improving maintainability or verifying the T.O.s.
However, there is no technical reason that this should be so.

As corrective actions are implemented, the failure rates of the various


subsystems change. This impacts the mean repair time (MRT) (also
known as mean time to repair (MTTR)). MRT is a weighted average
repair time, with more frequent repairs carrying a higher weight. Data
collected during the RGT can be used to track this changing MRT. In
addition, feedback from maintainers, identifying difficult or time-
consuming maintenance tasks or suggestions to improve the
maintainability of the loader, can be documented and captured by an
expanded FRACAS, thereby driving maintainability improvements.

RGT can also be expanded to include verification and improvement of the


T.O.s. RGT should not replace initial T.O. verification; the initial T.O.s
should be validated and verified at the contractor’s facility prior to RGT.
However, by requiring that all maintenance actions be performed in
accordance with the T.O.s, except where the manuals are clearly in error,
additional errors can be identified, documented, and captured by the
expanded FRACAS. This also offers the opportunity to verify T.O.
changes resulting from corrective action implementation, correction of
previously identified errors, and improvements suggested by the
maintainers.

66
By broadening the scope of the program to include maintainability and
T.O.s, the benefits of the RGT can be significantly increased, with no
appreciable degradation to the reliability growth process.

67
Case Study 2: Basic Expeditionary Airfield Resources
(BEAR)
Power Unit (BPU)
Reliability Feasibility Analysis

1. PURPOSE

The purpose of this document is to present the results of an analysis that


was performed in order to investigate the feasibility of the specified
reliability for the Basic Expeditionary Airfield Resources (BEAR) Power
Unit (BPU), as well as to present recommended changes to the draft
Purchase Description (PD) and its associated Statement of Work (SOW).

2. BACKGROUND

The BPU Requirements Correlation Matrix (RCM), dated 23 Jan 2007,


(hereafter referred to as the RCM), lists reliability as a key performance
parameter (KPP) with a threshold of, “Mean Time Between Failure
(MTBF) shall be 1500 hrs.” However, neither the RCM itself nor any of
the available program documents provide any rationale to justify this
requirement.

Draft PD PD05WRLEEG11, dated 23 Feb 2007, (hereafter referred to as


the draft PD), 1.1 Scope., states that the BPU “will be used as a prime
power unit to provide electrical power in support of BEAR forward
operating military bases.” 3.13.1 Reliability. of the draft PD states: “The
BPU engine-generator shall have a mean time between failure (MTBF) of
at least 1,500 hours (objective: 2,000 hours) measured at 90 percent one-
sided confidence limits.”

3. METHODOLOGY

Five different methodologies were used in order to estimate the reliability


of the BPU.

1. Generator set reliability data from Nonelectronic Parts Reliability


Data 1995 (NPRD-95)
68
2. A top-level parts count reliability prediction, utilizing component
reliability data from NPRD-95
3. Results of the Deployable Power Generation and Distribution
System (DPGDS) Operational Deployment Run (ODR), as
documented in the 8 Aug 2006 Memo for Record
4. Results of the MEP-012 First Article Test Endurance Run,
Contract Number F04606-89-D-0126-0001
5. Review of the reliability requirements for the Generator Set 750
kW, 50/60 Hertz, Mobile Prime – SM-ALC/MMIRE 84-04P
(Production) dated 5 Oct 1984, Change 1 dated 26 Dec 84 and
Generator Set 750 kW, 50/60 Hertz, Mobile Prime Model MEP-
012A – SM-ALC/LIEE ES930228 dated 28 Feb 93

3.1 Generator Set Reliability Data.

NPRD-95 contains the failure rate for one item that appears to be directly
comparable to the BPU engine-generator itself.

Failure Rate Mean Time


NPRD-95
(Failures/10E6 Between Page
Description
Hours) Failure (hours)

Generator,
Diesel,
31.4131 31,834 2-106
Packaged,
Continuous

The reported MTBF, 31,834 hours, is 2,122 percent of the RCM


requirement, indicating that the RCM requirement is feasible.

3.2 Top-Level Parts Count Reliability Prediction.

The draft PD, 3.13.1 Reliability., continues:

The engine-generator is defined as the engine and related


equipment (see 3.4 through 3.4.9.5); the generator and all
related equipment (see 3.5 through 3.5.10.6); the control
system and all related equipment (see 3.6); the DC
electrical systems (see 3.9 through 3.9.2); the external
69
interfaces (see 3.10 through 3.10.3); and the engine-
generator coupler. The engine-generator does not include
either the trailer (see 3.7 through 3.7.4) or the enclosure
(see 3.8).

A top-level parts count reliability prediction was performed using the


above listed systems and parts.

Mean Time
Failure Rate
NPRD-95 Between
Description in PD (Failures/10E6 Page
Description Failure
Hours)
(hours)
Internal
Engine, Diesel
Combustion 14.2389 70,230 2-88
(Summary)
Engine
Heat
Engine Cooling Exchangers, 2-
7.8829 126,857
System Radiator 112
(Summary)
Brushless AC 2-
Generator, AC 0.7960 1,256,281
Generator 105
Voltage Regulator,
2-
Regulator/Exciter Voltage 5.5527 180,093
166
System (Summary)
2-
Cranking Motor Starter, Motor 0.0212 47,169,811
192
(assumed to be
Controls included in 0 N/A N/A
other items)
(included in
Governor 0 N/A N/A
Engine, Diesel)
(assumed to be
Other Devices negligible for 0 N/A N/A
initial analysis)

Total 28.4917 35,098 N/A

The predicted reliability, 35,098 hours MTBF, is 2,340 percent of the


RCM requirement, indicating that the RCM requirement is feasible.

70
However, this margin will decrease as additional components are added to
increase the accuracy of the prediction.

3.3 Deployable Power Generation and Distribution System (DPGDS)


Reliability Data.

The BPU as analyzed in the top-level parts count reliability prediction


discussed above is essentially the same as an Engine/Generator subsystem
as described in the DPGDS ODR Memo for Record, dated 8 Aug 2006:
“An Engine/Generator includes the Engine, Generator, and all directly
associated cooling systems, hydraulic systems, and specific
Engine/Generator controls in the [Primary Distribution Center].”
Therefore, the reliability of the DPGDS Engine/Generator can be used as
an estimate of the BPU engine-generator reliability.

The observed (point estimate) Engine/Generator reliability as measured in


the ODR was 374 hours MTBF. This was only 24.9 percent of the RCM
requirement. This indicates that the RCM requirement is not feasible.

3.4 MEP-012 First Article Test Reliability Data.

The BPU is the replacement for the existing MEP-012. During the MEP-
012 First Article Test Endurance Run, two units were tested for 500 hours
each. Results are summarized in the following table:

71
Segment AW0001 AW0002 Total

Failures
0 – 100 hours 0 4 4
101 – 200 hours 1 1 2
201 – 500 hours 0 0 0
Total 1 5 6
MTBF (hours)
Observed 500 100 166.7
80 Percent 128.5 ≤ MTBF ≤ 53.5 ≤ MTBF ≤ 94.8 ≤ MTBF ≤
Confidence Level 441 158.5 257
37.8 ≤ MTBF ≤
1st 400 Hours N/A N/A
102.8
259.8 ≤ MTBF ≤
Last 600 Hours N/A N/A
5,677

Note that all six failures occurred in the first 400 hours of the test; two
failure modes were observed (fuse failures—four occurrences, and
improperly installed hose end on the fuel line—two occurrences). These
failure modes were identified, corrective actions were implemented, and
the units completed the remaining 600 hours with no further failures.
While the overall results were much less than the 1,500 hours MTBF
required for the BPU, the 600 hour failure-free period at the end of the test
indicates that the requirement may be feasible.

3.5 Reliability Requirements for the MEP-012.

The reliability requirements for the MEP-012 were reviewed to determine


the MTBF that was considered feasible at the time that these engine-
generators were purchased.

The specification for the MEP-012, Engineering Specification Generator


Set 750 kW, 50/60 Hertz, Mobile Prime – SM-ALC/MMIRE 84-04P
(Production) dated 5 Oct 1984, Change 1 dated 26 Dec 84, states:

3.4 Reliability. The lower test Mean Time Between Failure


(MTBF) shall be 500 hours and the upper test MTBF shall
be 1500 hours as defined in MIL-STD-781 (see 6.2.15 for
the definition of failure).
72
The specification for the MEP-012A, Engineering Specification Generator
Set 750 kW, 50/60 Hertz, Mobile Prime Model MEP-012A – SM-
ALC/LIEE ES930228 dated 28 Feb 93, paragraph 3.3, is identical to
paragraph 3.4 quoted above.

The required reliability is the lower test value, 500 hours MTBF in each
case. This is only 33.3 percent of that required by the RCM. According to
The Rome Laboratory Reliability Engineer’s Toolkit, p. 16, “A 10-20%
reliability improvement factor is reasonable for advancement of
technology.” This would suggest that a feasible reliability requirement for
the BPU would be 550 to 600 hours MTBF, and indicates that the 1,500
hour MTBF threshold in the RCM is unfeasible. Due to the age of the
MEP-012 and the technology advancements that have been made in diesel
engines, generators, control systems since that time, a 50% reliability
improvement factor may be reasonable.

These documents may explain the 1,500 hour MTBF reliability


requirement in the BPU RCM. While the required reliability is the lower
test value, 500 hours, someone who is unfamiliar with reliability testing
could easily assume that the upper test value, 1,500 hours, is the required
reliability. Absent any program documentation to the contrary, this
appears to be the most logical explanation for the RCM reliability
threshold. Thus, by accident, it appears that the reliability threshold was
tripled without any justification or analysis to indicate that the higher
value is feasible.

Another explanation could come from a possible misunderstanding of


MTBF demonstrated at a particular confidence level and observed MTBF,
which is a point estimate of the true MTBF. The MTBF requirement in the
RCM is the requirement of a true reliability of the equipment that the user
believes. Compliance with this requirement must be demonstrated. As
reliability is a probability, it cannot be directly measured. That means that
there is a possibility that the true reliability is somewhat better, or
somewhat worse, than the observed reliability.

To demonstrate the requirement in the RCM, 1,500 hours MTBF, the draft
PD requires the contractor test eight BPUs for a total of 13,950 hours with
a maximum allowable five (5) failures; this will demonstrate the MTBF at
90 percent one-sided confidence limits (ref. 4.6.20.2 of the draft BPU PD
73
and MIL-HDBK-781A Fixed-Duration Test Plans, Test Plan XV-D). This
will result in a minimum observed MTBF = 13,950/5 = 2,790 hours.
Comparing this 2,790 hours observed MTBF to the observed DPGDS
MTBF for RAII, 1,200 hours, and observed MTBF for the ODR, 374
hours, the requirement of 1,500 hours MTBF is not feasible. Refer to the
table below to see how the numbers change if the requirement in the RCM
is changed from 1,500 hours MTBF to the suggested 750 hours MTBF. At
the end of the RQT, with only 5 failures allowed, the minimum observed
MTBF equals 1,395 hours. The requirement of 750 hours MTBF is
feasible.

Test Test Demonstrated


Demonstrated
Duration Durations Observed Observed MTBF (at
Discrim- MTBF (at 750
for for MTBF MTBF 1,500 hrs and
ination hrs and 90%
MTBF=750 MTBF=1,500 (at 750 (at 1,500 90%
ratio, d confidence
hrs hrs (as in hrs) hrs) confidence
level)
(suggested) RCM) level)
1.5 33,750 67,500 938 1,875 750 1,500
2.0 14,100 28,200 1,085 2,169 750 1,500
3.0 6,975 13,950 1,395 2,790 750 1,500

4. RESULTS AND CONCLUSIONS

Results of this feasibility analysis are summarized in the following table.

74
Percent of
Projected Feasibility and
Methodology RCM
MTBF (hours) Risk
Requirement
Generator Set Feasible, Low
31,834 2,122
Reliability Data Risk
Top-level parts
Feasible, Low
count reliability 35,098 2,340
Risk
prediction
DPGDS ODR Not Feasible,
374 24.9
results High Risk
166.7 (observed, Not Feasible,
11.1
entire test) High Risk
MEP-012 First 17.3 to 378 (80
Article Test 260 ≤ MTBF ≤ percent Feasible,
5,677 confidence Moderate Risk
level)
MEP-012/MEP-
012A Not Feasible,
500 33.3
Specification High Risk
Requirements

Ideally, the various methods utilized in a feasibility analysis will converge


upon a single “most likely” result. However, in this case, the five
methodologies resulted in three divergent results. The most recent data,
from the most directly comparable system, indicates that the RCM
requirement is not feasible and would, therefore, result in a high risk of
program failure.

5. RECOMMENDATIONS

This analysis indicates that the RCM reliability requirement is not feasible
and would, therefore, result in a high risk of program failure. The BPU
reliability requirement should be reduced to a level that is feasible; 750
hours MTBF would appear to be an appropriate value, as this represents a
50 percent improvement over the MEP-012 requirements.

Note, however, that even a 750 hour MTBF requirement would not be low
risk. Both the DPGDS ODR results and the MEP-012 First Article Test
75
results indicate that achieving 750 hours MTBF will necessitate significant
engineering oversight. Recommend that the program include, at a
minimum:

1. Reliability modeling and predictions,


2. a Failure Reporting and Corrective Action System (FRACAS),
3. a formal Reliability Growth Test (RGT), and
4. a formal RQT.

Note that this feasibility analysis highlights a basic flaw with the DPGDS
design, which utilizes two engine-generators for each MEP-810 Prime
Power Unit (PPU). This doubles the predicted failure rate, and reduces the
predicted MTBF by half, for a PPU. Such a design would typically be
utilized to increase the mission (critical) reliability of the item, with
redundant components in parallel so that failure of one component would
not result in inability to complete the mission. However, in this
application, one engine-generator cannot supply sufficient power to
complete the mission; therefore, failure of either engine-generator will
result in mission failure. The DPGDS design suffers from the typical costs
of redundancy: increased cost, increased weight, increased package space,
increased complexity, and reduced basic (logistics) reliability, without
benefiting from the normal benefit of redundancy, increased mission
(critical) reliability.

76
Case Study 3: Fire Truck Depot Overhaul Study

This case study consists of the final report for Tracy Jenkin’s co-op project
from the summer of 2004. In a Mercer Engineering Research Center
(MERC) report entitled “Vehicle Depot Study,” dated 24 Aug 1998,
MERC provided the results and recommendations for a project:

… to study the existing U.S. Air Force vehicle depot


maintenance program. The vehicle groups evaluated were
fire fighting vehicles, aircraft tow tractors, aircraft loaders,
vacuum sweepers, and aircraft refuelers. The scope of the
study was to determine the effectiveness of the depot
program ….83

In order to quantitatively evaluate the effectiveness of the depot overhaul


program, MERC analyzed data from the Consolidated Analysis and
Reporting System (CARS) and Vehicle Interactive Management System
(VIMS) (actually, VIMS data is accessed through CARS):

The CARS and VIMS systems are means to track


maintenance data on ground support vehicles. MERC has
received both positive and negative feedback concerning
the reliability of the data from these systems. A sample
group of three (3) vehicles from each vehicle type
overhauled in FY96 was examined in order to analyze
maintenance information from the CARS system. The
vehicles were randomly selected. Data was gathered for the
year prior to overhaul (FY95) and the following year
(FY97). Comparisons of maintenance activities and costs
were made for the vehicles before and after depot
maintenance.84

The Maintenance Cost section of the MERC report concludes:

83
“Vehicle Depot Study Final Engineering Report,” Contract No. F09603-
98-F-0019, Document No. 080600-98062-F1, 24 Aug 1998, Mercer
Engineering Research Center (MERC), 135 Osigian Boulevard, Warner
Robins, GA 31088.
84
Ibid., p. 18.
77
Table 13 below shows a comparison of the average cost to
operate the equipment on a per hour basis. This is probably
the most important data collected relating to maintenance
costs, since it includes the amount of utilization and the
cost of maintenance. The cost of maintenance in this data
includes a sum of the parts costs, labor costs and fuel costs.
All vehicle groups show marked improvement in FY97.
Overall, there is a 37% drop in cost per hour to operate the
vehicles.85

Table 13: CARS Data Comparison - Cost/Hour

Avg Cost/Hour Avg Cost/Hour Percent


Vehicle Type PreDepot PostDepot Change
(FY95) (FY97)
MB-2 Tow $12 $9 -29%
Tractor
Tymco Sweeper $22 $10 -54%
25K Loader $44 $29 -36%
40K Loader $27 $17 -35%
P-19 Fire Truck $10 $7 -33%
Overall Average -37%

Since the MERC study only considered data from three vehicles of each
type one year prior and one year after depot overhaul, the data analyzed
was limited and the confidence in the results was low.

85
Ibid., p. 20.
78
Fire Truck Depot Overhaul Study
by Tracy Jenkins
Edited by Steven Davis

1. Abstract

The purpose of this study is to investigate the effects of depot overhaul on


the Air Force fire truck fleet.

2. Background

The intent of the overhaul is to increase the performance of the fire truck
fleet and extend the life expectancy of the current trucks. While in depot,
the truck is completely dismantled, defective parts are repaired or
replaced, and the vehicle is reassembled.

In August of 1998, Mercer Engineering Research Center (MERC)


provided their final report for a study of the Air Force vehicle depot
overhaul program. The study examined the effectiveness of the depot
program and the criteria for overhaul. MERC’s results confirmed the
effectiveness of the program and recommended continuing the depot
overhaul program to maintain reliability and consistency.

3. Methodology

This depot overhaul study was a continuation of the MERC task consisting
only of Air Force fire trucks. The vehicles used include the P-19 Aircraft
Rescue and Fire Fighting (ARFF) trucks and the P-22/P-24 structural
pumpers. These trucks have over ten years of service with the Air Force
and have been sent to depot within the last six years. Ten years of data was
collected from the Crash Rescue and ATAP depot programs. This data
includes 154 trucks from Crash Rescue and 34 trucks from ATAP. Crash
Rescue is the current depot contractor.

The Air Force supports a vehicle database (CARS) which stores


maintenance and cost information for each registered fire truck. This
database contains data regarding maintenance hours, failures, operational
and total hours, and repair costs. Using this data, information can be
calculated such as mean time between failure (MTBF), operation and
support costs (O&S costs), utilization rates (UR), and vehicle in
79
commission (VIC) rates. All information was analyzed as years before and
after depot.

4. Results

The effects of performance before and after depot were analyzed. The data
were examined to determine trends between the two depot overhaul
programs and trends between the two types of trucks studied.

4.1 Trends Between Depot Facilities

Table 1 shows the effect depot overhaul had on the 154 vehicles from
Crash Rescue. MTBF increased 20.2 percent, a significant improvement;
however, O&S cost increased 16.9 percent, a significant degradation. This
is an unexpected result, as a reduction in unscheduled maintenance
tasks—indicated by the increase in MTBF—should result in a
corresponding reduction in O&S costs. The 3.0 percent reduction in UR is
insignificant and does not explain the results. With conflicting results for
MTBF and O&S cost, the effectiveness of the Crash Rescue depot
overhaul is inconclusive.

Table 1: All Crash Rescue (P-19, P-22, and P-24)

O&S Cost
Per
MTBF UR VIC Rate
Operating
(hours) (%) (%)
Hour
($/hour)

Before 11.9 3.3 91.1 17.45

After 14.3 3.2 88.4 20.39

Percent
20.2 % -3.0 % -3.0 % 16.9 %
Change

80
Table 2 shows the effects that the depot program had on the 34 ATAP
vehicles. MTBF increased 10.3 percent, again, a significant improvement.
However, O&S cost increased 23.2 percent, again, a significant
degradation. The 13.9 percent reduction in UR is significant, but does not
explain the results. With conflicting results for MTBF and O&S cost, the
effectiveness of the ATAP depot overhaul is also inconclusive.

Table 2: ATAP (P-19 and P-22)

O&S Cost
Per
MTBF UR VIC Rate
Operating
(hours) (%) (%)
Hour
($/hour)

Before 10.7 3.6 94.1 14.10

After 11.8 3.1 93.6 17.37

Percent
10.3 % -13.9 % -0.4 % 23.2 %
Change

The data from both Crash Rescue and ATAP demonstrated similar results,
with a significant increase in MTBF along with a similar significant
increase in O&S costs. The effects of the depot overhaul, therefore, do not
appear to depend on the depot overhaul contractor.

4.2 Trends Between Vehicle Types

The data were then analyzed according to ARFF and structural trucks to
determine if the depot overhaul program produced different results on
different types of vehicles.

Table 3 shows the effects that the depot program had on the P-19 ARFF
vehicles. MTBF increased 16.4 percent, again, a significant improvement.
However, O&S cost increased 2.6 percent, a slight degradation. The UR

81
did not change. With conflicting results for MTBF and O&S cost, the
effectiveness of the P-19 depot overhaul is also inconclusive.

Table 3: All P-19 (Crash Rescue and ATAP)

O&S Cost
Per
MTBF UR VIC Rate
Operating
(hours) (%) (%)
Hour
($/hour)

Before 11.0 3.2 91.6 19.65

After 12.8 3.2 91.7 20.16

Percent
16.4 % 0% 0.1 % 2.6 %
Change

Table 4 shows the effects that the depot program had on the P-22 and P-24
structural pumpers. MTBF decreased 4.7 percent, again, a slight
degradation. However, O&S cost decreased 12.2 percent, a significant
improvement. The UR decreased 16.7 percent; however, this does not
explain the results. With conflicting results for MTBF and O&S cost, the
effectiveness of the P-22/P-24 depot overhaul is also inconclusive.

82
Table 4: All P-22/P-24 (Crash Rescue and ATAP)

O&S Cost
Per
MTBF UR VIC Rate
Operating
(hours) (%) (%)
Hour
($/hour)

Before 12.8 3.6 91.2 13.73

After 12.2 3.0 90.8 12.05

Percent
-4.7 % -16.7 % -0.4 % -12.2 %
Change

The data for the P-19 ARFF vehicles and P-22/P-24 structural pumpers
demonstrated contrasting results, with a significant increase in MTBF for
the P-19 versus a slight reduction in MTBF for the P-22/P-24, and a
significant increase in O&S costs for the P-19 compared to a significant
reduction in O&S costs for the P-22/P-24. Again, the O&S costs were
directly related to the MTBF when an inverse relationship was expected.
With conflicting results for MTBF and O&S cost, the effectiveness of the
fire truck depot overhaul is also inconclusive.

5. Conclusions

The direct relationship between MTBF and O&S costs was unexpected
and is unexplained. In three of the four cases analyzed, O&S costs
increased after the depot overhaul, which eliminates any possible
justification of depot overhaul on the basis of reducing overall life cycle
cost (LCC).

In summary, the depot overhaul essentially allows the user to keep the
truck in operational status for several more years.

83
Appendix 1: Developing a Textbook Reliability Program

1. The user sets the operational requirement, either stated in terms of


MTBF, or in terms that the system engineer can translate into an MTBF.
The user requirement represents the true reliability the user believes is
needed in the field; the user should not attach a confidence level to this
value.

2. The system engineer performs a feasibility analysis to verify that the


user requirements are achievable and negotiates lower requirements with
the user if necessary. According to The Rome Laboratory Reliability
Engineer’s Toolkit, p. 16, “A 10-20% reliability improvement factor is
reasonable for advancement of technology.”86

3. The system engineer develops the contractual requirement from the user
requirement. Since it is impossible to test for true reliability, the system
engineer selects appropriate confidence levels and selects or develops an
appropriate reliability demonstration test plan. The contractual
requirement can be stated in a number of equivalent ways (for example,
stating any two of the values for lower test limit, upper test limit, and
discrimination ratio is equivalent to stating all three). In order to have a
“good chance” of satisfying the user requirement, the lower test limit
should be set equal to the user requirement. When the contractor passes
the selected RQT, the system engineer is (1 − β ) × 100% confident that the
items meet the user reliability requirement.

4. During design and development, the contractor should design to a


predicted reliability at least equal to the upper test limit. This will give the
contractor a “good chance” of passing the RDT.

5. A reliability growth test should be conducted prior to the RQT to


identify unforeseen failure modes and verify the effectiveness of the
contractor’s corrective actions. The RGT should be set up to achieve an
instantaneous reliability equal to the upper test limit of the RQT. This,
too, will give the contractor a “good chance” of passing the RQT.

86
The Rome Laboratory Reliability Engineer’s Toolkit, op. cit., p. 16.
84
Appendix 2: Example R&M Requirements Paragraphs

Example Paragraphs from the Draft BPU PD


(PD05WRLEEG11, dated 23 Apr 2007)

3.13 Reliability and maintainability.

3.13.1 Reliability. The BPU engine-generator shall have a mean time


between failure (MTBF) of at least 750 hours measured at 90 percent one-
sided confidence limits. The engine-generator is defined as the engine and
related equipment (see 3.4 through 3.4.9.5); the generator and all related
equipment (see 3.5 through 3.5.10.6); the control system and all related
equipment (see 3.6); the DC electrical systems (see 3.9 through 3.9.2); the
external interfaces (see 3.10 through 3.10.3); and the engine-generator
coupler. The engine-generator does not include either the trailer (see 3.7
through 3.7.4) or the enclosure (see 3.8). Definitions of reliability terms
shall be in accordance with B.3.

3.13.2 Maintainability.

3.13.2.1 Preventive maintenance. The recommended preventive


maintenance interval (PMI) shall be at least 250 (objective: 400)
operating hours. Preventive maintenance tasks shall not require more than
8.0 (objective: 4.0) man-hours.

3.13.2.2 Corrective maintenance. The BPU shall have a mean time to


repair (MTTR) of no greater than 4.0 hours.

3.13.2.3 Inspection and servicing provisions.

a. Routine servicing tasks and pre-use inspections shall require no


hand tools.

b. Drain plugs and filters shall be directly accessible and oriented


to have unimpeded drainage to a catch pan.

c. The BPU shall be designed with maximum usage of sealed


lifetime lubrication bearings.

85
d. The BPU shall be designed so the correct oil and coolant levels
can be checked while the unit is running.

4.6.20 Reliability tests. All of the preproduction BPUs shall be subjected


to a reliability growth test (RGT) in accordance with 4.6.20.1 and a
reliability qualification test (RQT) in accordance with 4.6.20.2. Three
BPUs shall be tested using JP-8 turbine fuel and three BPUs shall be
tested using DF-2. The remaining two BPUs shall be tested utilizing JP-8
prior to the first PMI, DF-2 from the first PMI to the second PMI, JP-8
from the second PMI to the third PMI continuing to alternate fuels to the
conclusion of testing. All of the BPUs shall be operated in parallel with
each other and shall be loaded in accordance with the cyclic load schedule
below at 60 Hz and a power factor of 1.00. The cycle shall be repeated as
required to complete the specified test time. All of the BPUs shall be
operated continuously throughout the entire test period, except when a
BPU is taken off line for a PMI or investigation or repair of a failure or
implementation of a corrective action. PMIs shall be staggered, if possible,
so that only one BPU is off line for a PMI at any time. All requirements of
Appendix B shall apply to the reliability tests.

Cyclic Load Schedule

Percent of Rated Load Number of Hours at Each Load

50 24
75 24
25 24
100 24

4.6.20.1 RGT. An RGT shall be performed in order to identify and


eliminate systemic failure modes so as to increase the probability of an
“accept” decision at the conclusion of the RQT (see 4.6.20.2). The RGT
shall be planned, performed, and monitored using the Duane reliability
growth model; 5.5.1 of MIL-HDBK-781A shall be used as guidance. The
underlining assumption of the Duane model is that the plot of MTBF
versus time is a straight line on log-log plot. The intent is to achieve an
instantaneous reliability of at least 2,250 hours (the upper test limit for the
RQT) by the conclusion of the RGT. The BPUs shall be subjected to a
cumulative 11,250 hours of testing in accordance with 4.6.20. The
86
contractor shall develop a planned RGT curve so that progress can be
monitored and the RGT process can be revised as necessary. Success of
the RGT is dependent on the failure reporting, analysis, and corrective
action system (FRACAS) (see B.4.1).

Instantaneous reliability (MTBFi) shall be calculated by:

MTBFc
MTBFi = ,
1−α

where

MTBFc is cumulative reliability (mean time


between failure) and
α (alpha) is the Duane reliability growth rate.

α shall be determined by regression analysis of the failures. Cumulative


reliability shall be calculated by:

T
MTBFc = ,
n

where

T is cumulative test time and


n is the number of failures.

Failure purging, either the removal of a failure from the RGT tracking
process after the corrective action for that failure has been implemented
and its effectiveness verified or the removal of all but the first occurrence
of a failure mode, shall not be allowed.

Cumulative test time at any failure shall be calculated by adding the test
time of the failed BPU with that of each of the other BPUs as recorded on
their data logs prior to the time of the failure.

In the event that the instantaneous reliability is less that 90 percent of the
planned reliability, accumulation of RGT hours shall cease until the
contractor has incorporated sufficient corrective actions so that the
projected reliability is greater than the planned growth curve.
87
4.6.20.2 RQT. After successful completion of the RGT, a 6,975 hour
fixed-duration RQT shall be performed to demonstrate compliance with
3.13.1. Nominal consumer’s and producer’s risks shall be 10 percent; the
discrimination ratio shall be 3.0, and no more than five failures shall be
allowed. (Ref. Test Plan XV-D of MIL-HDBK-781A). Configuration
changes shall not be made during the RQT without approval of the
procuring activity.

4.6.21 Maintainability demonstration. Corrective maintenance tasks shall


be selected by the procuring activity from an approved list of tasks
provided by the contractor and shall be performed. This list shall represent
all expected failures over the life of the BPU. Thirty of the tasks shall be
selected by the Government. As part of this demonstration, the
recommended frequencies of the scheduled maintenance tasks and the
times recorded to accomplish the tasks shall be used to develop an
expected value of scheduled maintenance time per measure of use, such as
calendar time and hours of operation. The capability of performing both
preventive and corrective maintenance tasks by personnel wearing arctic
mittens and MOPP Level 4 Chemical Warfare Gear shall be demonstrated.

88
Example Reliability Appendix from the Draft BPU PD
(PD05WRLEEG11, dated 23 Apr 2007)

APPENDIX B

RELIABILITY

B.1 SCOPE

B.1.1 Scope. This appendix provides definitions and details a failure


reporting, analysis, and corrective action system (FRACAS) for use during
the preproduction and operational tests. This appendix is a mandatory part
of the specification. The information contained herein is intended for
compliance.

B.2 APPLICABLE DOCUMENTS. The following documents are


applicable to the PD to the extent specified herein.

B.2.1 Government documents.

B.2.1.1 Specifications, standards, and handbooks. The following


specifications, standards, and handbooks of the exact revision listed below
form a part of this specification to the extent specified herein.

DEPARTMENT OF DEFENSE HANDBOOKS

MIL-HDBK-470A Designing and Developing Maintainable


Products and Systems
MIL-HDBK-781A Handbook for Reliability Test Methods,
Plans, and Environments for Engineering,
Development, Qualification, and Production

(Copies of these documents are available online at


http://assist.daps.dla.mil/quicksearch/ or www.dodssp.daps.mil or from the
Standardization Document Order Desk, 700 Robbins Avenue, Building
4D, Philadelphia, PA 19111-5094.)

B.3 DEFINITIONS

89
B.3.1 Discrimination ratio. (d) is one of the standard test plan parameters;
it is the ratio of the upper test MTBF (θ0) to the lower test MTBF (θ1) that
Θ
is, d = 0 . (Ref. MIL-HDBK-781A)
Θ1

B.3.2 Failure. The event, or inoperable state, in which any item or part of
an item does not, or would not, perform as previously specified. (Ref.
MIL-HDBK-470A)

B.3.3 Failure, chargeable. A failure that is not non-chargeable.

B.3.4 Failure, non-chargeable. A failure that is non-relevant failure; that is


induced by Government furnished equipment operating, maintenance, or
repair procedures; or of a part having a specified life expectancy and
operated beyond the specified replacement time of the part.

B.3.5 Failure, intermittent. Failure for a limited period of time, followed


by the item’s recovery of its ability to perform within specified limits
without any remedial action. (Ref. MIL-HDBK-470A)

B.3.6 Failure, non-relevant. A failure caused by installation damage;


accident or mishandling; failure of the test facility or test-peculiar
instrumentation; an externally applied overstress condition, in excess of
the approved test requirements; normal operating adjustments specified in
the approved operating instructions; or human error. A secondary failure
within the test sample, which is directly caused by a non-relevant or
relevant primary failure, is also a non-relevant failure. The secondary
failure must be proved to be dependent on the primary failure.

B.3.7 Failure, relevant. An intermittent failure; an unverified failure (a


failure which cannot be duplicated, which are still under investigation or
for which no cause could be determined); a verified failure not otherwise
excluded as a non-relevant failure; or a pattern failure.

B.3.8 MTBF, lower test. (θ1) is that value which is the minimum
acceptable. The standard test plans will reject, with high probability,
equipment with a true MTBF that approaches (θ1). The lower test MTBF
is the required MTBF. (Ref. MIL-HDBK-781A)

90
B.3.9 MTBF, upper test. (θ0) is an acceptable value of MTBF equal to the
discrimination ratio times the lower test MTBF (θ1). The standard test
plans will accept, with high probability, equipment with a true MTBF that
approaches (θ0). This value (θ0) should be realistically attainable, based on
experience and information. The upper test MTBF is also known as the
“design to” MTBF. (Ref. MIL-HDBK-781A)

B.3.10 MTBF, predicted. (θp) is that value of MTBF determined by


reliability prediction methods; it is a function of the equipment design and
the use environment. (θp) should be equal to or greater than (θ0) in value,
to ensure with high probability, that the equipment will be accepted during
the reliability qualification test. (Ref. MIL-HDBK-781A)

B.3.11 Risk, consumers. (β) is the probability of accepting equipment with


a true mean-time-between-failures (MTBF) equal to the lower test MTBF
(θ1). The probability of accepting equipment with a true MTBF less than
the lower test MTBF (θ1) will be less than (β). (Ref. MIL-HDBK-781A)

B.3.12 Risk, producer’s. (α) is the probability of rejecting equipment


which has a true MTBF equal to the upper test MTBF (θ0). The probability
of rejecting equipment with a true MTBF greater than the upper test
MTBF will be less than (α). (Ref. MIL-HDBK-781A)

B.4 REQUIREMENTS

B.4.1 Failure reporting, analysis, and corrective action system (FRACAS).


A closed loop system shall be used to collect data, analyze, and record
timely corrective action for all failures that occur during the preproduction
and operational tests. The contractor's existing FRACAS shall be utilized
with the minimum changes necessary to conform to this specification. The
system shall cover all test samples, interfaces between test samples, test
instrumentation, test facilities, test procedures, test personnel, and the
handling and operating instructions.

B.4.1.1 Problem and failure action. At the occurrence of a problem or


failure that affects satisfactory operation of a test sample, entries shall be
made in the appropriate data logs and the failed test sample shall be
removed from test, with minimum interruption to the other test samples
continuing on test.

91
B.4.1.1.1 Problem and failure reporting. A failure report shall be initiated
at the occurrence of each problem or failure of the contractor hardware or
software, and Government-furnished equipment (GFE). The report shall
contain the information required to permit determination of the origin and
correction of failures. The existing failure report forms may be used with
minimum changes necessary to conform to the requirements of this
specification and shall include the information specified in a through c:

a. Descriptions of failure symptoms, conditions surrounding the


failure, failed hardware identification, and operating time (or
cycles) at the time of failure.

b. Information on each independent and dependent failure and the


extent of confirmation of the failure symptoms, the
identification of failure modes, and a description of all repair
actions taken to return the test sample to operational readiness.

c. Information describing the results of the investigation, the


analysis of all part failures, an analysis of the system design,
and the corrective action taken to prevent failure recurrence. If
no corrective action is taken, the rationale for this decision
shall be recorded.

B.4.1.1.2 Identification and control of failed items. A failure tag shall be


affixed to the failed part immediately upon the detection of any failure or
suspected failure. The failure tag shall provide space for the failure report
serial number and for other pertinent entries from the test sample failure
record. All failed parts shall be marked conspicuously or tagged and
controlled to ensure disposal in accordance with contract requirements.
Failed parts shall not be handled in any manner which may obliterate facts
which might be pertinent to the analysis. Failed parts shall be stored
pending disposition by the authorized approval agency of the failure
analysis.

B.4.1.1.3 Problem and failure investigations. An investigation and analysis


of each reported failure shall be performed. Investigation and analysis
shall be conducted to the level of hardware or software necessary to
identify causes, mechanisms, and potential effects of the failure. Any
applicable method (i.e., test, microscopic analysis, applications study,
dissection, X-ray analysis, spectrographic analysis, et cetera) of
92
investigation and analysis which may be needed to determine failure cause
shall be used. When the removed part is not defective or the cause of
failure is external to the part, the analysis shall be extended to include the
circuit, higher hardware assembly, test procedures, and subsystem if
necessary. Investigation and analysis of GFE failures shall be limited to
verifying that the GFE failure was not the result of the contractor's
hardware, software, or procedures. This determination shall be
documented for notification of the procuring activity.

B.4.1.1.4 Failure verification. Reported failures shall be verified as actual


failures or an acceptable explanation provided to the procuring activity for
lack of failure verification. Failure verification is determined either by
repeating the failure mode of the reported part or by physical or electrical
evidence of failure (leakage residue, damaged hardware, etc.). Lack of
failure verification, by itself, is not sufficient rationale to conclude the
absence of a failure.

B.4.1.1.5 Corrective action. When the cause of failure has been


determined, a corrective action shall be developed to eliminate or reduce
the recurrence of the failure. Repairs shall be made in accordance with
normal field operating procedures and manuals. The procuring activity
shall review the corrective actions at the scheduled test status review prior
to implementation. In all cases the failure analysis and the resulting
corrective actions shall be documented.

B.4.1.1.6 Problem and failure tracking and closeout. The closed loop
failure reporting system shall include provisions for tracking problems,
failures, analyses, and corrective actions. Status of corrective actions for
all problems and failures shall be reviewed at scheduled test status
reviews. Problem and failure closeout shall be reviewed to assure their
adequacy.

B.4.2 Failure categories. All failures shall be classified as relevant or non-


relevant. Relevant failures shall be further classified as chargeable or non-
chargeable. The procuring activity will make the final determination of
failure classifications.

B.5 TESTING PROVISIONS

93
B.5.1 Reliability test requirements. The reliability tests shall be conducted
in accordance with the reliability test procedures which have been
approved by the procuring activity. Testing shall be continued until a
reject decision has been reached or the total required test time has been
completed, whichever comes first.

B.5.2 Reliability test records. Reliability test records shall be maintained


as specified in the approved test procedure.

B.5.3 Performance parameter measurements. The test sample performance


parameters to be measured and the frequency of measurement shall be as
specified herein. When the value of any required performance parameter is
not within specified limits, a failure shall be recorded. If the exact time of
failure cannot be determined, the failure shall be presumed to have
occurred at the time of the last recorded observation or successful
measurement of that same parameter. Observations and measurements
shall be made at the specified interval and recorded during the test cycle.
At least one set of measurements shall be recorded when a test sample is
first energized after any specified shutdown period.

B.5.4 Reliability compliance. Reliability compliance shall be reviewed by


the procuring activity after each test sample failure is categorized or at any
other appropriate time. Compliance shall be based on the total
accumulated test time and the total number of chargeable failures at the
time of the review.

94
Example Paragraphs from the Draft BPU SOW
(dated 15 Feb 2007)

3.6.2 Reliability and maintainability.

3.6.2.1 Reliability.

3.6.2.1.1 Basic reliability model. The contractor shall develop and


maintain a basic reliability model for the BPU engine-generator (see
3.13.1 of the PD). All equipment and associated quantities comprising
these parts shall be included in the model. All equipment, including those
intended solely for item redundancy and alternate modes of operation,
shall be modeled in series. A basic reliability block diagram shall be
developed and maintained for the items with associated allocations and
predictions in each reliability block. The basic reliability block diagram
shall be keyed and traceable to functional block diagrams, drawings, and
schematics, and shall provide the basis for accurate mathematical
representation of basic reliability. Nomenclature of elements of the item
used in the basic reliability block diagrams shall be consistent with that
used in functional block diagrams, drawings, schematics, weight
statements, power budgets, and specifications. The basic reliability model
shall be documented in the design analysis (see 3.6.4) and reviewed at the
design reviews.

3.6.2.1.2 Basic reliability prediction. The contractor shall prepare and


maintain a basic reliability prediction for the BPU engine-generator (see
3.13.1 of the PD); it shall be based upon the associated basic reliability
model (see 3.6.2.1.1). All equipment and associated quantities comprising
these parts shall be included in the model except for documented
exclusions approved by the procuring activity. Failure rate data (or
equivalent reliability parameters) shall be consistent with the level of
detail of the basic reliability model and availability of procuring activity
approved relevant data sources for a comprehensive prediction (for
example, software reliability, human reliability, storage reliability, etc.).
The prediction shall be based upon the worst-case service use profile. All
data sources for failure rates, failure distribution, and failure rate
adjustment factors (for example, stress factors, duty cycle, etc.) shall be
identified for each reliability block. Data sources shall be MIL-HDBK-
217F (2), NPRD-95, or otherwise approved by the procuring activity. The

95
basic reliability prediction shall be documented in the design analysis (see
3.6.4) and reviewed at the design reviews.

3.6.2.2 Maintainability. The contractor shall prepare and maintain a


maintainability prediction for mean time to repair (MTTR) using MIL-
HDBK-470A, Appendix D, and MIL-HDBK-472, Notice 1, for guidance.
The model and failure rate data shall be consistent with that of the basic
reliability prediction (see 3.6.2.1.2). The maintainability prediction shall
be documented in the design analysis (see 3.6.4) and reviewed at the
design reviews.

96
Appendix 3: Summary of χ2 Models87

Two-Sided Confidence Single-Sided


Level Models Confidence Level
Models
) ) )
2CΘ 2CΘ 2CΘ
Failure Truncated ≤Θ≤ Θ≥ 2
Tests χ2 α χ α2 χ (1−α ), 2C
(1− ), 2 C , 2C
2 2
) ) )
2CΘ 2CΘ 2CΘ
Time Truncated ≤Θ≤ Θ≥
Tests χ2 α χ α2 χ (21−α ),( 2C + 2)
(1− ), ( 2 C + 2 ) , 2C
2 2

Notes:

C = number of failures occurring during the test


α = risk = 1 – confidence level
) test _ time
Θ = point estimate MTBF =
C
χ P , f = chi-squared statistical distribution value. P and f are
2

calculated based on the subscripts shown in the above table.


P depends on the confidence interval desired and f depends
on the number of failures occurring.

87
RADC Reliability Engineer’s Toolkit, op. cit., p. A-47.
97
Appendix 4: Fractiles of the χ2 Distribution88

Degrees of Probability in Percent


Freedom (f)
10.0 20.0 80.0 90.0
2 0.21072 0.44629 3.2189 4.6052
4 1.0636 1.6488 5.9886 7.7794
6 2.2041 3.0701 8.5581 10.645
8 3.4895 4.5936 11.030 13.362
10 4.8652 6.1791 13.442 15.987
12 6.3038 7.8073 15.812 18.549
14 7.7895 9.4673 18.151 21.064
16 9.3122 11.152 20.465 23.542
18 10.865 12.857 22.760 25.989
20 12.443 14.578 25.038 28.412
22 14.041 16.314 27.301 30.813
24 15.659 18.062 29.553 33.196
26 17.292 19.820 31.795 35.563
28 18.939 21.588 34.027 37.916
30 20.599 23.364 36.250 40.256
32 22.271 25.148 38.466 42.585
34 23.952 26.938 40.676 44.903
36 25.643 28.735 42.879 47.212
38 27.343 30.537 45.076 49.513
40 29.051 32.345 47.269 51.805

88
Ibid., pp. A-48 – A-50. This table has been abridged to include only the
10% and 20% upper and lower confidence levels (those most commonly
used in reliability calculations) and to delete the odd-numbered degrees of
freedom, which are not used in confidence level calculations. It has been
expanded to include more degrees of freedom and more significant digits.
98
Degrees of Probability in Percent
Freedom (f)
10.0 20.0 80.0 90.0
42 30.765 34.157 49.456 54.090
44 32.487 35.974 51.639 56.369
46 34.215 37.795 53.818 58.641
48 35.949 39.621 55.993 60.907
50 37.689 41.449 58.164 63.167
52 39.433 43.281 60.332 65.422
54 41.183 45.117 62.496 67.673
56 42.937 46.955 64.658 69.919
58 44.696 48.797 66.816 72.160
60 46.459 50.641 68.972 74.397
62 48.226 52.487 71.125 76.630
64 49.996 54.337 73.276 78.860
66 51.770 56.188 75.425 81.085
68 53.548 58.042 77.571 83.308
70 55.329 59.898 79.715 85.527
72 57.113 61.756 81.857 87.743
74 58.900 63.616 83.997 89.956
76 60.690 65.478 86.135 92.166
78 62.483 67.341 88.271 94.374
80 64.278 69.207 90.405 96.578
82 66.076 71.074 92.538 98.780
84 67.876 72.943 94.669 100.98
86 69.679 74.813 96.799 103.18
88 71.484 76.685 98.927 105.37
90 73.291 78.558 101.05 107.57
100 82.358 87.945 111.67 118.50
1000 943.13 962.18 1037.4 1057.7
99
Appendix 5: Factors for Calculating Confidence Levels89

Factor
80%
Two-
Failures 80%
Sided 60% Two-Sided
Two-
90% 80% One-Sided
Sided
One-
Sided
Time
All Other Lower Lower Upper Upper
Terminated
Cases Limit Limit Limit Limit
Lower Limit
0 1 0.43429 0.62133 4.4814 9.4912
1 2 0.25709 0.33397 1.2130 1.8804
2 3 0.18789 0.23370 0.65145 0.90739
3 4 0.14968 0.18132 0.43539 0.57314
4 5 0.12510 0.14879 0.32367 0.41108
5 6 0.10782 0.12649 0.25617 0.31727
6 7 0.09495 0.11019 0.21125 0.25675
7 8 0.08496 0.09773 0.17934 0.21477
8 9 0.07695 0.08788 0.15556 0.18408
9 10 0.07039 0.07988 0.13719 0.16074
10 11 0.06491 0.07326 0.12259 0.14243
11 12 0.06025 0.06767 0.11073 0.12772
12 13 0.05624 0.06290 0.10091 0.11566
13 14 0.05275 0.05878 0.09264 0.10560
14 15 0.04968 0.05517 0.08560 0.09709

89
The Rome Laboratory Reliability Engineer’s Toolkit, op. cit., p. A-43. This table
has been adapted and abridged to include only the 10% and 20% upper and
lower confidence levels (those most commonly used in reliability
calculations). It has been expanded to include more failures and more
significant digits. Note that The Rome Laboratory Reliability Engineer’s Toolkit is
in the public domain; it can, therefore, be freely distributed.
100
Factor
80%
Two-
Failures 80%
Sided 60% Two-Sided
Two-
90% 80% One-Sided
Sided
One-
Sided
Time
All Other Lower Lower Upper Upper
Terminated
Cases Limit Limit Limit Limit
Lower Limit
15 16 0.04697 0.05199 0.07953 0.08980
16 17 0.04454 0.04917 0.07424 0.08350
17 18 0.04236 0.04664 0.06960 0.07799
18 19 0.04039 0.04437 0.06549 0.07314
19 20 0.03861 0.04231 0.06183 0.06885
20 21 0.03698 0.04044 0.05855 0.06501
21 22 0.03548 0.03873 0.05560 0.06156
22 23 0.03411 0.03716 0.05292 0.05845
23 24 0.03284 0.03572 0.05048 0.05563
24 25 0.03166 0.03439 0.04825 0.05307
25 26 0.03057 0.03315 0.04621 0.05072
26 27 0.02955 0.03200 0.04433 0.04856
27 28 0.02860 0.03093 0.04259 0.04658
28 29 0.02772 0.02993 0.04099 0.04475
29 30 0.02688 0.02900 0.03949 0.04305
30 31 0.02610 0.02812 0.03810 0.04147
31 32 0.02536 0.02729 0.03681 0.04000
32 33 0.02467 0.02652 0.03559 0.03863
33 34 0.02401 0.02578 0.03446 0.03735
34 35 0.02338 0.02509 0.03339 0.03615
39 40 0.02071 0.02212 0.02890 0.03111
49 50 0.01688 0.01791 0.02274 0.02428
499 500 0.00189 0.00193 0.00208 0.00212
101
Appendix 6: Redundancy Equation Approximations Summary90
With Repair Without Repair

All units
are active
on-line Equation 4
with equal Equation 1
unit λ
failure n!(λ ) q +1 λ( n − q ) / n =
λ( n − q ) / n = n
1
∑i
rates. (n-
q) out of (n − q − 1)!( µ ) q
n required i=n−q
for
success.
Two
active on-
line units
with
different Equation 2 Equation 5
failure
and repair λ A λ B [( µ A + µ B ) + (λ A + λ B )] λ A 2 λB + λ A λ B 2
rates. λ1 / 2 = λ1 / 2 = 2
One of ( µ A )( µ B ) + ( µ A + µ B )(λ A + λ B ) λ A + λ B + λ A λB
2

two
required
for
success.
One
standby
off-line
unit with
n active
on-line
units
required
for Equation 3 Equation 6
success.
n[nλ + (1 − P) µ ]λ
Off-line

spare
λn / n+1 = λn / n+1 =
assumed
to have a
µ + n( P + 1)λ P +1
failure
rate of
zero. On-
line units
have
equal
failure
rates.

90 The Rome Laboratory Reliability Engineer’s Toolkit, op. cit., p. 90.


102
Key:
λx/y is the effective failure rate of the redundant configuration where x
of y units are required for success
n = number of active on-line units. n! is n factorial (e.g.,
5!=5x4x3x2x1=120, 1!=1, 0!=1)
λ = failure rate of an individual on-line unit (failures/hour) (note that
this is not the more common failures/106 hours)
q = number of on-line active units which are allowed to fail without
system failure
µ = repair rate (µ=1/Mct, where Mct is the mean corrective maintenance
time in hours)
P = probability switching mechanism will operate properly when
needed (P=1 with perfect switching)

Notes:
4. Assumes all units are functional at the start
5. The approximations represent time to first failure
6. CAUTION: Redundancy equations for repairable systems should
not be applied if delayed maintenance is used.

103
Appendix 7: Summary of MIL-HDBK-781A PRST Test
Plans 91

Producer’s Consumer’s Risk


Test Discrimination Ratio
Risk (α) (β)
Plan (d)
(%) (%)
I-D 11.5 12.5 1.5
II-D 22.7 23.2 1.5
III-D 12.8 12.8 2.0
IV-D 22.3 22.5 2.0
V-D 11.1 10.9 3.0
VI-D 18.2 19.2 3.0
VII-D 31.2 32.8 1.5
VIII-D 29.3 29.9 2.0

91 MIL-HDBK-781A, op.cit., p. 36.


104
Appendix 8: Summary of MIL-HDBK-781A Fixed-
Duration Test Plans 92

Test
Producer’s Consumer’s Discrim- Maximum
Test Duration
Risk (α) Risk (β) ination Failures
Plan (multiples
(%) (%) Ratio (d) to Accept
of θ1)

IX-D 12.0 9.9 1.5 45.0 36


X-D 10.9 21.4 1.5 29.9 25
XI-D 19.7 19.6 1.5 21.5 17
XII-D 9.6 10.6 2.0 18.8 13
XIII-
9.8 20.9 2.0 12.4 9
D
XIV-
19.9 21.0 2.0 7.8 5
D
XV-D 9.4 9.9 3.0 9.3 5
XVI-
10.9 21.3 3.0 5.4 3
D
XVII-
17.5 19.7 3.0 4.3 2
D
XIX-
29.8 30.1 1.5 8.1 6
D
XX-D 28.3 28.5 2.0 3.7 2
XXI-
30.7 33.3 3.0 1.1 0
D

92 Ibid., p. 131.
105
Appendix 9: Glossary

Army Material Systems Analysis Activity (AMSSA) Method – An RGT


method. “The AMSAA Method is based on the assumption that the times
between successive failures can be modeled as the intensity function of a
nonhomogeneous Poisson process.” (MIL-HDBK-781A, p. 16)
Availability – “A measure of the degree to which an item is in an operable
and committable state at the start of a mission when the mission is called
for at an unknown (random) time.” (MIL-HDBK-470A, p. G-2)
Availability, Achieved (Aa) – “Similar to Ai, except that preventive and
scheduled maintenance actions are factored into the Uptime variable
(MTBM).” (Reliability Toolkit: Commercial Practices Edition, p. 12)
Availability, Inherent (Ai) – “A measure of availability that includes only
the effects of an item design and its application, and does not account for
effects of the operational and support environment.” (MIL-HDBK-470A,
p. G-7)
Availability, Operational (Ao) – “Extends the definition of Ai to include
delays due to waiting for parts or processing paperwork in the Downtime
parameter (MDT).” (Reliability Toolkit: Commercial Practices Edition, p.
12)
Consumer’s Risk (β) – “The probability of accepting equipment with a
true mean-time-between-failures (MTBF) equal to the lower test MTBF
(θ1). The probability of accepting equipment with a true MTBF less than
the lower test MTBF (θ1) will be less than (β).” (MIL-HDBK-781A, p. 6)
Corrective Maintenance (CM) – “All actions performed as a result of
failure, to restore an item to a specified condition. Corrective maintenance
can include any or all of the following steps: Localization, Isolation,
Disassembly, Interchange, Reassembly, Alignment, and Checkout.” (MIL-
HDBK-470A, p. G-3)
Corrective Maintenance Time (CMT) – “The time spent replacing,
repairing, or adjusting all items suspected to have been the cause of the
malfunction, except those subsequently shown by interim test of the
system not to have been the cause.” (MIL-HDBK-470A, p. G-15)
Dependability (Do) – “A measure of the degree to which an item is
operable and capable of performing its required function at any (random)
time during a specified mission profile, given item availability at the start
of the mission.” (MIL-HDBK-470A, pp. G-3 – G-4)

106
Discrimination Ratio (d) – “One of the standard test plan parameters; it is
the ratio of the upper test MTBF (θ0) to the lower test MTBF (θ1) that is,
Θ
d = 0 .” (MIL-HDBK-781A, p. 6)
Θ1
Downtime – “That element of time during which an item is in an
operational inventory but is not in condition to perform its required
function.” (MIL-HDBK-338B, p. 3-5)
Duane Model – An RGT model that is based on the observation that a log-
log plot of the cumulative reliability versus cumulative test time will be a
straight line. It is named after its developer, J. T. Duane. (MIL-HDBK-
781A, p. 19)
Failure – “The event, or inoperable state, in which any item or part of an
item does not, or would not, perform as previously specified.” (MIL-
HDBK-470A, p. G-5)
Failure Purging – The removal of a failure from the RGT tracking process
after the corrective action for that failure has been implemented and its
effectiveness verified.
Failure Rate (λ(t)) – “The ratio of probability that failure occurs in the
interval, given that it has not occurred prior to t1, the start of the interval,
divided by the interval length.” (MIL-HDBK-338B, p. 5-2)
Failure Reporting And Corrective Action System (FRACAS) – “A closed
loop system … used to collect data on, analyze, and record timely
corrective action for all failures that occur during reliability tests. The
system should cover all test items, interfaces between test items, test
instrumentation, test facilities, test procedures, test personnel, and the
handling and operating instructions.” (MIL-HDBK-781A, p. 11)
Failure, Critical – “A failure or combination of failures that prevents an
item from performing a specified mission.” (MIL-HDBK-338B, p. 3-6)
Failure, Dependent – “A failure of one item caused by the failure of an
associated item(s). A failure that is not independent.” (MIL-HDBK-338B,
p. 3-6)
Failure, Non-Chargeable – A non-relevant failure; a failure induced by
Government furnished equipment (GFE); or a failure of parts having a
specified life expectancy and operated beyond the specified replacement
time of the parts (e.g., wear out of a tire when it has exceeded its specified
life expectancy). (Based upon MIL-STD-721C, p. 4)

107
Failure, Non-Relevant – A failure caused by installation damage; accident
or mishandling; failures of the test facility or test-peculiar instrumentation;
caused by an externally applied overstress condition, in excess of the
approved test requirements; normal operating adjustments (non-failures)
specified in the approved technical orders; dependent failures within the
test sample, which are directly caused by non-relevant or relevant primary
failures; or caused by human errors. (Based upon MIL-STD-721C, p. 4)
Failure, Relevant, Chargeable – Any failure other than a non-chargeable
failure.
Failure, Secondary – Another term for dependent failure.
Hazard Rate (h(t)) – “The limit of the failure rate as the interval length
approaches zero.” Also known as the instantaneous failure rate. (MIL-
HDBK-338B, p. 5-2)
Life Cycle Cost (LCC) – The sum of acquisition, logistics support,
operating, and retirement and phase-out expenses. (MIL-HDBK-470A, pp.
G-8)
Lower Test MTBF (θ1) – The lowest value of MTBF which is acceptable.
“The standard test plans will reject, with high probability, equipment with
a true MTBF that approaches (θ1).” (derived from MIL-HDBK-781A, p.
7)
Maintainability – “The relative ease and economy of time and resources
with which an item can be retained in, or restored to, a specified condition
when maintenance is performed by personnel having specified skill levels,
using prescribed procedures and resources, at each prescribed level of
maintenance and repair.” (MIL-HDBK-470A, p. G-8, definition (1))
Maintenance Action – “An element of a maintenance event. One or more
tasks (i.e., fault localization, fault isolation, servicing, and inspection)
necessary to retain an item in or restore it to a specified condition.” (MIL-
HDBK-470A, p. G-9)
Maintenance Event – “One or more maintenance actions required to effect
corrective and preventative maintenance due to any type of failure or
malfunction, false alarm, or scheduled maintenance plan.” (MIL-HDBK-
470A, p. G-9)
Mean Downtime (MDT) – The average time during which an item is in an
operational inventory but is not in condition to perform its required
function. (derived from MIL-HDBK-338B, p. 3-5)
Mean Repair Time (MRT) – Another term for Mean Time To Repair
(MTTR).
Mean Time Between Critical Failure (MTBCF) – A measure of mission
(critical) reliability.
108
Mean Time Between Failure (MTBF) – “A basic measure of reliability for
repairable items. The mean number of life units during which all parts of
the item perform within their specified limits, during a particular
measurement interval under stated conditions.” (MIL-HDBK-470A, p.
G-11)
Mean Time To Failure (MTTF) – “A basic measure of reliability for non-
repairable items. The total number of life units of an item population
divided by the number of failures within that population, during a
particular measurement interval under stated conditions.” (MIL-HDBK-
338B, p. 3-12)
Mean Time To Repair (MTTR) – “The sum of corrective maintenance
times at any specific level of repair, divided by the total number of failures
within an item repaired at that level during a particular interval under
stated conditions.” (MIL-HDBK-470A, p. G-11)
Mean Time To Restore System (MTTRS) – “A measure of the product
maintainability parameter, related to availability and readiness: The total
corrective maintenance time, associated with downing events, divided by
the total number of downing events, during a stated period of time.
(Excludes time for off-product maintenance and repair of detached
components.)” (MIL-HDBK-338B, p. 3-13)
Predicted MTBF (θp) – “That value of MTBF determined by reliability
prediction methods; it is a function of the equipment design and the use
environment. (θp) should be equal to or greater than (θ0) in value, to ensure
with high probability, that the equipment will be accepted during the
reliability qualification test.” (MIL-HDBK-781A, p. 7)
Preventive Maintenance (PM) – “All actions performed in an attempt to
retain an item in specified condition by providing systematic inspection,
detection, and prevention of incipient failures.” (MIL-HDBK-470A, p. G-
14)
Producer’s Risk (α) – “The probability of rejecting equipment which has a
true MTBF equal to the upper test MTBF (θ0). The probability of rejecting
equipment with a true MTBF greater than the upper test MTBF will be
less than (α).” (MIL-HDBK-781A, p. 6)
Redundancy – “The existence of more than one means for accomplishing
a given function. Each means of accomplishing the function need not
necessarily be identical. The two basic types of redundancy are active and
standby.” (MIL-HDBK-338B, p. 3-16)
Redundancy, Active – “Redundancy in which all redundant items operate
simultaneously.” (MIL-HDBK-338B, p. 3-16)

109
Redundancy, Standby – “Redundancy in which some or all of the
redundant items are not operating continuously but are activated only upon
failure of the primary item performing the function(s).” (MIL-HDBK-
338B, p. 3-16)
Reliability – “The probability that an item can perform its intended
function for a specified interval under stated conditions.” (MIL-HDBK-
470A, p. G-15, definition (2))
Reliability Demonstration Test (RDT) – Another term for Reliability
Qualification Test (RQT).
Reliability Growth Test (RGT) – “A series of tests conducted to disclose
deficiencies and to verify that corrective actions will prevent recurrence in
the operational inventory. (Also known as “TAAF” testing.)” (MIL-STD-
785B, p. 3, definition for “Reliability development/growth test (RD/GT)”)
Reliability Qualification Test (RQT) – “A test conducted under specified
conditions, by, or on behalf of, the government, using items representative
of the approved production configuration, to determine compliance with
specified reliability requirements as a basis for production approval. (Also
known as a “Reliability Demonstration,” or “Design Approval” test.)”
(MIL-STD-785B, p. 3, definition for “Reliability qualification test
(RQT)”)
Reliability, Basic – “Measure of system’s ability to operate without
logistics support.” (Rome Laboratory Reliability Engineer’s Toolkit, p. 11)
Reliability, Critical – Another term for mission reliability.
Reliability, Logistics – Another term for basic reliability. (Rome
Laboratory Reliability Engineer’s Toolkit, p. 11)
Reliability, Mission – “Measure of system’s ability to complete mission.
Consider[s] only failures that cause [or would cause] a mission abort.”
(Rome Laboratory Reliability Engineer’s Toolkit, p. 11)
Reliability, Observed – “A point estimate of reliability equal to the
probability of survival for a specified operating time, t, given that the
equipment was operational at the beginning of the period.” (MIL-HDBK-
781A, p. 7)
Repair Time – Another term for Corrective Maintenance Time.
Repairable Item – “An item which, when failed, can be restored by
corrective maintenance to an operable state in which it can perform all
required functions.” (MIL-HDBK-338B, p. 3-17)
Scheduled Maintenance – “Periodic prescribed inspection and/or servicing
of products or items accomplished on a calendar, mileage, or hours
operation basis.” (MIL-HDBK-470A, p. G-15)

110
Test Analyze And Fix (TAAF) – Another term for Reliability Growth Test
(RGT).
Time, Active – “That time during which an item is in an operational
inventory.” (MIL-HDBK-470A, p. G-1)
Time, Standby – Time during which an item is ready to operate, but not in
operation.
Unscheduled Maintenance – “Corrective maintenance performed in
response to a suspected failure.” (MIL-HDBK-470A, p. G-17)
Upper Test MTBF (θ0) – “An acceptable value of MTBF equal to the
discrimination ratio times the lower test MTBF (θ1). The standard test
plans will accept, with high probability, equipment with a true MTBF that
approaches (θ0). This value (θ0) should be realistically attainable, based on
experience and information.” (MIL-HDBK-781A, p. 7)
Uptime – “Hours that product is in customer’s possession and works”
(Reliability Toolkit: Commercial Practices Edition, p. 11)
Utilization Rate (UR) – “The planned or actual number of life units
expended, or missions attempted during a stated interval of calendar time.”
(MIL-HDBK-470A, p. G-17)

111

You might also like