You are on page 1of 7

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL. VOL. 6.

251-257 (1990)

WHAT IS WRONG WITH THE EXISTING RELIABILITY


PREDICTION METHODS?
KAM L. WONG
Kambea Industries, I130 Ronda Dr., Manhattan Beach, California 90266, U.S.A.

SUMMARY
Inaccurate reliability predictions could lead to disasters such as in the case of the U.S. Space Shuttle
failure. The question is: ‘what is wrong with the existing reliability prediction methods?’ This paper
examines the methods for predicting reliability of electronics. Based on information in the literature
the measured vs predicted reliability could be as far apart as five to twenty times. Reliability calculated
using the five most commonly used handbooks showed that there could be a 100 times variation. The
root cause for the prediction inaccuracy is that many of the first-order effect factors are not explicitly
included in the prediction methods. These factors include thermal cycling, temperature change rate,
mechanical shock, vibration, power on/off, supplier quality difference, reliability improvement with
respect to calendar years and ageing. As indicated in the data provided in this paper any one of these
factors neglected could cause a variation in the predicted reliability by several times. The reliability vs
ageing-hour curve showed that there was a 10 times change in reliability from 1000 ageing-hours to 10,
000 ageing-hours.Therefore, in order to increase the accuracy of reliability prediction the factors must
be incorporated into the prediction methods.

KEY WORDS Reliability prediction MiI-Hdbk-217 Environmental factors Environmental stress screening Age-
ing

INTRODUCTION the present version, 217E. In the meantime commer-


cial companies also attempted to come up with their
A quantitative measure of reliability is a necessity
own reliability prediction methods. In particular Bell
for making intelligent decisions in the deployment
Communication Research (Bellcore) has developed
and operation of a system. Inaccurate reliability
a model that is one step ahead of 217E. It partially
predictions could lead to disasters such as the U.S.
takes into account of decreasing equipment failure
Space Shuttle failure and the U.S. Iran rescue mis-
rates. Unfortunately even with all these improve-
sion failure, an analysis of its success probability
ments the reliability predictions are still far from
was given after the fact in Reference 1. In the other
accurate, as indicated in Figure 1 and Table 1. Fig-
direction pessimistic reliability predictions could
ure 1, which was compiled many years ago, shows
lead to overstocking of spares and waste of money.
how inaccurate were the predictions made using Mil-
The question is: ‘what is wrong with the existing Hdbk-217B. Table I (Reference 21, containing a
reliability prediction methods?” In this paper the comparison of Mil-Hdbk-217D predicted reliabilities
problems with the reliability prediction methods for
vs experienced reliabilities, shows that the predic-
electronics will be examined.
tions were again very inaccurate.
The various methods, such as Mil-Hdbk-217 and
the Bellcore reliability prediction procedure (RPP),
THE RELIABILITY PREDICTION generally include the following input parameters:
INACCURACY PROBLEM part complexity, part technology, package tech-
nology, part application, electrical stress, tempera-
When a system becomes available the quantitative ture, manufacturing quality control level and gross
reliability can be measured. However, by then if the environment. The mathematical techniques used to
reliability requirement were not met by the system combine these parameters to provide reliability
it would have been too late. Indeed that was the case numbers include exponential expressions, multi-
that led to the publication of the first comprehensive pliers and additions. The models are often simple
reliability prediction handbook, TR1100, in 1956 by approximations without scientific basis. For this rea-
RCA. It was hoped that, using methods provided son and the fact that many of the critical factors
in the handbook, electronic equipment reliability affecting reliability were not even considered in the
could be predicted. Comparing to current reliability models, the predictions simply do not give the cor-
prediction techniques TRllOO was rather crude. rect reliability numbers. Table 11, from Reference
Refinements were made and the document became 3, shows reliability predictions of several memory
the forerunner of Mil-Hdbk-217. Many more part boards using various prediction procedures. The
types and many new parameters were added to reach predictions procedures used were
0748-8017/90/04025 1-07$05.00 Received 23 May 1990
@ 1990 by John Wiley & Sons, Ltd. Revised 23 May 1990
252 K. L. WONG

500

100
12

50 Radlctcd < Obsrtrrd T

10
C
= 6
(0

0,
2
=
e 1
-.-e,
t .6

?!
B0
.l
UPW I

!
Confide
.05 Limit

Polnt E8

.o 1 Lowar
Conflda
Clmlt
.005

I J I I I I I I
.Wt .OM .01 .06 .1 .6 1 6 10 60

Pradlctrd Foilun Raw \/loe Ilrc

Figure 1 . Memory devices observed and Mil-Hdbk-217B predicted failure rates

(a) RPP: Bellcore reliability prediction pro- different from the others. No wonder that the pre-
cedure dictions simply do not match the experienced num-
(b) 217D: Mil-Hdbk-217D bers. Part of the reason for the inaccuracy lies in
(c) BT: British Telecom Handbook the inaccuracy of the values assigned to the various
(d) CNET: French National Center of Telecom- input parameters which in turn affect the failure
munications failure data compilation rates. Expansion of the failure database to correlate
(e) N R : Nippon Telegraph and Telephone the parameters to the failure rates can improve the
Reliability Table accuracy in this area. The most important reason
(f) Supplier A or B: Memory board supplier’s for the inaccuracy lies in the fact that many critical
own procedure factors that influence reliability in the first order
manner are not even considered in the existing mod-
The predictions in Table I1 clearly indicate that each els. These critical factors will be discussed individu-
method came up with its own reliability number ally in the remainder of this paper.
WHAT IS WRONG? 253

Table I. System operating reliability summary

Range of
quarterly MTBF
1985 values
Number 1985 cumulative 90 Cumulative field
Equipment of cumulative Demonstrated per cent Predicted MTBF divided by
type failures period hours MTBF confidence MTBF predicted MTBF Low High
~- - ~~ ~

A 34 2,493,812 73,347 58,316 52,726 1.39 39,322 384,873


B 33 1,710,767 51,841 41,074 46,873 1.11 45,156 67,933
C 49 5,850,567 119,399 98,745 45,626 2.56 77,023 210,665
D 12 8,344,380 695,365 469,270 75,648 9.19 340,716
E 6 1,710,767 285,128 162,434 57,830 4.93 149,546
F 18 4,157,758 230,987 168,006 75,455 3.06 161,565 741,667
G 263 4,153,484 15,923 14,575 3,794 4.20 13,322 16,853
H 1 262,656 262,656 67,526 10,972 23.94 51,840
I 3 262,656 87,552 39,315 4,822 18.16 43,249
J 36 1,106,135 30,726 24,595 13,305 2.31 22,921 66,203
K 6 119,351 19,892 11,332 72,225 0.28 14,350 46,147
L 6 43,776 7,296 4,156 7,690 0-95 4,320 8,832
M 35 1,062,359 30,353 24,217 28,957 1.05 17,381 63,259
N 91 2,124,718 23,349 20,345 4,895 4.77 16,591 44,654
0 3 43,776 14,592 6,553 5,131 2.84 7,344
P 0 43,776 18,992 18,054
Q 62 4,157,758 67,061 56,713 28,443 2.36 53,855 81,356
R 6 2,078,879 346,480 197,386 85,329 4.06 203,390 741,667
S 41 2,078,879 50,704 41,177 35,291 1.44 35,317 103,863
T 10 1,040,471 104.047 87,534 109,355 0.95 76,577 363,441
U 2 986,784 493,392 185,406 94,883 5.20 172,572

Table 11. Predicted reliability of a large memory board following paragraphs discuss a number of the critical
but neglected real environments.
Board failure rate
Percentage
Procedure FITS per year
Vibration and mechanical shock
RPP 38,500 33 Each of the general environmental factors used
217D 4,240,460 3713 in the existing models implies certain vibration and
BT 700 0.6 shock environment. For example, for the factor of
CNET 37,870 33 ‘airborne fighter uninhabited’, vibration and shock
NTT 37,940 33
Supplier A 56,280 49 spectrum and magnitude expected from the unin-
Supplier B 19,600 17 habited compartment of a fighter are assumed.
Unfortunately, knowing the general vibration level
of an area is far from knowing the true vibration
level in a unit. Each unit would have its own
environment depending where it is installed and the
THE REAL ENVIRONMENTS
exact mounting structure. Figure 2, from Reference
The so-called environmental factors used in the 4, shows the response spectrum on a printed circuit
existing reliability prediction models are usually board (PCB) subjected to U.S. NAVMAT P-9492
multipliers for modifying the basic failure rates to standard environmental stress screening vibration
reflect the influence of the particular environments input. From this Figure one can discern that the
on generation of failures. These factors are usually amplitudes of the vibration were dampened at some
classified into broad categories such as ‘ground frequencies and amplified at others. Each unit would
fixed’, ‘ground mobile’, ‘airborne fighter, uninhabi- have its own response spectrum and, hence, its own
ted’, and ‘missile launch’. Temperature is usually failure rate when subjected to the same vibration
considered separately. Thus, except for tempera- input. Figure 3, from Reference 5 , shows time to
ture, these general environmental factors include failure of an air-to-air missile with respect to
many other kinds of real environments: e.g. vibration levels. The scattering of points in this Fig-
vibration, mechanical shock, thermal cycling, ther- ure indicates that there was a very large variation
mal shock, thermal transient, electrical transient, in the response from missile to missile. Furthermore,
power on/off, moisture, humidity, altitude, sand and the part type dominating the failure distribution
dust, and chemical contamination. Many of these changes with vibration intensity as shown in Figure
real environments depend on equipment usage. The 4, from Reference 5. This is because different failure
254 K . L. WONG

10.000 mechanisms exist within each part type. Thus the


failure mechanisms and their responses to specific
vibration and shock could easily cause a reliability
1.000
variation of two to three times.

Thermal cycling and temperature transient


G2/HZ
0.100 Thermal cycling and temperature transient are
also implied in the general environmental factor
used with the existing prediction models. Although
the condition selected for the model might be correct
0.010
on the average, specific usage under the exact con-
dition could make a tremendous difference. The
same fighter aircraft operating in the arctic would
have thermal environments quite different from
0.001
io 100 1,000 those operating in the tropics. The sortie length
FREQUENCY ( HZ ) would also affect the thermal cycle. Figure 5, from
Figure 2. PCB vibration response Reference 6, shows how the number of thermal
cycles and temperature ranges affect the number
of failures. In general, for reasonable temperature
extremes, thermal cycling effects outweigh sustained
temperature effects. Thermal cycling causes mech-
anical expansion and contraction. Thus, the equip-
ment would go through fatigue cycles as a result of
temperature cycling. The rate of temperature
change is related to the thermal cycling rate and,
hence, to the fatigue cycle frequency. For brittle
material higher rate of change would accelerate fail-
ures, but for ductile material, e.g. aluminium and
solder, the effect is just the opposite. Figure 6, from
Reference 7, shows that for the aluminium specimen
tested, lowering the fatigue cycling frequency would
shorten the fatigue cycle life. Thus, in evaluating
thermal cycling effects on failures, the specific
material must be examined and be taken into
account in the prediction. This factor alone can also
change the item reliability by several times.

Vibration level,g, r.m.t. Storage, shipping, non-operating and power onloff


Figure 3. Measured M " F as function of vibration level for an When a system is not operating it is still going
air-to-air missile
through environmental stresses. Thus non-operating
failure rates should simply be the failure rates at
zero electrical stress. And, the non-operating failure
rate should be different under storage, shipping, on-
aircraft-ground and on-aircraft-flight conditions. It

S C

-55°C/+1250C

: :
2

0
1 20 40 60 180 200 220 240 260
80 100 120 140 160ooc1+550c

Vibration level, g,r.m.s. CYCLES OF TEMPERATURE

Figure 4. Change of failure mode distribution with vibration Figure 5. IC failures as function of temperature cycles for differ-
level for air-to-air missile ent temperature ranges 0 1968 IEEE
WHAT IS WRONG? 255

things such as periodic verification of parts' mechan-


0 /
/
ical integrity, long-term measured failure rate, mini-
/
/
0
0 mum life expectancy, extent of parametric measure-
e t 0 /
/ 0 ments and amount of environmental screening.
These things are specified in the procurement speci-
fications. However, there are many more elements
not controlled by the specifications that are critical
to the control of reliability. Different manufacturers
building the same part to the same specification
could easily have a 3 to 1 variation in reliability.
Figure 7, from Reference 6, shows percentage fail-
ures as a function of temperature cycles for three
different supplier groups. In this Figure the worst
to the best supplier had 40 times difference in per-
I I I I I I I I centage failures. Choosing a good supplier would
35 40 45 50 55 60 65 70 make a much greater impact on reliability than the
Log, v+(- AH -)i controls imposed through specifications. Specifi-
R T
cations serve a good purpose to a certain extent.
Figure 6. Composite N-temperature diagram (2450 psi). The problem is that any specification is limited in
ASTM: 0 high frequency, 1440 CPM; 0 low frequency, 25 CPM;
IJ= frequency, CPM: T = temperature, K; R = gas constant; coverage. If one were to specify everything, the part
AH = activation energy. 34 k cal g atom cost would have been prohibitive. If a supplier does
not have a problem there is no need to impose a
control and add cost. However, each supplier has
was concluded from the 1979 RADC non-operating its own specific problems. This variation greatly con-
failure rates for avionics study8 that the on-ground trols the reliability of the parts and systems. The
non-operating failures contributed 10 to 30 per cent products from different suppliers could have a three
of the total number of failures for aircraft with a times difference in reliability.
utilization rate of 20 to 60 hours per month. Unfor-
tunately a system must be energized before one can
SYSTEM AGE
find out if there is a failure. Thus, non-operating
failures and turn-on failures are extremely difficult In recent years system failure data have shown that
to separate, if not impossible. The percentage cited electronic system failure rates decrease with system
above could come from either condition. Many miss- age. Figure 8, from Reference 10, gives failure rate
ile storage studies have indicated that the failures data on a set of Digital Air Data Computer used by
appeared to be independent of the length of storage. commercial airlines under manufacturer's warranty.
This strongly suggests that turn-on was the culprit The data compilers had carefully removed all sys-
that caused most of the failures. Reference 9 pro- tems that had improvement modifications from the
vided data showing that in an aircraft sortie the compilation so that the change in failure rate would
failure rate would start high and then gradually be a pure function of age. Figure 8 shows that by
decrease during the mission. This might also be ageing time of 15,000 h the failure rate has gone
related to effects of power turn-on. A great deal of down from the original 400 per cent f./1000 h to 4
unknown still exists in this area, but storage, ship- per cent, a 100 times change. From ageing time of
ping, non-operating and power on/off effects on 1000 h to 10,000 h the change was 10 times. Figure 9
reliability must be addressed. shows data presented in 198711on over 300 satellites
which also showed such ageing effects. Reference
12 provides a rationale for this phenomenon. Note
Other environments
Temperature is an important factor. It has been
incorporated in all existing models. Other environ-
ments, such as humidity, moisture, sand, dust, alti-
tude and radiation, do not appear to be critical
factors in themselves. Moisture could be very
important when coupled with certain manufacturing
defects. It is most likely that these factors would not
contribute significantly to the prediction inaccurac-
ies.

SUPPLIER QUALITY DIFFERENCE 0 100 150 200 250 300


TEMP CYCLES -5S0/C +125'C
The parts manufacturer quality control factor in the Figure 7. RF inductor failures as function of temperature cycles
present prediction models was developed to reflect (for four vendors) 0 1968 IEEE
256 K. L. WONG

on the ESS effects were presented by this author at


the May 1989, Institute of Environmental Sciences
Annual Technical Meeting in a paper entitled: ‘A
new environmental stress screen theory for elec-
tronics’. l 3

RELIABILITY IMPROVEMENT VS
CALENDAR YEAR
Figure 10, from Reference 14, shows how the failure
rate of a part decreases with calendar year. This was
due to the continuous upgrading of the manufactur-
ing processes by all of the manufacturers as well as
the specific part supplier. In the computer reliability
prediction by function regression equation in Mil-
Hdbk-338, the coefficient for calendar year is the
I
1.o
IM)
f 250
I I
I
500
I
low
I
I
2500
I
woo
I
I
I
76.000
1o.Ooo E0
m
.
I
I 100.000 second most significant factor in the equation (the
first being addhubtract time). This is because the
AVERAGE AGE/LRU. HRS
I I I 1 1 I IIII equation was developed at a time when integrated
JAN JAN UN
72 73 7) circuits reliability was being imprved at a tremen-
dous rate and large quantities of integrated circuits
Figure 8. Electronics failure rate versus average age in hours
0 1979 IEEE were being used in computers. The reliability could
easily double every three years. Thus, if the true
“O1’
reliability of a system built at a particular time frame
is desired, this factor must be included.

CONCLUSIONS
It has been shown in this paper that the various
electronics reliability prediction methods do not pro-
vide consistent results, and the measured reliabilities
do not agree with the predicted numbers. The root
cause for the prediction inaccuracy lies in the fact
that many of the first-order effect factors are not
0 I 2 3 4 5 6 7 8 9 1 0 1 1 12131415161718 explicitly included in the prediction methods. These
HALF YEAR INTERVALS
factors include thermal cycling, temperature change
Figure 9. Unsmoothed failure ratios of spacecraft operation in rate, mechanical shock, vibration, power on/off,
orbit @ 1987 IEEE
supplier quality difference, reliability improvement
with respect to calendar years and ageing trends.
Any one of these factors neglected could cause a
that ageing is not limited to the operating period. variation in the predicted reliability by several times.
A system would age, although very slowly, when In order to make the predictions more accurate
not operating. Any reliability prediction method these factors must be incorporated into the predic-
must take into account this ageing effect. Otherwise tion methods.
one could be off by a factor of 10.

ENVIRONMENTAL STRESS SCREENING


(ESS)
Figure 8 data can also be viewed as a curve reflecting
reliability improvement through ESS as in effect
the computers were exposed to aircraft operating 1979
environments with respect to ageing time. If an ESS
regimen can be designed to provide a 100 times
stress acceleration over the actual aircraft environ-
PROJECTED
ment, then through this ESS the failure rate of the I
computer can be lowered to the 4 per cent f./1000 h CUMULATIVE NET UNITS BUILT
point after 150 h of ESS exposure. Thus ESS directly
Figure 10. Reliability improvement of digital circuit versus cal-
affect the reliability of the equipment. More details endar time (from Reference 14) 0 1981 IEEE
WHAT IS WRONG? 257

REFERENCES Institute of Environmental Sciences, May 1989.


14. G. Peattie, ‘Quality control of Ic‘s’, ZEEE Spectrum,
1. G.E. Benz, ‘In defence of numerical techniques’, Quality October 1981.
and Reliability Engineering International, 2, 25-32 ( I 986).
2. L . R. Webster, ‘Field vs. predicted for commercial SATCOM
terminals’, RAMS Proceedings, January 1986.
Author’s biography:
3 . J.L. Spencer, ‘The highs and lows of reliability predictions’,
RAMS Proceedings, January 1986.
4. R.M. Turner, ‘Random vibration screeing of printed circuit Kam L. Wong is manager of the Engineering Division of
boards’, Test Engineering and Management, DecembedJanu- Kambea Industries. Prior to his present position he was
ary, 1988-89. with Hughes Aircraft Co. where he served in many
5 . D.B. Meeker and W.D. Everett, ‘U.S. Navy experience on capacities including Corporate Reliability Manager and
the effects of carrier-aircraft environment on guided missiles’, Manager of Product Analysis Laboratory which covered
A G A R D Conference Proceedings, May 1979. thermal dynamics, structural mechanics, electromagnetic
6. J.Q. Reynolds, ‘Effects of sustained temperature cycling on compatibility and reliability. H e was also in charge of
parts’, RAMS Proceedings, January 1968.
environmental testing, hybrid circuit development, etched
7. A.F. Madayag, (ed.), Metal Fatigue Theory and Design,
Wiley, 1969. circuit fabrication and electronic assembly. H e is coauthor
8. G.A. Kern, I. Quart, S.S. Tung and K.L. Wong, ‘Nonoperat- of a book Reliability Engineering f o r Electronic Systems
ing failure rates for avionics study’, RA DC-TR-80-136 Final (Wiley) and author of many papers. He received the P.K.
Technical Report, 1980. McElroy Award at RAMS in 1981, delivered a keynote
9. M.B. Shurman, ‘Time-dependent failure rates for jet air- speech at the European Electronic Conference in 1982,
craft’, RAMS Proceedings, January 1978. and received the Reliability, Test and Evaluation Award
10. A.G. Bezat and L.L. Montague, ‘The effect of endless burn- from the Institute of Environmental Sciences in 1989 for
in on reliability growth projections’, RAMS Proceedings, his pioneering work leading to improved understanding
January 1979. of long-term failure rate characteristics. He served as the
11. H. Hecht and E . Fiorentino, ‘Reliability assessment of space-
craft electronics’, RAMS Proceedings, January 1987. Vice Chairman for the Environmental Stress Screening
12. K.L. Wong and D.L. Lindstrom, ‘Off the bathtub onto the on Electronic Hardware Committee of IES for two years.
roller-coaster curve’, RAMS Proceedings, January 1988. Mr. Wong received his BSEE from California Institute of
13. K.L. Wong, ‘A new environmental stress screening theory Technology and MSEE degree from University of Sou-
for electronics’, Proceedings of the Annual Technical Meeting, thern California.

You might also like