You are on page 1of 61

Applied Reliability

Page 1

APPLIED

RELIABILITY

Techniques for Reliability Analysis

with Applied Reliability Tools (ART) (an EXCEL Add-In) and JMP® Software

AM216 Class 5 Notes

Santa Clara University

STAT-TECH ®

Spring 2010

Applied Reliability

Page 2

AM216 Class 5 Notes

Accelerated Testing

(continued from Class 4 Notes)

Accelerated Test Example (Analysis in JMP)

Sample Sizes for Accelerated Testing

System Models

Series System

Parallel System

Analysis of Complex Systems

Standby Redundancy

Defective Subpopulations

Graphical Analysis

Mortals and Immortals

Models

Case Study

Class Project Example

Modeling the Field Reliability

Evolution of Methods

General Reliability Model

AMD Example

Applied Reliability

Page 3

System Models

Series System

Consider a system made up with n components in series. If the i th component has reliability R i (t),

the system reliability is the product of the individual

reliabilities, that is,

R

s

(

t

)

R

1

t

R

2

t

.R

n

t

which we denote with the capital “pi” symbol for

multiplication

R

s

t

n

i 1

R

i

t

The system CDF, in terms of the individual CDF’s, is

F

s



t

 

1

n

i 1

1

F

i



t

The system failure rate is the sum of the individual

component failure rates.

is higher than the highest individual failure rate.

The system failure rate

Applied Reliability

Page 4

System Models

Parallel System

Consider a system made up with n components in parallel. The system CDF is the product of the individual CDF’s, that is,

F

s

t

n

i 1

F

i

t

The system reliability is

R

s



t

1

n

i 1

1

R

i



t

System failure rates are no longer additive (in fact, the system failure rate is smaller than the smallest individual failure rate), but must be calculated using basic definitions.

Applied Reliability

Page 5

System Failure Rate

Two Parallel Components

A component has CDF F(t) and a failure rate h(t). Two components are used in parallel in a system. Determine the failure rate of the system.

SOLUTION The CDF for the two components in parallel is F 2 (t) and the PDF, by differentiation, is 2F(t)f(t). The failure rate of the system is

h

s

t

f

s



t

1 F

s



t

2 

F t

f

t

1 F

2



t



2 F t

1



F t

f



t

1



F t



2 F t



1 F t



h t

The result shows that the system failure rate is a factor 2F/(1+F) times the component failure rate. The smaller the component CDF, the bigger the improvement. Redundancy makes a larger difference in early life, and much less difference later on.

Applied Reliability

Page 6

Class Project

System Models

A) A component has reliability R(t) = 0.99. Twenty-five components in series form a system. Calculate the system reliability.

B) A component has reliability R(t) = 0.95 Three components in parallel form a system. Calculate the system reliability.

Applied Reliability

Page 7

Reliability Block Diagrams

For components in series:

A

B

For components in parallel:

A
B

Applied Reliability

Page 8

Example of Series-Parallel System: Big Rig

G
C
A
H
D
Trailer
Cab
I
E
J
F
B
I
G
E C
B
A
J
H
F D

Reliability Block Diagram (RBD)

Applied Reliability

Page 9

Class Project

Complex Systems

A system consists of seven units: A, B, C, D, E, G, H.

For the system to function unit A and either unit B or C and either D and E together or G and H together must

be working. Draw the reliability block diagram for this

setup.

Write the equation for the CDF of the system in terms

of the individual component reliabilities, that is, the R i ,

where i = A, B, C,

subsystems:A alone; B with C; and D,E,G,H.

, G, H. Hint: Consider the three

Applied Reliability

Page 10

Standby Versus Active Redundancy

In contrast to active parallel redundancy, there is standby redundancy in which the second component is idle until needed. Assuming perfect

switching and no degradation of the idle

component, standby redundancy results in higher

reliability and less maintenance costs than active

parallel redundancy.

exponentially distributed failure times, is shown below.

An illustration, assuming

System Failure Rates (2 Components)

0.012

0.01

0.008

0.006

0.004

0.002

0

Single
Parallel
Standby
0
50
100
150
200
250
300
350
400

Applied Reliability

Page 11

Series, Parallel Reliability in ART

In ART, select System Reliability information. Click OK.

Enter necessary

Applied Reliability

Page 12

Reliability Experiment

Consider

We test 100 units for 1,000 hours. There are 30 failures by 500 hours, but no more by the end of

test.

Question :

populations or just censored data ?

Are we dealing with two

Question : If we continue the test, will we see only a few more failures, or will the other 70 fail with the same life distribution ?

Applied Reliability

Page 13

Defect Models

Mortals versus Immortals

The usual assumption in reliability analysis is that all units can fail for a specific mechanism. If a defective subpopulation exists, only a fraction of

the units containing the defect may be susceptible

to failure. These are called mortals.

Units without the fatal flaw do not fail. These are called immortals.

The model for the total population of mortals and immortals becomes :

CDF = (fraction mortals) x CDF(mortals)

Reliability analysis focuses on the life distribution of

the defective subpopulation and the mortal fraction.

Applied Reliability

Page 14

Example of a Defective Subpopulation

A Processing Problem

Suppose we have 25 wafers in a lot, but only two wafers are contaminated with mobile ions due to a processing error.

If components are assembled from the 25 wafers, assuming equal yield per wafer, only 2/25= 8% of the components can have the fatal “defect” that

makes failure possible.

The components from the non-contaminated wafers will not fail for this mechanism since they are defect free; that is, we have a defective subpopulation.

Applied Reliability

Page 15

Spotting a Defective Subpopulation

Graphical Analysis

Assume that a specified failure mode follows a lognormal distribution.

Plot the data on lognormal graph paper. If instead of

following a straight line, the points seem to curve away from the cumulative percent axis, it’s a signal that a defective subpopulation may be present.

If test is run long enough, expect plot to bend over

asymptotic to cumulative percent line that represents

proportion of defectives in the sample.

Applied Reliability

Page 16

Defective Subpopulations

Graphical Analysis

Plot based on total sample (mortals and immortals).

Plot based only on mortal subpopulation.

Applied Reliability

Page 17

Defect Model

Mortals and Immortals

The observed CDF F obs (t) is

F obs (t) = p F m (t)

where F m (t) is the CDF of the mortals and p is the fraction of mortals (units with the fatal defect) in the total sample size.

For example, if there are 25 % mortals in the population, and the mortal CDF at time t is 40%, then we would expect to observe about 0.25x0.40 = 0.10 or 10% failures in the total random sample at time t.

Applied Reliability

Page 18

Major Computer Manufacturer Reliability Data

Gate Oxide Fails

 Time (hours) 24 48 168 500 1000 Rejects 201 23 1 1 1 Sample Size 58,000 57,392 10,000 2,000 1,999 Censored 407 47,369 7,999 0 1,998

Analysis by Company Using Lognormal Distribution

T 50 : 1.149E32 hours

Sigma: 26.175

Applied Reliability

Page 19

What Do These Numbers Mean?

Analysis by Company Using Lognormal Distribution

T 50 : 1.149E32 hours

Sigma : 26.175

Plus and minus 3 sigma range of time to failure distribution extends from 33 seconds to 1.66E62 years !

It takes seconds to get to 0.1% cumulative failures,

but over 412,000 hours (that is, 47 years) to get to 1.00% !

Assuming everything can fail is misleading and unnecessary.

Applied Reliability

Page 20

Modeling with Defective Subpopulations

The same data, assuming 99% of the failures have occurred by 48 hours, can be modeled by a fraction defective subpopulation of 227/58,000 = 0.39% and

a lognormal distribution of failure times for the mortals T 50 =10.6 hours and sigma = 0.68.

Practically 100% of failures occur by 168 hours. Any failures thereafter are probably not related to the defective subpopulation. For example, handling

induced failures are a possibility.

Applied Reliability

Page 21

Defective Subpopulation Models

If we don’t consider mortals vs. immortals, we will incorrectly assume that all units can fail.

Projections of field reliability will be biased

unless we identify the limited defective units.

Applied Reliability

Page 22

Statistical Reliability Analysis and Modeling:

A Case Study

Analysis of Reliability Data

with Failures from a Defective Subpopulation

Applied Reliability

Page 23

Reliability Study

Background

One lot of a device type with initial burn-in results at 168 hours, 125 o C :

Over 50% fallout due to bake recoverable

failures

Since other lots, with similar manufacturing, might have escaped to a few customers, we needed to assess the field impact.

We were able to impound this lot, containing about

300 devices not burned-in.

Applied Reliability

Page 24

Reliability Study

Design

Two static stresses:

179 Units :

90 Units :

30 Units:

125 o C ambient 150 o C ambient Control

Frequent readouts at 2, 4, 8, 16, 32, 48, 68, 92, 116 hours

Applied Reliability

Page 25

Purpose of Study

Reliability Modeling

Determine if fraction defective (mortals) model applies

Determine failure distribution (lognormal, parameters)

Determine if true acceleration is present

Determine activation energy for acceleration factors

Determine recovery kinetics with and without bake - Is 24 hours at 150 o C necessary? - Do devices recover at room temperature?

Applied Reliability

Page 26

Modeling Procedure

Statistical Analysis Plan

Analyze cumulative percent failures plot versus time, both linear and probability plots.

Estimate fraction mortals for stress cells. Test

for significant difference.

Plot fallout of mortals (reduced sample size) on lognormal probability graph. Check for linearity and equality of slopes.

Run maximum likelihood analysis. Test for equality of shape factors (sigmas). Estimate single sigma. Estimate median life T 50 for both cells.

Check model fit against original data.

Applied Reliability

Page 27

Reliability Study

Bake Recoverable Failures

L in e a r

P lo t o f C u m u la tiv e

F a ilur e s Ve rsu s T im e

80%
70%
60%
50%
40%
30%
20%
1 50oC
1 25oC
10%
0 %
0 2 0
4 0
6 0
8 0
100
120
Cum ula tive Pe rc e nt

Stre s s Tim e (Pow e r on H our s )

Sam ple S iz e s : 1 5 0 oC =9 0 ; 1 2 5o C = 1 7 9

Applied Reliability

Page 28

Reliability Study

Bake Recoverable Failures

P r o b a bi li ty Pl o ts
(N o
A d ju s tm e n t
fo r
M o r ta ls)
1
0 .5
0
0
1
2
3
4
5
-0 .5
- 1
-1 .5
150oC
125oC
- 2
-2 .5
Ln (Tim e t o Fa ilure )
Sa m pl e
S iz e s : 1 5 0 o C
= 9 0 ;
1 2 5 o C
=
1 7 9
Sta ndar d Norm a l Va ria te : Z

Applied Reliability

Page 29

Reliability Study

Bake Recoverable Failures

Norm a l V a ria te : Z

S ta nda rd

2.5

2

1.5

1

0.5

0

-0.5

-1

-1.5

-2

3

0

0.5

1

1.5

2
2.5

4

4.5

5

3.5

150oC

125oC

Ln(Tim e to Fa ilure )

M orta l S a m ple S iz e s: 150oC = 64;

125oC = 113

Applied Reliability

Page 30

APL PROGRAM FOR MLE

GENLNEST

ENTER NUMBER OF CELLS: 2

CHOOSE CONF. LIMIT FOR BOUND IN PERCENT: 90

ENTER ANY EXACT TIMES OF FAILURE FOR CELL 1

ENTER START AND ENDPOINT OF ALL READOUT INTERVALS (INCLUDE ZERO’S)

SPREAD 2 4 8 16 32 48 68 92 116

ENTER CORRESPONDING NUMBERS OF FAILS PER INTERVAL (INCLUDE ZERO’S)

34 6 21 2 0 0 0 1 0

ENTER TIMES ALL FAILED UNITS WERE REMOVED FROM TEST (INCLUDING END OF TEST)

116

ENTER CORRESPONDING NUMBERS REMOVED

0

ENTER ANY EXACT TIMES OF FAILURE FOR CELL 2

ENTER START AND ENDPOINT OF ALL READOUT INTERVALS (INCLUDE ZERO’S)

SPREAD 2 4 8 16 32 48 68 92 116

ENTER CORRESPONDING NUMBERS OF FAILS PER INTERVAL (INCLUDE ZERO’S)

5 0 36 8 42 7 3 4 3

ENTER TIMES ALL FAILED UNITS WERE REMOVED FROM TEST (INCLUDING END OF TEST)

16 116

ENTER CORRESPONDING NUMBERS REMOVED

2 3

MAXIMUM LIKELIHOOD ESTIMATES

 VARIANCE VARIANCE COVARIANCE CELL T50 SIGMA M U SIGMA M U MU SIGMA 1 1.90 1.208 .444 .0322 .0373e-1 .643e-2 2 15.08 1.060 2.714 .0059 .0104e-3 .266e-5 ESTIMATE BOUNDS (90 PERCENT CONFIDENCE) NUM. NUM. CELL ON TEST FAIL T50 LOW T50 UP SIGMA LOW SIGMA UP 1 64 64 1.38 2.63 .909 1.508 2 113 108 12.74 17.86 .933 1.187 WANT EQUAL T50’S OR SIGMAS OR BOTH IN SOME CELLS (Y/N)? Y CELLS: 1 2 TYPE 1 FOR EQUAL SIGMA’S, 2 FOR EQUAL MU’S, 3 FOR BOTH THE SAME: 1 THE ASSUMPTION OF QUAL SIGMA’S CAN NOT BE REJECTED AT THE 95 PERCENT LEVEL. UNDER THIS ASSUMPTION, RESULTS LIKE OBSERVED OCCUR ABOUT 41.9 PERCENT OF THE TIME. (THE SMALLER THIS PERCENT, THE LESS LIKELY THE ASSUMPTION.) MAXIMUM LIKELIHOOD ESTIMATES VARIANCE VARIANCE COVARIANCE CELL T50 SIGMA M U SIGMA M U MU SIGMA 1 2.02 1.090 .704 .0051 .0247e-2 .538e-3 2 15.08 1.090 1.713 .0051 .0110e-2 .250e-5 ESTIMATE BOUNDS (90 PERCENT CONFIDENCE) NUM. NUM. CELL ON TEST FAIL T50 LOW T50 UP SIGMA LOW SIGMA UP 1 64 64 1.56 2.63 .972 1.207 2 113 108 12.68 17.54 .972 1.207

WANT EQUAL T50’S OR SIGMAS OR BOTH IN SOME CELLS (Y/N)?

N

Applied Reliability

Page 31

Reliability Study

Bake Recoverable Failures

Model Fit to Actual

80%
70%
60%
50%
40%
30%
1 50oC
1 25oC
20%
M
LE
F it: 1 50oC
M
LE
F it: 1 25oC
10%
0%
0 20
40
60
80
1 00
1 20
1 40
C umum ative Per cent Failur es

Tim e

(P ow er o n Ho ur s)

Applied Reliability

Page 32

Projection to Field Conditions

Acceleration Statistics

Estimate acceleration factor between two

stress cells :

AF = 15.08 / 2.02 = 7.465

Estimate activation energy, based on Tj’s,

35 o C above ambient:

E A = 1.375 eV

Estimate field T 50 based on Tj at 55 o C ambient : field T 50 = 18,288 hours

Using field T 50 , sigma = 1.090, lognormal distribution:

-project fallout and failure rates for various mortal fractions -use customer field data to determine which mortal fraction applies

Applied Reliability

Page 33

Projection to Field Use

Bake Recoverable Fails

P e r ce nt

C um u lat iv e

Pr o je cted F ie ld F a llo u t w ith Va rio u s M o rtal

2 0%

1 8%

1 6%

1 4%

1 2%

1 0%

8%

6%

4%

2%

0%

P er cen tag es

5%
1 0%
2 0%
3 0%
4 0%
5 0%
6 6%
 0 2 4 6 8 1 0 T ime i n F ie ld ( K H o ur s )

Applied Reliability

Page 34

A Note of Caution

Analysis When Mortals Are Present

Since the analysis which took into account the presence of a defective subpopulation, parameter estimates were accurate. The two customers, notified of the affected lots, used analysis for decisions on how to treat remaining product in field.

If assessment is not done correctly and there is a low incidence of mortals, the T 50 ’s and sigma’s for a lognormal distribution may become very large and inaccurate.

Applied Reliability

Page 35

A Side Benefit

Screening a Wearout Mechanism

Note that it may be possible to screen a wearout

failure mechanism if only a subpopulation of the

units are mortal for that mechanism and sufficient acceleration is obtainable.

See Trindade paper “Can Burn-in Screen Wearout Mechanism? Reliability Models of Defective Subpopulations - A Case Study” in 29 th Annual Proceedings of Reliability Physics Symposium (1991)

Applied Reliability

Page 36

Class Project

Defect Models

50 components are put on stress. Readouts are at 10, 25, 50, 100, 200, 500, and 1,000 hours. The failure counts at the respective readouts are 2, 2, 4, 5, 4, 3, and 0.

1. Estimate the CDF for all units using the table below with n = 50.

 CDF Est Cum # All Units Time Fails (%) 10 2 25 4 50 8 100 13 200 17 500 20 1000 20

2. Plot the data on Weibull probability paper on the next page.

Does the data appear distributed according to a

Weibull distribution or does a defect model seem possible?

Applied Reliability

Page 37

Weibull Probability Paper

Applied Reliability

Page 38

Note: “Percent Failure” scale on Weibull Probability paper is faint. Values are 99.9, 98.0, 90.0, 70.0, 50.0, 30.0, 20.0, 10.0, 5.0, 2.0, 1.0, 0.5, 0.2, 0.1, etc.

Applied Reliability

Page 39

Class Project

Defect Model Estimates

Weibull Parameter Estimates for Mortal Population:

Characteristic Life (c)

Shape Parameter (m)

F

(

t

) 1

e

t

/

c

m

How could we confirm that the Weibull model for the mortal population fits the data? We estimate the CDF at three times and compare to observations.

 Mortal CDF Model Empirical (Weibull Mortal CDF for CDF All Time Model) Fraction All Units Units 25 0.221 0.4 100 0.632 0.4 1000 1.000 0.4

Applied Reliability

Page 40

Defective Subpopulations in ART

Enter failure information (readout times, cumulative failures) into columns. Under ART, select Defective Subpopulations… Enter required information. Click OK.

Applied Reliability

Page 41

System Models

A General Model for the

Field Reliability of

Integrated Circuits

An Evolution in the Projection of Field Failure Rates

Applied Reliability

Page 42

Failure Rate Calculations

Primitive Method

Assumptions

Constant failure rate

Single overall activation energy

Ambient temperatures

No separation of failure modes

Applied Reliability

Page 43

Primitive Method

Problems with Calculations

Example

100 units are stressed for 1,000 hours at 125 o C. Assume no self heating. One unit fails at 10 hours for mechanism with E A of 1.0 eV. Second unit fails at 500 hours for failure mechanism with E A of 0.5 eV.

Primitive Method Calculation

Overall average activation energy : 0.75 eV Acceleration Factor (125 o C to 55 o C): AF = 106 IFR (constant) at 55 o C :

[1E9x2/(10+500+98x1000)]/AF = 192 FITS

Applied Reliability

Page 44

Primitive Method

Comparative Calculation

Individual Analysis by Failure Mechanism

Mechanism 1: E A = 1.0 eV, AF = 501 IFR (constant) at 55 o C:

[1E9/(10+500+98x1000)]/AF = 20 FITS

Mechanism 2: E A = 0.5 eV, AF = 22,

IFR (constant) at 55 o C:

[1E9/(10+500+98x1000)]/AF = 461 FITS

Total IFR = 481 FITS

Applied Reliability

Page 45

Failure Rate Calculations

Later Improved Method

Early failures (infant mortality) reported separately

Long-term life modeled with activation energy

specific to failure mechanisms

Constant failure rate for long term life

Temperature acceleration calculated with junction temperatures

Applied Reliability

Page 46

Later Method

Problems

Competing failure modes not adequately modeled with constant failure rate

Zero rejects and unidentified mechanisms often not treated

Bathtub curve approximated in flat region only because of constant failure rate

Applied Reliability

Page 47

An Alternative Model

Three categories of possible failures:

Test Escapes Defective Subpopulations Competing Failure Mechanisms

The three D’s:

Defective

Deficient

Applied Reliability

Page 48

Non-Functional Test Escapes

Quality issue

Inadequate testing at manufacturer or damaged after testing prior to customer receipt

Rejects “discovered” at customer; called mistakenly reliability failures

Assume zero in model

Applied Reliability

Page 49

Defective Subpopulations

There are proportions of the total population at risk of failure. Defective units are called mortals. The ones without the defect are called immortals.

Defective subpopulations are generally associated with processing problems.

There are physical reasons why defective subpopulations should exist.

Always question the assumption (common in the traditional approach) that any observed failure type will eventually affect all other devices.

Applied Reliability

Page 50

Competing Risks

There are failure mechanisms that can affect all units.

We call these mechanisms competing risks

because several different types may exist and any

one can cause the unit to fail.

These mechanisms are typically associated with design, processing, or material problems.

We model the failures using Weibull or Lognormal distributions

Applied Reliability

Page 51

General Reliability Model

F

T

   1 

F

e

F

d

where

F N =

1 - R 1 R 2

R N

F

N

Activation energies are specific to failure mechanisms.

Zero rejects and unidentified mechanisms are included.

Generates complete bathtub curve!

Applied Reliability

Page 52

General Reliability Model In Use at AMD

AMD Reliability Brochure 1994 Data

Applied Reliability

Page 53

AMD Reliability Brochure 1994 Data

Applied Reliability

Page 54

Appendix

Applied Reliability

Page 55

Class Project

System Models

A) A component has reliability R(t) = 0.99. Twenty-five components in series form a system. Calculate the system reliability.

R s (t) = (0.99) 25 = 0.778 or 77.8%

B) A component has reliability R(t) = 0.95

Three components in parallel form a system.

Calculate the system reliability.

R s (t) = 1- (1- 0.95) 3 = 0.9999 or 99.99%

Applied Reliability

Page 56

Class Project

Complex Systems

A system consists of seven units: A, B, C, D, E, G, H. For the system to function unit A and either unit B or C

and either D and E together or G and H together must

be working. Draw the reliability block diagram for this setup.

D
E
B
A
C
G
H

Write the equation for the CDF of the system in

terms of the individual component reliabilities, that is,

the R i , where i = A, B, C,

three subsystems:A alone; B with C; and D,E,G,H.

1) R A 2) R BC =1- (1- R B )(1- R C ) 3) R DEGH = 1- (1- R DE )(1- R GH ) = 1- (1- R D R E )(1- R G R H ) The system CDF is F S = 1 - R S = 1 - R A R BC R DEGH

, G, H. Hint: Consider the

Applied Reliability

Page 57

Class Project

Defect Models

1. Estimate the proportion defective p and the number of mortals in the sample. Fill in the mortal CDF column in the table below.

 Cum # CDF Est All CDF Est Time Fails Units (%) Mortals (%) 10 2 2/50 = 4% 25 4 4/50 = 8% 50 8 8/50 = 16% 100 13 13/50 = 26% 200 17 17/50 = 34% 500 20 20/50 = 40% 1000 20 20/50 = 40%

2. Plot the data for the mortal subpopulation on the same sheet of paper. Does the fit look reasonable?

4. Estimate the characteristic life c = T 63 , the 63rd percentile.

5. Estimate the shape parameter m by drawing a line perpendicular to the “best fit by eye line” through the estimation point on the Weibull paper and reading the beta estimation scale.

Applied Reliability

Page 58

Class Project

Defect Model Example

n = 50

 Cum # CDF Est All CDF Est Time Fails Units (%) Mortals (%) 10 2 2/50 = 4% 2/20 = 10% 25 4 4/50 = 8% 4/20 = 20% 50 8 8/50 = 16% 8/20 = 40% 100 13 13/50 = 26% 13/20 = 65% 200 17 17/50 = 34% 17/20 = 85% 500 20 20/50 = 40% 20/20 = 100% 1000 20 20/50 = 40% 20/20 = 100%

Estimated mortal fraction, p : 0.40 or 40%

CDF estimate for mortals is based on sample size of defective subpopulation.

Applied Reliability

Page 59

Weibull Probability Plot

Applied Reliability

Page 60

Class Project

Defect Model Example Model Check

Weibull Parameter Estimates for Mortal Population :

Characteristic Life (c) Shape Parameter (m)

100

1.0

F

(

t

) 1

e

t

/

c

m

 Mortal CDF Model Empirical (Weibull Mortal CDF for CDF All Time Model) Fraction All Units Units 25 0.221 0.4 0.088 0.08 100 0.632 0.4 0.253 0.26 1000 1.000 0.4 0.400 0.40

Applied Reliability

Page 61

Class Project

Defect Model p x Weibull CDF Plot

Defect Model Example

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
100
200
300
400
500
600
700
800
900
1000
CDF

Times (Hrs)