13.9, Software Reibilty 787
37. A producer of pocket calculators estimates that the calculators fail ata rate of one
every five years. The calculators are sold for $25 each with a one-year free
replacement warranty but can be purchased from an unregistered mail-order
source for $18.50 without the warranty. Is it worth purchasing the caleulator with
the warranty?
38. For Problem 37, what length of period of the warranty equates the replacement
costs of the calculator with and without the warranty?
39. Zemansky’s sells tires with a pro rata warranty. The tires are warranted to deliver
50,000 miles with the rebate based on the remaining tread on the tire. The ties fail
on the average after 35,000 miles of wear. Suppose the tires sell for $50 each with
the warranty. If failures occur completely at random, what would be a consistent
price for the tres if no warranty were offered?
40. Habard’s, a chain of hardware stores, sells a variety of tools and home repair items.
One of their best wrenches sells for $5.50. Habard’s will include a three-year fr
replacement warranty for an additional $1.50. The wrench is expected to be sub-
ject to heavy use and, based on past experience, will fail randomly at a rate of one
every eight years. Is it worth purchasing the warranty?
41. Consider the case in which the failure mechanism for the product does not obey
the exponential law. In that case, the cost under the free replacement warranty that
is indifferent to the cost of buying the item without a warranty is given by
C= KIM(M,) + 1,
where M(t) is known as the renewal function.
If the time between failures, 7, follows an Erlang law with parameters A and
2, then
At
Mi) =F ~ 0.25 + 0.25" — forall = 0.
(See, for example, Barlow and Proschan, 1965, p. 57.)
4. For Example 13.13, presented in this section, determine the indifference value
of the item with a free replacement warranty when the failure law follows an
Erlang distribution. Assume that A = 3 to give the same value of E(T)as in the
example.
‘, Is the value of the warranty larger or smaller than in the corresponding expo-
nential case? Explain the result intuitively.
13.9 SOFTWARE RELIABILITY
Software reliability is a problem with characteristics different from hardware reliabil-
ity problems. Typically, new software possesses a few “bugs,” or errors. Ideally, one
would like to remove all the bugs from the software before its release, but thet may be
impossible. It is more reasonable to release the software when the number of bugs has
been reduced to ar acceptable level. Predicting the number of remaining bugs is, how-
ever, a difficult problem.
The importance of software reliability cannot be overemphasized. Quoting from
The Wall Street Journal (Davis, 1987):
‘The tiniest software bux can fell the mightiest machine—often with disastrous consequences,
During the past five yeas, software defects have Killed sailors, maimed patents, woundedWN ttetatey
Snapsh:
RELIABILITY-CENTERED MAINTENANCE
IMPROVES OPERATIONS AT THREE MILE
ISLAND NUCLEAR PLANT
The Three Mile Island nuclear facility located on the
Susquehanna River about 10 miles from Harrisburg,
Pennsylvania, is notorious in one respect. It was the site
of the worst nuclear power generating plant accident in
the United States. In March of 1979, Unit 2 underwent a
core meltdown as safety systems failed to lift nuclear
fuel rods from the core. The facility was shut dows as a
result of the accident for the nextsix and one-halfyears,
finally reopening in October 1985. The plant, operated
by GPU Nuclear Corporation, has compiled one of the
‘most impressive records in the industry since it has re-
pened Arcarding ta Fox et al. (1994), the plant was
ranked top in the world in 1989 on the basis of its capac-
ity factor (proportion of up-time).
In 1987, GPU began to consider the benefits of a
reliablity-centered maintenance (RCM) approach to
preventive maintenance. They identified 28 out of a
total of 134 systems as viable candidates for RCM. These
28 systems included the main turbine, the cooling
water system, the main generator, and circulating water,
The RCM process relied on the following four basic
principles
Preserve system functions.
‘+ Identify equipment failures that
functions.
defeat those
‘+ Prioritize failure modes.
‘= Define preventive maintenance tasks for high-priority
failure modes.
‘The RCM project spanned the period of September
1988 to June 1994, A total of 3,778 components in the
28 subsystems came under consideration. By the end of
the program, preventive maintenance policies included
more than 5,400 tasks for these components. The cost of
implementing RCM was substantial: about $30,000 per
system. However, these costs were more than offset by
the benefits. Over the period 1990 to 1994, records show
a significant decline in plant equipment failures. n addi
tion, a reliabilty-based maintenance program can have
other benefits, including
+ Increased plant availability
+ Optimized spare parts inventories.
* Identification of component failure modes.
* Discovery of new plant failure scenarios.
+ Training for engineering personnel
* Identification of components that benefit from
revised preventive maintenance strategies.
+ Identification of potential design improvements.
* Improved documentation.
Fox et al. (1994) report several lessons learned from
this experience. One is that it is better for the intemal
maintenance organization, rather than an outside
agency, to direct the process. This avoids the “we versus
they" syndrome, Successful implementation is also more
likely in this case. A cost analysis checklist was developed
to screen failure modes. Finally, the team evolved an
efficient multiuser relational database software systemto
facilitate RCM evaluations. This system reduced the time
required to perform the necessary analyses by 50 percent.
The lesson learned from this case is that a carefully
designed and implemented reliability-based preventive
maintenance program can have big payoffs for high
stakes systems,
corporations and threatened to cause the government-sccurities market to collapse. Such
problems are likely to grow as industry and the military increasingly rely on software to run
systems of phenomenal complexity, including President Reagan's proposed “Star Wars” ant
missile defense system,
Several models have been proposed for estimating software reliability. However, we
will not present these models in detail because their utility has yet to be determined.
Jelinski and Moranda (1972) have suggested the following approach. Let N be the total
initial error content (i, the number of bugs) in the software. As the software under-
goes testing, the number of bugs is reduced. They assume that the failure rate (that is,
the likelihood of detecting a bug) is proportional to the number of bugs remaining in.
the program, where @ is the proportionality constant. That is, the time until detection13.10 Hiss Nows 789
of the first bug has the exponential distribution with parameter N¢; the time between
detection of the first and the second bugs has exponential distribution with parameter
(N ~ 1d; and so on.
Hence, as bugs are removed from the program, the amount of time required to de~
tect the next bug increases. After n bugs have been removed, one will have observed the
values of 7), 72, ... Ty representing the time between successive detections. These
observations are used to estimate ¢ and N using the maximum likelihood principle.
Based on these estimates, one could predict exactly how much testing would be
required in order to achieve a certain level of reliability in the software.
Shooman (1972) suggests using a normalized error rate to measure the error content
in the program, He defines
p(t) = Errors per total number of instructions per month of debugging time
and develops a reliability model based on first principles. He demonstrates how this
model can be used to build a functional relationship between the amount of time
devoted to debugging and the reliability of the program.
‘The works of Jelinski and Moranda and of Shooman represent the foundation of the
theory of software reliability. Extensions of their methods have been considered. It
remains to be seen, however, if these methods provide accurate descriptions of the
problem and whether they ultimately will assist in predicting the time required to
achieve an acceptable level of reliability.
13.10 HISTORICAL NOTES
Much of the theory of reliability life testing, and maintenance strategies has its roots,
in actuarial theory developed by the insurance industry. Sophisticated mathematical
models for predicting survival probabilities date back to the turn of the century. Lotka
some of the connections between equipment replacement models and
‘The work of Weibull (1939 and 1951) laid the foundations for the
subject of fatigue life in materials.
Interest in reliability problems became considerably more widespread during World
War II when attempts were made to understand the failure laws governing complex
military systems. During the 1950s, problems concerning life testing and missile
reliability began to receive serious attention. In 1952 the Department of Defense
established the Acvisory Group on Reliability of Electronic Equipment, which pub-
lished its first report on reliability in June of 1957.
The origins of tie specific age replacement models presented in this chapter are un-
clear. However, sophisticated age replacement models date back as far as the early
1920s (see Taylor. 1923, and Hotelling, 1925). The stochastic planned replacement
models presented in Section 13.7 form the basis for much of the research in replace~
ment theory, but the origins of these models are unclear as well.
Section 13.8, on warranties, is based on the paper by Blischke and Scheuer (1975).
Extensions and corrections of their work can be found in Mamer (1982). Readers
interested in pursuing further reading should refer to the excellent texts by Barlow and
Proschan (1965 and 1975) on reliability models, and by Gertsbakh (1977) on mai
tenance strategies. Issues concerning the application of maintenance models are di
cussed by Turban (1967) and Mann (1976).790 Chapter Thirteen Rell and Mainuanailty
13.11 Summary The purpose of this chapter was to review the terminology and the methodology of the
theory and application of reliability and maintenance models. Reliability theory is an area
of study that has received considerable attention from mathematicians. However, the mth
ematics is of interest not only for its own sake, These models are extremely useful in an
operational setting in considering such issues as failure characteristics of operating equip-
‘ment, economically sound maintenance strategies, and the value of product warranties and
service contracts.
‘The complexity of the analysis depends upon the assumptions made about the random
variable T, which represents the lifetime of a single item or piece of operating equipment.
The distribution function of T, F(), is the probability that the item fails at or before time
1(P(T= 1)), whereas the reliability function of T, R(), i the probability thatthe item fails
after time #(P{7 > ¢}). An important quantity related to these functions is the failure rate
fietion r(t), which is the ratio /()/ R() of the probability density function and the reliabil-
ity function, IfA/is sufficiently small, the term r()A¢ can be interpreted as the conditional
probability thatthe item will fail in the next Ar units of time given that it has survived up
until time f
The failure rate function provides considerable information about the aging charactecis-
tics of operating equipment. In a manufacturing environment, we would expect that most
operating equipment would have an increasing failure rate function. That means it would be
more likely to fail as it ages. A decreasing failure rate function can arise when the likeli-
hood of early failure is high due to defectives in the population. The Neibull probability aw
can be used to describe te failure characteristics of equipment having either an increasing
ora decreasing failure rate function,
Of interest is the case in which the failure rate function is constant. This case gives rise
to the exponential distribution for the lifetime ofa single component. The exponential distri-
bution is the only continuous distribution possessing the memoryless property. This means
thatthe conditional probability that an item that has been operating up until time fails inthe
next s units of time is independent of
The Poisson process describes the situation in which a single piece of operating equip-
‘ment fails according to the exponential distribution and is replaced immediately upon fail-
ure. When this occurs, the number of failures in a given time has the Poisson distribution,
the time between successive failures has the exponential distribution, and the time for
failures to occur has the Erlang distribution,
‘The chapter considered the reliability functions of complex systems of componeats.
It showed how to obtain the reliability functions for components in series and parallel from
the reliability functions ofthe individual components, The chapter also considered K out of N
systems, which function only ifat least K components function.
Reliability issues form the basis of the maintenance models discussed in the latter
half of the chapter. An important measure of a system's performance is the availabilty
which is the proportion of the time that the equipment operates. We treated both deter
ministic age replacement models, which do not explicitly include the likelihood of
equipment failure, and stochastic age replacement models, which do. The stochastic
models allow for replacing the equipment before failure. This is of interest when items
have an increasing failure rate function and unplanned failures are more costly than
planned failures
Finally, we concluded the chapter with a discussion of the economic value of warranties
‘A warranty isa promise supplied by the seller to the buyer to either replace the item with a
new one if it fails during the warranty period (free replacement warranty) or provide a
discount on the purchase of a new item proportional to the remaining amount of timeAddional Problems on Relay and Mainuaindilty 794
(or wear) in the warranty period (pro rata warranty). The issues surrounding warrenties and
service contracts are similar, but service contract models are considerably more complex,
‘owing to the need t) include multiple levels of repait.
Additional
Problems on
Reliability and
Maintainability
42. A large nationsl producer of appliances has traced customer experience with a popular
toaster oven. A survey of 5,000 customers who purchased the oven early in 2000 has
revealed the following:
Number of
Year Breakdowns
2000 188,
2001 58
2002 63
2003 R
2004 54
2005 n
4a. Using these data, estimate py = the probability that a toaster oven fails in its Ath
-year of operation, for k= 1,...,
‘b. What is the likelihood that a toaster oven will last atleast six years without failure
based on these data?
¢. The discrete failure rate function has the form ri = px/Ri-1, where Ris the prob-
ability that a unit survives through period &: Determine the failure rate function for
the first five years of operation from the given data.
d. Suppose that you purchased a toaster oven at the beginning of 2004 and itis still
‘operating a: the end of 2007. If the reliability has not changed appreciebly from
2000 to 2007, use the results of part (c) to obtain the probability that it will fail
during the first two months of calendar year 2008.
43. Six thousand ‘ight bulbs light a large hotel and casino marquee. Each bulb fails
completely at random, and each has an average lifetime of 3,280 hours. Assuming that
the marquee steys lit continuously and bulbs that bum outare replaced immedietely, how
‘many replacemrents must be made each year on the average?
44, The owner of tie hotel mentioned in Problem 43 has devided that in order to devsease
the number of burned-out bulbs, she will replace all 6,000 bulbs at the start of each
‘year in addition to replacing the bulbs as they burn out. Comment on the effectiveness
of this strategy
45. The owner of the hotel mentioned in Problem 43 falls on hard times and dispenses with
replacement of the bulbs. She notices that more than half of the bulbs have burned out
before the advertised average lifetime of 3,280 hours and decides to sue the light bulb
‘manufacturer for false advertising. Do you think she has a case? (Hint: What fraction of
the bulbs would be expected to fail prior to the mean lifetime?)
46, Continuing with the example of Problem 43, determine the following:
4. The proportion of bulbs lasting more than two years.
'. The probability that a bulb chosen at random fails in the frst three months of
operation.792. Chapter Thirteen Rly and Mainaanailty
. The probability that a bulb that has lasted for 10 years fails in the next three months
of operation.
47. Assume that the bults in Problem 43 are not replaced as they fail
‘a. What fraction of the 6,000 bulbs are expected to fail in the first year?
'b, What fraction of the bulbs surviving the first year are expected to fail in the second,
year?
‘c. What fraction ofthe bulbs surviving the nth year are expected to fail in yearn + 1 for
any value ofn = 1,2,...?
d. Using the results of part (c), of the original 6,000 bulbs, how many would be
expected to fail ir the fourth year of operation?
48, The mean value of a Weibull random variable is given by the formula
waa PTC + 1/8)
where I represents the gamma function. The gamma funetion has the property that
14k) = & = HT = 1) for any value of k > 1 and PC) = 1, Notice that if & is an
integer, this results in [(&) = (k ~ 1)! If is not an integer, one must use the recur-
sive definition for I) coupled with the following table, For values of 1 = k= 2, T(&)
is given by
k re) k re)
100 1.0000 1.55. 889)
105 9735 1.60 2935
110 9514 1.65. 9001
115 9330 1.70 ‘9086
120 9121.75 9191
125 9064 1.80 9314
130 9751.85. 9456
135 9121.90 9612
140 8731.95 9799
145 8857 200 ~—*1.0000
150 8862
For example, this table would be used as follows: T(3.6) =
(2.6)(1.6)1'(.6) = (2.61.6)(.8935) = 3.717.
‘a. Compute the expected failure time for Example 13.4 regarding copier equipment.
5. Compute the expected failure time for a piece of operating equipment whose fail-
ure law is given in Example 13.2.
‘. Determine the mean failure time for
dd. Determine the mean failure time for a =
2.6)02.6) =
35 and B = 0.20.
1.90 and B = 0.45.
49. Suppose that a particalar light bulb is advertised as having an average lifetime of 2,000
hours and is known to satisfy an exponential failure law. Suppose for simplicity thatthe
bulb is used continuously. Find the probability thatthe bulb lasts
‘a. More than 3,000 hours.
, Less than 1,500 hours
¢. Between 2,000 and 2,500 hours.
50. Applicational Materals sells several pieces of equipment used in the manufacture
of silicon-based microprocessors. In 2003 the company filled 130 orders for model
55212. Suppose that the machines fail according to a Weibull law. In particular,
the cumulative distribution function F(() of the time until failure of any machine