Software Testing: Protocol Comparison

James Yen, David Banks, P. Black, L. J. Gallagher, C. R. Hagwood, R. N. Kacker, and L. S. Rosenthal National Institute of Standards and Technology Gaithersburg, MD 20899 USA March 28, 1998

suggested, especially in the area of conformance veri cation. In order to compare the e cacy of these protocols, we have implemented a designed simulation experiment that examines performance in terms of testing costs and risks. We nd that no single method dominates all others, and provide guidance on how to choose the best protocol for a given application.

Abstract: Software testing is hard, expensive, and uncertain. Many protocols have been

1

the focus has become more driven by applications. and deciding when to release software. some of the seminal work in this area appeared decades ago. This protocol is a form of two-stage sampling Cochran. its simplicity makes it a commonly used procedure. and then applies dynamic programming to determine when the expected cost of additional search exceeds the expected cost of the remaining bugs. More recently. They generate a sensible test for every speci cation listed by the designers. Our paper describes a methodology to resolve uncertainty about the relative merits of competing protocols.Software Testing: Protocol Comparison 1 Software Testing Many researchers have proposed protocols for testing software. Here one makes assumptions about the distribution of the costs of bugs. in the context of proofreading. If the software fails few of these. then verify that the code passes all or nearly all tests. and often fails to nd interactive aws. if the speci cation list is long. Under Starr's assump2 . as software quality became a key issue and as people acquired better empirical data on software bugs. But if the software fails many of the tests. then managers meet to decide how much additional testing e ort should be allocated. In practice. Oddly enough. inspecting software. Nonetheless. Here one develops tests from a list of speci cations. The problem of testing software entails a tradeo between the cost of testing and the cost of undiscovered bugs. xed-e ort protocol. This balance is complicated by the di culty in estimating the number and severity of undiscovered bugs. Practitioners have also developed testing procedures. This strategy is expensive. and presents results that indicate how delicate such comparisons can be. conformance testers rely upon a straightforward. A more mathematical approach is based on Starr's model for proofreading 1974. and by the di culty in creating realistic but mathematically tractable models for bugs. which often work well but generally lack theoretical justi cation. A slightly di erent procedure is used for tests made during software development. the di culty of capturing those bugs. then those bugs are corrected and the software is released. 1977.

and has the potential to impact larger issues in software management. this made their solution unappealing in practice. in the belief that most costly software errors occur as an interaction. his approach opened up a new way of formulating these problems. but still obtained locally asymptotically minimax solutions. Our simulation approach is generalizable. 2 The Protocols This section describes the two testing protocols compared in this paper. there are a number of other techniques for software testing. it remained unclear whether their solution had attractive small-sample properties. in the general context of optimal module inspection. Section 4 discusses our results. Other workers use prior information to focus testing e ort on modules that are likely to contain bugs. The rst approximates versions of conformance testing. The price of this generality was an enormous increase in the mathematical di culty of solving the dynamic programming problem. the breakdown of modularity. Section 2 describes Starr's procedure and the xed-e ort protocols that we compare. we plan to extend the simulation comparison to a more de nitive set of protocols. and budgeting large projects. Although the robustness of Starr's model is a key concern in practice. as in Kuo and Yang 1993. we compare Starr's protocol against xed-e ort protocols that approximate current conformance testing practice. as currently practiced. 1991. Dalal and Mallows 1998 describe tests that exercise modules or functions in pairs or triples. but small changes in those assumptions make closed form solutions impossible. In the future. They dramatically weakened Starr's assumptions. To illustrate the kinds of results one can achieve. Related work. This paper describes a simulation experiment that enables comparisons among any set of software testing protocols. Section 3 outlines the design of the simulation experiment.tions. despite their unusually direct attention to reality in weakening Starr's assumptions. Also. The most signi cant of Starr's successors is the work of Dalal and Mallows 1988. has been done by Easterling et al. an explicit answer can be calculated. Besides these main strategies. The second is an implemen3 . such as the maintainance of old code. 1995.

From discussions with people at NIST who perform such tests especially in the black-box scenario. here the testers have complete access to the code. In between these extremes is gray-box testing. but within modules. Black-box testing is common in procurement. and thus. we approximate conformance testing by a xed-e ort random testing protocol. or ones in which software failure has catastrophic consequences. At the other extreme is white-box conformance testing. where the tester has no knowledge of nor access to the actual code cf. 100. there is broad agreement that the outcome from one test is roughly independent of the outcome of another. Thus the basic protocol consists of a xed number of tests. 1995. Usually this is done as black-box testing. with small adaptations to accomodate estimation issues not treated in the original paper. as is often done in-house by the software developer. when a purchaser or third-party must verify that proprietary software satis es performance speci cations. All three kinds of conformance testing rely heavily on problem-speci c information. 1000. 2.. If the latter does not o er much improvement over the former. non-critical software validation. as a rst pass. one usually nds that testing is either managed independently for di erent 4 . 50. each probe is independent of the others. while the second represents the most mathematically sophisticated of the practicable protocols. 500. the outcomes of which are approximately independent.1 Conformance Testing In conformance testing one exercises the prescribed functionality of the software code by executing a series of tests. 1998. where some of the code is available or there is knowledge of the organization and functions of the modules cf. which takes the values 10. We examine six kinds of xed-e ort random testing protocols. The testing e ort may be heavily concentrated in speci c modules or functions that the tester believes to be problematic. corresponding to di erent levels of e ort. and 5000. Banks et al. We chose these protocols to compare because the rst is simple to simulate and informs practical application. Beizer. For larger projects. then formal optimality properties are an unreliable guide to software validation. This dependence on context is di cult to micro-model.tation of Starr's optimal protocol 1974. The number of probes or tests is J . This range was chosen to bracket typical practice in small.

we develop an implementation of Starr's protocol that applies to software. or payo . meaning that the cumulative distribution function of the capture time X of an individual prey has the form P X x = 1 . the capture times for the software bugs have the same distributional form as in Starr's model. Consider a set with a large in nite number of objects. where there is a cost for searching and a reward for nding. Starr assumes that N. N and  are not assumed 5 . x 0: 2 The value. of which N are distinguished. assuming that each undiscovered typographical error entails a random cost and that the proofreader is paid by the hour.2 Starr's Protocol Starr 1974 gave an explicit solution for the problem of deciding when to stop proofreading a document. . The objects could be typographic errors. The key extensions are methods to estimate the number of undiscovered bugs and the distribution of their costs. 2. it becomes unpro table to continue proofreading. However. or even prey. x 0 and the probability density of X is 1 f x = e. In this paper. as discussed below.modules. which is the most commonly used theory for analyzing software failure data. Starr assumed that the times to capture of the distinguished objects were exponentially distributed with rate  and mean 1=. of the bugs upon capture are independent random variables independent of the capture times with common mean a|thus the fact that bug is hard to catch implies nothing about its payo .x .x. software bugs. The problem is to identify as many as possible of the group of N objects. e. or that a more complex multi-stage inspection protocol is used. they are for the most part present in the Jelinski-Moranda J-M model 1972. However. In the J-M model. and a are all known. For this situation. nding that point entails the solution of a nontrivial dynamic programming problem. Although the above assumptions may seem unrealistic. Both extensions will be addressed in subsequent work. it is obvious that after a certain point.

to be known. and plugging them into 1. bt. Note that the capture model implies that once a bug is captured. one can't use a rule that says Stop when the undiscovered bugs have total value less than d" because one doesn't know the value of future bugs. We need now to discuss parameter estimation procedures that use only the real-time incoming data from the current software project. where the net payo is vt . The obvious estimator of the average value a of the bugs is simply the arithmetic mean of the individual payo s. Of course. Let b be the unit cost of searching. . b=a: 3 This optimal scheme says that one should continue searching until the number of bugs remaining in the code is no more than b=a. The solution maximizes the expected total payo over all possible stopping schemes that depend only upon the present and past thus. Starr's analysis produced a stopping rule that is asymptotically optimal given certain assumptions. and often the payo values are assumed to be constant. when one starts to debug software one does not know N . Starr's optimal procedure is the one that stops searching at the rst time at which kt  N . which is b times the length of the search. Thus we would stop searching ^ at the rst time that the following inequality holds: ^ kt  N ." it is instantaneously repaired without introducing new defects. the total payo of the bugs captured minus the search cost. and kt be the number and vt be the total value of the bugs found by time t. Starr's stopping rule maximizes expected net payo . for example. But the accuracy of this estimate depends strongly upon the assumption of independence between payo and the 6 . Dalal and Mallows 1988 work with probabilistic models of software testing that are much more general and thus have the capacity to be more realistic. the expected cost of capture for a single bug. the actual number of bugs present. and a after each capture. or even the rate  or the average payo a of the bugs. Therefore our attempt to automate Starr's procedure in a conformance testing context involves producing estimates ^ ^ N. and its small sample properties should be very similar to those of Starr's theory. b=^: a^ 4 This approximation to Starr's rule is asymptotically valid.

These are the adjusted estimates in our implementation of Starr's method. k 5 t t t i 1 1   =1   =1 ^ Joe and Reid then divide the resulting values of the maximum likelihood estimator N into three cases: ^ Case 1: If w  kt= Pkt 1=i. i=1 6 7 . and w  kt + 1=2 then N = 1. most of which features maximum likelihood methods in censored or truncated situations. so that the ^ maximum likelihood estimator N is the value of N that maximizes kt Y N . In the following discussion. Blumenthal and Marcus 1975 show that the maximum likelihood estimator of N is often in nite. k + 1. i + 1 . However. Suppose that the N capture times T . Similarly. we follow closely the approach of Joe and Reid 1985. The simultaneous estimation of N and  is problematic and has produced considerable work in the literature. Deemer and Votaw 1955 produce a maximum likelihood estimate of  by considering the capture times until a time T to be observations from an exponential distribution truncated at T .length of time to catch the bug. : : : . then N = kt: i =1 ^ Case 2: If kt  2. For example. Let t . If this is suspect. : : : . this maximum likelihood estimator often turns out to be 0. t kt be the rst kt ordered capture times. one should consider Bayesian methods. Case 3: If kt  2 and kt X kt = 1=i w kt + 1=2. N = k . as described in section 3. kt = N + w . They attempt to alleviate that problem through a Bayesian analysis. Joe and Reid 1985 produce di erent modi cations of the maximum likelihood methods that utilize interval estimates and harmonic means. Assume that kt  1. then Joe and Reid state that i the maximum likelihood estimator of  as a function of N is kt =tw + N . in a paper on estimating population size. which is impractical in our application. kt. TN are independent and identically distributed according to an exponential distribution with rate . : : : gN jw. Set w = Pkt t i =t.

N  of N . N = 1. : and for Case 2. ~ In our implementation of Starr's procedure. Then. kt  cgN jw. Simulations show that these estimates are quite good when kt is large and is a substantial proportion of N . Joe and Reid advocate combining the interval endpoints N . N in a harmonic mean to produce an estimate of N of the form 1 ~ 1 N = 2 N . kt 8 1 2 1 and ^ N c = sup N  kt : gN jw. + N .  . the performances of these estimators are greatly diminished when kt is smaller in number and when it is a smaller proportion of N .  . fk . Cases 1 and 2 are known as the degenerate cases. : 1 1 2 1 1 1 1 2 1 1 2 10 ~ 1 For Case 1. we use this harmonic mean estimator N as ^ our estimator N for the degenerate cases but use the maximum likelihood estimator in the ^ ^ nondegenerate case. where mk. kt 2 2 9 for a given c between 0 and 1. ^ where the estimate N = 1 is useless for implementation of Starr's procedure. As would be expected. These methods for estimating the unknown parameters needed to generate Starr's stopping rule are not entirely satisfactory. so that N = 2 kt. . leading to ~ N = 2N . and a. N = kt. 7 +1 1 1 for positive integers r k. r=kg =r . as it leads to the estimate  = 0: Joe and Reid produce modi ed estimators of N that circumvent this ^ di culty.r = 1 . N is used to produce an estimate  of  by substituting N for ^ N in the maximum likelihood formula N = kt=tw + N . they often have considerable prior information about these parameters from similar software projects that have been completed in the past. Although testers do not know the exact values of N. Also. . kt  cgN jw. kt.kt  w  mk . where ^ N c = inf N  kt : gN jw. These estimates are based on their proposed interval estimates N . experienced programmers have 1 8 . + N . k + r.then the maximum likelihood estimator is the value of k satisfying mk. There is special trouble in Case 2.kt .

If. In particular. then apply each protocol and compare their success. Starr's research was in uential. the estimation of N and  requires only the tabulation of the bugs caught and their capture times. it remains unclear whether their asymptotic results enable testers to achieve signi cant cost reductions in the nite-horizon reality of standard software testing. 3 Simulation We need to assess competing protocols for software testing. they allowed the costs of the bugs to be time dependent. we concentrate on comparing Starr's protocol with several versions of xed-e ort random testing. one wants an empirical comparison. Here. it may be di cult to assess the value of each bug as it is found. But this approach is too expensive. and they obtained asymptotic results suggesting that the method applies when code is of inhomogeneous quality. In particular. is especially important. they dropped the distributional assumptions on the length of time needed to nd the bugs. The plan is to simulate di erent kinds and intensities of bugs in code. By applying each testing protocol across the same set of real software with known bugs and comparing the results.a good feel for the approximate rate of bugs present and their relative costs. then a procedure that estimates a without using prior information tends to underestimate the average payo . the average value of the bugs. However. at the beginning. use of prior information in estimating a. in contrast. In addition. the discovered bugs are all minor. One may have many minor bugs but a few important bugs. Starr's procedure should bene t from a Bayesian estimate of a. The parameters that govern bug characteristics are chosen to match the available empirical data and expert opinion. Thus we replace empirical evaluation with simulation. a de nitive evaluation becomes easy. This error leads Starr's procedure to premature and potentially disastrous termination. leading Dalal and Mallows 1988 to generalize the results in ways that were more faithful to the conformance testing paradigm. Ideally. Previous discussion has referenced the most promising strategies available in the literature. These considerations indicate a sequential Bayesian approach for updating prior information as data on the current software project become available. We emphasize that the simulation does not produce pseudocode that might actually 9 .

tree. The complete delity of the simulation to a real debugging procedure may be less important than the generation of simulation data that is su ciently accurate that one can assess the relative performances of di erent search protocols. it is important for practitioners to develop a feel for the number of passes J that are needed for successful conformance testing on di erent qualities of software. Payo : This indicates the reward for nding and repairing that particular bug or the cost to the producer or tester if the bug is not found. Capture Status: A 1 if the bug has been captured and 0 otherwise. Catchability: This is a score that measures how hard the bug is to discover. one selects one of the bugs in the module at random. The characteristics used in our simulation are: Module: This indicates in which module or modules the bug occurs. the simulation creates a list of records. those who implement some version of Starr's 10 . it creates lists of mathematical objects that are abstract representations of bugs and their characteristics. If that random number is less than the catchability score C for that bug. Similarly. For example.g. e. so that if one bug is found. or parallel. linear. Often qualitative di erences emerge even from relatively simplistic simulations. each of which represents a single bug. A debugging process is simulated using the above quantities and each of the inspection protocols. We may seek to simulate design structures among the modules as well. together with enough contextual information that the protocols can operate. To conduct a single search pass through a particular module. others in its clusters can have their Catchability scores revised. All the capture times in this simulation take the form of the number of discrete search passes that have been performed on the bug list. rather. otherwise. but this aspect is secondary. then that bug is considered to have been discovered.compile. In future work additional elds will be added. that search step fails. If that bug has not been captured found and xed. Speci cally. such as the probability that correction introduces a new bug and pointers to other bugs in the same cluster. then a uniform random number is generated. The attributes of the record indicate various characteristics of that bug.

The last distribution speci cally models third-party conformance testing. q = 0:5. If a user feels that the instantiation of the technique is unrealistic. logical. one could categorize the type of error typographic. and 150 bugs. The rst is a reasonable representation of code that is tested during manufacture. where the easy bugs have largely been found and removed in-house before inspection. the absolute value of this quantity is unimportant|what matters is its ratio to bug payo s. giving an asymmetric distribution with most of the weight concentrated near 0. since many easily found bugs will be present. The simulation experiments were run for the simple case of a single module. These cases are meant to represent a small range of programs from short. The simulation we report is an instructive pilot. q = 1. 20. and incorporating some simple structures for module connectivity. we used b = 0:005 as the unit cost of searching. The lower the score. For these experiments. Neither of these gains depends strongly on the details of the simulation. Other attributes can be included to make the modeling more realistic. each with di ering characteristics. it is simple to add new parameters and rerun the experiment. when a vendor has failed to implement some functionality. 11 . or model bug clusters and x failures. 0:5. 50. 1. or an interaction between modules. the less likely it is that a bug will be found. giving a symmetric U-shaped distribution 2. Here they are randomly generated from a Betap. For example. p.protocol need to understand its robustness to failure in the underlying assumptions. Future tests are planned for multiple modules. Represented in the simulation are the cases where there are 10. clean code to that which is relatively long and buggy. p. based upon a discrete analog of the J-M model. p. giving the uniform rectangular distribution 3. q distribution with three di erent pairs for p. 1. The Catchability score takes values in the unit interval. q: 1. Our approach is inexpensive and extremely exible|it can address other problems in software management than just inspection. These values span a range of plausible scenarios. it also applies to conformance testing. q = 10.

Of course. expensive bugs. these need to be positive numbers that can take very large values representing critically important bugs. The rst is the net payo . there were 10 experiments for each of the other bug counts and catchability distributions. The second is the total cost of the bugs left uncaptured by the search. we measured four quantities on the performance of the protocol. which is the total payo minus the cost of undiscovered bugs|this represents the best composite measure for most practical applications. along with independent random payo s from an Exponential1 distribution. can be adjusted to accomodate a very wide range of possible scenarios. 0. the various catchability and cost indices do not have to be randomly generated.As to the payo values. but can be xed by the user to ensure that the protocol choice will be sensitive to. Here they are randomly generated from an exponential distribution with unit mean.5 distribution. Thus there were 10 experiments in which 150 bugs were generated with random catchabilities drawn from a Beta0. For each replication. The same catchabilities and payo s were then used for each of the seven protocols Starr's protocol and six levels of xed-e ort random search.5. a smart software company might run a preliminary simulation tailored to their circumstance in order to cost out the appropriate amount and kind of inspection need for a particular piece of code. which is the payo of the bugs captured minus the cost of the search. Thus the typical ratio of a bug payo to a unit of inspection time is 200. the presence of elusive. for the xed-e ort protocols. for instance. the structure of the modules. our domain experts consider this a reasonable starting point. Finally. In practice. To combine the previous information. and the number of bugs in each module can be xed or randomly generated from a Poisson distribution. this 12 . This reuse of the random values across protocols is a variance reduction technique that allows more powerful inference on contrasts among the protocols. we calculated a penalized net score. 4 Results The simulation was replicated 10 times for each protocol and each combination of Catchability level and bug count. The number and characteristics of modules. Similarly. Similarly. all of the above parameters. we recorded that total search time. for example. including the number of modules and bugs.

73 5.5. it is notable that Starr's apparently poor performance diminishes as the number of bugs increases and the di culty of catching the bugs decreases. The advantage of Starr's is that it provides a stopping time. The performance of the Starr procedure as implemented as compared to random testing is mixed. the rst row shows the net payo for each of the seven protocols. The more bugs there are and the harder they are to nd.37 5. Starr's protocol had di culty in stopping early enough. Table 1 shows the results from the experiment in which there are 10 bugs. but there is large variation in search time for Starr's protocol.72 -15. on average.84 -2.0 50. Here Starr's protocol is more competitive.56 4.0 1000. Note that Starr's protocol performed rather poorly.98 0. then it is easy to do poorly.5 catchability distribution.90 0.5 distribution. either by searching much too long 13 .70 1.41 -15.86 3. 5000 do even better. if one chooses a bad value of J . However. To read Table 1.34 Avg.3 10. 500 did rather well. Time 4380.0 100. These four numbers were averaged across the ten replicates to produce the numbers in the following tables. the number of probes.0 Table 1: Average performance criteria for 10 bugs with Beta0. it is di cult to know an appropriate value of J a priori.12 2.63 6.07 7. the larger J should be. In particular.5.0 5000. The second row shows the average cost of undiscovered bugs. but it sounds a clearly cautionary note in selecting the inspection protocol. but protocols with J = 1000. 0.27 4.0 500.18 Penal. with catchability scores drawn from the Beta0.54 4.equals J . Fixed-e ort testing can be better than Starr's procedure in terms of net payo if one knows the right" value of J .16 Omitted 1. The third row combines the information. The other 11 tables are shown in the appendix and provide a more nuanced description. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo -13. 0.31 0. Net -15. the J = 500 protocol wins on this criterion. And the fourth row shows that.98 6.85 7. but that the xed-e ort protocols with J = 100. These results should be interpreted provisionally.

64 Penal. used for approving drug release. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 8. to support inspection decisions about whether to release software.48 8.72 1. Starr's stopping rule as implemented does not choose the ideal search time. but it is usually close to the correct magnitude of J .35 15. 14 . More generally. We also plan to modify the methods of clinical trials.26 -6.14 13. Rather.1 10. thereby supporting intelligent choice of the protocol for a speci c application.0 5000.0 500. This is unsurprising|in problems of this complexity. These protocols will include versions of Dalal and Mallows' rule 1988.0 50. some protocols are better for certain situations. no method can be expected to dominate all the competitors.or not long enough.0 Table 2: Average performance criteria for 20 bugs with Beta0.20 3. Net -1. 0.5. this may partly re ect our choice of estimates used in implementing the protocol. there is no value of J for xed-e ort testing that is universally superior. we shall experiment with more realistic situations across a broader range of protocols.5 catchability distribution.69 -12. Similarly.85 7.30 0.40 13. The value of the simulation study is that it enables practitioners to understand the factors that distinguish the situations. Results are expressed as averages of ten replicates.60 15.0 1000.71 Avg. Time 112. for four di erent criteria. our interpretation is that the mathematical optimality results obtained by Starr do not transfer reliably into practical bene t of course. while others excel in di erent situations.65 3. For the future.34 13. as well as two-stage protocols that undertake a preliminary set of probes in order to determine how many probes to allocate in the second stage.0 100.07 Omitted 10.66 11.91 5. 5 Appendix: Simulation Tables The following tables display the results from a 734 simulation experiment that compares seven inspection protocols across three levels of catchability distribution and four levels of bug counts.62 11.46 1.95 -6.

63 -143.17 -13.68 -40. 0.5.22 50.0 500. 1.88 0. Time 10237. Time 399.63 5.0 100.Starr J = 10 Net Payo 34.60 54.40 22.88 8.10 37. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo -41.87 Avg.36 132.6 10.50 111.0 50.5 catchability distribution.0 100.0 Table 3: Average performance criteria for 50 bugs with Beta0. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 113.70 6.28 15.92 9.73 9.43 116.87 9.04 23.63 8. 0.57 Omitted 13.0.68 4.0 5000.74 Avg. Time 1242.98 -112.0 J = 50 J = 100 J = 500 J = 1000 J = 5000 18.74 Omitted 1.07 -13.0 100.0 Table 5: Average performance criteria for 10 bugs with Beta1.73 6. 15 .29 1. Net 21.34 7.19 110.89 20.5.5 catchability distribution.92 42.0 500.60 4.0 50.64 1.0 5000.14 0.18 0.00 Penal.0 1000.45 30.22 95.00 8.94 6.43 35.55 5.0 Table 4: Average performance criteria for 150 bugs with Beta0. Net 80.0 1000.4 10.08 73.83 Omitted 32.38 20.49 Avg.03 0.11 35.88 4. Net -43.09 0.24 40.55 28.23 -12.0 catchability distribution.0 5000.30 33.06 Penal.0 500.0 10.32 -80.69 96.0 1000.31 40.01 45.96 Penal.97 148.

29 11.96 108.09 4.0 Table 8: Average performance criteria for 150 bugs with Beta1.26 8.0 500.0 100.89 Avg.05 112. 1.12 0.78 48. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 94.32 36.0. Net 0.81 43.55 -18. 16 . Net 52.82 14.0.0 1000.61 23.19 6.80 4.47 Omitted 25.00 Penal.22 Avg.89 Omitted 4.18 -44.03 114.19 99.01 0.37 Avg.4 10.08 27.0 50.0 1000.73 25. Time 1116.Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo -6.71 3.89 137.69 Penal.42 90.0 Table 6: Average performance criteria for 20 bugs with Beta1.10 50.0 Table 7: Average performance criteria for 50 bugs with Beta1.50 -105.0 5000.25 35.89 5.0 catchability distribution.05 14.76 -11.50 -6.06 12.39 -71.0 5000.22 27. Time 4155.0 500.93 37.60 12. Net -10.0 100. Time 290.0 5000.78 6.67 4.96 4.26 58.0 catchability distribution.79 1.78 -132.0 1000.49 4.0 J = 50 J = 100 J = 500 J = 1000 J = 5000 17.0 500.29 123.0 100. Starr J = 10 Net Payo 25.65 35.79 18.45 40.51 2.0 catchability distribution.0 50.75 Omitted 41. 1.3 10.59 12.9 10.81 -6.0.97 27.03 13. 1.31 0.51 43.38 Penal.64 106.

0 500.Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo -6.91 44.55 -7.26 -0.78 0.94 0.46 31.0 100.59 -10.0 catchability distribution.66 Omitted 10.09 0.0 Table 9: Average performance criteria for 10 bugs with Beta10.0 5000. 1.80 51.46 -4.0 500.05 5.65 0.0 50.40 1.6 10.49 2.59 5.0.86 22.95 Omitted 15.69 Penal.42 Avg.0 1000.56 -3.94 -16.31 Avg.64 4.42 Omitted 1. Time 1428.0 1000. 1.77 8.89 -6.57 1.0 500. Net -8.27 -43. Time 601.2 10.84 16.10 2.18 18.19 16.0 Table 10: Average performance criteria for 20 bugs with Beta10.0.73 4.79 7.99 5.43 7.37 13.0.13 18.29 -16.09 2.0 1000.0 5000.64 Penal.85 -51.00 3. Net -2.57 25.10 4.0 catchability distribution.63 15. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 29.39 -14.0 50.47 -36.99 0.3 10.16 6.0 5000.80 3. Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 8.54 9.23 -18. 17 .90 47.65 0.0 50.00 Penal.80 4.62 20.0 100.52 24.0 100.53 3. Net 13. Time 2717. 1.0 catchability distribution.35 Avg.53 13.0 Table 11: Average performance criteria for 50 bugs with Beta10.

and Votaw. New York. C. 1. R.32 5.0 5000.01 4. B. Black-Box Testing Techniques for Functional Testing of Software and Systems. References 1 Banks.0 500. 1995.95 86. Dashiell. NY.0 100. D. and Rosenthal. 18 . W.00 34. Sampling Techniques. 872-879.0 catchability distribution. 5 Dalal.. D. and Mallows. NISTIR-6129.0 50. L. L.89 141. Gaithersburg. Journal of the American Statistical Association. W. R. S. Hagwood. Annals of Mathematical Statistics. C. New York.913-922. 70.60 -143. S. Net 20. Covering designs for testing software. 1988. Kakcer. 1977..79 33. C. 498-504.31 50.0 1000. 4 Cochran. Wiley. L.93 Omitted 53. 6 Dalal. 1998. and Mallows.88 -137. 1998.05 -133.Starr J = 10 J = 50 J = 100 J = 500 J = 1000 J = 5000 Net Payo 74. 26. S. 3rd ed. National Institute of Standards and Technology. and Marcus.15 90. Time 3524. W.. 7 Deemer. 1975 Estimating population size with exponential failure.06 52. 1955 Estimation of parameters of truncated or censored exponential distributions.66 110.84 -39.0 Table 12: Average performance criteria for 150 bugs with Beta10.. When should one stop testing software? Journal of the American Statistical Association. Submitted for publication. 83. Software testing by statistical methods.90 Avg. R. NY. L..0. Gallagher. Wiley. G.87 -76.. R.87 144.38 139. MD.3 10.03 Penal. 3 Blumenthal. 2 Beizer..46 1.

Freiberger. W. Z. K. 1991. L. 80. 1993. P. 65-82. 13 Starr. 4.8 Easterling. 12 Kuo.. H. 1974. M. N. Journal of Applied Probability. 1995. Journal of Computational and Graphical Statistics. 33. Technometrics. ed. Optimal and adaptive stopping based on capture times. T. and Yang.. 222-226. F. V. 1972 Software Reliability Research. Journal of the American Statistical Association. 9 Jelinski. Bayesian computation of software reliability. System-based component-test plans and operating characteristics: Binomial data. L. and Yang. Mazumdar. 11 Kuo. 11. and Moranda. London: Academic Press. in Statistical Computer Performance Evaluation. T. 294-301. Estimating the number of faults in a system.. W. 10 Joe. 165170. Spencer. 1985. 287-298. Diegert. Proceedings of the American Statistical Association Section on Statistical Computing. A sampling-based approach to software reliability.. G. and Reid. 465-484.. 19 . N. R.