A REPORT ON RANDOM TESTING*
Joe W. Duran
Southwest Research Institute
San Antonio, Texas 78284
Simeon Ntafos
The University of Texas at Dallas
Richardson, Texas 75080
ABSTRACT Because the theoretical results presented
in [3] suggest that random testing might be quite
Random testing of programs is usually (but practical for many classes of programs, we have
not always) viewed as a worst case of program begun some simulations and experiments to inves-
testing. Test case generation that takes into ac- tigate such a premise empirically. The question
count the program structure is usually preferred. addressed in this paper is not "what reliability
Path testing is an often proposed ideal for estimates can be made?" (though this is extremely
structural testing. Path testing is treated here important), but is rather "how good is random
as an instance of partition testing. (Partition testing at finding errors?"
testing is any testing scheme which forces execu-
tion of at least one test case from each subset FAILURE RATES
of a partition of the input domain.) Simulation
results are presented which treat path and parti- Let 8 be the probability that a program
tion testing in a reasonably favorable way, and will fail to execute correctly on an input case
yet still suggest that random testing may often chosen from a given input distribution. If the
be more cost effective. ResultS of actual random program is used for a long period of time with
testing experiments are presented which tend to input from a particular operational profile
confirm the viability of random testing as a use- (input distribution) then t h e failure rate actu-
ful validation tool. ally experienced will converge toward $. We thus
refer to @ as the failure rate, which is a valu-
able measure of the operational reliability of
the program.
INTRODUCTION
Suppose the input domain D is partitioned
In a recent paper on functional testing,
into k subsets, D 1 ~ D 2 U . . . U D k = D, and that a
Howden [10] remarks that "...much has been writ-
randomly chosen i n p u f case has probability pl of
ten about structural testing, but little about
being chosen from D i. Then 8 = EPiSi, where @i
black box testing, the method it was supposed to
is the failure rate for Di. The term "partition
replace, and over which it was supposed to be an
testing" refers to any test data generation
improvement." He then reports on a successful re- method which partitions the input domain and
finement of black box functional testing, in forces at least one test case to come from each
which a detailed design "structure" is used to subset. Path testing is a special case of parti-
develop test cases. Another black box testing
tion testing. Since D could also be partitioned
method which has not yet been adequately studied according to alternative functions s program is
is random testing. There has been strong dis-
supposed to perform, it is apparent that parti-
agreement about its value. Myers [12] says,
tion testing contains instances of both black box
"Probably the poorest [test case design] methodo-
and structural testing.
logy of all is random-input testing...". On the
other hand, Thayer et al [13] recommend that the
Thayer et al [13] showed that if n random
final testing of a program should be done with
tests are carried out and x program failures are
input cases chosen at random from its operational
discovered, then 0", the i - e upper confidence
profile--the expected run-time distribution of bound on 0, is the largest G value such that
inputs to the program. Additionally, they showed
how quantitative estimates of a program's opera-
tional reliability can be inferred from random Z 81(1 _ 8) n-i h e •
testing. Girard and Rault [5] have also proposed i=0
random testing as a valuable test case generation
scheme. Further, recent results [3] show that (That is, there is a probability of I - ~ that
path testing, a popular paradigm for structural the calculated bound, 8 " , actually exceeds the
testing, can lead to less satisfactory reliabil- true failure rate of the program. The @* value
ity estimates than a corresponding number of ran- is of course valid only with respect to the input
dom test case executions. distribution used for test data selection.)
If x = O, then @* = 1 _ ~I/n. It has been
*This work was supported in part by the National shown [3] that if n test runs are made according
Science Foundation under Grant MCS-8003322. to a partition testing scheme, and no failures
179
CHI627-9/81/0000/0179500.75 © 1981 IEEE
are discovered, then, if there is no prior knowl- the slmulat[on results reported below may be bi-
edge about the distribution of @i values, the re- ased somewhat in favor of path testing over ran-
suiting value of @* is bounded below by ! _ ~i/n. dom testing.
It is this result that led us to investigate ran-
dom testing experimentally•
ERROR FINDING ABILITY
Although operational reliability estimates
CD
from test results are desirable, it is also im-
portant to directly investigate the error finding
ability of test methods. Ideally, one might wish 4
to characterize this ability in terms of errors
found per dollar, perhaps weighted by error se-
verity. Being at present unable to characterize
errors and their effects well enough to prepare
any formal models for making "errors found per I
dollar" calculations, we pose a related question 0 8i i
- "What is the probability that a set of test
cases will discover at least one error?" (We as-
FIGURE i. HYPOTHETICAL PROBABILITY
sume that finding a single failure instance is
DISTRIBUTION FOR e. VALUES
equivalent to finding evidence of at least one 1
error.)
In order to investigate the effect on the
For random testing, the probability Pr of Pr > Pp question, we simulated a 25 subset par-
finding at least one error in n tests is I - (I tit[on testing scheme, with one randomly chosen
_@)n. For a partition test method in which ni test case per subset, i.e., k = n = 25 and ni =
test cases are chosen randomly from each Di, the I. The @ i ' s were chosen so that 2% of the time
probability of finding at least one error is @i ~ .98, and 98% of the time, @i < .049. The
given by Pi values were chosen randomly from a uniform
distribution. We found Pr ~ Pp in 14 trials out
k n. of a total of 50. The mean value of (Pr/Pp) was
]' = i - ~ (i - e i) ~ 0.932. A histogram of the (Pr/Pp) values found is
P i=l
shown as Figure 2. Figure 3 gives a histogram of
With respect to the same partitioning, a similar experiment, run using the same distri-
butions, but with k = n = 50.
k
n
er = i - (i - 8) n = i - (I - Z Pi @i ) '
i=l
where n = ~ n i . The q u e s t i o n ~ c a n then be posed, 16
"When is Pr >__ Pp?" The answer depends upon the 14
values of the (pi, @i) pairs. In the following M E A N = .932
12
section we report on some simulations performed
to help answer the question.
~ 8
6
PARTITION TESTING SIMULATIONS
4
The point of doing the work of partition 2
I i I ~ i I i
testing is to find errors. A partition testing
instance such as path testing will be good to the •4 •5 .6 .7 .8 .9 1.0 i.i
extent that a test case chosen from a subset w i l l
Pr/Pp
have a high probability of finding an error
(i.e., exhibiting an execution failure) if one is
present affecting that subset. Ideally, then, @i FIGURE 2. 25 PARTITIONS, 50 TRIALS
should be either 0 or I (without k having to
grow too large). Our conversations with various It is of particular interest that in the
programmers and computer scientists indicate that majority of cases where Pr < Pp, the difference
many feel that path testing approaches at least a was not great. This suggests that for those pro-
little way toward this ideal. This is equivalent grams for which carrying out some contemplated
to believing that the distribution of 6i values partition testing scheme is much more expensive
for program paths looks something like t h a t shown than performing an equivalent number of random
in Figure I. On the other hand, Howden's recent tests, random testing might find more errors per
experiments [I0] discovered many examples of unit cost.
"partially correct" (not in the Floyd sense) pro-
gram paths. That is, he found many paths that RANDOM TESTING EXPERIMENTS
compute correct values for some, but not all, of
their input data. This may mean that the distri- Although the assumptions we have had to
bution of 0i values for path testing is more make so far are quite plausible, we cannot deter-
nearly uniform than suggested by Figure I. Thus, mine the extent of the usefulness of the results
180
operational profile should be used. We speculate
26 that this may not generate data as effective,
24 overall, for error detection as would uniform
22 sampling, but it might be more effective for the
more critical errors.
20
18 Next, we tested the line editor program
16 discussed by Goodenough and Gerhart [6], using
MEAN = .949
random character strings weighted for more fre-
N 12 quent selection of blanks and new line charac-
lO ters. The proportions of test cases which de-
tected the five errors given in Table III of [6]
8 are shown in Table i. Only one error at a time
6 was present in the program during the experiment.
4
2
TABLE i. ERRORS DETECTED IN LINE EDITOR PROGRAM
, I I , I I I I I /
.4 .5 .6 .7 .8 .9 1.0 i.i
Pr/Pp Detection
Error No. Proportion
FIGURE 3. 50 PARTITIONS, 50 TRIALS
I 5/10
2 8/10
without further work. Since "real life" ~i dis- 3 9/10
tributions are difficult to determine, and since 4 10/10
"Is Pr > Pp ?" is not the ideal question to 5 9/10
pose, we--turn to a direct empirical investigation
of random testing. At this time, we have com-
pleted only a few programs, most of which are
quite simple, but the results, presented below, BUGGYFIND [1,2] contains a subtle error,
are very interesting. actually made during a conversion of Hoare's FIND
program. Boyer et al [i] state, "This bug is
The first three programs in the "Common quite difficult to detect by conventionally test-
Blunders" chapter of Kernighan and Plauger [II] ing the program with random array inputs..." Our
which contain other than initialization errors randomly generated test data detected the error
are a sine program, a sorting program and a bi- in 3 out of 50 cases. This may at first seem
nary search program. (The initialization errors like poor performance but actually it is quite
were not considered because they are "too easy" good. The test case generation and execution was
to detect.) easy to perform and did demonstrate a high error
detection probability for the test sequence as a
SIN Program: There are three non-initlalization whole. (Consider that for a detection rate of
errors which can cause the program to •06, the probability of detection in 50 trials is
fall to execute according to specifica- approximately .95). It is interesting to notice
tion. Fifty random test cases were that branch testing is not adequate for detecting
run. The test cases detected one of this error [I]. However, our 3 random test cases
the errors ii out of 50 times, another which detected the error did each cover all but
was detected 24 out of 50 times, and a one of the program's branches. Three other cases
third, 45 out of 50 times. each covered all but one branch, without detect-
ing the error. The total set of test data did
SORT Program: 24 test cases of randomly gener- ensure the execution of all branches.
ated lists of size ranging from 1 to
50 were executed. The "off-by-one" DeMillo et al [2] also experimented with
error of the program was detected by random testing of FIND. Our test cases were cho-
21 of the 24 test cases. sen from vectors of size ranging from I to I0,
with element values chosen uniformly from the
BINARY SEARCH: 50 test cases were generated, range -i00 to I00. Using a similar test genera-
with table size ranging from 0 to 20. tion method, DeMillo et al created a I000 member
Here, an "operational profile" of the set of random test cases which was equivalent in
program was considered. In 20% of the power, according to their mutant catching mea-
test cases the input element searched sure, to their carefully constructed seven member
for was not in the table, but was in test case set referred to as D1 in their paper.
the table for the remaining 80% of the From the description in [2], creating DI appears
cases. The program's error was de- to have required considerable effort. Addition-
tected by 18 of the 50 test cases. ally, they created a set of I00 random test
cases. By their measure, it was only slightly in-
This last case brings up an important question. ferior to the larger random set and to DI, leav-
Should random testing be done by a uniform sam- ing II live mutants rather than I0. Our results
pling of the input domain, or from an operational indicate that such a I00 vector test set would be
profile, or both? For reliability estimates, the nearly certain to catch the (presumably subtle)
181
actual error in BUGGYFIND. Considering that the matrix, there i~ a small but non-zero
random test cases were inexpensive to generate probability that a randomly selected
and execute, mutation analysis can be viewed as test case will detect the first error.
confirming the cost effectiveness of the error
finding ability of random testing in this partic- CONCLUSIONS
ular experiment.
The results compiled so far indicate that
Moving on to larger programs containing ac- random testing can be cost effective for many
tual errors, we tested the standard deviation programs. Our experiments have shown that random
program studied by Hetzel [8] and Howden [9]. All testing can discover some relatively subtle er-
25 randomly generated test cases detected the rors without a great deal of effort. We can also
error, reminding us that many "real-life" errors report that for the programs so far considered,
are in fact easy to detect if one is careful to the sets of random test cases which have been
determine whether a test run failed to execute generated provide near total branch coverage. The
properly. No test data is good enough if the unexecuted branches tend to be those which pro-
tester is unable or unwilling to determine vide handling of exceptional cases. This tempts
whether the output is correct. This suggests that us to conjecture that extremal/special values
automatic or machine assisted output checking is testing (c.f. [9]) in conjunction with random
highly desirable. It is essential if large num- testing might be a powerful testing method, pro-
bers of random tests are to be performed. viding the additional option of obtaining a sound
conservative reliability estimate. Further experi-
In an attempt to get more subtle "real- mentation is needed to obtain better statistical
life" errors, we looked backwards (in terms of evidence on the effectiveness of random testing.
issue number) through issues of Transactions on Also, models for studying the cost effectiveness
Mathematical Software, searching for two tracta- of random testing and comparing it to that of
ble (relative to our resources) programs i n Col- other testing strategies need to be constructed.
lected Algorithms of the ACM for which errata
have been recently published. Errata were found REFERENCES
for algorithms 424 [4] and 408 [7]. (Note: We did
not consider efficiency changes, portability I. Boyer, R.S., B. Elspas, K.N. Levitt, "SE-
problems, or the like.) LECT-A Formal System for Testing and Debug-
ging Programs by Symbolic Execution," Proc.
1975 Int'l. Conf. on Reliable Software, pp.
Algorithm 424 (Clenshaw-Curtis Quadrature): No 2 34-245 •
actual experiments were needed. Virtu-
ally any test case will discover the 2. DeMillo, R.A., R.J. Lipton, and F.G. Say-
error, so long as the tester carefully ward, "Hints on Test Data Selection: Help
checks the output. The problem is that for the Practicing Programmer," IEEE Com-
the error causes an incorrect second- puter, Vol. II, No. 4, pp. 34-41, 1978.
ary output value, which was apparently
not checked. The definite integral 3. Duran, J.W. and J.J. Wiorkowski, "Quantify-
value computed as the primary output ing Software Validity by Sampling," IEEE
is apparently correct. Trans. Reliability, Vol. R-29, pp. 141-144,
1980.
Algorithm 408 (Sparse Matrix Package): Subroutine
TRSPMX of the algorithm package is re- 4. Geddes, K.O., "Remark on Algorithm 424,"
ported to have two errors. The first ACM Trans. Math. Software, Vol. 5, p. 240,
causes a failure whenever the input is 1979.
a matrix with only one column. The
second causes failure when all elements 5. Girard, E. and J.-C. Rault, "A Programming
in the first row of the input matrix Technique for Software Reliability," Conf.
are zero. Since sparse matrix packages Record, 1973 IEEE Symp. Computer Software
are presumably useful only when matrix Reliability, pp. 44-50.
dimensions are not very small, we set
up the random test case generation to 6. Goodenough, J.B. and S.L. Gerhart, "Toward
create no tests with row or column di- a Theory of Test Data Selection," IEEE
mensions smaller than I0. (The maximum Trans. Software En~., Vol. SE-I, pp. 156-
was set at 50.) When the density of 173, 1975.
test matrices was allowed to vary uni-
formly between 0 and .4, the second 7. Gustavson, F., "Remark on Algorithm 408,"
error was detected by 14 cases out of ACM Trans. Math. Software, Vol, 4, p. 295,
a total of 135. With the density vary- 1978.
ing between 0 and .2, the error was
detected by 6 cases out of 38. How- 8. Hetzel, W.C., "An Experimental Analysis of
~ver, the test data will obviously not Program Verification Methods," Ph.D. Disser-
detect the first error. It could, of tation, The University of North Carolina,
course, be argued that the first error 1976.
should have little effect on the opera-
tional reliability. If no lower limits 9. Howden, W.E., "Symbolic Testing - Design
are placed on the dimensions of the Techniques, Costs, and Effectiveness," Na-
182
tional Bureau of Standards Tech. Report, 12. Myers, G.J., The Art of Software Testing,
NTIS #PB-268517, May, 1977. John Wiley & Sons, New York, 1979, p. 36.
I0. Howden, W.E., "Functional Program Testing," 13. Thayer, R.A., M. Lipow, and E.C. Nelson,
IEEE Trans. Software En~., Vol. SE-6, pp. Software Reliability, North-Holland, Amster-
162-169, 1980. dam, 1978.
II. Kernighan, B.S. and P.J. Plauger, The El-
ements of Programming Style (2nd ed.), New
York, McGraw-Hill, 1978.
183