You are on page 1of 11

Automated Software Test Data Generation for Complex Programs

Christoph Michael & Gary McGraw

Reliable Software Technologies
21515 Ridgetop Circle, Suite 250, Sterling, VA 20166 fccmich,

Abstract 1. Research prototypes place severe constraints on

We report on GADGET, a new software test gen- the language being analyzed (e.g., disallow-
eration system that uses combinatorial optimization ing function calls or procedures) so that pro-
to obtain condition/decision coverage of C/C++ pro- grams can more easily be instrumented by hand.
grams. The GADGET system is fully automatic and QUEST/Ada requires the program under test to
supports all C/C++ language constructs. This allows be instrumented by hand. TESTGEN only al-
us to generate tests for programs more complex than lows programs written in a subset of the PASCAL
those previously reported in the literature. We address language. The problem with such limitations is
a number of issues that are encountered when automat- that they prevent one from studying complex pro-
ically generating tests for complex software systems. grams. The unchallenging demands of simple pro-
These issues have not been discussed in earlier work grams can make nave schemes like random test
on test-data generation, which concentrates on small generation appear to work better than they actu-
programs (most often single functions) written in re- ally do [McGraw et al., 1997].
stricted programming languages.
2. The function minimization techniques applied are
1 Dynamic test data generation often overly simplistic. ADTEST and TESTGEN
In this paper, we introduce the GADGET system, use gradient descent to perform function min-
which uses a test data generation paradigm commonly imization, and this technique su ers when the
known as dynamic test data generation. GADGET is objective function contains local minima. Al-
the rst system to apply non-local optimization tech- though quantitative results were not reported
niques (including simulated annealing and genetic al- on the performance of ADTEST, [Gallagher and
gorithms) to the problem of test data generation. Be- Narasimhan, 1997] states that local minima cause
cause GADGET uses advanced instrumentation tech- diculties for that system. TESTGEN's gradient
niques, it is applicable to programs that are larger and descent system performed well in [Korel, 1996],
more complex than those traditionally reported in the but it was aided by a heuristic path selection
literature. strategy.
Dynamic test data generation was originally pro- As a result of these two weaknesses, programs for
posed by [Miller and Spooner, 1976] and then inves- which test data are generated in the literature are tra-
tigated further with the TESTGEN system of [Korel, ditionally small, with few conditionals, little nesting,
1990,Korel, 1996], the QUEST/Ada system of [Chang and simple control ow structures.
et al., 1996], and the ADTEST system of [Gallagher Function minimization using gradient descent suf-
and Narasimhan, 1997]. This paradigm treats parts fers from some well understood weaknesses. In our
of a program as functions that can be evaluated by work we apply more sophisticated techniques for func-
executing the program, and whose value is minimal tion minimization|genetic search [Holland, 1975] and
for those inputs that satisfy test adequacy criterion simulated annealing [Kirkpatrick et al., 1983]. In this
such as code coverage. In this way, the problem of paper, we compare the performances of simulated an-
generating test data reduces to the better-understood nealing, two implementations of genetic algorithms,
problem of function minimization. and a form of gradient descent when they are applied
Standard approaches to dynamic test data genera- to dynamic test data generation. We use a random
tion su er from two main problems: test data generator to create a baseline for our com-
 McGraw is correspondence author. parisons.
By automating instrumentation of a program for to true or false, but does not contain any other
analysis, we are able to analyze programs including all true/false-valued expressions, while a decision is an
C/C++ language constructs, including function and expression that in uences the program's ow of con-
method calls. This opens the door for experimentation trol. Condition-decision coverage requires that each
with more complex programs than those previously branch in the code be taken and that every condition
studied. in the code be true and false at least once.
Our experiments demonstrate a potentially impor- 1.2 Function minimization
tant di erence between dynamic test data generation Our approach, after Korel, is based on the idea that
and other function minimization problems. On the parts of a program can be treated as functions. One
programs we tested, coincidental discovery of test in- can execute the program until a certain location in
puts satisfying new criteria was more common than the code is reached, record the values of one or more
their deliberate discovery. Although one might ex- variables at that location, and treat those values as
pect random test generation to be good at discovering though they were the value of a function. For ex-
things by coincidence, especially given a set of stan- ample, suppose that a hypothetical program contains
dard constraints [Duran and Natfos, 1984], it did not the condition if (pos >= 21) ... on line 324, and
perform well in our experiments. The guided search that the goal is to ensure that the true branch of this
performed by more sophisticated techniques is good at condition is taken. We must nd an input that will
setting up the coincidental discovery of tests satisfying cause the variable pos to have a value greater than or
new criteria, since these methods tend to concentrate equal to 21 when line 324 is reached. A simple way
on restricted areas of the input space. to determine the value of pos on line 324 is to execute
1.1 Code coverage the program up to line 324 and then record the value
Empirical results indicate that tests selected on of pos.
the basis of test adequacy criteria such as code cov- Let pos 324 (x) denote the value of pos, recorded on
erage are good at uncovering faults [Horgan et al., line 324 when the program is executed on the input x.
1994, Chilenski and Miller, 1994]. Furthermore, test Then the function:
adequacy criteria are objective measures by which the 
quality of software tests can be judged. Neither of F(x) = 21 , pos 324 (x); if pos 324 (x) < 21;
0; otherwise
these bene ts can be realized unless test data that
satisfy the adequacy criteria can be found. Therefore is minimal when the true branch is taken on line 324.
there is a need to generate such tests automatically. Thus, the problem of test data generation is reduced
In practice, most test adequacy criteria require cer- to one of function minimization|to nd the desired
tain features of a program's source code to be exer- input, we must nd a value of x that minimizes the
cised. A simple example is a criterion that says, \Each objective function F(x).
statement in the program should be executed at least This is unfortunately an oversimpli cation, because
once when the program is tested." Test methodologies line 324 may not be reached for some inputs. There are
that use such criteria are usually called coverage anal- two common solutions to this problem. First, one can
yses, because certain features of the source code are treat the problem of reaching the desired location as a
to be covered by the tests. The example given above subproblem that must be solved before the minimiza-
describes statement coverage. tion of F(x) can commence [Korel, 1990]. Second, one
There is a hierarchy of increasingly complex cover- can amend the de nition of F(x) so that it will have a
age criteria having to do with the conditional state- very large value whenever the desired condition is not
ments in a program. At the top of the hierarchy is reached [Gallagher and Narasimhan, 1997]. Another
multiple condition coverage, which requires the tester possibility (the one that we adopt) is to implement an
to ensure that every permutation of values for the opportunistic strategy that seeks to cover whatever
Boolean variables in every condition occurs at least conditions it can reach [Michael et al., 1997].
once. At the bottom of the hierarchy is function Korel's subgoal chaining approach is advantageous
coverage which requires only that every function be when there is more than one path that reaches the
called once during testing (saying nothing about the desired location in the code. The test-data-generation
code inside each function). Somewhere between these algorithm is free to choose whichever path it wants
extremes is condition/decision coverage, which is the (as long as it can force that path to be executed),
criterion we use in our test data generation experi- and some paths may be better than others. In the
ments. A condition is an expression that evaluates TESTGEN system, heuristics are used to select the
path that seems most likely to have an impact on the Decision true false
target condition. 1 X X
In the ADTEST system of [Gallagher and 3 - X
Narasimhan, 1997], an entire path is speci ed in ad-
vance, and the goal of test data generation is to nd an Table 1: A sample coverage table after Chang. This table
input that executes the desired path. Since it is known can be automatically generated from the code fragment
which branch must be taken for each condition on the shown above. Such tables are used to decide where to
path, all of these conditions can be combined in a sin- apply function minimization during automatic test data
gle function whose minimization leads to an adequate generation.
test input. The ADTEST system begins by trying to
satisfy the rst condition on the path, and the second
condition is added only after the rst condition can Coverage tables provide a strategy for dealing with
be satis ed. As more conditions are reached, they are the situation where a desired condition is not reached.
incorporated in the function that the algorithm seeks Instead of picking a particular condition as TESTGEN
to minimize. does, or picking a particular path like ADTEST, this
strategy is opportunistic and seeks to cover whatever
1.2.1 Coverage tables and opportunism conditions it can reach. Although this is inecient
when one only wants to exercise a certain feature of
The QUEST/Ada system of [Chang et al., 1996] cre- the code under test, it can save quite a bit of unnec-
ates test data using rule-based heuristics. For exam- essary work if one wants to obtain complete coverage
ple, one rule causes values of parameters to increase according to some criterion.
or decrease by a xed constant percentage. The test
adequacy criterion chosen by Chang et al. is branch 2 GADGET
coverage. The system creates a coverage table for GADGET's name re ects the fact that it originally
each branch and marks those that have been success- used only genetic algorithms to perform function mini-
fully covered. The table is consulted during analy- mization (GADGET is an acronym expanding to \Ge-
sis to determine which branches to target for test- netic Algorithm Data GEneration Tool.") GADGET
ing. Partially-covered branches are always chosen over currently implements four optimization techniques for
completely{non-covered branches. We independently test generation: simulated annealing, gradient de-
developed the coverage-table strategy, and use it in scent, a standard genetic algorithm, and a di erential
GADGET. genetic algorithm. GADGET can also generate tests
This procedure can be illustrated using the follow- at random (providing our baseline approach). One of
ing code fragment, which comes from a control system: the ve algorithms is chosen by an experimentor for
1: if (err > bnd[0])
each complete GADGET run. A run includes multiple
2: rarea = area (1.0, (bnd[0] - bnd[1]));
executions of the test program.
3: else if ((err > bnd[2]))
GADGET automatically generates test data for
4: weight = rule (err, bnd[2], bnd[0]);
arbitrary C/C++ programs, with no limitations on
5: else ...
the permissible language constructs and no require-
ment for hand-instrumentation. We report test results
In order to reach the if condition on line 3, a test for complex programs containing up to 2000 lines of
case must cause the condition on line 1 to be false. source code with complex, nested conditionals. To
Any conditions on line 5 can only be reached when our knowledge, these are the most complex programs
the condition on line 1 and the condition on line 3 are for which results have been reported. By contrast
both false. standard programs reported in the literature average
If the goal of test generation is to exercise all 30 lines of code and have relatively simple condition-
branches in the code, then there must both be test als [Yin et al., 1997, Korel, 1996, Chang et al., 1996].
cases that cause the rst condition to be true as well Results of GADGET runs on small programs can be
as some that cause it to be false. Therefore, when found in [McGraw et al., 1997]. We take advantage
we have achieved branch coverage of line 1, we will al- of GADGET's capability for processing complex pro-
ready have at least one test case that reaches the con- grams by examining the impact of program complexity
dition on line 3. We are thus ready to begin looking on the problem of dynamic test data generation.
for test cases that cover both branches of the decision Our gradient descent algorithm is quite simple.
on line 3. The situation is illustrated by Table 1. It begins with a seed input, and measures the objec-
tive function for all inputs in the neighborhood of the at random, but each input's probability of being cho-
seed. The neighboring input that results in the best sen is proportional to its tness. The third step is
objective function value becomes the new seed, and crossover (or recombination). If an input contains n
the process continues until a solution is found or un- bits, the crossover point is a randomly selected integer
til no further improvement in the objective function c, 0 < c < n. Two o spring A and B are created in
seems possible. A number of techniques can be used such a way that the rst c bits of A are copied from
to de ne what the neighbors of the seed are. In our the rst parent, while the rst c bits of B are copied
experiments, the neighborhood was formed by incre- from the second. After the crossover point, each child
menting and decrementing each input parameter, so takes its bits from the remaining parent, so that A's
that an input containing ten parameters (creating a last n , c bits come from the second parent, while the
ten-dimensional input space) has twenty neighbors. last n , c bits of B come from the rst. In addition to
GADGET also allows the dimensionality of the input evaluation, selection, and recombination, genetic algo-
space to be adjusted by treating the input string as a rithms use mutation to guard against the permanent
series of bit elds, which are treated as integers when loss of information. Mutation simply results in the
they are incremented or decremented during gradient infrequent ipping of a bit within an input.
descent. The algorithm permits random selection of Our second genetic algorithm is the di erential
the step-size, which determines how much an input GA described in [Storn, 1996]. Here, an initial pop-
value is incremented or decremented. For most of our ulation is constructed as above. Recombination is ac-
experiments, the step sizes were chosen according to complished by iterating over the inputs in the popu-
a Gaussian distribution whose mean depended on the lation. For each such input I , three mates, A, B , and
experiment, and whose standard deviation was half of C , are selected at random. A new input I 0 is created
the mean. according to the following method, where we let Ai
Our simulated annealing algorithm is close to denote the value of the ith parameter in the input A,
the standard Metropolis algorithm described in [Kirk- and likewise for the other inputs: for each parameter
patrick et al., 1983]. For a given seed input, a ran- value Ii in the input I , we let Ii0 = Ii with probability
domly selected neighbor is generated with one of the p, where p is a parameter to the genetic algorithm.
same neighborhood functions used by gradient descent With probability 1 , p, we let Ii0 = Ai + (Bi , Ci ),
(the step size is also randomly selected, as above). If where is a second parameter of the GA. If I 0 results
the new input results in an improved objective func- in a better objective function value than I , then I 0
tion, it becomes the new seed. If it does not improve replaces I ; otherwise I is kept.
the objective function, it still becomes the new seed 3 Experimental results
with probability e,d=T , where d measures the wors- In this section, we describe several experimental re-
ening of the objective function value, and T is a pa- sults obtained with GADGET. All experiments report
rameter usually called the temperature of the system. on condition/decision coverage criteria. We begin with
T gradually decreases with time. Moves that make a simple program, triangle, for which results are of-
the objective function worse are more likely when the ten reported in the literature. The second set of re-
temperature is higher, and they are impossible when sults was obtained from a suite of synthetic programs
the temperature reaches zero. At zero temperature, with varying complexity. These experiments were de-
the Metropolis algorithm allows a move to a new seed signed to explore the e ect of program complexity on
that does not change the objective function value. Be- the diculty of automatic test data generation. Our
cause we encountered many at regions in the objec- nal set of results comes from a complex component
tive function, we disallowed such moves at zero tem- of a Boeing 737 autopilot control program with about
perature unless none of the inputs in the neighborhood 2,000 lines of code. B737 includes both complex con-
led to an improvement. trol structures and deeply nested conditionals.
The rst of our genetic algorithms (GA) is stan- 3.1 Simple programs, including triangle
dard. It begins by encoding each input as a string of We begin our GADGET experimentation on
bits. A population of such inputs is randomly gen- a set of simple functions much like those re-
erated. In the rst step, evaluation, the objective ported in the literature [Chang et al., 1996, Ko-
function is evaluated for each input, giving the in- rel, 1990, Yin et al., 1997]. These programs are
put a tness value. The next step, selection, is used roughly of the same complexity order, averaging
to nd two inputs that will be mated to contribute 30 lines of code and all having relatively simple
to the next generation. The two inputs are selected decisions. Complete results for: Binary search,
Bubble sort, Date range, Euclidean greatest gram) random test generation is far less adequate even
common denominator, Insertion sort, Computing though resource limitations are not a telling factor:
the median, Quadratic formula, and Warshall's the nal 90% of the randomly-generated tests failed to
algorithm can be found in [McGraw et al., 1997]. satisfy any new coverage criteria. In our subsequent
Averages from the results reported there show that experiments on larger programs, we nd this to be a
random achieves 88:28% coverage, the standard GA continuing trend: program size, program complexity,
93:159%, and the di erential GA 95:93%. Since the and coverage level all decrease the percentage of ade-
addition of simulated annealing and gradient descent quacy criteria that can be satis ed using random test
to GADGET we have not experimented with simple generation.
programs. Since the percentage of adequacy criteria satis ed
Figure 1 shows how the three of our ve by random test generation actually goes down when
algorithms|standard GA, di erential GA, and the programs become more complex, the suggestion
random|perform on the triangle program. The GA is that something about large programs, other than
exhibits the best performance, something that holds the sheer number of conditions that must be covered,
across the board in simple program experiments. In makes it dicult to generate test data for them. This
general the di erential GA shows slower convergence is con rmed in our later experiments.
to a solution. It is taking advantage of a longer focus 3.2 The role of complexity
on a local area of the search space. In our second set of experiments, we create
Our results for random test case generation resem- synthetic programs of varying complexity and use
ble those reported elsewhere for cases where the code GADGET to generate test data satisfying condi-
being analyzed was relatively simple. In [Chang et al., tion/decision coverage. We are particularly interested
1996], random test data generation was reported to in controlling two characteristics of programs: 1) how
cover 93:4% of the conditions on average. Although its deeply conditional clauses were nested (we call this the
worst performance|on a program containing 11 deci- nesting factor), and 2) the number of Boolean condi-
sion points|was 45:5%, it outperformed most of the tions in each decision (which we call condition factor).
other test generation schemes that were tried. Only The second parameter controls the complexity of in-
symbolic execution had better performance for these dividual conditional expressions; for example the de-
programs. [Korel, 1996] reports on eleven programs cision ((i <= 0) || (j <= 0) || (k <= 0)) con-
averaging just over 100 lines of code. Overall, random tains three conditions, and thus has a condition factor
test data generation was fairly successful, achieving of three. We can classify programs according to their
100% statement coverage for ve programs, and aver- complexity with a function compl(nesting factor, con-
aging 76% coverage on the other six. dition factor). In our experiments, programs were
It is also interesting to compare our results with generated with complexities compl(0; 1), compl(3; 2),
those obtained by [Korel, 1996] for three slightly larger and compl(5; 3). All of these programs have ten input
programs. Simple branch coverage was the goal. Ran- parameters, so they lead to a ten-dimensional search
dom test generation achieved 67%, 68%, and 79% cov- space for the optimization algorithms.
erage, respectively, on the three programs analyzed. All of our optimization algorithms use the same pa-
Symbolic test generation achieved 63%, 60%, and rameters for each of the three synthetic programs. The
90% coverage, while dynamic test generation achieved two GA's have a population size of thirty, and they
100%, 99%, and 100% coverage. abandon a search if no progress is made in ten gen-
These results show a common trend: random erations. Gradient descent and simulated annealing
test generation has at least an adequate perfor- use step sizes selected according to a Gaussian distri-
mance on such programs, but for larger programs or bution with mean 50 and standard deviation 25. For
more demanding coverage criteria (including condi- simulated annealing, the initial temperature is 0:1 and
tion/decision coverage), its performance deteriorates. it is decreased in steps of 0:01. The temperature is de-
The programs used in [Korel, 1996] were larger than creased when twenty moves produce no improvement
those used in [Chang et al., 1996], and random test in the objective function.
generation had poorer performance on the larger pro- Figures 2{4 show the pointwise mean performance
grams. over four runs of each test generation technique. For
The results reported in [McGraw et al., 1997] use the simplest program, random test generation and the
small programs, though the test-adequacy criterion is GAs quickly achieve high coverage (they are not dis-
more stringent. In some cases (like the triangle pro- tinguishable in the plot). Although gradient descent
triangle compl(0,1)
% covered % CDC coverage
Random (R)
95.00 GA 98.00 Standard GA
Diff GA (dGA)
96.00 Gradient desc (D)
Sim anneal (SA)
85.00 R dGA GA
75.00 differential D
50.00 random

executions x 103 74.00 executions x 103

0.00 2.00 4.00 6.00 8.00 10.00 0.00 0.50 1.00 1.50 2.00 2.50

Figure 1: A convergence graph for the triangle pro- Figure 2: Random test generation and the GAs per-
gram. The graph shows how performance (in terms of form well on the simple program, compl(0,1). Simu-
percent CDC coverage) improves as the number of ex- lated annealing and gradient descent need many more
ecutions increases. The curves represent the pointwise executions to attain the same performance. Curves
mean performance over ve runs for each system. in Figures 1{3 represent the pointwise mean over four
runs for each system.
compl(3,2) compl(5,3)
% CDC coverage % CDC coverage
Random (R) Random (R)
80.00 Standard GA 60.00 Standard GA
Diff GA (dGA) 58.00 SA Diff GA (dGA)
Gradient desc (D) Gradient desc (D)
76.00 Sim anneal (SA) Sim anneal (SA)
D 54.00
74.00 GA 52.00
48.00 D
68.00 dGA 44.00
SA 40.00 dGA
38.00 GA
62.00 36.00
R 34.00
56.00 28.00
50.00 20.00
executions x 103 executions x 103
0.00 10.00 20.00 30.00 40.00 0.00 5.00 10.00 15.00

Figure 3: As complexity increases, simulated anneal- Figure 4: High complexity programs are hard for
ing and gradient descent begin to out-pace the stan- all generators, but simulated annealing has the best
dard GA, but they are still somewhat more expensive. performance. Gradient descent outperforms the GAs
The di erential GA's performance lagged. compl(3,2). but is slightly more expensive than the standard GA.
catches up, and simulated annealing ultimately sur- ity has a drastic a ect not only on the performance
passes the other techniques, both take many execu- of our various optimization techniques, but also on
tions to do so. When we examined the details of each which strategies are best. Clearly, more research is
execution, we nd that the GAs perform well because needed on the relationships between program com-
they often manage to satisfy new criteria coinciden- plexity, search space characteristics, and dynamic test
tally | that is, they often nd inputs satisfying one generation strategies.
criterion while they are searching for an input that We can also conclude that coincidental satisfaction
satis es another. of new criteria plays an important role, especially in
Gradient descent and simulated annealing are not simpler programs. This allows the genetic algorithms
as good at uncovering useful inputs coincidentally. to perform as well as random test generation on sim-
They work as hard on the simple program as they ple programs, even though one might otherwise expect
do on the complex ones, but the extra e ort is not them to do as much unnecessary work as gradient de-
needed in this case. Ultimately, they do as well as scent or simulated annealing.
random test generation and the GAs, but they are 3.3 b737
unnecessarily expensive. In our nal study, we use the GADGET system
Figure 3 shows results for a program of intermediate on b737, a C program which is part of an autopilot
complexity, compl(3; 2). Random test generation does system. This code has 69 decision points and 2046
not do as well as before. The standard GA continues source lines of code (excluding comments), and it was
to discover many inputs by coincidence, but it quickly generated by a CASE tool. The program has a 186-
levels o . The di erential GA performs a more focused dimensional input space. Figure 5 shows the function
search than the standard GA, and it fails to nd as call graph of b737 which imparts some idea of its com-
many inputs by coincidence. Gradient descent and plexity as a program. (A similar graph for triangle
simulated annealing perform better than the GAs and has only four nodes.)
use fewer resources than the di erential GA, but they Ten attempts are made with each method to ob-
are still more expensive than the standard GA. tain condition/decision coverage of b737. For the two
Finally, Figure 4 shows the results for a program genetic algorithms, we make some attempt to tune
with compl(5; 3). Here, simulated annealing is clearly performance by adjusting the size of the population
more successful than any of the other techniques. Gra- and the number of generations that can elapse with-
dient descent is almost as successful, though the stan- out any improvement before the GAs give up. The
dard GA outperforms it in the early stages. The dif- goal of this ne-tuning is to maximize the percent-
ferential GA ultimately outperforms the standard GA. age of conditions covered, while keeping the execution
The total number of executions was smaller in the time low. For the standard genetic algorithm, we use
compl(5; 3) experiment than in compl(3; 2). The rea- populations of 100 inputs each, and allow 15 genera-
son for this is that none of the test generators at- tions to elapse when no improvement in tness is seen.
tempt to satisfy conditions they cannot reach. In the For the di erential GA we use populations of 25 in-
compl(3; 2) experiment, more conditions are reached, puts each, and allow 20 generations to elapse when
so all the generators do more work. there is no improvement in tness (the smaller pop-
We cannot conclude from these experiments that ulation size allows more generations to be evaluated
GAs perform poorly on complex programs compared with a given number of program executions, and we
to simulated annealing. With each of the optimization found that the di erential GA typically needed more
techniques we used, it is standard for practitioners generations because it converges more slowly than the
to tune parameters like the population size, anneal- standard GA).
ing schedule, and so on, in order to get better per- For gradient descent, we attempt to control the
formance. Mere coincidence may account for the fact number of executions by varying the size of the incre-
that our simulated annealing technique is better tuned ments and decrements in the input values, and we do
for the synthetic complex programs than the GAs are. the same for simulated annealing. We use randomly
Although we made some attempt to tune the optimiz- selected increments and decrements, chosen according
ers, nothing is known about how they should be tuned to a Gaussian distribution with mean 4 and a standard
for more complex programs. This is, after all, the rst deviation of 2. We use the same annealing schedule as
time most of these techniques have been used for test in our other experiments.
data generation. Figure 7 shows two convergence graphs comparing
We can conclude, however, that program complex- random test generation, the standard GA, and the





F T b

F T c
daVinci V2.0.3




Figure 5: The function call graph of b737 imparts some idea of its complexity Figure 6: The ow of con-
as a program. trol for a hypothetical pro-

di erential GA on the left, with gradient descent and achieves about 82:5% coverage, somewhat below the
simulated annealing on the right. The graphs show other nonrandom techniques.
the pointwise best, worst, and mean performance over Here, as in our other experiments, most useful in-
ten separate runs of each system. For each point on puts are discovered coincidentally. The fact that most
the horizontal axis, the worst curve shows the worst inputs are discovered by luck means that most criteria
performance of the optimizer over all runs at that par- not satis ed by chance were not satis ed at all. In this
ticular time. Thus, the performance never dropped respect, the b737 experiments shed some light on the
below this line on any run. Likewise, the performance true behavior of the test generators. We will examine
never rises above the curve marked best. a typical run of the standard GA and then discuss a
The best performance is seen on one of the ten run of the di erential GA.
standard GA runs. In this run, genetic search is The execution of the standard GA we examine as
able to achieve more than 93% CDC code coverage| a model was selected because its coverage results are
signi cantly better than random test generation, close to the mean value of all coverage results pro-
which only achieves 55% coverage. Overall, the stan- duced by the standard GA on b737. In its 11,409 ex-
dard GA performs best, with a mean performance no- ecutions of b737, this run sought to satisfy adequacy
ticeably higher than that of the di erential GA, and criteria on twelve di erent conditions in the code. Of
far above that of random test generation. There is lit- these twelve attempts, only one was successful. The
tle variability in the performance of the standard GA remaining eleven attempts show little forward progress
and almost none in the performance of the random during ten generations of evolution. While making
test generator, but the di erential GA exhibits sur- these failed attempts, however, the GA coincidentally
prising variability between runs. This may be caused discovers fourteen tests that satisfy conditions other
by the smaller population we used for the di eren- than the ones it was working on at the time. The high
tial GA; with fewer inputs in the population, it is less coverage level that is nally attained is mostly due to
likely that statistical uctuations in tness will cancel these fourteen inputs. Indeed, the most successful ex-
each other out. ecutions have the shortest run-times precisely because
Because b737 has a 186-dimensional input space, so many inputs are found coincidentally. Many con-
we had diculty tuning gradient descent and simu- ditions get covered by chance before the GA is ready
lated annealing so that they would only require 15,000 to begin working on them. This does not mean that
executions. Thus, the convergence graph on the right the GAs would fail to nd those inputs if a concerted
represents a di erent time period than the one on the attempt had been made.
left. However, we can simply cut o the executions The GA failed to cover the following eight condi-
tions, which we discuss further below:
of gradient descent and simulated annealing at 15,000 for (index = begin; index <= end && !termination;
runs, making their results comparable to those ob- index++)
tained for the other three techniques. As we see from if (((T <= 0.0) || (0.0 == Gain)))
the gure on the right, simulated annealing in this case if (o_5)
if (o_5)
achieves about 91% coverage on average|slightly be- if (((o_5 > 0.0) && (o < 0.0)))
low the level of the standard GA. Gradient descent if (FLARE)
b737 (GAs and random) b737 (gradient descent and SA)
% covered % covered
GA grad. desc.
100.00 differential 100.00 SA
95.00 95.00
GA best, worst, mean SA best, worst, mean
diff. GA best
90.00 90.00

85.00 diff. GA mean 85.00 grad. desc. best, worst, mean

80.00 80.00

diff. GA worst
75.00 75.00

70.00 70.00

65.00 65.00

60.00 60.00

random best, worst, & mean

55.00 55.00

50.00 50.00
executions x 103 executions x 103
0.00 5.00 10.00 15.00 0.00 10.00 20.00 30.00 40.00 50.00

Figure 7: Convergence graph comparing performance of ve systems on the b737 code. Three curves are shown for each
system. They represent the best performance, the pointwise mean over ten runs, and the worse performance. The GAs
both show much better performance than random test data generation, with the standard GA generally outperforming
di erential GA. The performance of simulated annealing is just below that of the standard GA, and that of gradient
descent slightly below that of the standard GA. Note the di erence between number of executions shown in the two plots.

if (FLARE) 1997]. With an improved strategy for dealing with

if (DECRB) such conditionals, GA behavior should improve.
During a typical run of the di erential GA, the The GAs also fail to cover several conditions not
15982 executions of b737 involve 25 di erent attempts containing Boolean variables, in spite of the fact that
to satisfy speci c criteria. Again, only one objective is such conditions provide the GAs with useful tness
obtained through evolution, though ten additional in- functions. The conditions not covered by the GAs all
put cases, not necessarily prime objectives of the GA, occur within decisions containing more than one con-
are discovered. dition, and this may account for the GA's diculties.
The following are the conditions that the di eren- However, it is also important to bear in mind that
tial GA failed to cover:
these conditions do not tell the whole story, since the
for (index = begin; index <= end && !termination; index++)variables appearing in the condition may be compli-
if (o)
if (o)
cated functions of the input parameters. In almost all
if ((((OP - D2) < 0.0) && ((OP - D1) > 0.0))) cases where the optimizers fail, they do so because all
if (o_3) the inputs they examine lead to the same value for the
if (((o_5 > 0.0) && (o < 0.0)))
if (FLARE)
objective function (causing a plateau e ect).
if (((!LOCE) || ONCRS))
if (RESET)
4 Coincidental coverage
Our examination of the ve test generation tech-
Most of the decisions not covered by the GAs con- niques uncovers a number of interesting features of the
sidered together only contain a single Boolean vari- dynamic test generation problem. Perhaps the most
able, signifying a condition that can be either true signi cant of these is that test criteria are often sat-
or false. The technique we use to de ne our tness is ed coincidentally. That is, the test generator can
function has limited e ectiveness with such conditions; nd inputs that satisfy one criterion even though it
speci cally, the handling of Boolean variables and enu- is searching for inputs to satisfy a di erent one. The
merated types is currently inadequate [McGraw et al., ability to nd good tests by coincidence apparently
plays a crucial role in the success of a dynamic test- generate inputs that reach c. Until the GA's goal is
data generation technique. This is in spite of the fact satis ed, all newly generated inputs will by de nition
that random test generation, which should be good take the false branch at c, and therefore they will all
at nding inputs coincidentally, performs poorly in reach condition d. Each time a new input is generated
most of our experiments. The ability of combinato- that reaches c, there is a possibility it will exercise a
rial search algorithms to concentrate on a restricted new branch of d.
part of the input space usually (but not always) gives By contrast, inputs selected completely at random
them an advantage when it comes to discovering new may be unlikely to reach d, because many will take
inputs accidentally. the false branches of conditions a and b. Therefore
The two genetic algorithms were especially good at randomly selected inputs are less likely to exercise new
discovering inputs coincidentally, and it appears that branches of d.
this led to a sizable decrease in the computational re-
sources they require. This is an interesting result, be- 5 Conclusions
cause the added sophistication of some combinatorial In this paper, we report on results from three sets
search algorithms slows them down, making them un- of experiments using dynamic test data generation.
suitable for easy problems that could be solved by sim- Test data are generated for programs of various sizes,
pler techniques. In our experiments, coincidental dis- including some that are much more complex than
covery of inputs lets the genetic algorithms compete those usually subjected to test data generation in the
with simpler techniques on easy problems. past. The following are some salient conclusions of our
We also nd that dynamic test generation fre- study:
quently encounters plateaus in the objective function.  Coincidental satisfaction of new test criteria can
These are regions where it is dicult or impossible for play an important role. In general, we found
the algorithms to nd inputs that change the objective that the most successful attempts to generate test
function's value. Some of these plateaus are caused by data do so by satisfying many criteria coinciden-
true/false-valued variables appearing within condi-
tally. This coincidental discovery of solutions is
tions, but many of the plateaus cannot be explained facilitated by the fact that a test generator must
in this way. It is somewhat surprising that plateaus solve a number of similar problems, and may lead
should be encountered so often, especially when the to considerable di erences between dynamic test
search space has high dimensionality and allows the data generation and other optimization problems.
search algorithm to move in many directions.
4.1 Why random breaks down  Our test generation techniques have a more dif-
The most interesting question raised by our exper- cult time with some programs than others. In
iments is the following: if the two GAs had so much particular, programs having deeply nested condi-
success with inputs they happened on by chance, then tional statements and many conditions per deci-
why didn't random test generation, which ought to sion are hard to cover using our techniques.
be good at nding things by chance, perform equally  Plateaus in the objective function appear to be
well? a common occurrence in dynamic test genera-
Our explanation of this phenomenon is illustrated tion. Combinatorial optimization techniques tend
by the diagram in Figure 6, which represents the ow to stall on such plateaus, and special techniques
of control in a hypothetical program. The nodes repre- for dealing with them may useful. One promis-
sent decisions. Suppose that we do not have an input ing technique is [Korel, 1990]'s dynamic path se-
that takes the true branch of the condition labeled lection; other strategies are deserving of future
c. Because of the coverage-table strategy, GADGET research.
does not attempt to nd such an input until decision
c can be reached (such an input must take the true  Our experiments suggest that the standard GA
branches of conditions a and b). When the GA starts and simulated annealing tend to outperform non-
trying to nd an input that takes the true branch combinatorial methods including random and
of c, inputs that reach c are used as seeds. During gradient descent. Further experimentation is
reproduction, some newly generated inputs will reach needed to clarify which methods are most useful
c and some will not, but those that do not will have under what conditions and what objective func-
poor tness values, and they will not usually repro- tions are most useful for real programs. A crite-
duce. Thus, during reproduction, the GA tends to rion that is dicult to cover with one technique
may often be easier with another. We are consid- [McGraw et al., 1997] McGraw, G., Michael, C., and
ering adaptive strategies to deal with this prob- Schatz, M. (1997). Generating software test data by
lem, including the ability to switch search strate- evolution. Technical report, Reliable Software Technolo-
gies dynamically according to current search data. gies, Sterling, VA. Submitted to IEEE Transactions on
Software Engineering.
Our results show that automatically generating test [Michael et al., 1997] Michael, C., McGraw, G., Schatz,
data can be successfully accomplished using combina- M., and Walton, C. (1997). Genetic algorithms for test
torial optimization for complex programs written in data generation. In Proceedings of the Twelfth IEEE
C/C++. Though there are several remaining avenues International Automated Software Engineering Confer-
to explore before the technology is fully mature, our ence (ASE 97), pages 307{108, Tahoe, NV.
initial experiments with GADGET are promising. [Miller and Spooner, 1976] Miller, W. and Spooner, D. L.
Acknowledgments The authors wish to thank Michael (1976). Automatic generation of oating point test data.
Schatz and Berent Eskikaya for many helpful contributions IEEE TSE, SE-2(3):223{226.
to this paper. Tom O'Connor and Brian Sohr contributed [Storn, 1996] Storn, R. (1996). On the usage of di erential
the GADGET acronym. This research has been made pos- evolution for function optimization. In Proc. NAFIPS
sible by the National Science Foundation under award number '96, pages 519{523.
DMI-9661393. Any opinions, ndings, and conclusions or rec- [Yin et al., 1997] Yin, H., Lebne-Dengel, Z., and Malaiya,
ommendations expressed in this publication are those of the Y. (1997). Automatic test generation using checkpoint
authors and do not necessarily re ect those of the National Sci- encoding and antirandom testing. In Proceedings of
ence Foundation. the 8th International Symposium on Software Reliability
References Engineering (ISSRE), pages 84{95.
[Chang et al., 1996] Chang, K., Cross, J., Carlisle, W.,
and Liao, S. (1996). A performance evaluation of
heuristics-based test case generation methods for soft-
ware branch coverage. International Journal of Software
Engineering and Knowledge Engineering, 6(4):585{608.
[Chilenski and Miller, 1994] Chilenski, J. and Miller, S.
(1994). Applicability of modi ed condition/decision
coverage to software testing. Software Engineering Jour-
nal, pages 193{200.
[Duran and Natfos, 1984] Duran, J. and Natfos, S. (1984).
An evaluation of random testing. IEEE Transactions on
Software Engineering, pages 438{444.
[Gallagher and Narasimhan, 1997] Gallagher, M. J. and
Narasimhan, V. L. (1997). Adtest: A test data gen-
eration suite for ada software systems. IEEE TSE,
[Holland, 1975] Holland, J. (1975). Adaption in Natural
and Arti cial Systems. University of Michigan Press,
Ann Arbor, MI. Reissued by MIT Press, 1992.
[Horgan et al., 1994] Horgan, J., London, S., and Lyu, M.
(1994). Achieving software quality with testing coverage
measures. IEEE Computer, 27(9):60{69.
[Kirkpatrick et al., 1983] Kirkpatrick, S., Gellat, C., and
Vecchi, M. (1983). Optimization by simulated anneal-
ing. Science, 220(4598):671{680.
[Korel, 1990] Korel, B. (1990). Automated software test
data generation. IEEE Transactions on Software Engi-
neering, 16(8):870{879.
[Korel, 1996] Korel, B. (1996). Automated test data gen-
eration for programs with procedures. In Proceedings of
the 1996 International Symposium on Software Testing
and Analysis, pages 209{215. ACM Press.