Professional Documents
Culture Documents
Figure 1: A convergence graph for the triangle pro- Figure 2: Random test generation and the GAs per-
gram. The graph shows how performance (in terms of form well on the simple program, compl(0,1). Simu-
percent CDC coverage) improves as the number of ex- lated annealing and gradient descent need many more
ecutions increases. The curves represent the pointwise executions to attain the same performance. Curves
mean performance over ve runs for each system. in Figures 1{3 represent the pointwise mean over four
runs for each system.
compl(3,2) compl(5,3)
% CDC coverage % CDC coverage
Random (R) Random (R)
80.00 Standard GA 60.00 Standard GA
Diff GA (dGA) 58.00 SA Diff GA (dGA)
78.00
Gradient desc (D) Gradient desc (D)
56.00
76.00 Sim anneal (SA) Sim anneal (SA)
D 54.00
74.00 GA 52.00
50.00
72.00
48.00 D
70.00
46.00
68.00 dGA 44.00
42.00
66.00
SA 40.00 dGA
64.00
38.00 GA
62.00 36.00
R 34.00
60.00
32.00
58.00
30.00
56.00 28.00
R
26.00
54.00
24.00
52.00
22.00
50.00 20.00
executions x 103 executions x 103
0.00 10.00 20.00 30.00 40.00 0.00 5.00 10.00 15.00
Figure 3: As complexity increases, simulated anneal- Figure 4: High complexity programs are hard for
ing and gradient descent begin to out-pace the stan- all generators, but simulated annealing has the best
dard GA, but they are still somewhat more expensive. performance. Gradient descent outperforms the GAs
The dierential GA's performance lagged. compl(3,2). but is slightly more expensive than the standard GA.
compl(5,3).
catches up, and simulated annealing ultimately sur- ity has a drastic aect not only on the performance
passes the other techniques, both take many execu- of our various optimization techniques, but also on
tions to do so. When we examined the details of each which strategies are best. Clearly, more research is
execution, we nd that the GAs perform well because needed on the relationships between program com-
they often manage to satisfy new criteria coinciden- plexity, search space characteristics, and dynamic test
tally | that is, they often nd inputs satisfying one generation strategies.
criterion while they are searching for an input that We can also conclude that coincidental satisfaction
satises another. of new criteria plays an important role, especially in
Gradient descent and simulated annealing are not simpler programs. This allows the genetic algorithms
as good at uncovering useful inputs coincidentally. to perform as well as random test generation on sim-
They work as hard on the simple program as they ple programs, even though one might otherwise expect
do on the complex ones, but the extra eort is not them to do as much unnecessary work as gradient de-
needed in this case. Ultimately, they do as well as scent or simulated annealing.
random test generation and the GAs, but they are 3.3 b737
unnecessarily expensive. In our nal study, we use the GADGET system
Figure 3 shows results for a program of intermediate on b737, a C program which is part of an autopilot
complexity, compl(3; 2). Random test generation does system. This code has 69 decision points and 2046
not do as well as before. The standard GA continues source lines of code (excluding comments), and it was
to discover many inputs by coincidence, but it quickly generated by a CASE tool. The program has a 186-
levels o. The dierential GA performs a more focused dimensional input space. Figure 5 shows the function
search than the standard GA, and it fails to nd as call graph of b737 which imparts some idea of its com-
many inputs by coincidence. Gradient descent and plexity as a program. (A similar graph for triangle
simulated annealing perform better than the GAs and has only four nodes.)
use fewer resources than the dierential GA, but they Ten attempts are made with each method to ob-
are still more expensive than the standard GA. tain condition/decision coverage of b737. For the two
Finally, Figure 4 shows the results for a program genetic algorithms, we make some attempt to tune
with compl(5; 3). Here, simulated annealing is clearly performance by adjusting the size of the population
more successful than any of the other techniques. Gra- and the number of generations that can elapse with-
dient descent is almost as successful, though the stan- out any improvement before the GAs give up. The
dard GA outperforms it in the early stages. The dif- goal of this ne-tuning is to maximize the percent-
ferential GA ultimately outperforms the standard GA. age of conditions covered, while keeping the execution
The total number of executions was smaller in the time low. For the standard genetic algorithm, we use
compl(5; 3) experiment than in compl(3; 2). The rea- populations of 100 inputs each, and allow 15 genera-
son for this is that none of the test generators at- tions to elapse when no improvement in tness is seen.
tempt to satisfy conditions they cannot reach. In the For the dierential GA we use populations of 25 in-
compl(3; 2) experiment, more conditions are reached, puts each, and allow 20 generations to elapse when
so all the generators do more work. there is no improvement in tness (the smaller pop-
We cannot conclude from these experiments that ulation size allows more generations to be evaluated
GAs perform poorly on complex programs compared with a given number of program executions, and we
to simulated annealing. With each of the optimization found that the dierential GA typically needed more
techniques we used, it is standard for practitioners generations because it converges more slowly than the
to tune parameters like the population size, anneal- standard GA).
ing schedule, and so on, in order to get better per- For gradient descent, we attempt to control the
formance. Mere coincidence may account for the fact number of executions by varying the size of the incre-
that our simulated annealing technique is better tuned ments and decrements in the input values, and we do
for the synthetic complex programs than the GAs are. the same for simulated annealing. We use randomly
Although we made some attempt to tune the optimiz- selected increments and decrements, chosen according
ers, nothing is known about how they should be tuned to a Gaussian distribution with mean 4 and a standard
for more complex programs. This is, after all, the rst deviation of 2. We use the same annealing schedule as
time most of these techniques have been used for test in our other experiments.
data generation. Figure 7 shows two convergence graphs comparing
We can conclude, however, that program complex- random test generation, the standard GA, and the
AUTOPILOT
LIM_180 MODE1 RUDDER_CMD PITCH_ADJ AFTLIM MODE3 PRE_FLARE MODE2 AIL_CMD LOCCMD EPR_GAIN CROSS_VEL SPOILER RtoA_XFD PRE_FLARE_LONG BEFORE_GSE BANK LOC_ERROR CALC_HR BANK_ADJ FLARE_CONTROL CAS_ADJ
F T b
VER_SINE fabs CALC_GSTRK SPEEDC CROSSTKADJ WIND_SHEAR LOCCF ONED LOCINT CALC_PSILIM PHICMDFB IN_ON_BEAM EZSWITCH GSE_ADJ CALC_HDER
F T c
MODLAG cos WASHOUT FOLAG STATE DZONE INTEGRATE KOUNT
daVinci V2.0.3
exp
F T
Figure 5: The function call graph of b737 imparts some idea of its complexity Figure 6: The
ow of con-
as a program. trol for a hypothetical pro-
gram.
dierential GA on the left, with gradient descent and achieves about 82:5% coverage, somewhat below the
simulated annealing on the right. The graphs show other nonrandom techniques.
the pointwise best, worst, and mean performance over Here, as in our other experiments, most useful in-
ten separate runs of each system. For each point on puts are discovered coincidentally. The fact that most
the horizontal axis, the worst curve shows the worst inputs are discovered by luck means that most criteria
performance of the optimizer over all runs at that par- not satised by chance were not satised at all. In this
ticular time. Thus, the performance never dropped respect, the b737 experiments shed some light on the
below this line on any run. Likewise, the performance true behavior of the test generators. We will examine
never rises above the curve marked best. a typical run of the standard GA and then discuss a
The best performance is seen on one of the ten run of the dierential GA.
standard GA runs. In this run, genetic search is The execution of the standard GA we examine as
able to achieve more than 93% CDC code coverage| a model was selected because its coverage results are
signicantly better than random test generation, close to the mean value of all coverage results pro-
which only achieves 55% coverage. Overall, the stan- duced by the standard GA on b737. In its 11,409 ex-
dard GA performs best, with a mean performance no- ecutions of b737, this run sought to satisfy adequacy
ticeably higher than that of the dierential GA, and criteria on twelve dierent conditions in the code. Of
far above that of random test generation. There is lit- these twelve attempts, only one was successful. The
tle variability in the performance of the standard GA remaining eleven attempts show little forward progress
and almost none in the performance of the random during ten generations of evolution. While making
test generator, but the dierential GA exhibits sur- these failed attempts, however, the GA coincidentally
prising variability between runs. This may be caused discovers fourteen tests that satisfy conditions other
by the smaller population we used for the dieren- than the ones it was working on at the time. The high
tial GA; with fewer inputs in the population, it is less coverage level that is nally attained is mostly due to
likely that statistical
uctuations in tness will cancel these fourteen inputs. Indeed, the most successful ex-
each other out. ecutions have the shortest run-times precisely because
Because b737 has a 186-dimensional input space, so many inputs are found coincidentally. Many con-
we had diculty tuning gradient descent and simu- ditions get covered by chance before the GA is ready
lated annealing so that they would only require 15,000 to begin working on them. This does not mean that
executions. Thus, the convergence graph on the right the GAs would fail to nd those inputs if a concerted
represents a dierent time period than the one on the attempt had been made.
left. However, we can simply cut o the executions The GA failed to cover the following eight condi-
tions, which we discuss further below:
of gradient descent and simulated annealing at 15,000 for (index = begin; index <= end && !termination;
runs, making their results comparable to those ob- index++)
tained for the other three techniques. As we see from if (((T <= 0.0) || (0.0 == Gain)))
the gure on the right, simulated annealing in this case if (o_5)
if (o_5)
achieves about 91% coverage on average|slightly be- if (((o_5 > 0.0) && (o < 0.0)))
low the level of the standard GA. Gradient descent if (FLARE)
b737 (GAs and random) b737 (gradient descent and SA)
% covered % covered
GA grad. desc.
100.00 differential 100.00 SA
random
95.00 95.00
GA best, worst, mean SA best, worst, mean
diff. GA best
90.00 90.00
80.00 80.00
diff. GA worst
75.00 75.00
70.00 70.00
65.00 65.00
60.00 60.00
50.00 50.00
executions x 103 executions x 103
0.00 5.00 10.00 15.00 0.00 10.00 20.00 30.00 40.00 50.00
Figure 7: Convergence graph comparing performance of ve systems on the b737 code. Three curves are shown for each
system. They represent the best performance, the pointwise mean over ten runs, and the worse performance. The GAs
both show much better performance than random test data generation, with the standard GA generally outperforming
dierential GA. The performance of simulated annealing is just below that of the standard GA, and that of gradient
descent slightly below that of the standard GA. Note the dierence between number of executions shown in the two plots.