You are on page 1of 10

IEEE International Conference on Software Testing Verification and Validation Workshops

Comparison of Unit-Level Automated Test Generation Tools

Shuang Wang and Jeff Offutt


Software Engineering
George Mason University
Fairfax, VA 22030, USA
{swangb,offutt}@gmu.edu

Abstract effort will be worthwhile [2]. Another possible reason is


that unit tests must be maintained, and maintenance for unit
Data from projects worldwide show that many software tests is often not budgeted. A third possibility is that de-
projects fail and most are completed late or over budget. velopers may not know how to design and implement high
Unit testing is a simple but effective technique to improve quality unit tests; they are certainly not taught this crucial
software in terms of quality, flexibility, and time-to-market. knowledge in most undergraduate computer science pro-
A key idea of unit testing is that each piece of code needs grams.
its own tests and the best person to design those tests is Using automated unit test tools instead of manual testing
the developer who wrote the software. However, generating can help with all three problems. Automated unit test tools
tests for each unit by hand is very expensive, possibly pro- can reduce the time and effort needed to design and imple-
hibitively so. Automatic test data generation is essential to ment unit tests, they can make it easier to maintain tests as
support unit testing and as unit testing is achieving more at- the program changes, and they can encapsulate knowledge
tention, developers are using automated unit test data gen- of how to design and implement high quality tests so that
eration tools more often. However, developers have very developers do not need to know as much. But an impor-
little information about which tools are effective. This ex- tant question developers must answer is “which tool should
periment compared three well-known public-accessible unit I use?”
test data generation tools, JCrasher, TestGen4j, and JUB. This empirical study looks at the most technical chal-
We applied them to Java classes and evaluated them based lenging part of unit testing, test data generation. We se-
on their mutation scores. As a comparison, we created two lected tools to empirically evaluate based on the following
additional sets of tests for each class. One test set con- three factors: (1) the tool must automatically generate test
tained random values and the other contained values to sat- values with little or no input from the tester, (2) the tool
isfy edge coverage. Results showed that the automatic test must test Java classes, and (3) the tool must be free and
data generation tools generated tests with almost the same readily available (for example, through the web).
mutation scores as the random tests. We selected three well-known, public-accessible auto-
mated tools (Section 2.1). The first is JCrasher [3], a ran-
dom testing tool that causes the class under test to “crash.”
1 Introduction The second is TestGen4J [11], whose primary focus is to ex-
ercise boundary value testing of arguments passed to meth-
An important goal of unit testing is to verify that each ods. The third is JUB (JUnit test case Builder) [19], which
unit of software works properly. Unit testing allows many is a framework based on the Builder pattern [7]. We use
problems to be found early in software development. A these tools to automatically generate tests for a collection
comprehensive unit test suite that runs together with daily of Java classes (Section 2.3).
builds is essential to a successful software project [20]. As As a control, our second step was to manually generate
the computing field uses more agile processes, relies more two additional sets of tests. A set of purely random tests was
on test driven development, and has higher reliability re- generated for each class as a “minimal effort” comparison,
quirements for software, unit testing will continue to in- and tests to satisfy edge coverage on the control flow graph
crease in importance. were generated.
However, some developers still do not do much unit test- Third, we seeded faults using the mutation analysis tool
ing. One possible reason is they do not think the time and muJava [9, 10] (Section 2.4). MuJava is an automated class

978-0-7695-3671-2/09 $25.00 © 2009 IEEE 210


DOI 10.1109/ICSTW.2009.36
mutation system that automatically generates mutants for the “weakest effort” testing strategy, and it seems natural to
Java classes, and evaluates test sets by calculating the num- expect a unit test data generator tool to at least do better than
ber of mutants killed. random value generation. We wrote a special-purpose tool
Finally, we applied the tests to muJava (Section 2.4), and that generated random tests in two steps. For each test, the
compared their mutation scores (the percentage of killed tool arbitrarily selected a method from the class to test. (The
mutants). Results are given in Section 3 and discussed in number of methods in each class is given in Table 2.) Then
Section 4. Related work is presented in Section 5, and Sec- the tool randomly generated values for each parameter for
tion 6 concludes the paper. that method. The tool did not parse the classes–the methods
and parameters were hard-coded into tables in the tool. We
2 Experimental Design decided to create the same random number of tests for each
subject class as the tool from Section 2.1 that created the
most tests for that class. For all the subject classes, JCrasher
This section describes the design of the experiment.
generated the most tests, so the study had the same number
First, each unit test data generation tool used is described,
of random tests as JCrasher had.
then the process used to manually generate additional tests
We elected to go with a test criterion as a second control.
is presented. Next, the Java classes used in the study are
Formal test criteria are widely promoted by researchers and
discussed and the muJava mutation testing tool is presented.
educators, but are only spottily used in industry [8]. We
The process used in conducting the experiment is then pre-
chose one of the weakest and most basic test criterion: edge
sented, followed by possible threats to validity.
coverage on the control flow graphs. We created control
flow graphs by hand for each method in each class, then
2.1 Subjects–Unit Testing Tools designed inputs to cover each edge in the graphs.

Table 1 summarizes the three tools this study examined 2.3 Java Classes Tested
in this paper. Each tool is described in detail below.
JCrasher [3] is an automatic robustness testing tool for Table 2 lists the Java classes used in this experiment.
Java classes. JCrasher examines the type information of BoundedStack is a small, fixed sized, implementation
methods in Java classes and constructs code fragments that of a Stack from the Eclat tool’s website [14]. Inventory
will create instances of different types to test the behavior is taken from the MuClipse [18] project, the eclipse plug-in
of the public methods with random data. JCrasher explic- version of muJava. Node is a mutable set of Strings that is
itly attempts to detect bugs by causing the class under test to a small part of a publish/subscribe system. It was a sample
crash, that is, to throw an undeclared runtime exception. Al- solution from a graduate class at George Mason University.
though limited by the randomness of the input values, this Queue is a mutable, bounded FIFO data structure of fixed
approach has the advantage of being completely automatic. size. Recipe is also taken from the MuClipse project; it is
No inputs are required from the developer. a javabean class that represents a real-world Recipe object.
TestGen4J [11] automatically generates JUnit test cases Twelve is another sample solution. It tries to combine
from Java class files. Its primary focus is to perform bound- three given integers with arithmetic operators to compute
ary value testing of the arguments passed to methods. It exactly twelve. VendingMachine is from Ammann and
uses rules, written in a user-configurable XML file, that de- Offutt’s book [1]. It models a simple vending machine for
fines boundary conditions for the data types. The test code chocolate candy.
is separated from test data with the help of JTestCase1 .
JUB (JUnit test case Builder) [19] is a JUnit test case
generator framework accompanied by a number of IDE spe- Table 2. Subject Classes Used
cific extensions. These extensions (tools, plug-ins, etc.) are Name LOC Methods
invoked from within the IDE and must store generated test BoundedStack 85 11
case code inside the source code repository administered by Inventory 67 11
the IDE. Node 77 9
Queue 59 6
2.2 Additional Test Sets Recipe 74 15
TrashAndTakeOut 26 2
As a control comparison, we generated two additional Twelve 94 1
sets of tests for each class by hand with some limited tool VendingMachine 52 6
support. Testing with random values is widely considered Total 534 61
1 http://jtestcase.sourceforge.net/

211
Table 1. Automated Unit Testing Tools
Name Version Inputs Interface
JCrasher 0.1.9 (2004) Source File Eclipse Plug-in
TestGen4J 0.1.4-alpha (2005) Jar File Command Line (Linux)
JUB 0.1.2 (2002) Source File Eclipse Plug-in

2.4 MuJava
Table 3. Classes and Mutants
Classes Mutants
Our primary measurement of the test sets in this exper- Traditional Class Total
iment is their ability to find faults. MuJava is used to seed BoundedStack 224 4 228
faults (mutants) into the classes and evaluate how many mu- Inventory 101 50 151
tants each test set kills. Node 18 4 22
MuJava [9, 10] is a mutation system for Java classes. It Queue 117 6 120
automatically generates mutants for both traditional muta- Recipe 101 26 127
tion testing and class-level mutation testing. MuJava can TrashAndTakeOut 104 0 104
test individual classes and packages of multiple classes. Twelve 234 0 234
Tests are supplied by the users as sequences of method calls VendingMachine 77 7 84
to the classes under test encapsulated in methods in separate Total 976 97 1073
classes.
MuJava creates object-oriented mutants for Java classes
according to 24 operators that include object-oriented oper-
ators. Method level (traditional) mutants are based on the 2.6 Threats to Validity
selective operator set by Offutt et al. [13]. After creating
mutants, muJava allows the tester to enter and run tests, and
evaluates the mutation coverage of the tests. In muJava,
tests for the class under test are encoded in separate classes As with any study that involves specific programs or
that make calls to methods in the class under test. Mutants classes, there is no guarantee that the classes used are “rep-
are created and executed automatically. resentative” of the general population. As yet, nobody has
developed a general theory for how to choose representative
2.5 Experimental Conduct classes for empirical studies, or how many classes we may
need. This general issue has a negative impact on our abil-
ity to apply statistical analysis tools and reduces the external
Figure 1 illustrates the experimental process. The Java validity of many software engineering empirical studies, in-
classes are represented by the leftmost box, P. Each of cluding this one.
the three automated tools (JCrasher, TestGen4J, JUB) were
used to create sets of tests (Test Set JC, ...). Then we manu- Another question is whether the three tools used are rep-
ally created the random tests, then the edge coverage tests. resentative. We searched for free, publicly available, tools
that generated tests and were somewhat surprised at how
MuJava was designed and developed before JUnit and
few we found. We initially hoped to use an advanced tool
other widely used test harness tools, so it has its own syn-
called Agitar Test Runner, however it is no longer available
tactic requirements. Each muJava test must be in a public
to academic users.
method that returns a String that encapsulates the result of
the test. Thus many of the tests from the three tools had to Another possible problem with internal validity may be
be modified to run with muJava. This modification is illus- in the manual steps that had to be applied. Tests from
trated in Figure 2. the three tools had to be translated to muJava tests. These
Then muJava was used to generate mutants for each sub- changes were only to the structure of the test methods, and
ject class, and run all five test sets against the mutants. This did not affect the values, so it is unlikely that these trans-
resulted in five mutation scores for each subject class (40 lations affected the results. The random and edge coverage
mutation scores in all). Because muJava separates scores tests were generated by hand with some tool support. These
for the traditional mutation operators from the class muta- tests were generated without knowledge of the mutants, so
tion operators, we also kept these scores separate. as to avoid any bias.

212
muJava
JCrasher
JCrasher Test Set JC
Mutation Score
TestGen4J
TestGen4J Test Set TG
Mutation Score
JUB
P JUB Test Set JUB Mutants
Mutation Score
Manual Random
Test Set Ran
Random Mutation Score

Manual Edge Cover


Test Set EC
Edge Cover Mutation Score

Figure 1. Experiment Implementation

public void test242() throws Throwable public String test242() throws Throwable
{ {
try { try {
int i1 = 1; int i1 = 1;
int i2 = 1; =⇒ int i2 = 1;
int i3 = -1; int i3 = -1;
Twelve.twelve (i1, i2, i3); return Twelve.twelve (i1, i2, i3);
} }
catch (Exception e) } catch (Exception e) }
{ dispatchException (e); } { return e.toString(); }
} }

Figure 2. Conversion to MuJava Tests

3 Results and JUB tests, but not as well as the JCrasher tests. The
edge coverage tests, however, kill 24% more mutants than
The number of mutants for each class are shown in Ta- the strongest tool, with fewer tests.
ble 3. MuJava generates traditional (method level) and class Table 5 shows the same data for the class level mutants.
mutants separately. The traditional operators focus on indi- There is more diversity in the scores from the three tools,
vidual statements, and the class operators focus on connec- and JCrasher killed 12% more class level mutants than the
tions between classes, especially in inheritance hierarchies. random tests did. However, the edge coverage tests are still
Node looks anomalous because of its small number of mu- far stronger, killing 21% more mutants than the strongest
tants. Node has only 18 traditional mutants over 77 lines performing tool (JCrasher).
of code. However, the statements are distributed over nine, Table 6 combines the scores from Tables 5 and 4 for both
mostly very small, methods. Most methods are very short traditional and class level mutants. Again, the JUB tests
(four have only one statement), they use no arithmetic, shift, are the weakest, JCrasher, TestGen and the random tests are
or logical operators, they have no assignments, and only fairly close, and the edge coverage tests kill far more mu-
a few decision statements. Thus, Node has few locations tants; 24% more than the JCrasher tests.
where muJava’s mutation operators could be applied. Table 7 summarizes the data for all subject classes for
Results from running the five sets of tests under muJava each set of tests. Because the number of tests diverged
are shown in Tables 4 through 7. Table 4 shows the number widely, we asked the question “how efficient is each test
of tests in each test set and the mutation scores on the tradi- generation method?” To approximate efficiency, we com-
tional mutants, in terms of the percentage of mutants killed. puted the number of mutants killed per test. Not surpris-
The “Total” row gives the sum of the tests for the subject ingly, the edge coverage tests were at the high end with
classes and the mutation score across all subject classes. As a score of 6.5. TestGen generated the least number of
can be seen, the random tests did better than the TestGen tests, and came out as being the second most efficient, 5.2.

213
Table 4. Traditional Mutants Killed
Classes JCrasher TestGen JUB Edge Coverage Random
Tests % Killed Tests % Killed Tests % Killed Tests % Killed Tests % Killed
BoundedStack 21 22.8% 10 23.7% 11 17.4% 17 46.9% 21 22.3%
Inventory 20 68.3% 10 60.4% 11 68.3% 19 60.4% 20 65.3%
Node 16 50.0% 16 50.0% 9 33.3% 14 61.1% 16 50.0%
Queue 9 39.3% 5 34.2% 9 41.0% 10 51.3% 9 15.4%
Recipe 29 78.2% 10 70.3% 15 35.6% 22 75.2% 29 57.4%
TrashAndTakeOut 13 64.4% 2 30.8% 2 30.8% 5 70.2% 13 59.6%
Twelve 27 35.0% 1 0.0% 1 0.0% 12 86.3% 27 35.0%
VendingMachine 14 10.4% 5 9.1% 6 9.1% 10 75.3% 14 10.4%
Total 149 42.1% 59 28.0% 64 24.3% 109 66.2% 149 36.2%

Table 5. Class Mutants Killed


Classes JCrasher TestGen JUB Edge Coverage Random
Tests % Killed Tests % Killed Tests % Killed Tests % Killed Tests % Killed
BoundedStack 21 25.0% 10 25.0% 11 25.0% 17 75.0% 21 25.0%
Inventory 20 44.0% 10 36.0% 11 36.0% 19 76.0% 20 40.0%
Node 16 25.0% 16 25.0% 9 25.0% 14 25.0% 16 25.0%
Queue 9 33.3% 5 33.3% 9 33.3% 10 33.3% 9 16.7%
Recipe 29 73.1% 10 53.8% 15 11.0% 22 80.8% 29 38.5%
TrashAndTakeOut 13 N/A 2 N/A 2 N/A 5 N/A 5 N/A
Twelve 27 N/A 1 N/A 1 N/A 12 N/A 27 N/A
VendingMachine 14 0.0% 5 0.0% 6 0.0% 10 0.0% 14 0.0%
Total 149 46.4% 59 37.1% 64 25.6% 109 67.0% 149 34.0%

Table 6. Total Mutants Killed


Classes JCrasher TestGen JUB Edge Coverage Random
Tests % Killed Tests % Killed Tests % Killed Tests % Killed Tests % Killed
BoundedStack 21 22.8% 10 23.2% 11 17.5% 17 47.4% 21 22.4%
Inventory 20 60.3% 10 52.3% 11 57.6% 19 65.6% 20 57.0%
Node 16 45.5% 16 45.5% 9 31.8% 14 54.5% 16 45.5%
Queue 9 37.4% 5 32.5% 9 39.0% 10 48.8% 9 14.6%
Recipe 29 77.2% 10 66.9% 15 30.6% 22 76.4% 29 53.5%
TrashAndTakeOut 13 64.4% 2 30.8% 2 30.8% 5 70.2% 13 59.6%
Twelve 27 35.0% 1 0.0% 1 0.0% 12 86.3% 27 35.0%
VendingMachine 14 9.5% 5 8.3% 6 8.3% 10 69.0% 14 9.5%
Total 149 42.5% 59 28.8% 64 24.4% 109 66.3% 149 36.0%

214
JCrasher and the random tests were the least efficient; they generated by JCrasher, and concluded out that this is be-
generated a lot of tests without much obvious benefit, which cause JCrasher uses invalid values to attempt to “crash” the
adds a burden on the developers who must evaluate the re- class, as shown in Figure 4. JUB only generates tests that
sults of each test. uses 0 for integers and null for general subjects, and Test-
Figure 3 illustrates the total percent of mutants killed by Gen4J generates “normal” inputs, such as blanks and empty
each test set in a bar chart. The difference between the edge strings. JCrasher, of course, also created many more tests
coverage tests and the others is remarkable. than the other two tools.
Tables 8 and 9 give a more detailed look at the scores
for each mutation operator. The mutation operators are public void test18() throws Throwable
{
described on the muJava website [10]. No mutants were
try
generated for five traditional operators (AODS, SOR, LOR,
{
LOD or ASRS) or for most of the class operators (IHI, IHD, String s1 =
IOP, IOR, ISI, ISD, IPC, PNC, PMD, PPD, PCI, PCC, PCD, "˜!@$$%ˆ&*()_+{}|[]’;:/.,<>?‘-=";
PRV, OMR, OMD, OAN, EOA, or EOC), so they are not Node n2 = new Node();
shown. The edge coverage tests had the highest scores for n2.disallow (s1);
all mutation operators. None of the test sets did particularly }
well on the arithmetic operator mutants (operators whose catch (Exception e)
names start with the letter ’A’). {dispatchException (e);}
}

4 Analysis and Discussion


Figure 4. JCrasher Test
We have anecdotal evidence from in-class exercises that
when hand crafting tests to kill mutants, it is trivially easy The measure of efficiency in Table 7 is a bit biased with
to kill between 40% to 50% of the mutants. This anecdo- mutation, as many mutants are very easy to kill. It is quite
tal note is confirmed by these data, where random values common for the first few tests to kill a lot of mutants and
achieved an average mutation score of 36%. It was quite subsequent tests to kill fewer mutants, leading to a sort of
distressing to find that the three tools did little better than diminishing returns.
random testing. JCrasher was slightly better (6.5% overall), We separated the data for traditional and class mutants
TestGen was worse (a 7.2% lower mutation score, and JUB in Tables 4 and 5. JCrasher’s tests had a slightly higher
was even worse (a 11.6% lower mutation score). mutation score on the class mutants, and TestGen’s, JUB’s
An interesting observation from Table 6 is that the scores and the random tests were slightly lower. There was little
for VendingMachine are much lower for all sets of tests difference in the mutation scores for the edge coverage tests.
except for edge coverage. The other four mutation scores However, we are not able to draw any general conclusions
are below 10%. The reason is probably because of the from those data.
relative complexity of VendingMachine. It has several We also looked for a correlation between the number
multi-clause predicates that determine most of its behavior: of tests for each class and the mutation score. With the
JCrasher tests and Random tests, the largest number of tests
(coin!=10 && coin!=25 && coin!=100) (for Recipe in both cases) led to the highest mutation
(credit >= 90) scores. However, the other three test sets showed no such
(credit < 90 || stock.size() <= 0) correlation and we see no correlation with the least numbers
(stock.size() >= MAX) of tests. In fact, edge coverage produced the least number
of tests for class Twelve (12) but had the highest mutation
MuJava creates dozens of mutants on these predicates, and
score (83.6%). Again, we can draw no general conclusions
the mostly random values created by the three tools in this
from these data.
study have a small chance of killing those mutants.
Another interesting observation from Table 6 is that the
scores for BoundedStack were the second lowest for all 5 Related Work
the test sets except edge coverage (in which it was the low-
est). A difference in that class is that only two of the eleven This research project was partially inspired by a paper
methods have parameters. The three testing tools depend by d’Amorim et al., which presented an empirical compar-
largely on the method signature, so fewer parameters may ison of automated generation and classification techniques
mean weaker tests. for object-oriented unit testing [4]. Their study compared
Another finding is that JCrasher got the highest muta- pairs of test-generation techniques based on random gen-
tion score among the three tools. We examined the tests eration or symbolic execution and test-classification tech-

215
Table 7. Summary Data
% Killed Efficiency
Tool Tests Traditional Class Total Killed / Tests
JCrasher 149 42.1% 46.4% 42.5% 3.1
TestGen 59 28.0% 37.1% 28.8% 5.2
JUB 64 24.3% 25.6% 24.4% 4.1
Edge Coverage 109 66.2% 67.0% 66.3% 6.5
Random 149 36.2% 34.0% 36.0% 2.6

100%
Percent mutants killed
80%
66%
60%
43%
40% 36%
28%
24%
20%

0
Edge
JCrasher TestGen JUB Random
Coverage
Test Sets

Figure 3. Total Percent Mutants Killed by Each Test Set

niques based on uncaught exceptions or operational models. sion incorporated dynamic symbolic evaluation and a dy-
Specifically, they compared two tools that implement auto- namic domain reduction procedure to generate tests [12].
mated test generation; Eclat [15], which uses random gen- A more recent tool that uses very similar techniques is the
eration and their own tool, Symclat, which uses symbolic Daikon invariant detector tool [6]. It augments the kind
generation. The tools also provide test classification based of symbolic evaluation that Godzilla used with program in-
on an operational model and an uncaught exception model. variants, an innovation that makes the test generation pro-
The results showed that the two tools are complementary in cess more efficient and scalable. Daikon analyzes values
revealing faults. that a program computes when running and reports proper-
ties that were true over the observed executions. Eclat, a
In a similar study on static analysis test tools, Rutar et
model-driven random testing tool, uses Daikon to dynami-
al. [17] compared five static analysis test tools against five
cally infer an operational model consisting of a set of likely
open source projects. Their results showed that none of the
program invariants [15]. Eclat requires classes to test plus
five tools strictly subsumes any of the others in terms of
an example of their use, such as a program that uses the
the fault-finding capability. They proposed a meta-tool to
classes or a small initial test suite. As Eclat’s result may
combine and correlate the abilities of these five tools. Wag-
depend on the initial seeds, it was not directly comparable
ner et al. [21] presented a case study that applied three static
with the other tools in this study.
fault-finding tools as well as code review and manual testing
to several industrial projects. Their study showed that the
Another tool that is based on Daikon is Jov [22], which
static tools predominantly found different faults than man-
presents an operational violation approach for unit test gen-
ual testing but a subset of faults found by reviews. They
eration and selection, a black-box approach that does not
proposed a combination of these three types of techniques.
require specifications. The approach dynamically generates
An early research tool that implemented automated unit operational abstractions from executions of the existing unit
testing was Godzilla, which was part of the Mothra muta- test suite. These operational abstractions guide test gener-
tion suite of tools [5]. Godzilla used symbolic evaluation to ation tools to generate tests to violate them. The approach
automatically generate tests to kill mutants, and a later ver- selects tests that violate operational abstractions for inspec-

216
Table 8. Mutation Scores per Traditional Mutation Operator
Traditional Mutants JCrasher TestGen JUB Edge Coverage Random
AORB 56 32% 21% 30% 66% 29%
AORS 11 46% 27% 36% 55% 27%
AOIU 66 46% 32% 17% 79% 36%
AOIS 438 28% 24% 22% 53% 22%
AODU 1 100% 100% 100% 100% 100%
ROR 256 61% 25% 17% 79% 57%
COR 12 33% 25% 25% 58% 33%
COD 6 33% 33% 17% 50% 33%
COI 4 75% 75% 50% 75% 75%
LOI 126 53% 48% 44% 80% 48%
Total 976 42% 28% 24% 66% 36%

Table 9. Mutation Scores per Class Mutation Operator


Class Mutants JCrasher TestGen JUB Edge Coverage Random
IOD 6 50% 50% 50% 50% 50%
JTI 20 95% 50% 25% 100% 55%
JTD 6 100% 100% 0% 100% 50%
JSI 13 0% 0% 0% 23% 0%
JSD 4 0% 0% 0% 50% 0%
JID 2 0% 0% 0% 0% 0%
JDC 6 83% 66% 83% 83% 67%
EAM 28 0% 0% 0% 57% 0%
EMM 12 100% 100% 100% 100% 100%
Total 97 46% 36% 26% 69% 34%

tion. These tests exercise new behavior that had not yet search community for more than two decades, industrial test
been exercised by the existing tests. Jov integrates the use data generators seldom, if ever, try to generate tests that sat-
of Daikon and Parasoft Jtest [16] (a commercial Java test- isfy test criteria. Tools that evaluate coverage are available,
ing tool). Agitar Test Runner is a commercial test tool that but they do not solve the hardest problem of generating test
was partially based on Daikon and Godzilla, but was unfor- values. This study has led us to conclude that it is past time
tunately not available for this study. for criteria-based test data generation to migrate into tools
that developers can use with minimal knowledge of soft-
6 Conclusions ware testing theory.
These tools were compared with only one test criterion,
This paper compared three free, publicly accessible, unit edge coverage on control flow graphs. This is widely known
test tools on the basis of their fault finding abilities. Faults in the research community to be one of the simplest, cheap-
were seeded into Java classes with an automated mutation est, and least effective test criterion. Our anecdotal experi-
tool and the tools’ tests were compared with hand generated ence with manually killing mutants indicates that scores of
random tests and edge coverage tests. around 40% are trivial to achieve and 70% is fairly easy to
Our findings are that these tools generate tests that are achieve with a small amount of hand analsyis of the class’s
very poor at detecting faults. This can be viewed as a de- structure. This observation is supported by this study, in
pressing comment on the state of practice. As users expec- which random values reached the 40% level and edge cover-
tations for reliable software continue to grow, and as agile age tests reached the 70% level. However, mutation scores
processes and test driven development continue to gain ac- of 80% to 90% are often quite hard to reach with hand-
ceptance throughout the industry, unit testing is becoming generated tests. This should be possible with more stringent
increasingly important. Unfortunately, software developers criteria such as prime paths, all-uses, or logic-based cover-
have few choices in high quality test data generation tools. age.
Whereas criteria-based testing has dominated the re- We have also observed that software developers have few

217
educational opportunities to learn unit testing skills. De- Windsor, UK, August 2006. IEEE Computer Society
spite the fact that testing consumes more than half of the Press.
software industry’s resources, we are not aware of any uni-
versities in the USA that require undergraduate computer [9] Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. Mu-
science students to take a software testing course. Very few Java : An automated class mutation system. Soft-
universities do more than teach a lecture or two on testing ware Testing, Verification, and Reliability, 15(2):97–
in a general software engineering survey; the material that 133, June 2005.
is presented is typically 20 years old. This study has led us
[10] Yu-Seung Ma, Jeff Offutt, and Yong-Rae
to conclude that it is past time for universities to teach more
Kwon. muJava home page. Online,
knowledge of software testing to undergraduate computer
2005. http://cs.gmu.edu/∼offutt/mujava/,
science and software engineering students.
http://salmosa.kaist.ac.kr/LAB/MuJava/, last ac-
cess December 2008.
References
[11] Manish Maratmu. Testgen4j. Online,
[1] Paul Ammann and Jeff Offutt. Introduction to Soft- 2005. http://developer.spikesource.com/wiki/in-
ware Testing. Cambridge University Press, Cam- dex.php/Projects:testgen4j, last access December
bridge, UK, 2008. ISBN 0-52188-038-1. 2008.

[2] AutomatedQA. Testcomplete. Online, 2008. [12] Jeff Offutt, Zhenyi Jin, and Jie Pan. The dynamic
http://www.automatedqa.com/products/testcomplete/, domain reduction approach to test data generation.
last access December 2008. Software–Practice and Experience, 29(2):167–193,
January 1999.
[3] Christoph Csallner and Yannis Smaragdakis.
JCrasher: An automatic robustness tester for Java. [13] Jeff Offutt, Ammei Lee, Gregg Rothermel, Roland
Software: Practice and Experience, 34:1025–1050, Untch, and Christian Zapf. An experimental determi-
2004. nation of sufficient mutation operators. ACM Transac-
[4] Marcelo d’Amorim, Carlos Pacheco, Tao Xie, Darko tions on Software Engineering Methodology, 5(2):99–
Marinov, and Michael D. Ernst. An empirical compar- 118, April 1996.
ison of automated generation and classification tech- [14] Carlos Pacheco and Michael D.
niques for object-oriented unit testing. In Proceed- Ernst. Eclat tutorial. Online.
ings of the 21st International Conference on Auto- http://groups.csail.mit.edu/pag/eclat/manual/tu-
mated Software Engineering (ASE 2006), pages 59– torial.php, last access December 2008.
68, Tokyo, Japan, September 2006. ACM / IEEE
Computer Society Press. [15] Carlos Pacheco and Michael D. Ernst. Eclat: Auto-
matic generation and classification of test inputs. In
[5] Richard A. DeMillo and Jeff Offutt. Constraint-based
19th European Conference on Object-Oriented Pro-
automatic test data generation. IEEE Transactions
gramming (ECOOP 2005), Glasgow, Scotland, July
on Software Engineering, 17(9):900–910, September
2005.
1991.
[6] Michael D. Ernst, Jake Cockrell, William G. Gris- [16] Parasoft. Jtest. Online, 2008.
wold, and David Notkin. Dynamically discovering http://www.parasoft.com/jsp/products/home.jsp-
likely program invariants to support program evolu- ?product=Jtest, last access December 2008.
tion. IEEE Transactions on Software Engineering,
[17] Nick Rutar, Christian B. Almazan, and Jeffrey S. Fos-
27(2):99–123, February 2001.
ter. A comparison of bug finding tools for Java. In Pro-
[7] Erich Gamma, Richard Helm, Ralph Johnson, and ceedings of the 15th International Symposium on Soft-
John M. Vlissides. Design Patterns: Elements of ware Reliability Engineering, pages 245–256, Saint-
Reusable Object-Oriented Software. Addison-Wesley, Malo, Bretagne, France, November 2004. IEEE Com-
1995. ISBN 0-201-63361-2. puter Society Press.

[8] Mats Grindal, Jeff Offutt, and Jonas Mellin. On the [18] Ben Smith and Laurie Williams. Killing
testing maturity of software producing organizations. mutants with MuClipse. Online, 2008.
In Testing: Academia & Industry Conference - Prac- http://agile.csc.ncsu.edu/SEMaterials/tutorials/mu-
tice And Research Techniques (TAIC / PART 2006), clipse/, last access December 2008.

218
[19] Mark Tyborowski. JUB (JUnit test case Builder). On- [21] S. Wagner, J. Jurjens, C. Koller, and P. Trischberger.
line, 2002. http://jub.sourceforge.net/, last access De- Comparing bug finding tools with reviews and tests.
cember 2008. In 17th IFIP TC6/WG 6.1 International Conference
on Testing of Communicating Systems, pages 40–55,
[20] Sami Vaaraniemi. The benefits of au- May 2005.
tomated unit testing. Online, 2003.
http://www.codeproject.com/KB/architecture/on- [22] Tao Xie and David Notkin. Tool-assisted unit-test
unittesting.aspx, last access December 2008. generation and selection based on operational ab-
stractions. Automated Software Engineering Journal,
13(3):345–371, July 2006.

219

You might also like