Professional Documents
Culture Documents
Table 1 summarizes the three tools this study examined 2.3 Java Classes Tested
in this paper. Each tool is described in detail below.
JCrasher [3] is an automatic robustness testing tool for Table 2 lists the Java classes used in this experiment.
Java classes. JCrasher examines the type information of BoundedStack is a small, fixed sized, implementation
methods in Java classes and constructs code fragments that of a Stack from the Eclat tool’s website [14]. Inventory
will create instances of different types to test the behavior is taken from the MuClipse [18] project, the eclipse plug-in
of the public methods with random data. JCrasher explic- version of muJava. Node is a mutable set of Strings that is
itly attempts to detect bugs by causing the class under test to a small part of a publish/subscribe system. It was a sample
crash, that is, to throw an undeclared runtime exception. Al- solution from a graduate class at George Mason University.
though limited by the randomness of the input values, this Queue is a mutable, bounded FIFO data structure of fixed
approach has the advantage of being completely automatic. size. Recipe is also taken from the MuClipse project; it is
No inputs are required from the developer. a javabean class that represents a real-world Recipe object.
TestGen4J [11] automatically generates JUnit test cases Twelve is another sample solution. It tries to combine
from Java class files. Its primary focus is to perform bound- three given integers with arithmetic operators to compute
ary value testing of the arguments passed to methods. It exactly twelve. VendingMachine is from Ammann and
uses rules, written in a user-configurable XML file, that de- Offutt’s book [1]. It models a simple vending machine for
fines boundary conditions for the data types. The test code chocolate candy.
is separated from test data with the help of JTestCase1 .
JUB (JUnit test case Builder) [19] is a JUnit test case
generator framework accompanied by a number of IDE spe- Table 2. Subject Classes Used
cific extensions. These extensions (tools, plug-ins, etc.) are Name LOC Methods
invoked from within the IDE and must store generated test BoundedStack 85 11
case code inside the source code repository administered by Inventory 67 11
the IDE. Node 77 9
Queue 59 6
2.2 Additional Test Sets Recipe 74 15
TrashAndTakeOut 26 2
As a control comparison, we generated two additional Twelve 94 1
sets of tests for each class by hand with some limited tool VendingMachine 52 6
support. Testing with random values is widely considered Total 534 61
1 http://jtestcase.sourceforge.net/
211
Table 1. Automated Unit Testing Tools
Name Version Inputs Interface
JCrasher 0.1.9 (2004) Source File Eclipse Plug-in
TestGen4J 0.1.4-alpha (2005) Jar File Command Line (Linux)
JUB 0.1.2 (2002) Source File Eclipse Plug-in
2.4 MuJava
Table 3. Classes and Mutants
Classes Mutants
Our primary measurement of the test sets in this exper- Traditional Class Total
iment is their ability to find faults. MuJava is used to seed BoundedStack 224 4 228
faults (mutants) into the classes and evaluate how many mu- Inventory 101 50 151
tants each test set kills. Node 18 4 22
MuJava [9, 10] is a mutation system for Java classes. It Queue 117 6 120
automatically generates mutants for both traditional muta- Recipe 101 26 127
tion testing and class-level mutation testing. MuJava can TrashAndTakeOut 104 0 104
test individual classes and packages of multiple classes. Twelve 234 0 234
Tests are supplied by the users as sequences of method calls VendingMachine 77 7 84
to the classes under test encapsulated in methods in separate Total 976 97 1073
classes.
MuJava creates object-oriented mutants for Java classes
according to 24 operators that include object-oriented oper-
ators. Method level (traditional) mutants are based on the 2.6 Threats to Validity
selective operator set by Offutt et al. [13]. After creating
mutants, muJava allows the tester to enter and run tests, and
evaluates the mutation coverage of the tests. In muJava,
tests for the class under test are encoded in separate classes As with any study that involves specific programs or
that make calls to methods in the class under test. Mutants classes, there is no guarantee that the classes used are “rep-
are created and executed automatically. resentative” of the general population. As yet, nobody has
developed a general theory for how to choose representative
2.5 Experimental Conduct classes for empirical studies, or how many classes we may
need. This general issue has a negative impact on our abil-
ity to apply statistical analysis tools and reduces the external
Figure 1 illustrates the experimental process. The Java validity of many software engineering empirical studies, in-
classes are represented by the leftmost box, P. Each of cluding this one.
the three automated tools (JCrasher, TestGen4J, JUB) were
used to create sets of tests (Test Set JC, ...). Then we manu- Another question is whether the three tools used are rep-
ally created the random tests, then the edge coverage tests. resentative. We searched for free, publicly available, tools
that generated tests and were somewhat surprised at how
MuJava was designed and developed before JUnit and
few we found. We initially hoped to use an advanced tool
other widely used test harness tools, so it has its own syn-
called Agitar Test Runner, however it is no longer available
tactic requirements. Each muJava test must be in a public
to academic users.
method that returns a String that encapsulates the result of
the test. Thus many of the tests from the three tools had to Another possible problem with internal validity may be
be modified to run with muJava. This modification is illus- in the manual steps that had to be applied. Tests from
trated in Figure 2. the three tools had to be translated to muJava tests. These
Then muJava was used to generate mutants for each sub- changes were only to the structure of the test methods, and
ject class, and run all five test sets against the mutants. This did not affect the values, so it is unlikely that these trans-
resulted in five mutation scores for each subject class (40 lations affected the results. The random and edge coverage
mutation scores in all). Because muJava separates scores tests were generated by hand with some tool support. These
for the traditional mutation operators from the class muta- tests were generated without knowledge of the mutants, so
tion operators, we also kept these scores separate. as to avoid any bias.
212
muJava
JCrasher
JCrasher Test Set JC
Mutation Score
TestGen4J
TestGen4J Test Set TG
Mutation Score
JUB
P JUB Test Set JUB Mutants
Mutation Score
Manual Random
Test Set Ran
Random Mutation Score
public void test242() throws Throwable public String test242() throws Throwable
{ {
try { try {
int i1 = 1; int i1 = 1;
int i2 = 1; =⇒ int i2 = 1;
int i3 = -1; int i3 = -1;
Twelve.twelve (i1, i2, i3); return Twelve.twelve (i1, i2, i3);
} }
catch (Exception e) } catch (Exception e) }
{ dispatchException (e); } { return e.toString(); }
} }
3 Results and JUB tests, but not as well as the JCrasher tests. The
edge coverage tests, however, kill 24% more mutants than
The number of mutants for each class are shown in Ta- the strongest tool, with fewer tests.
ble 3. MuJava generates traditional (method level) and class Table 5 shows the same data for the class level mutants.
mutants separately. The traditional operators focus on indi- There is more diversity in the scores from the three tools,
vidual statements, and the class operators focus on connec- and JCrasher killed 12% more class level mutants than the
tions between classes, especially in inheritance hierarchies. random tests did. However, the edge coverage tests are still
Node looks anomalous because of its small number of mu- far stronger, killing 21% more mutants than the strongest
tants. Node has only 18 traditional mutants over 77 lines performing tool (JCrasher).
of code. However, the statements are distributed over nine, Table 6 combines the scores from Tables 5 and 4 for both
mostly very small, methods. Most methods are very short traditional and class level mutants. Again, the JUB tests
(four have only one statement), they use no arithmetic, shift, are the weakest, JCrasher, TestGen and the random tests are
or logical operators, they have no assignments, and only fairly close, and the edge coverage tests kill far more mu-
a few decision statements. Thus, Node has few locations tants; 24% more than the JCrasher tests.
where muJava’s mutation operators could be applied. Table 7 summarizes the data for all subject classes for
Results from running the five sets of tests under muJava each set of tests. Because the number of tests diverged
are shown in Tables 4 through 7. Table 4 shows the number widely, we asked the question “how efficient is each test
of tests in each test set and the mutation scores on the tradi- generation method?” To approximate efficiency, we com-
tional mutants, in terms of the percentage of mutants killed. puted the number of mutants killed per test. Not surpris-
The “Total” row gives the sum of the tests for the subject ingly, the edge coverage tests were at the high end with
classes and the mutation score across all subject classes. As a score of 6.5. TestGen generated the least number of
can be seen, the random tests did better than the TestGen tests, and came out as being the second most efficient, 5.2.
213
Table 4. Traditional Mutants Killed
Classes JCrasher TestGen JUB Edge Coverage Random
Tests % Killed Tests % Killed Tests % Killed Tests % Killed Tests % Killed
BoundedStack 21 22.8% 10 23.7% 11 17.4% 17 46.9% 21 22.3%
Inventory 20 68.3% 10 60.4% 11 68.3% 19 60.4% 20 65.3%
Node 16 50.0% 16 50.0% 9 33.3% 14 61.1% 16 50.0%
Queue 9 39.3% 5 34.2% 9 41.0% 10 51.3% 9 15.4%
Recipe 29 78.2% 10 70.3% 15 35.6% 22 75.2% 29 57.4%
TrashAndTakeOut 13 64.4% 2 30.8% 2 30.8% 5 70.2% 13 59.6%
Twelve 27 35.0% 1 0.0% 1 0.0% 12 86.3% 27 35.0%
VendingMachine 14 10.4% 5 9.1% 6 9.1% 10 75.3% 14 10.4%
Total 149 42.1% 59 28.0% 64 24.3% 109 66.2% 149 36.2%
214
JCrasher and the random tests were the least efficient; they generated by JCrasher, and concluded out that this is be-
generated a lot of tests without much obvious benefit, which cause JCrasher uses invalid values to attempt to “crash” the
adds a burden on the developers who must evaluate the re- class, as shown in Figure 4. JUB only generates tests that
sults of each test. uses 0 for integers and null for general subjects, and Test-
Figure 3 illustrates the total percent of mutants killed by Gen4J generates “normal” inputs, such as blanks and empty
each test set in a bar chart. The difference between the edge strings. JCrasher, of course, also created many more tests
coverage tests and the others is remarkable. than the other two tools.
Tables 8 and 9 give a more detailed look at the scores
for each mutation operator. The mutation operators are public void test18() throws Throwable
{
described on the muJava website [10]. No mutants were
try
generated for five traditional operators (AODS, SOR, LOR,
{
LOD or ASRS) or for most of the class operators (IHI, IHD, String s1 =
IOP, IOR, ISI, ISD, IPC, PNC, PMD, PPD, PCI, PCC, PCD, "˜!@$$%ˆ&*()_+{}|[]’;:/.,<>?‘-=";
PRV, OMR, OMD, OAN, EOA, or EOC), so they are not Node n2 = new Node();
shown. The edge coverage tests had the highest scores for n2.disallow (s1);
all mutation operators. None of the test sets did particularly }
well on the arithmetic operator mutants (operators whose catch (Exception e)
names start with the letter ’A’). {dispatchException (e);}
}
215
Table 7. Summary Data
% Killed Efficiency
Tool Tests Traditional Class Total Killed / Tests
JCrasher 149 42.1% 46.4% 42.5% 3.1
TestGen 59 28.0% 37.1% 28.8% 5.2
JUB 64 24.3% 25.6% 24.4% 4.1
Edge Coverage 109 66.2% 67.0% 66.3% 6.5
Random 149 36.2% 34.0% 36.0% 2.6
100%
Percent mutants killed
80%
66%
60%
43%
40% 36%
28%
24%
20%
0
Edge
JCrasher TestGen JUB Random
Coverage
Test Sets
niques based on uncaught exceptions or operational models. sion incorporated dynamic symbolic evaluation and a dy-
Specifically, they compared two tools that implement auto- namic domain reduction procedure to generate tests [12].
mated test generation; Eclat [15], which uses random gen- A more recent tool that uses very similar techniques is the
eration and their own tool, Symclat, which uses symbolic Daikon invariant detector tool [6]. It augments the kind
generation. The tools also provide test classification based of symbolic evaluation that Godzilla used with program in-
on an operational model and an uncaught exception model. variants, an innovation that makes the test generation pro-
The results showed that the two tools are complementary in cess more efficient and scalable. Daikon analyzes values
revealing faults. that a program computes when running and reports proper-
ties that were true over the observed executions. Eclat, a
In a similar study on static analysis test tools, Rutar et
model-driven random testing tool, uses Daikon to dynami-
al. [17] compared five static analysis test tools against five
cally infer an operational model consisting of a set of likely
open source projects. Their results showed that none of the
program invariants [15]. Eclat requires classes to test plus
five tools strictly subsumes any of the others in terms of
an example of their use, such as a program that uses the
the fault-finding capability. They proposed a meta-tool to
classes or a small initial test suite. As Eclat’s result may
combine and correlate the abilities of these five tools. Wag-
depend on the initial seeds, it was not directly comparable
ner et al. [21] presented a case study that applied three static
with the other tools in this study.
fault-finding tools as well as code review and manual testing
to several industrial projects. Their study showed that the
Another tool that is based on Daikon is Jov [22], which
static tools predominantly found different faults than man-
presents an operational violation approach for unit test gen-
ual testing but a subset of faults found by reviews. They
eration and selection, a black-box approach that does not
proposed a combination of these three types of techniques.
require specifications. The approach dynamically generates
An early research tool that implemented automated unit operational abstractions from executions of the existing unit
testing was Godzilla, which was part of the Mothra muta- test suite. These operational abstractions guide test gener-
tion suite of tools [5]. Godzilla used symbolic evaluation to ation tools to generate tests to violate them. The approach
automatically generate tests to kill mutants, and a later ver- selects tests that violate operational abstractions for inspec-
216
Table 8. Mutation Scores per Traditional Mutation Operator
Traditional Mutants JCrasher TestGen JUB Edge Coverage Random
AORB 56 32% 21% 30% 66% 29%
AORS 11 46% 27% 36% 55% 27%
AOIU 66 46% 32% 17% 79% 36%
AOIS 438 28% 24% 22% 53% 22%
AODU 1 100% 100% 100% 100% 100%
ROR 256 61% 25% 17% 79% 57%
COR 12 33% 25% 25% 58% 33%
COD 6 33% 33% 17% 50% 33%
COI 4 75% 75% 50% 75% 75%
LOI 126 53% 48% 44% 80% 48%
Total 976 42% 28% 24% 66% 36%
tion. These tests exercise new behavior that had not yet search community for more than two decades, industrial test
been exercised by the existing tests. Jov integrates the use data generators seldom, if ever, try to generate tests that sat-
of Daikon and Parasoft Jtest [16] (a commercial Java test- isfy test criteria. Tools that evaluate coverage are available,
ing tool). Agitar Test Runner is a commercial test tool that but they do not solve the hardest problem of generating test
was partially based on Daikon and Godzilla, but was unfor- values. This study has led us to conclude that it is past time
tunately not available for this study. for criteria-based test data generation to migrate into tools
that developers can use with minimal knowledge of soft-
6 Conclusions ware testing theory.
These tools were compared with only one test criterion,
This paper compared three free, publicly accessible, unit edge coverage on control flow graphs. This is widely known
test tools on the basis of their fault finding abilities. Faults in the research community to be one of the simplest, cheap-
were seeded into Java classes with an automated mutation est, and least effective test criterion. Our anecdotal experi-
tool and the tools’ tests were compared with hand generated ence with manually killing mutants indicates that scores of
random tests and edge coverage tests. around 40% are trivial to achieve and 70% is fairly easy to
Our findings are that these tools generate tests that are achieve with a small amount of hand analsyis of the class’s
very poor at detecting faults. This can be viewed as a de- structure. This observation is supported by this study, in
pressing comment on the state of practice. As users expec- which random values reached the 40% level and edge cover-
tations for reliable software continue to grow, and as agile age tests reached the 70% level. However, mutation scores
processes and test driven development continue to gain ac- of 80% to 90% are often quite hard to reach with hand-
ceptance throughout the industry, unit testing is becoming generated tests. This should be possible with more stringent
increasingly important. Unfortunately, software developers criteria such as prime paths, all-uses, or logic-based cover-
have few choices in high quality test data generation tools. age.
Whereas criteria-based testing has dominated the re- We have also observed that software developers have few
217
educational opportunities to learn unit testing skills. De- Windsor, UK, August 2006. IEEE Computer Society
spite the fact that testing consumes more than half of the Press.
software industry’s resources, we are not aware of any uni-
versities in the USA that require undergraduate computer [9] Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. Mu-
science students to take a software testing course. Very few Java : An automated class mutation system. Soft-
universities do more than teach a lecture or two on testing ware Testing, Verification, and Reliability, 15(2):97–
in a general software engineering survey; the material that 133, June 2005.
is presented is typically 20 years old. This study has led us
[10] Yu-Seung Ma, Jeff Offutt, and Yong-Rae
to conclude that it is past time for universities to teach more
Kwon. muJava home page. Online,
knowledge of software testing to undergraduate computer
2005. http://cs.gmu.edu/∼offutt/mujava/,
science and software engineering students.
http://salmosa.kaist.ac.kr/LAB/MuJava/, last ac-
cess December 2008.
References
[11] Manish Maratmu. Testgen4j. Online,
[1] Paul Ammann and Jeff Offutt. Introduction to Soft- 2005. http://developer.spikesource.com/wiki/in-
ware Testing. Cambridge University Press, Cam- dex.php/Projects:testgen4j, last access December
bridge, UK, 2008. ISBN 0-52188-038-1. 2008.
[2] AutomatedQA. Testcomplete. Online, 2008. [12] Jeff Offutt, Zhenyi Jin, and Jie Pan. The dynamic
http://www.automatedqa.com/products/testcomplete/, domain reduction approach to test data generation.
last access December 2008. Software–Practice and Experience, 29(2):167–193,
January 1999.
[3] Christoph Csallner and Yannis Smaragdakis.
JCrasher: An automatic robustness tester for Java. [13] Jeff Offutt, Ammei Lee, Gregg Rothermel, Roland
Software: Practice and Experience, 34:1025–1050, Untch, and Christian Zapf. An experimental determi-
2004. nation of sufficient mutation operators. ACM Transac-
[4] Marcelo d’Amorim, Carlos Pacheco, Tao Xie, Darko tions on Software Engineering Methodology, 5(2):99–
Marinov, and Michael D. Ernst. An empirical compar- 118, April 1996.
ison of automated generation and classification tech- [14] Carlos Pacheco and Michael D.
niques for object-oriented unit testing. In Proceed- Ernst. Eclat tutorial. Online.
ings of the 21st International Conference on Auto- http://groups.csail.mit.edu/pag/eclat/manual/tu-
mated Software Engineering (ASE 2006), pages 59– torial.php, last access December 2008.
68, Tokyo, Japan, September 2006. ACM / IEEE
Computer Society Press. [15] Carlos Pacheco and Michael D. Ernst. Eclat: Auto-
matic generation and classification of test inputs. In
[5] Richard A. DeMillo and Jeff Offutt. Constraint-based
19th European Conference on Object-Oriented Pro-
automatic test data generation. IEEE Transactions
gramming (ECOOP 2005), Glasgow, Scotland, July
on Software Engineering, 17(9):900–910, September
2005.
1991.
[6] Michael D. Ernst, Jake Cockrell, William G. Gris- [16] Parasoft. Jtest. Online, 2008.
wold, and David Notkin. Dynamically discovering http://www.parasoft.com/jsp/products/home.jsp-
likely program invariants to support program evolu- ?product=Jtest, last access December 2008.
tion. IEEE Transactions on Software Engineering,
[17] Nick Rutar, Christian B. Almazan, and Jeffrey S. Fos-
27(2):99–123, February 2001.
ter. A comparison of bug finding tools for Java. In Pro-
[7] Erich Gamma, Richard Helm, Ralph Johnson, and ceedings of the 15th International Symposium on Soft-
John M. Vlissides. Design Patterns: Elements of ware Reliability Engineering, pages 245–256, Saint-
Reusable Object-Oriented Software. Addison-Wesley, Malo, Bretagne, France, November 2004. IEEE Com-
1995. ISBN 0-201-63361-2. puter Society Press.
[8] Mats Grindal, Jeff Offutt, and Jonas Mellin. On the [18] Ben Smith and Laurie Williams. Killing
testing maturity of software producing organizations. mutants with MuClipse. Online, 2008.
In Testing: Academia & Industry Conference - Prac- http://agile.csc.ncsu.edu/SEMaterials/tutorials/mu-
tice And Research Techniques (TAIC / PART 2006), clipse/, last access December 2008.
218
[19] Mark Tyborowski. JUB (JUnit test case Builder). On- [21] S. Wagner, J. Jurjens, C. Koller, and P. Trischberger.
line, 2002. http://jub.sourceforge.net/, last access De- Comparing bug finding tools with reviews and tests.
cember 2008. In 17th IFIP TC6/WG 6.1 International Conference
on Testing of Communicating Systems, pages 40–55,
[20] Sami Vaaraniemi. The benefits of au- May 2005.
tomated unit testing. Online, 2003.
http://www.codeproject.com/KB/architecture/on- [22] Tao Xie and David Notkin. Tool-assisted unit-test
unittesting.aspx, last access December 2008. generation and selection based on operational ab-
stractions. Automated Software Engineering Journal,
13(3):345–371, July 2006.
219