Professional Documents
Culture Documents
1staats Love Mutation
1staats Love Mutation
Measure Variable
Program Generate Mutants Variable Effectiveness
Effectiveness
Abstract—In testing, the test oracle is the artifact that deter- Ranking
5% Estimate
Fault Finding Effectiveness (%) 80 90
2.5% Estimate
All outputs
60
All outputs
2.5% Estimate
40
50
5% Estimate
Random 40 Random
20 Output-Base Output-Base
Mutation-Based 30 Mutation-Based
0
Idealized Mutation-Based 20 Idealized Mutation-Based
5 10 15 5 10 15
Oracle Size Oracle Size
(a) DWM 1, Branch Inputs (b) DWM 1, MC/DC Inputs
40 50
Random Random
Output-Base Output-Base
Fault Finding Effectiveness (%)
5% Estimate
30
Idealized Mutation-Based 5% Estimate Idealized Mutation-Based
30
All outputs
All outputs
20
20
10 10
0 0
2 4 6 8 10 12 14 2 4 6 8 10 12 14
Oracle Size Oracle Size
(c) DWM 2, Branch Inputs (d) DWM 2, MC/DC Inputs
100
70
Fault Finding Effectiveness (%)
80
2.5% Estimate
60
2.5% Estimate
All outputs
5% Estimate
5% Estimate
All outputs
50 60
40
30 40
Random Random
20 Output-Base 20 Output-Base
10 Mutation-Based Mutation-Based
Idealized Mutation-Based Idealized Mutation-Based
0 2 4 6 8 10 0 2 4 6 8 10
Oracle Size Oracle Size
(e) Latctl Batch, Branch Inputs (f) Latctl Batch, MC/DC Inputs
70 100
60
Fault Finding Effectiveness (%)
80
50
All outputs
5% Estimate
40 60
2.5% Estimate
30
5% Estimate
All outputs
20 Random 40 Random
Output-Base Output-Base
10 Mutation-Based 20
Mutation-Based
Idealized Mutation-Based Idealized Mutation-Based
0
2 4 6 8 10 2 4 6 8 10
Oracle Size Oracle Size
(g) Vertmax Batch, Branch Inputs (h) Vertmax Batch, MC/DC Inputs
Table III
M EDIAN R ELATIVE I MPROVEMENT U SING M UTATION -BASED S ELECTION OVER O UTPUT-BASE S ELECTION
realistic scenarios is less than the gap between the output- user a suggested oracle size9 . However, this is potentially
base approach and our approach. Thus we can conclude that expensive and provides no insight concerning why our ap-
while there is clearly room for improvement in oracle data proach is very good (relative to baseline approaches, and in
selection methods, our approach appears to often be quite an absolute sense) for some systems and merely equivalent
effective in terms of absolute performance. for others. Developing metrics that can allow us to a priori
estimate the effectiveness of our approach is an area for
D. Impact of Coverage Criteria (Q3) potential future work.
Our final question concerns the impact of varying the
coverage criteria—and thus the test inputs—on the relative VI. T HREATS TO VALIDITY
effectiveness of oracle selection. In this study, we have used
External Validity: Our study is limited to four syn-
two coverage criteria of varying strength. Intuitively, it seems
chronous reactive critical systems. Nevertheless, we believe
likely that when using stronger test suites (those satisfying
these systems are representative of the class of systems
MC/DC in this study), the potential for improving the testing
in which we are interested and our results are therefore
process via oracle selection would be less, as the test inputs
generalizable to other systems in the domain.
should do a better job of exercising the code.
We have used Lustre [14] as our implementation language
Precisely quantifying likeness is difficult; however, as
rather than a more common language such as C or C++.
shown in Figure 2, for each case example the general
However, as noted previously, systems written in Lustre are
relationship seems (to our surprise) to be roughly the same.
similar in style to traditional imperative code produced by
For example, for the DWM 1 system, we can see that
code generators used in embedded systems development. A
despite the overall higher levels of fault finding when using
simple syntactic transformation suffices to translate Lustre
the MC/DC test suites, the general relationships between
code to C code that would be generally be used.
the output-base baseline approach, our approach, and the
In our study, we have used test cases generated to satisfy
idealized approach remain similar. We see a rapid increase in
two structural coverage criteria. These criteria were selected
effectiveness for small oracles, followed by a decrease in the
as they are particularly relevant to the domain of interest.
relative improvement of our approach versus the output-base
However, there exist many methods generating test inputs,
baseline as we approach oracles of size 10 (corresponding to
and it is possible that tests generated according to some
an output-only oracle), followed by a gradual increase in said
other methodology would yield different results. For ex-
relative improvement. In some cases relative improvements
ample, requirements-based black-box tests may be effective
are higher for branch coverage (Latctl Batch) and in others
at propagating errors to the output, reducing the potential
they are higher for MC/DC (Vertmax Batch).
improvements for considering internal state variables. We
These results indicate that, perhaps more that the test
therefore avoid generalizing our results to other methods of
inputs used, characteristics of the system under test are
test input selection, and intend to study how our approach
the primary determinant of the relative effectiveness of our
generalizes to these methods in future work.
approach. Unfortunately, it is unclear exactly what these
characteristics are. Testers can, of course, estimate the effec- 9 Indeed, automating our study to estimate the effectiveness of our
tiveness of our applying approach by automating our study, approach is akin to applying mutation testing to estimate test input/oracle
applying essentially the same approach used to provide the data effectiveness
Branch MC/DC
Oracle Size DWM 1 DWM 2 Vertmax Batch Latctl Batch DWM 1 DWM 2 Vertmax Batch Latctl Batch
1 3.7% 101.6% 3.6% 4.7% 2.5% 68.0% 2.9% 0.6%
2 8.0% 152.1% 19.5% 8.2% 2.9% 76.9% 6.4% 1.1%
3 12.5% 156.4% 15.4% 15.0% 5.9% 94.9% 6.2% 3.0%
4 14.6% 158.6% 16.1% 13.7% 7.1% 96.0% 7.4% 5.0%
5 16.2% 152.1% 21.0% 13.6% 7.5% 94.6% 7.1% 5.5%
6 17.5% 146.9% 21.9% 12.4% 8.3% 93.5% 8.3% 6.4%
7 19.7% 140.4% 20.9% 17.7% 9.1% 94.3% 9.2% 7.0%
8 22.3% 139.3% 21.7% 20.9% 8.3% 85.7% 10.2% 7.9%
9 22.5% 140.8% 19.6% 14.2% 8.8% 84.2% 9.8% 8.8%
10 20.2% 143.5% 20.2% 16.1% 9.5% 80.8% 10.5% 8.9%
11 21.0% 147.9% – – 9.8% 82.5% – –
12 18.6% 144.2% – – 9.8% 81.0% – –
13 19.3% 144.2% – – 9.8% 83.9% – –
14 17.0% 139.3% – – 9.8% 83.0% – –
15 17.3% – – – 9.7% – – –
16 17.3% – – – 9.5% – – –
17 18.0% – – – 9.7% – – –
18 19.0% – – – 10.1% – – –
Table IV
M EDIAN R ELATIVE I MPROVEMENT U SING I DEALIZED M UTATION -BASED S ELECTION OVER R EALISTIC M UTATION -BASED S ELECTION
The tests used are effectively unit tests for a Lustre is not more effective, it appears to be comparable to the
module, and the behavior of oracles may change when using output-base approach. We have also found that our approach
tests designed for large system integration. Given that we performs within an acceptable range from the theoretical
are interested in testing systems at this level, we believe maximum.
our experiment represents a sensible testing approach and is While our results are encouraging, and demonstrate that
applicable to practitioners. our approach can be effective using real-world avionics
We have generated approximately 250 mutants for each systems, there are a number of questions that we would like
case example, with 125 mutants used for training sets and to explore in the future, including:
up to 125 mutants used for evaluation. These values are • Estimating Training Set Size: Our evaluation indicates
chosen to yield a reasonable cost for the study. It is possible that for our case examples, a reasonable number of
the number of mutants is too low. Nevertheless, based on mutants are required to produce effective oracle data
past experience, we have found results using less than 250 sets. However, when applying our approach to very
mutants to be representative [22], [23]. Furthermore, pilot large systems, or when using a much larger number of
studies showed that the results change little when using more test inputs, we may wish to estimate the needed training
than this 100 mutants for training or evaluation sets. set size.
Construct Validity: In our study, we measure the fault • Oracle Data Selection Without Test Inputs: In our
finding of oracles and test suites over seeded faults, rather approach, we assume the tester has already generated
than real faults encountered during development of the a set of test inputs before generating oracle data. We
software. Given that our approach to selecting oracle data may wish to generate oracle data without knowing the
is also based on the mutation testing, it is possible that test inputs that will ultimately be used. Is oracle data
using real faults would lead to different results. This is generated with one set of test inputs effective for other,
especially likely if the fault model used in mutation testing different test inputs?
is significantly different than the faults we encounter in
VIII. ACKNOWLEDGEMENTS
practice. Nevertheless, as mentioned earlier, Andrews et al.
have shown that the use of seeded faults leads to conclusions This work has been partially supported by NASA Ames
similar to those obtained using real faults in similar fault Research Center Cooperative Agreement NNA06CB21A,
finding experiments [1]. NSF grants CCF-0916583 and CNS-0931931, CNS-
1035715, an NSF graduate research fellowship, and the
VII. C ONCLUSION AND F UTURE W ORK WCU (World Class University) program under the National
In this study, we have explored a mutation-based method Research Foundation of Korea and funded by the Ministry
for supporting oracle creation. Our approach automates the of Education, Science and Technology of Korea (Project No:
selection of oracle data, a key component of expected value R31-30007).
test oracles. Our results indicate that our approach is suc- R EFERENCES
cessful with respect to alternative approaches for selecting
[1] J. Andrews, L. Briand, and Y. Labiche. Is mutation an
oracle data, with improvements up to 145.8% over output- appropriate tool for testing experiments? Proc of the 27th
base oracle data selection, with improvements in the 10%- Int’l Conf on Software Engineering (ICSE), pages 402–411,
30% range relatively common. In cases where our approach 2005.
[2] L. Baresi and M. Young. Test oracles. In Technical Report [19] The NuSMV Toolset, 2005. Available at
CIS-TR-01-02, Dept. of Computer and Information Science, http://nusmv.irst.itc.it/.
Univ. of Oregon.
[20] C. Pacheco and M. Ernst. Eclat: Automatic generation and
[3] L. Briand, M. DiPenta, and Y. Labiche. Assessing and classification of test inputs. ECOOP 2005-Object-Oriented
improving state-based class testing: A series of experiments. Programming, pages 504–527, 2005.
IEEE Trans. on Software Engineering, 30 (11), 2004.
[21] A. Rajan, M. Whalen, and M. Heimdahl. The Effect of
[4] J. Chilenski. An investigation of three forms of the modified Program and Model Structure on MC/DC Test Adequacy
condition decision coverage (MCDC) criterion. Technical Coverage. In Proceedings of 30th International Confer-
Report DOT/FAA/AR-01/18, Office of Aviation Research, ence on Software Engineering (ICSE), 2008. Available at
Washington, D.C., April 2001. http://crisys.cs.umn.edu/ICSE08.pdf.
[5] V. Chvatal. A greedy heuristic for the set-covering problem. [22] A. Rajan, M. Whalen, and M. Heimdahl. The effect of pro-
Mathematics of operations research, 4(3):233–235, 1979. gram and model structure on MC/DC test adequacy coverage.
In Proc. of the 30th Int’l Conference on Software engineering,
[6] T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction pages 161–170. ACM New York, NY, USA, 2008.
to Algorithms. MIT Press and McGraw-Hill, 2001.
[23] A. Rajan, M. Whalen, M. Staats, and M. Heimdahl. Require-
[7] R. DeMillo, R. Lipton, and F. Sayward. Hints on test ments coverage as an adequacy measure for conformance
data selection: Help for the practicing programmer. IEEE testing. In Proc. of the 10th Int’l Conf. on Formal Methods
computer, 11(4):34–41, 1978. and Software Engineering, pages 86–104. Springer, 2008.
[8] R. Evans and A. Savoia. Differential testing: a new approach [24] S. Rayadurgam and M. P. Heimdahl. Coverage based test-case
to change detection. In Proc. of the 6th Joint European Soft- generation using model checkers. In Proc. of the 8th IEEE
ware Engineering Conference and Foundations of Software Int’l. Conf. and Workshop on the Engineering of Computer
Engineering, pages 549–552. ACM, 2007. Based Systems, pages 83–91. IEEE Computer Society, April
2001.
[9] R. Fisher. The Design of Experiment. New York: Hafner,
1935. [25] D. J. Richardson, S. L. Aha, and T. O’Malley. Specification-
based test oracles for reactive systems. In Proc. of the 14th
[10] G. Fraser and A. Zeller. Mutation-driven generation of unit Int’l Conference on Software Engineering, pages 105–118.
tests and oracles. In Prc. of the 19th Int’l Symp. on Software Springer, May 1992.
Testing and Analysis, pages 147–158. ACM, 2010.
[26] RTCA. DO-178B: Software Considerations In Airborne
[11] G. Fraser and A. Zeller. Generating parameterized unit tests. Systems and Equipment Certification. RTCA, 1992.
In Prc. of the 20th Int’l Symp. on Software Testing and
Analysis, pages 147–158. ACM, 2011. [27] M. Staats, M. Whalen, and M. Heimdahl. New ideas and
emerging results track: Better testing through oracle selec-
[12] M. Garey and M. Johnson. Computers and Intractability. tion. In Proc. of NIER Workshop, Int’l. Conf. on Software
Freeman, New York, 1979. Engineering (ICSE) 2011. ACM New York, NY, USA, 2011.
[13] A. Gargantini and C. Heitmeyer. Using model checking [28] M. Staats, M. Whalen, and M. Heimdahl. Programs, testing,
to generate tests from requirements specifications. Software and oracles: The foundations of testing revisited. In Proc.
Engineering Notes, 24(6):146–162, November 1999. of Int’l. Conf. on Software Engineering (ICSE) 2011. ACM
New York, NY, USA, 2011.
[14] N. Halbwachs. Synchronous Programming of Reactive Sys-
tems. Klower Academic Press, 1993. [29] K. Taneja and T. Xie. Diffgen: Automated regression unit-test
generation. In Automated Software Engineering, 2008, pages
[15] M. P. Heimdahl, G. Devaraj, and R. J. Weber. Specification 407–410. IEEE, 2008.
test coverage adequacy criteria = specification test generation
inadequacy criteria? In Proc. of the Eighth IEEE Int’l Symp. [30] C. Van Eijk. Sequential equivalence checking based on
on High Assurance Systems Engineering (HASE), Tampa, structural similarities. Computer-Aided Design of Integrated
Florida, March 2004. Circuits and Systems, IEEE Trans. on, 19(7):814–819, 2002.
[16] Y. Jia and M. Harman. An analysis and survey of the [31] J. Voas and K. Miller. Putting assertions in their place. In
development of mutation testing. IEEE Transactions on Software Reliability Engineering, 1994., 5th Int’l Symposium
Software Engineering, (99):1, 2010. on, pages 152–157, 1994.
[17] P. Kvam and B. Vidakovic. Nonparametric Statistics with [32] Q. Xie and A. Memon. Designing and comparing auto-
Applications to Science and Engineering. Wiley-Interscience, mated test oracles for gui-based software applications. ACM
2007. Trans. on Software Engineering and Methodology (TOSEM),
16(1):4, 2007.
[18] Mathworks Inc. Simulink product web site.
http://www.mathworks.com/products/simulink.