Professional Documents
Culture Documents
a r t i c l e i n f o a b s t r a c t
Article history: Search-based software testing is the application of metaheuristic search techniques to generate software
Received 7 May 2008 tests. The test adequacy criterion is transformed into a fitness function and a set of solutions in the search
Received in revised form 16 December 2008 space are evaluated with respect to the fitness function using a metaheuristic search technique. The
Accepted 18 December 2008
application of metaheuristic search techniques for testing is promising due to the fact that exhaustive
Available online 4 February 2009
testing is infeasible considering the size and complexity of software under test. Search-based software
testing has been applied across the spectrum of test case design methods; this includes white-box (struc-
Keywords:
tural), black-box (functional) and grey-box (combination of structural and functional) testing. In addition,
Systematic review
Non-functional system properties
metaheuristic search techniques have also been applied to test non-functional properties. The overall
Search-based software testing objective of undertaking this systematic review is to examine existing work into non-functional
search-based software testing (NFSBST). We are interested in types of non-functional testing targeted
using metaheuristic search techniques, different fitness functions used in different types of search-based
non-functional testing and challenges in the application of these techniques. The systematic review is
based on a comprehensive set of 35 articles obtained after a multi-stage selection process and have been
published in the time span 1996–2007. The results of the review show that metaheuristic search tech-
niques have been applied for non-functional testing of execution time, quality of service, security, usabil-
ity and safety. A variety of metaheuristic search techniques are found to be applicable for non-functional
testing including simulated annealing, tabu search, genetic algorithms, ant colony methods, grammatical
evolution, genetic programming (and its variants including linear genetic programming) and swarm
intelligence methods. The review reports on different fitness functions used to guide the search for each
of the categories of execution time, safety, usability, quality of service and security; along with a discus-
sion of possible challenges in the application of metaheuristic search techniques.
Ó 2009 Elsevier B.V. All rights reserved.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
2. Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
2.1. Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
2.2. Generation of search strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958
2.3. Study selection criteria and procedures for including and excluding primary studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959
2.4. Study quality assessment and data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 960
3. Results and synthesis of findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
3.1. Execution time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961
3.2. Quality of service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 963
3.3. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
3.4. Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 965
3.5. Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 967
4. Discussion and areas for future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968
5. Validity threats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
6. Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973
* Corresponding author. Tel.: +46 457 385840; fax: +46 457 27914.
E-mail addresses: wasif.afzal@bth.se (W. Afzal), richard.torkar@bth.se (R. Torkar), robert.feldt@bth.se (R. Feldt).
0950-5849/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.infsof.2008.12.005
958 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
Acknowledgement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
Appendix A. Search strings and study quality assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 974
Grey-box testing includes assertion testing and exception exclusion criteria independently. Out of 367 references, the three
condition testing [3]. This exclusion criterion is relaxed to researchers were in agreement on 229 references to exclude, 25
include those studies where a structural test criterion is used to include and 113 required a meeting to reach consensus. In the
to test non-functional properties, e.g. [15]. meeting, the researcher in the minority for a paper tried to con-
– Are not related to the testing of the end product, e.g. [16]. vince others; otherwise the majority decision was taken. This
– Are related to test planning, e.g. [17]. application of detailed exclusion criteria resulted in 60 remaining
– Make use of model checking and formal methods, e.g. references, which were further filtered out by reading full-text. A
[18,19]. final figure of 24 primary studies was reached after excluding sim-
– Report performance of a particular metaheuristic instead of ilar studies that were published in different venues. The 24 pri-
its application to software testing, e.g. [20]. mary studies were complemented with 11 more papers from
– Report on test case prioritization, e.g. [21]. phase 2 of the search strategy (Fig. 1). The fact that we gathered
– Are used for prediction and estimation of software proper- 11 papers from phase 2 of the search strategy indicates that mak-
ties, e.g. [22]. ing a generic search string that would give the entire relevant set of
primary studies from searching only within electronic databases is
The first phase of research resulted in a total of 501 papers. difficult in the field of study under investigation. The terminologies
After eliminating duplicates found by more than one electronic used by various authors differed a lot; both in terms of specifying
database, we were left with 404 papers. Table 1 shows the distribu- the non-functional property and the used metaheuristic. As an
tion of papers before duplicate removal among different sources. example, if we consider the primary study [23], identified using
The exclusion was done using a tollgate approach (Fig. 2). To be- phase 2, we observed that although it does mention using genetic
gin with, a single researcher excluded 37 references out of a total of programming in the title, it does not mention the target non-func-
404 primarily based on title and abstract, which were clearly out of tional property of security. Similarly, in the abstract and key words,
scope and did not relate to the research question. The remaining we do not find words synonymous to testing. Therefore, we believe
367 references were subject to detailed exclusion criteria, which that the phase 2 of the search strategy helped us to gather a more
involved three researchers. First, each researcher applied the representative set of primary studies.
Fig. 2. Multi-step filtering of studies (tollgate approach) and final number of primary studies.
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 961
not to rank studies according to an overall quality score) but used a Table 2
binary ‘yes’ or ‘no’ scale [24]. Table 15 in Appendix A shows the Distribution of primary studies per non-functional area.
p
application of the study quality assessment criteria where a ( ) Non-functional property Author(s) Year References
indicates ‘yes’ and () indicates ‘no’. Most of our developed criteria Execution time (42.86%) Wegener et al. 1996 [25]
were fulfilled by all of the studies, exceptions being Alander et al. 1997 [26]
[34,35,12,15,54] where evidence of a comparison group was miss- Wegener et al. 1997 [27]
ing, but these quality differences were not found to be largely con- Wegener et al. 1998 [10]
O’Sullivan et al. 1998 [28]
founded with study outcomes. What follows next is the list of Tracey et al. 1998 [29]
developed criteria: Mueller et al. 1998 [30]
Puschner et al. 1998 [31]
– Is the reader able to understand the aims of the research? Pohlheim et al. 1999 [32]
Wegener et al. 2000 [33]
– Is the context of study clearly stated, that includes popula-
Groß et al. 2000 [34]
tion being studied (e.g. academic vs. industrial) and tasks Groß 2001 [35]
to be performed by population (e.g. small scale vs. large Groß 2003 [36]
scale) Briand et al. 2005 [12]
– Was there a comparison or control group? Tlili et al. 2006 [11]
– Are the measures used in the study fully defined [5]? Quality of service (5.71%) Canfora et al. 2005 [37]
– Is there an adequate description of the data collection Di Penta et al. 2007 [38]
15, 54 15,
52 Safety 52 53
53 54
Quality of 37,
37 38 service
38
26,12,
25 26, 10,28, 33, 12 11
Execution 30-36,28,
35 36 29
27 29,30, 32 34
time 11,25,10,
31 27
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 GA, SA GE LGP GA, TS,SA, SA, TS HC,SA,AC, GA
SA PSO GA HC TS GA
Fig. 3. Distribution of NFSBST research over range of applied metaheuristics and time period (adapted from [55]).
execution times. The fitness function used was the execution time the design of an operational experimental environment for evolu-
of an individual measured in micro seconds. The experimental re- tionary testing with the integration of a commercially available
sults using a simple C function showed that the longest execution timing package.
time of 26.27 ls was found very quickly with GA in less than 20 The aforementioned experimental results identified several lim-
generations. Moreover, a new shortest execution time of 5.27 ls, itations when using evolutionary testing. The instrumentation re-
which was not discovered by statistical and systematic testing, quired for the software under test (SUT), which extends the
was found. This study marks one of the earliest use of GA to test executable programming code by inserting hardware dependent
temporal correctness of real-time systems. counting instructions, is bound to affect the execution times. Also
Alander et al. [26] presented experiments performed in a simu- as a typical search strategy, it is difficult to ensure that the execu-
lator environment to measure response time extremes of protec- tion times generated in the experiments represents global opti-
tion relay software using genetic algorithms. The fitness function mum. More experimentation is also required to determine the
used was the response time of the tested software. The results most appropriate and robust parameters. Lastly, there is a need
showed that GA generated more input cases with longer response for an adequate termination criterion to stop the search process.
times. In Wegener et al. [27], experiments were performed using In Sullivan et al. [28], cluster analysis was used on the popula-
genetic algorithms involving five test objects from different appli- tion from the most recent generation to determine if GA should
cation domains having varying lines of code and integer input terminate. Execution time measured in processor cycles was used
parameters. This time the fitness function used was the execution as a fitness function and a complex algorithm from the domain
time measured in terms of processor cycles rather than seconds. of automotive electronics was used for seven runs of GA. Cluster
The results show that GA consistently outperformed random test- analysis was performed on the final population (generation 399)
ing by finding more extreme times. of each run. With cluster analysis, it was possible to examine
The research community soon realized the benefits of measure- which of the test runs converged to local optima and thus contin-
ment in terms of processor cycles; being more precise and inde- uing with these runs would not yield better results. The results of
pendent of the interrupts from the operating system (e.g. context the study demonstrated the potential of incorporating cluster anal-
switching and paging). Also measurement in terms of processor cy- ysis as a useful termination criterion for evolutionary tests and
cles is deterministic in the sense that it is independent of system suggested appropriate changes in the search strategy to include
load and results in the same execution times for the same set of in- cluster analysis information.
put parameters. However, such a measurement is dependent on Tracey et al. [29] applied simulated annealing (SA) to test four
the compiler and optimizer used, therefore, the processor cycles simple programs for WCET as part of a generalized test data gener-
differ for each platform [27]. ation framework. The fitness function used is the measure of actual
Wegener and Grochtmann [10] continued further with experi- dynamic execution time. The WCET of the programs was already
mentation to compare evolutionary testing (using genetic algo- known and a valid test case was one that exercised a path yielding
rithms) with random testing. The fitness function used was the already known WCET. The results of the experiment showed
duration of execution measured in processor cycles. This time that the use of SA was more effective with larger parameter space.
the range of input parameters was raised to 5000. The results The authors highlighted the need of a detailed comparison of var-
showed that, with a large number of input parameters, evolution- ious optimization techniques to explore WCET and BCET of the
ary testing obtained more extreme execution times with less or SUT. With this goal in mind, Pohlheim and Wegener [32] used an
equal testing effort than random testing. In order to better evaluate extension of genetic algorithms making use of multiple sub-popu-
the application of evolutionary testing, Groß et al. [56] presented lations, each using a different search strategy. The authors name
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 963
the approach as extended evolutionary algorithms. The duration of situations. The individual measures were combined to form a pre-
execution measured in processor cycles was taken as the fitness diction system that successfully forecasted evolutionary testability
function. The extended evolutionary algorithm was applied on with 90% accuracy.
two test objects. The first test object was the bubble sort algorithm Results from the two studies [34,35] also confirmed that there is
and the results from this experiment was used to find appropriate a relationship between the complexity of a test object and the abil-
evolutionary parameters for the second test object which con- ity of evolutionary algorithm to produce input parameters accord-
tained software modules from a motor control project. The evolu- ing to B/WCET. The results also confirmed the properties (given
tionary algorithm found longer execution times for all the given above) of the test programs that caused most problems for evolu-
modules in comparison with systematic testing. tionary testing. Due to program properties inhibiting evolutionary
As mentioned earlier, it is difficult to ensure that the execution testability [36], pointed out that an ideal testing strategy is a com-
times generated in the experiments represent global optimum. bination of evolutionary testing supported by human knowledge of
Therefore it appears interesting to compare the results of evolu- the test object. The initial population of individuals can benefit from
tionary testing with static analysis to find a bound within which human knowledge to direct the search in those areas of search
WCET and BCET might lie. Mueller et al. [30] presented such a space that are difficult to reach as the fitness function does not pro-
comparison. Both approaches were used in five experiments to vide information to generate such unlikely input combinations.
determine the WCET and BCET of different programs. Three pro- In one of the more recent studies, Tlili et al. [11] used the ap-
grams were from real-time systems while the remaining two proach of seeding an evolutionary algorithm with test data achiev-
were general-purpose algorithms. The fitness function used is ing a high structural coverage and reduction in the amount of
the execution time measured in processor cycles. The results search space by restricting the range of input variables in the initial
showed that methods of static analysis and evolutionary testing population. The fitness function was the measurement of the exe-
bound the actual execution times. For WCET, the estimates of sta- cution time of test data as number of CPU clock ticks. The results
tic analysis provided an upper bound while the measurements of indicated that for almost all the test objects, application of seeding
evolutionary testing yielded a lower bound. Conversely, static and range restriction outperform standard evolutionary real-time
analysis’ estimates provided a lower bound for BCET while evolu- testing with random initial population when measuring long exe-
tionary testing measurements constituted an upper bound. In Pus- cution times. Also with seeding and range restriction, fewer gener-
chner and Nossal [31], genetic algorithms were applied to find ations found the longest execution times. Similar results were
WCET for seven programs and the results were compared with achieved for finding the shortest execution times.
those from random search, static analysis and best effort timings In Briand et al. [12,57], another approach to use genetic algo-
that were researchers’ own efforts to find input data to yield rithms for critical deadlines misses was used. The authors called
WCET. The execution time measured in processor cycles as well the approach stress testing because the system was exercised in
as time units were used as a fitness function for different pro- such a way that some tasks were close to missing a deadline. This
grams. Genetic algorithms found same or longer times than ran- approach can also be called robustness testing but since the basic
dom search. In comparison with best effort timings, genetic objective of the paper was to find the sequence of arrival times
algorithms matched the timings and found a longer time in one of events for aperiodic tasks, which will cause the greatest delays
case, while in comparison with static analysis, the upper bounds in the execution of the target task, we choose to discuss this paper
were not broken but were matched on several occasions. In an- under execution time. The study was restricted to seeding times
other study, Wegener et al. [33] used genetic algorithms to test for aperiodic tasks and the tasks synchronization, since input data
temporal behavior of six time critical tasks in an engine control were accounted for in execution time estimates. In comparison
system, with the fitness function used was execution time mea- with other approaches to evolutionary testing for finding the
sured in processor cycles. Genetic algorithms outperformed both WCET, this approach was different in the sense that test data de-
random search and developer-made tests. sign did not require the implementation of the system and, sec-
In Groß [36], 15 example test programs were used in experi- ondly, did not consider the tasks in isolation. Genetic algorithms
ments to measure the maximal execution times using genetic algo- were used to search for the sequence of arrival times of events
rithms. The fitness function used was the execution time of the test for aperiodic tasks that would cause the greatest delays in execu-
object for a particular input situation measured in microseconds. tion of the target task. The fitness function was expressed in an
The results of evolutionary testing were compared with random exponential form, based on the difference between the deadline
testing and with the performance of an experienced human tester. of an execution and the executions actual completion. A prototype
The results indicated that evolutionary testing outperforms ran- tool called real-time test tool (RTTT) was built to facilitate the exe-
dom testing as random testing could only produce about 85% of cution of runs of genetic algorithm. Two case studies were con-
the maximum execution times found by evolutionary testing. The ducted; one case study consisted of researchers own scenarios
human tester was more successful in 4 out of 15 test programs, while the second consisted of an actual real-time system. The re-
which indicated the presence of properties of test objects that in- sults from the timing diagrams illustrated that RTTT was a useful
hibit evolutionary testability [35] i.e. the ability of an evolutionary tool to stress the system more than the scenarios covered by
algorithm to successfully generate test cases that violates the tim- schedulability theory.
ing specification. A summary of results of applying metaheuristics for testing
Groß et al. [34] presented a prediction model based on com- temporal properties is given in Table 3.
plexity of the test object, which can be used to predict evolutionary
testability. It was found that there were several properties inhibit-
ing evolutionary testability, which included small path domains, 3.2. Quality of service
high-data dependence, large input vectors, and nesting.
Additionally, several source code measures, which map pro- Search-based testing of Quality of Service (QoS) represents a
gram attributes inhibiting evolutionary testability, were also pre- mix of search-based software engineering and service-oriented
sented in Groß [35]. Code annotations were inserted into the test software engineering. Metaheuristic search techniques have been
object’s source code along their shortest and longest algorithmic used for quality of service aware composition and violation of ser-
execution paths. The fitness function was then based on maximal vice level agreements (SLAs) between the integrator and the end
and minimal possible annotation coverage by the generated input user.
964 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
Table 3
Summary of results applying metaheuristics for testing temporal properties. The last column on the right covers any issues such as constraints, limitations and highlights. (GA is
short for Genetic Algorithm, SA is short for Simulated Annealing while EGA is short for Extended GA.)
In Canfora et al. [37], genetic algorithms were used to determine netic algorithms were used to generate test cases that covered the
the set of service concretizations (i.e. bindings between abstract path and violated the SLA. The fitness function combined a dis-
and concrete services) that lead to QoS constraint satisfaction while tance-based fitness that rewards solutions close to QoS constraint
optimizing the objective function. An abstract service is the feature violation, with a fitness guiding the coverage of target statements.
required in a service orchestration while concrete services repre- The two fitness factors were dynamically weighted. The approach
sent functionally equivalent services realizing the required feature. was applied to two case studies. The first case study was an audio
In a contract with potential users, the service provider can estimate processing workflow containing invocations to four abstract ser-
ranges for the QoS attributes as part of service level agreement vices. The second case study, a service producing charts, applied
(SLA), i.e. a contract between an integrator and end-user for a given the black-box approach with fitness calculated only on the basis
QoS level. QoS attributes consist of non-functional properties such of how close solutions violate QoS constraint. In case of audio
as cost, response time and availability so the fitness function opti- workflow, the genetic algorithm using the proposed fitness func-
mized the QoS attribute chosen by the service integrator. The QoS tion, which combined distance-based fitness with coverage of tar-
attributes of composite services were determined using rules with get statements, outperformed random search. For the service
aggregation function for each workflow construct. The fitness func- producing charts, use of black-box approach successfully violated
tion was designed in a way to maximize some QoS attributes (e.g. the response time constraint, showing the violation of QoS con-
reliability and availability) while minimizing others (e.g. cost and straints for a real service available on the Internet.
response time). Based on the type of penalty factor (static vs. dy- A summary of results of applying metaheuristics for QoS-aware
namic), static fitness function and dynamic fitness functions were composition and violation of SLA is given in Table 4.
proposed. With experiments on 18 invocations of 8 distinct abstract
services, the performance of genetic algorithms was compared with 3.3. Security
integer programming. The results showed that when the number of
concrete services is small, integer programming outperformed ge- A variety of metaheuristic search techniques have been applied
netic algorithms. But as the number of concrete services increased, to detect security vulnerabilities like detecting buffer overflows;
genetic algorithm was able to keep its time performance while inte- including grammatical evolution, linear genetic programming, ge-
ger programming grew exponentially. netic algorithm and particle swarm optimization.
Di Penta et al. [38] used genetic algorithms to generate test data In Kayacik et al. [40], grammatical evolution (GE) was used to
that violated QoS constraints causing SLA violations. The generated discover the characteristics of a successful buffer overflow. The
test data included combinations of inputs and bindings for the ser- example vulnerable application in this case performed a data copy
vice-oriented system. The test data generation process was com- without checking the internal buffer size.
posed of two steps. In the first step, the risky paths for a The exploit was represented by a sample C program that approx-
particular QoS attribute were identified and in the second step, ge- imated the desired return address and assembled the malicious buf-
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 965
Table 4
Summary of results applying metaheuristics for QoS-aware composition and violation of SLA. The last column on the right covers any issues such as constraints, limitations and
highlights. (GA is short for Genetic Algorithm.)
fer exploit. The malicious buffer contained a shell code, representing 35 days of simulated network traffic. The results showed that ge-
attacker’s arbitrary code, that overwrites the return address to gain netic algorithms outperform all of the swarms with respect to the
control. number of distinct vulnerabilities discovered.
For the attack to be successful, it was important to jump to the Budynek et al. [41] modeled the hacker behavior along with the
first instruction of the shell code or to the sequence of no operation creation of a hacker grammar for exploring hacker scripts using ge-
instructions called NoOP sled. The fitness function represented six netic algorithms. A hacker script contained sequences of UNIX
characteristics of malicious buffer which included existence of the commands issued by the hacker upon logging in to the system.
shell-code, success of the attack, NoOP sled score, back-to-back de- One script was one individual with a single UNIX command acting
sired return addresses, desired return address accuracy and score as a gene. A fitness function was defined based on the efficiency
calculated on NoOP sled size. Three sets of experiments were per- and effectiveness of the hacking scripts i.e. the script fitness value
formed, namely basic grammatical evolution (GE), GE with niching was calculated by number of goals achieved, number of pieces of
(to include population diversity) and GE with niching and NoOP min- evidence discovered by the log analyzer, number of bad commands
imization. The results found were comparable. In Kayacik et al. [43], used by the hacker and the length of the script used by the hacker.
the vulnerable system call was taken to be the UNIX execve com- The results of experiments showed various top-scoring scripts ob-
mand and linear GP was used to evolve variants of an attack. The tained from various runs.
UNIX execve command required the registers EAX, EBX, ECX, EDX Grosso et al. [42] used static analysis and program slicing to
to be correctly configured and the stack to contain the program name identify vulnerable statements and their relationships, that were
to be executed. The fitness function returned a maximum fitness of further explored using genetic algorithms for buffer overflows.
10 if all conditions were satisfied. The experimental results indicated Three different fitness functions were compared. The first one (vul-
that evolved attacks discovered different ways of attaining sub-goals nerable coverage fitness) included weighted values for statement
associated with building buffer overflow attacks and expanding the coverage, vulnerable statement coverage and number of execu-
instruction set provided better results as compared to basic GP. tions of vulnerable statements. The second fitness function (nest-
Kayacik et al. [23] used linear GP for automatic generation of ing fitness) incorporated observed maximum nesting level
mimicry attacks to perform evasion of intrusion detection system corresponding to the current test case while the third fitness func-
(IDS), which in this case was an open source target anomaly detec- tion (buffer boundary fitness) included a term accounting for the
tor called stride. The candidate mimicry attacks were in the form distance from the buffer boundaries. Two programs were used
of system call sequences. The system call sequences consisted of for experimentation. The case in which the expert’s knowledge
most frequently executed instructions from the vulnerable appli- was used to define the initial search space and also for the case
cation, which in this case is traceroute, a tool used to determine having random initial population showed that buffer boundary fit-
the route taken by packets across an IP network. ness outperformed both the vulnerable coverage and the nesting
An acceptable anomaly rate was established for stride. The fitness. This showed that fitness functions using distance from
objective of the attacker therefore was to reduce the anomaly rate the limit of buffers were helpful for deriving genetic algorithm evo-
below this acceptable limit. The study described an attack denoted lution. In DelGrosso et al. [44], the authors improved on the previ-
by successful completion of three steps i.e. open a UNIX password ous basic boundary fitness [42] to propose a dynamic weight
file, write a line and close the file. The fitness function rewarded at- fitness in which the genetic algorithm weights were calculated
tacks that successfully followed the steps and at the same time min- by solving a maximization problem via linear programming. So
imized the anomaly rate. The results showed that the approach was with weights that could be tuned at each genetic algorithm gener-
able to reduce the anomaly rate to 2.97% for the entire attack. ation, fast discovery of buffer overflows could be achieved. The dy-
In Dozier et al. [39], security vulnerabilities in an artificial im- namic weight fitness outperformed the previous basic boundary
mune system (AIS) based IDS were identified using genetic algo- fitness on experiments with two different sets of C applications.
rithms and particle swarm optimization. The study used A summary of results of applying metaheuristics for detecting
GENERITA Red Team (GRT), a system based on evolutionary algo- security vulnerabilities is given in Table 5.
rithms, which performed the vulnerability analysis of the IDS. The
AIS-based IDS communicates with the GRT by receiving red packets 3.4. Usability
in the form of attacks and returns the percentage of the detector set
that failed to detect the red packet. This percentage was the fitness Usability testing in the context of application of metaheuristics
of the red packet. The packets took the form of triplets (ip_ad- is concerned with construction of covering array which is a combi-
dress, port, src) while the AIS maintained a population of detec- natorial object.
tors with different ranges of IP addresses and ports. Matching rules The user is involved in numerous interactions taking place
were applied to match the data triple and a detector. The GRTs used through the user interface. With the number of different features
consisted of a steady-state GA and six variants of particle swarm available and their respective levels, the interactions cannot be
optimization. Experiments were performed using data representing tested exhaustively due to a combinatorial explosion. Interaction
966 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
Table 5
Summary of results applying metaheuristics for detecting security vulnerabilities. The last column on the right covers any issues such as constraints, limitations and highlights.
(GE is short for Grammatical Evolution, LGP is short for Linear Genetic Programming and PSO is short for Particle Swarm Optimization.)
testing offers savings, in that it aims to cover every combination of With respect to the application of metaheuristics, the fitness
pair-wise (or t-way) interaction at least once [58]. It is interesting function used for constructing a covering array is the number
to see that there are different competing constraints. On one hand, of uncovered t-subsets, so the covering array itself will have a
the objective is high-coverage, while on the other hand the test cost of 0. Since one does not know the size of the test suite a
suite size needs to be small to reduce overall testing cost. Covering priori, therefore, heuristic search techniques apply transforma-
array (CA) needs to be constructed to capture the t-way interac- tions to a fixed size array until constraints are satisfied. The re-
tions. For software systems, each factor (feature or component) sults of implementing SA for handling t-way coverage in fixed
comprises of different levels (options or parameters or values), level cases are provided by Cohen et al. [46]. The results showed
therefore a mixed level covering array (MCA) is proposed. How- that in comparison with greedy search techniques used in test
ever, as compared to CA, there are few results on the upper bound case generator (TCG) [60] and automatic efficient test generator
and construction algorithms for mixed level covering arrays, espe- (AETG) [61], SA improved on the bounds of minimum test cases
cially using heuristic search techniques [46]. Algebraic constructs, in a test suite of strength two, e.g. for MCAðN; 2; 51 38 22 Þ, SA gave
greedy methods and metaheuristic search techniques have been 15 as minimum test cases as compared to 20 and 19 by TCG and
applied to construct covering arrays. Our interest here is to explore AETG, respectively. In case of strength 3 constructions, the SA
the use of metaheuristic search techniques for constructing cover- algorithm did not performe as well as the algebraic construc-
ing arrays. Hoskins et al. provide definitions relevant to covering tions. Therefore, the initial results indicated SA as more effective
array and mixed level covering array [59]: than other approaches for finding smaller sized test suites. On
the other hand, SA took much more execution time as compared
A covering array, CAkðN; t; k; v Þ, is an N k array for which every
to simpler heuristics. Stardom [45] also used SA, GA and tabu
N t sub-array has the property that every t-tuple appears at least
search (TS) for constructing covering arrays. The results indi-
k times. In this application, t is the strength, k is the number of fac-
cated that SA and TS were best in constructing covering arrays.
tors (degree), and v is the number of symbols for each factor
A genetic algorithm turned out to be least effective; taking more
(order). When k is 1, every t-way interaction is covered at least
time and moves to find good covering arrays. Stardom reported
once; this is the case of most interest, and we often omit the sub-
new upper bounds on size of covering array using SA, some of
script k when it is 1. The covering array is optimal if it contains
which were later improved by Cohen et al. [46]. Stardom’s study
the minimum possible number of rows. The size of such a covering
indicated that SA’s main advantage was the capability of execut-
array is the covering array number CANðt; k; v Þ.
ing many moves in a short time; therefore if the search space
A mixed level covering array, MCAkðN; t; k; ðv 1 ; v 2 ; . . . ; v k ÞÞ, is an
was dense, SA quickly located objects. On the other hand, TS
N k array in which, for each column i, there are exactly v i levels;
performed much better when the size of an array’s neighbor-
again every N t sub-array has the property that each possible t-
hood was smaller.
tuple occurs at least k times. Again k is omitted from the notation
Along with the application of metaheuristic techniques for con-
when it is 1. A mixed covering array provides the flexibility to con-
structing covering arrays, there is also evidence of integrated ap-
struct test suites for systems in which components are not
proaches (Table 6). Cohen et al. [46] proposed one such approach
restricted to having the exact same number of levels.
for using algebraic construction along with search techniques. An
To adapt to the practical concerns of software testing, it is desir- example of this integrated approach is given in Cohen and Ling
able that some subset of features have higher interaction coverage. [48] where a new strategy called augmented annealing takes
For example, the overall system might have 100% two-way cover- advantage of computational efficiency of algebraic construction
age, but a subset of features might have 100% three-way coverage. and generality of SA. Specifically, algebraic construction reduced
To this end, in Cohen et al. [46], Cohen et al. propose variable a problem to smaller sub-problems on which SA runs faster. The
strength covering arrays. As with mixed level covering arrays, con- experimental results reported new bounds for some strength three
struction methods and algorithms for variable strength test suites covering arrays e.g. CAð3; 6; 6Þ and CAð3; 10; 10Þ. A hybrid approach
is in its preliminary stages with [46] providing some initial results is also given in Bryce and Colbourn [51] for constructing covering
for test suite sizes constructed using simulated annealing (SA). array. The study focussed on covering as many t-tuples as early
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 967
Approach used Articles Safety testing is an important component of the testing strategy
Independent application of Cohen et al. [46,47], Stardom [45], Nurmela [49], of safety critical systems where the systems are required to meet
metaheuristics Shiba et al. [50] safety constraints. In terms of metaheuristic search techniques,
Use of an integrated approach Cohen et al. [48], Bryce et al. [51] SA and GA are applied for safety testing.
In Tracey et al. [52], the authors proposed an approach using
GAs and SA to generate test data violating a safety property. This
approach extended the authors’ previous work in developing a
as possible. So rather than minimizing the number of tests to general framework for dynamically generating test data. The vio-
achieve t-way coverage, the initial rate of coverage was the pri- lation of a safety property meant a hazard or a failure condition,
mary concern. The hybrid approach applied a one-test-at-a-time which was initially identified using some form of hazard analysis
greedy algorithm to initialize tests and then applied heuristic technique e.g. functional hazard analysis. The fitness function
search to increase the number of t-tuples in a test. The heuristic used evaluates different branch predicates and evaluates to zero
search techniques applied were hill-climbing, SA, TS and great if the safety property evaluates to false and will be positive other-
flood. With different inputs, SA in general produced the quickest wise. The search stopped when a test data with a zero cost was
rate of coverage with 10 or 100 search iterations. The study also found. The cost function calculation is presented in Table 8. where
concluded that smaller test suites do not relate to greater rate of K represents the penalty which was added for undesirable data
coverage; hence two different and sometimes inconsistent goals [52].
when applied in a real world setting. The paper provided a simple example where either SA or GAs
As mentioned earlier, there is less evidence on construction of can be applied to automatically search for test data violating safety
variable strength arrays. One such study used SA to find variable properties that must hold ‘after’ the execution of the SUT. The same
strength arrays and provided initial bounds [47]. In his work given cost function can also be used to generate test data to violate
using TS, Nurmela improved on previously known upper bounds safety conditions at specific points ‘during’ the execution of the
on the sizes of optimal covering arrays [49]. Experimenting with SUT. In this case, the SUT needed to be instrumented such that
number of factors and number of values for each factor, good the branch predicates were replaced by procedures which served
upper bounds were tabulated for strength-two covering arrays. two purposes of returning the boolean value of the predicate they
In addition, the study improved upper bounds for strength three replaced and adding to the overall cost the contribution made by
covering arrays. The TS algorithm was found to work best for each individual branch predicate that was executed. An example
strength two covering array with number of levels for each factor was given with an original program and the instrumented program
equal to three. According to [49], it was difficult to be sure about with examples of how the fitness function was able to guide the
the upper bounds to be optimal or not because TS being a sto- search.
chastic algorithm could improve upon the new bounds if given
more computing time.
Toshiaki et al. used genetic algorithms (GA) and ant colony algo-
Table 8
rithm (ACA) for constructing covering arrays [50]. The results were Cost function calculation.
compared with AETG, in-parameter order (IPO) [62] algorithm and
SA. SA outperformed their results of using GA and ACA with respect Element Value
to the size of resulting test sets for two-way and three-way testing. Boolean if TRUE then 0 else K
Their results however outperformed AETG for two-way and three- a¼b if abs(a - b)=0 then 0 else abs(a - b) + K
a–b if abs(a - b) – 0 then 0 else K
way testing. It was interesting to find that using GA, the results did a<b if a - b < 0 then 0 else (a - b) + K
not match with those produced by Stardom’s study [45] which ab if a - b 0 then 0 else (a - b) + K
indicated that GA did not perform well in generating covering ar- a>b if b - a < 0 then 0 else (b - a) + K
rays, even though several attempts were made to modify the struc- ab if b - a 0 then 0 else (b - a) + K
a_b min(cost(a), cost(b))
ture of the algorithm.
a^b cost(a) + cost(b)
A summary of results of applying metaheuristics for covering :a Negation propagated over a
array construction is given in Table 7.
Table 7
Summary of results applying metaheuristics for covering array construction. The last column on the right covers any issues such as constraints, limitations and highlights. (TS is
short for Tabu Search, SA is short for Simulated Annealing, HC is short for Hill Climbing, ACA is short for Ant Colony Algorithm and GA is short for Genetic Algorithm.)
The approach presented by Tracey et al. [52] has been extended fitness function, which was defined according to the problem at
by Abdellatif-Kaddour et al. [53] for sequential problems. In hand. For example, in case of a car control system, the fitness func-
Abdellatif-Kaddour et al. [53], SA was used for step-wise construc- tion checked for violations of signal amplitude boundaries. The ap-
tion of test scenarios (progressive exploration of longer test scenar- plied fitness function consisted of two levels, which differentiated
ios) to test safety properties in cyclic real-time control systems. quality between multiple output signals violating the defined
The stepwise construction was required due to the sequential boundaries. An objective value of 1 indicated a severe violation
behavior of the control systems as it was expected that safety while for less severe violations, the closeness of the maximal value
property violation will occur after execution of a particular trajec- to the defined boundary was calculated. The results of the experi-
tory or sequence of data in the input domain. The test strategy was ment performed on a car control system showed that the optimiza-
applied to a steam boiler case study where the target safety prop- tion continually found better values and ultimately a serious
erty was the non-explosion of the boiler. Along with the objective violation was detected.
of violating a safety property, there was a set of dangerous situa- A summary of results of applying metaheuristics for safety test-
tions of interest when exploring progressive evolution towards ing is given in Table 9.
property violation. So the objective was not only violation of a tar-
get property but also to reach a dangerous situation. For the steam 4. Discussion and areas for future research
boiler case study, there could be ten possible safety property viola-
tions and three dangerous situations. The solution space in this The body of knowledge into the use of metaheuristic search
case was divided into several subsets of smaller sizes. So different techniques for verifying the temporal correctness is geared to-
classes of test sequences were independently searched. For each wards real-time embedded systems. For these systems, temporal
class, the objective was defined into sub-objectives corresponding correctness must be verified along with the logical correctness.
to either safety property violation or the achievement of a danger- The fact that there is a lack of support for dynamic testing of
ous situation. The overall cost was the minimum value of the dif- real-time system for temporal correctness caused the research
ferent sub-objective cost functions. The efficiency of using SA community to take advantage of metaheuristic search techniques.
was analyzed in comparison with random sampling. The first It is possible to differentiate the temporal testing research into two
experiment showed that random sampling found test sequences dimensions. One of them focuses on violation of timing constraints
that fulfilled main objective more quickly than the approach using due to input values and most of the temporal testing research fol-
SA. Therefore a revised SA was used in which the acceptance prob- lows this dimension. The other dimension, which is the one taken
ability was adjusted to allow for significant moves in case of no by Briand et al. [12] analyses task architectures and consider seed-
cost improvement. The revised version of SA offered significant ing times of events triggering tasks and tasks’ synchronization, i.e.
improvement over the basic version of SA, while in comparison Briand’s et al. study does not consider tasks in isolation. Both ap-
with random sampling a slight improvement was observed both proaches to temporal verification are complementary.
in terms of total number of iterations and successful search. The re- The performance outcome information from studies related to
sults of the study confirmed the usefulness of stepwise construc- execution time are given in Table 10. It is fairly evident from the
tion of test scenarios, but in terms of efficiency of SA algorithm, table that GA consistently outperforms random and statistical test-
the cost effectiveness as compared with random sampling remains ing in wide variety of situations, producing comparatively longer
questionable. execution times faster and also finding new bounds on BCET. GAs
In Baresel et al. [15,54], an evolutionary testing approach using were also able to perform better than human testers and on occa-
genetic algorithms was presented for structural and functional se- sions where it failed to do so may be attributed to the complexity
quence testing. For complex dynamic system like car control sys- of the test objects inhibiting evolutionary testability. With respect
tems, long input sequences were necessary to simulate these to comparison with static analysis, GA performed comparably well
systems. At the same time, one of the most important aims was and both techniques are shown to bound the actual execution
to generate a compact description for the input sequences, contain- times from opposite ends.
ing as few elements as possible but having enough variety to stim- For execution time, genetic algorithms are used as the metaheu-
ulate the system under test as much as necessary. In order to have ristic in vast majority of cases (14 out of 15 papers), while SA finds
a compact description of input sequences for a car control system, application in one of the studies. The preference of using genetic
the long input sequence was divided into sections. Each section algorithms over SA can be attributed to the very nature of search
had a base signal having signal type, amplitude and length of sec- mechanism inherent to genetic algorithms. Since a genetic algo-
tion as variables. These variables had bounds within which the rithm maintains a population of possible solutions, it has a better
optimization generated solutions. The output sequences generated chance of locating global optimum as compared to SA which pro-
by simulating the car control system were to be evaluated against a ceed one solution at a time. Also due to the fact that temporal
Table 9
Summary of results applying metaheuristics for safety testing. The last column on the right covers any issues such as constraints, limitations and highlights. (GA is short for
Genetic Algorithm and SA is short for Simulated Annealing.)
Table 10
Key evaluation information and outcomes of execution time studies. (GA is short for Genetic Algorithm, EGA is short for Extended GA, while SA is short for Simulated Annealing.)
Article and Method of evaluation Test objects Performance factor evaluated Outcomes of the experiment
metaheuristic
Wegener et al. Comparison with statistical Simple C function Finding WCET/BCET and number The longest execution time was
[25], GA and systematic testing of generations found very fast and a new shortest
execution time was found, not
previously discovered by statistical
and systematic testing
Alander et al. [26], Comparison with random Relay software simulator Finding processing time GA generated input data with longer
GA testing extremes response times
Wegener et al. Comparison with statistical 5 programs with up to 1511 LoC Finding WCET/BCET and the GA found more extreme times
[27], GA and systematic testing and 843 integer input number of tests required although on occasions required more
parameters tests to do so
Wegener and Comparison with random 8 programs with up to 1511 LoC Finding WCET/BCET and the GA always obtained better results as
Grochtmann testing and 5000 input parameters number of tests required compared with random testing
[10], GA
O’Sullivan et al. Comparison with different An algorithm from automotive Convergence analysis of various Cluster analysis turned out to be a
[28], GA termination criteria i.e. electronics termination criteria more powerful termination criterion
limiting the number of which allowed quick location of local
generations, time spent on optima
test and examining the
fitness evolution
Tracey et al. [29], Comparison in terms of size 4 programs (conditional blocks, Finding WCET SA successfully executed a worst case
SA of parameter space to simple loop, binary integer path and was more effective with
search square root and insertion sort) larger, complex parameter space
Pohlheim and Comparison with Modules from a motor control Finding maximum execution GAs found longer execution times for
Wegener [32], systematic testing system times all the given modules
EGA
Mueller and Comparison with static 5 experiments with industrial Finding WCET/BCET Use of evolutionary testing and static
Wegener [30], analysis and reference applications analysis bounded the actual
GA execution times from opposite ends
and were complementary
Puschner and Comparison with static 7 programs with diverse Finding WCET For large input data space, GA
Nossal [31], GA analysis and random testing execution time characteristics outperformed random method. In
comparison with static analysis, the
performance of GA is comparable
Wegener et al. Comparison with execution An engine control system with 6 Finding WCET Longer execution times were found
[33], GA time found with developers time critical tasks with the evolutionary test than with
test the developer tests
Groß [36], GA Random test case 15 example test programs Finding WCET Evolutionary testing generated more
generation and worst-case times as compared with
performance of an random testing. Also for only 4 out of
experienced human tester 15 test objects the human tester was
more successful
Groß [35], GA None 21 test objects of varying input Evolutionary testability The prediction system forecasted
size evolutionary testability with almost
90
Groß et al. [34], None 22 test objects with varying Evolutionary testability Evolutionary testability and
GA input sizes complexity of test objects was found
to be interrelated
Tlili et al. [11], Standard evolutionary real- 12 test objects with varying Finding longer execution times Using range restriction and seeding
EGA time testing with random cyclomatic complexity initial population with data achieving
initial population and no high structural branch coverage,
search space restriction longer execution times were found
for most of the test objects except for
two
Briand et al. [12], None Two case studies, one of Maximizing critical deadline It was possible to identify seeding
GA researchers own scenarios and misses times such that small errors in the
the second consisted of an actual execution time estimates could lead
real time system to missed deadlines
behavior of real-time systems always results in complicated multi- function, mutation and crossover operators and search for robust
dimensional search space with many plateaus and discontinuities, and appropriate parameters. In terms of test objects, the research
genetic algorithms are suitable since they perform well for prob- focuses on properties of test objects inhibiting evolutionary test-
lems involving large number of variables and complex input ability and formulation of complexity measures for predicting evo-
domains. lutionary testability (Fig. 4).
There is another way of analyzing the body of knowledge into The fact that seeding the initial population of an evolutionary
the use of metaheuristics for testing temporal correctness; which algorithm with test data having high structural coverage has had
is in terms of (1) properties associated with the metaheuristic itself better results, it would be interesting to design a fitness function
and (2) properties related to the SUT (test object). In terms of that takes into account structural properties of individuals. Then
metaheuristic, the research focuses on improving multiple issues: it will be possible not only to reward individuals on the basis of
reduction in search space, comparative studies with static analysis, execution time but also on their ability to execute complex parts
selection schemes for initial population of genetic algorithm, eval- of the source code. Similarly, there are different types of structural
uation of suitable termination criterion for search, choice of fitness coverage criteria, which can be used to seed initial populations and
970 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
might prove helpful in the design of a fitness function that takes tems. In terms of QoS, different attributes are of interest and are
into account such a structural coverage criterion. competing e.g. cost, response time, availability and reliability.
In terms of reliable termination criterion for evolutionary test- These attributes need to be computed for workflow constructs.
ing, use of cluster analysis is found to be useful over other termina- Empirical evidence has to be gathered for a thorough comparison
tion criteria, e.g. number of generations and examination of fitness of genetic algorithms with non-linear integer programming for
evolution. However, to the authors’ knowledge, the use of cluster QoS-aware service composition. Also network configurations and
analysis as a termination criterion is used in only one study [28]. server load, being one of the factors causing SLA violations, is to
Cluster analysis information can be used to change the search be accounted for generating test data violating the SLA.
strategy in a way that escapes local optima and helps exploring Buffer overflow attacks compromises the security of applica-
more feasible areas of the search space. Also the performance of tions. The attacker needs three requirements for a successful ex-
evolutionary algorithms can be made much better by using robust ploit, (i) a vulnerable program within the target system (ii)
parameters. The search for these parameters is still on. Similarly, information of the size of memory reference necessary to cause
variations of existing algorithms e.g. like using extended evolution- the overflow and (iii) the correct placement of a suitable exploit
ary algorithms might give interesting insights into the perfor- to make use of the overflow when it occurs [43]. In order to guide
mance of evolutionary testing. the interpretation of findings, performance outcomes are summa-
We also gathered studies regarding application of metaheuristic rized in Table 12, in addition to Table 5. The range of studies offers
search techniques for quality of service aware composition and variations with respect to the main theme, although all have the
violation of service level agreements (SLAs) between the integrator common goal of addressing security testing. Therefore, we see
and the end user. Genetic algorithms have been applied to tackle studies making use of metaheuristic search techniques to create
the QoS-aware composition problem as well as generation of in- a range of successful attacks to evade common intrusion detection
puts causing SLA violations. Table 11 show that in comparison with systems as well as to identify buffer overflows. For detecting buffer
linear integer programming and random search, genetic algorithms overflows, the attacker’s arbitrary code needs modification to in-
were more successful in meeting QoS constraints. We also infer crease the success chances of creating a malicious buffer. In case
that the testing of service-oriented systems has inherently several of hacker scripts generation, the goals for the hacker script gener-
issues. These include testability problems due to lack of observabil- ation needs to be defined. The use of metaheuristic search tech-
ity of service code and structure, integration testing issues due to niques includes grammatical evolution, linear genetic
the use of late-binding mechanisms, lack of control and involved programming, genetic algorithm and particle swarm optimization.
cost of testing [38]. These issues raises the need for adequate test- Devising a useful fitness function is the focus of majority of the
ing strategies and approaches tailored for service-oriented sys- studies which highlights the difficulty in instrumenting security is-
Table 11
Key evaluation information and outcomes of QoS studies. (GA is short for Genetic Algorithm.)
Article and Method of Test objects Performance factor evaluated Outcomes of the experiment
metaheuristic evaluation
Canfora et al. Linear A workflow containing 18 Convergence times of integer When the number of concrete services available for each abstract
[37], GA integer invocations of 8 distinct programming and GA for the same service is large, GA should be preferred instead of integer
programming abstract services achieved solution programming. On the other hand, whenever the number of concrete
services available is limited, integer programming is to be preferred
Di Penta et al. Random Audio processing workflow Violation of QoS constraint and The new approach outperformed random search and successfully
[38], GA search and a service for chart time required to converge to a violated QoS constraints
generation solution
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 971
Table 12
Key evaluation information and outcomes of security studies. (GA is short for Genetic Algorithm, GE is short for Grammatical Evolution, while PSO is short for Particle Swarm
Optimization.)
Article and Method of evaluation Test objects Performance factor evaluated Outcomes of the experiment
metaheuristic
Dozier et al. [39], GA Comparison of steady state GA and six 1998 MIT Lincoln Lab Data To discover holes (Type II GA outperformed all of the swarms
and PSO variants of PSO errors/false negatives) with respect to number of distinct
vulnerabilities discovered
Kayacik et al. [40], GE Detection (or not) of each exploit A simple (generic) vulnerable The number of alerts that The results from three variants of GE
(basic, niching and through the Snort misuse detection application performing a data Snort generates when attacks were comparable
niching & NoOP system copy without checking the are executed
minimization) internal buffer size
Budynek et al. [41], Results from a log analyzer Automatic generation of Evidence collection from the Various top scoring scripts were
GA computer hacker scripts (a logs obtained
sequence of Unix commands)
Grosso et al. [42], GA Comparison of three different fitness A white-noise generator and a Modified t-test to compare A fitness function accounting for the
functions namely vulnerable coverage function contained in the ftp fitness values of three fitness distance from the buffer boundaries
fitness, nesting fitness and buffer client functions across the two test outperformed fitness function not
boundary fitness objects using the same factor
Grosso et al. [44], GA The fitness that does not use dynamic Two different sets of C A comparison of fitness The new fitness function outperformed
weighing applications values of competing fitness the comparable ones
cases in terms of number of
generations
Kayacik et al. [43], Evaluation of a new approach Three experiments with Avoidance of attack detection The evolved attacks discovered
Linear GP different data sets by Snort, the network based different ways of attaining sub-goals
intrusion detection system associated with building buffer
overflow attacks
Kayacik et al. [23], Compares two fitness functions Instruction set consisting of Measurement of alarm rate Reduction of anomaly rate from 65%
Linear GP (incremental and concurrent) in most frequently occurring indicating evasion of Stride, to 2:7%
detecting anomaly rate system calls the anomaly host based
detection system
sues as an appropriate search metric. Most of the fitness functions parameter sets, the heuristic algorithms run slowly due to the large
are based on the ability of the attack to fulfill the conditions neces- amount of memory required (to store information). Therefore, exe-
sary for a successful exploit. This has resulted in studies comparing cution time is a known barrier in finding more results using meta-
the performances of different fitness functions rather than compar- heuristic search algorithms. One obvious way to achieve efficient
ing with traditional approaches to security testing. This indicates memory management is to reuse the previously calculated t-com-
that most of the studies into use of metaheuristics for this domain binations by storing them in some form of a temporary memory.
is largely exploratory and no trends are observable that can be gen- Also as mentioned earlier, the size of the test suite is not known
eralized. Due to this we see authors experimenting with simple a priori, many sizes must be tested for obtaining a good bound
and small applications on a limited scale. The scalability of these on a t-way test set of given size.
approaches, with larger data sets and greater number of trials, is It is also worth mentioning that finding optimal parameters for
an interesting area of future research. The work of Kayacik et al. effectively using metaheuristics require many trials, moreover it is
[40,43,23] is notable as they move towards a general framework difficult to be sure whether the upper bounds produced are opti-
for attack generation based on the evolution of system call se- mal or not because further improvements on bounds can take
quences. Also the co-evolution of attacker-detector pairs (as place if given more computing time. In addition to Table 7, we
pointed out by Kayacik et al. [23]) that provides the opportunity summarize key evaluation information and outcomes in Table 13.
to actually preempt new attack behaviors before they are encoun- We gather that a range of metaheuristics have been applied for
tered, formulation of techniques (like static analysis and program constructing covering arrays. This includes SA, TS, GA and ant col-
slicing) to reduce the search space for the evolutionary algorithm, ony optimization. We find SA and TS to be widely applicable search
hybridization of GA and PSO algorithm for effective searching and techniques, particularly SA being applied to generate smaller sized
creation of an interactive tool for vulnerability assessment based test suites. Out of seven primary studies, five report using SA while
on evolutionary computation for the end users provides further fu- three use TS, either being used independently or in combination
ture extensions. with other search techniques and algebraic constructions. We infer
In the area of usability, we found the application of metaheuris- from Table 13, SA consistently performs better than GA, HC, ACA
tic search techniques for constructing covering arrays. Being a and TS in terms of size of resulting test sets. The performance of
qualitative attribute, usability possesses different interpretations. SA can further be improved by integrating it with the use of alge-
However, we classify studies reporting construction of covering ar- braic constructions. Another possibility is to begin with a greedy
rays under usability as they relate to a form of interaction testing algorithm (like TCG) and then make a transition to heuristic search
covering t-way user interactions whereby each test case exposes after meeting a certain condition [46]. GA has been applied in two
different areas of a user interface. The research for construction studies with contradictory results and hence requires further
of covering arrays for software testing have dual focus of finding experimentation. An interesting area is to explore the use of ant
new techniques to produce smaller covering arrays and to produce colony optimization to generalize initial results given by [50]. In
them in reasonable amount of time. As expected, a trade-off must terms of construction of variable strength covering arrays, there
be achieved between computational power and size of resulting is a potential for further research for finding the best approach
test suites. The extent of evidence related to applying metaheuris- for variable strength test suite.
tics for finding better bounds on covering arrays suggest that meta- Safety testing is used to test safety critical systems that have to
heuristics are very successful for smaller parameter sets. For larger satisfy the safety constraints in addition to satisfying the functional
972 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
Table 13
Key evaluation information and outcomes of usability studies. (GA is short for Genetic Algorithm, TS is short for Tabu Search, SA is short for Simulated Annealing, HC is short for
Hill Climbing, PSO is short for particle swarm optimization while, ACA is short for Ant Colony Algorithm.)
Article and Method of evaluation Test objects Performance factor evaluated Outcomes of the experiment
metaheuristic
Stardom [45], Comparison of SA, TS and Different sizes of covering arrays, Three tests to find the best GA was ineffective at finding quality arrays when
SA, TS and GA CA(13,11:1), CA(9,7:1), CA(7,6:1) arrays possible in the shortest compared with TS and SA. SA in general was found to
GA amount of time be very useful for finding covering arrays of various
sizes while when the size of an arrays neighborhood
was smaller, TS was able to find much better arrays
Cohen et al. Comparison of SA, HC and Different sizes of mixed covering Number of test cases in a test HC and SA improved on bounds given by AETG and
[46], SA greedy methods (AETG, arrays and fixed covering arrays suite and time required to TCG, SA consistently performed well or better than HC
and HC TCG) obtain them
Cohen et al. Presentation of results for a Minimum, maximum and Minimum, maximum and Variable strength covering arrays of different sizes
[47], SA new combinatorial object average sizes of different average sizes of variable
(variable strength covering variable strength covering arrays strength covering arrays after
array) 10 runs of SA
Nurmela [49], Comparison with best Upper bounds on g 2 ðZ qn Þ for small Size of the covering array The search algorithm worked best for t=2 and q=3
TS known upper bounds q and n
Cohen et al. Comparisons with strength Several bounds for strength three Size of strength three covering Combination of combinatorial construction and SA
[48], SA three covering arrays covering arrays array presented new bounds for some strength 3 covering
arrays
Bryce et al. Comparisons among TS, SA Two inputs with factors having Rate of t-tuple coverage SA had the fastest rate of t-tuple coverage
[51], TS, and HC equal number of levels and two
SA, HC inputs having mixed number of
levels
Toshiaki et al. Comparisons with AETG, IPO Covering arrays and mixed Size of resulting test sets and For t=2, GA and ACA performed comparable to AETG.
[50], ACA, and SA algorithms for the covering arrays of strength 2 and amount of time required for For t=3, GA and ACA outperformed AETG. SA
GA cases t ¼ 2 and t ¼ 3 3 generation outperformed GA and ACA with respect to size of the
resulting test sets
specification. We take safety in terms of dangerous conditions, and calibration of algorithmic parameters. This is desirable espe-
which may contribute to an accident. There are two approaches cially when the extensions to the basic approaches of safety test-
for achieving verification of safety properties: dynamic testing ing can be applied to fault injection, testing for exception
and static analysis. Static analysis does not require execution of conditions and testing for safe component reuse and integration
the safety critical system while dynamic testing executes the sys- [52]. In terms of alternate design choices, we see in Abdellatif-
tem in a suitable environment with test data generated to test Kaddour et al. [53] that the efficiency of SA is dependent on the
safety properties. Both static analysis and dynamic testing are initial solution because when no cost improvement is observed,
complementary approaches; in many cases the results of static the search does not allow moves larger than those authorized
analysis are used to give criteria for dynamic testing (as in Tracey by the neighborhood function. Therefore, improvement in the effi-
et al. [52]). The available primary studies discussing testing of ciency of the SA algorithm can be achieved by searching else-
safety properties can be differentiated into two themes. One is where then the neighborhood of the current solution if no cost
the case where generation of separate inputs is discussed to test improvement is made. Further experimentation is also desirable
the safety property while the other case discusses generation of se- in terms of safety testing real world applications, however, the
quence of inputs. instrumentation required for testing of safety conditions is a chal-
The performance outcomes for studies related to safety testing lenge with respect to the scalability of the technique. Also, since
are given in Table 14. The studies show that SA and GAs are ap- the design choices are highly dependent on the nature of the
plied in the context of safety testing, GA being more successfully safety critical system, it is consequently important to include
applied. However, the results suggest a need for further experi- more problem specific knowledge into the representation of solu-
mentation in terms of investigating alternative design choices tions and design of fitness function.
Table 14
Key evaluation information and outcomes of safety studies. (GA is short for Genetic Algorithm, while SA is short for Simulated Annealing.)
Article and Method of evaluation Test objects Performance factor evaluated Outcomes of the experiment
metaheuristic
Tracey et al. Comparison of safety conditions obtained Small size Finding test data that causes an The approach might be useful not only for safety
[52], GA from software fault tree analysis and functions used implementation to violate a safety verification but also for integration with fault
and SA functional specification in the pre- property injection, testing for exception conditions and testing
proof step for safe component reuse
Kaddour et al. Effectiveness in terms of finding Non-explosion Finding the test sequence that Both random search and SA were effective while
[53], SA appropriate test sequences and efficiency in of the steam lead to either an explosion or a random search was more efficient
terms of comparing the speed of SA with boiler dangerous situation
random sampling
Baresel et al. None Dynamic car Violation of defined requirements It was possible to generate real world input sequences
[15], control system for output sequence generated by causing violations of a given safety requirement
Pohlheim the simulation of the dynamic
et al. [54], system
GA
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 973
years. The results of our systematic review also indicate that the
[54]
p
p
p
p
p
p
p
p
p
current body of knowledge concerning search-based software test-
[15]
ing does not report studies on many of the other non-functional
p
p
p
p
p
p
p
p
p
[53]
properties. On the other hand, there is a need to extend the early
optimistic results of applying NFSBST to larger real world systems,
p
p
p
p
p
p
p
p
p
p
thus moving towards a generalization of results.
Safety
[52]
p
p
p
p
p
p
p
p
p
p
Acknowledgement
[50]
p
p
p
p
p
p
p
p
p
p The authors would like to thank Barbara Ann Kitchenham and
[51]
p
p
p
p
p
p
p
p
p
p
the anonymous referees for their comments on the earlier drafts
[48]
of the paper.
p
p
p
p
p
p
p
p
p
p
[49]
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
p
[34]
p
p
p
p
p
p
p
p
p
References
[33]
p
p
p
p
p
p
p
p
p
p
[2] M. Harman, The current state and future of search based software engineering,
I: Is the research useful for software industry and research community?
F: Does the data collection methods relate to the aims of the research?
www.dur.ac.uk/ebse/>.
B: Is the context of study clearly stated?
p
p
p
p
p
p
p
p
p
p
Organization, 2001.
p
p
p
p
p
p
p
p
p
p
[7] N.E. Fenton, S.L. Pfleeger, Software metrics: A rigorous and practical approach,
[27]
p
p
p
p
p
p
p
p
p
p
1998.
p
p
p
p
p
p
p
p
p
p
[9] D.G. Firesmith, Common concepts underlying safety, security, and survivability
[25]
p
p
p
p
p
p
p
p
p
p
G
A
B
C
E
F
I
J
W. Afzal et al. / Information and Software Technology 51 (2009) 957–976 975
[10] J. Wegener, M. Grochtmann, Verifying timing constraints of real-time systems [35] H.-G. Groß, A prediction system for dynamic optimization-based execution
by means of evolutionary testing, Real-Time Systems 15 (3) (1998) 275– time analysis, in: SEMINAL 01: Proceedings of the First International
298. Workshop on Software Engineering using Metaheuristic Innovative
[11] M. Tlili, S. Wappler, H. Sthamer, Improving evolutionary real-time testing, in: Algorithms, Toronto, Canada, 2001.
GECCO’06: Proceedings of the Eighth Annual Conference on Genetic and [36] H.-G. Groß, An evaluation of dynamic, optimisation-based worst-case
Evolutionary Computation, ACM Press, 2006, pp. 1917–1924. execution time analysis, in: ITPC’03: Proceedings of the International
[12] L.C. Briand, Y. Labiche, M. Shousha, Stress testing real-time systems with Conference on Information Technology: Prospects and Challenges in the 21st
genetic algorithms, in: GECCO’05: Proceedings of the Seventh Annual Century, Kathmandu, Nepal, 2003.
Conference on Genetic and Evolutionary Computation, ACM Press, 2005, pp. [37] G. Canfora, M. Di Penta, R. Esposito, M.L. Villani, An approach for QoS-aware
1021–1028. service composition based on genetic algorithms, in: GECCO’05: Proceedings
[13] B.A. Kitchenham, E. Mendes, G.H. Travassos, Cross versus within-company cost of the 7th Annual Conference on Genetic and Evolutionary Computation, ACM,
estimation studies: A systematic review, IEEE Transactions on Software 2005, pp. 1069–1075.
Engineering 33 (5) (2007) 316–329. [38] M. Di Penta, G. Canfora, G. Esposito, V. Mazza, M. Bruno, Search-based testing
[14] E.K. Burke, G. Kendall (Eds.), Introductory tutorials in optimisation, decision of service level agreements, in: GECCO’07: Proceedings of the Nineth Annual
support and search methodology, Springer, 2005. Conference on Genetic and Evolutionary Computation, ACM Press, 2007, pp.
[15] A. Baresel, H. Pohlheim, S. Sadeghipour, Structural and functional sequence 1090–1097.
test of dynamic and state-based software with evolutionary algorithms, in: [39] G.V. Dozier, D. Brown, J. Hurley, K. Cain, Vulnerability analysis of immunity-
Genetic and Evolutionary Computation – GECCO 2003, Lecture Notes in based intrusion detection systems using evolutionary hackers, GECCO’04:
Computer Science, vol. 2724, Springer, 2003, pp. 2428–2441. Proceedings of the Sixth Annual Conference on Genetic and Evolutionary
[16] Y. Zhan, J.A. Clark, Search based automatic test-data generation at Computation, vol. 3102, Springer, 2004, pp. 263–274.
an architectural level, in: Genetic and Evolutionary Computation – GECCO [40] H.G. Kayacik, A.N. Zincir-Heywood, M. Heywood, Evolving successful stack
2004, Lecture Notes in Computer Science, vol. 3103, Springer, 2004, pp. 1413– overflow attacks for vulnerability testing, in: ACSAC’05: Proceedings of the
1424. 21st Annual Computer Security Applications Conference, IEEE Computer
[17] Y.S. Dai, M. Xie, K.L. Poh, B. Yang, Optimal testing-resource allocation with Society, Washington, DC, USA, 2005, pp. 225–234.
genetic algorithm for modular software systems, Journal of Systems and [41] J. Budynek, E. Bonabeau, B. Shargel, Evolving computer intrusion scripts for
Software 66 (1) (2003) 47–55. vulnerability assessment and log analysis, in: GECCO’05: Proceedings of the
[18] K. Derderian, R.M. Hierons, M. Harman, Q. Guo, Input sequence generation for Seventh Annual Conference on Genetic and Evolutionary Computation, ACM,
testing of communicating finite state machines (CFSMS), in: GECCO’04: New York, NY, USA, 2005, pp. 1905–1912.
Proceedings of the Sixth Annual Conference on Genetic and Evolutionary [42] C.D. Grosso, G. Antoniol, M.D. Penta, P. Galinier, E. Merlo, Improving network
Computation, Lecture Notes in Computer Science, vol. 3103, Springer, 2004, applications security: A new heuristic to generate stress testing data, in:
pp. 1429–1430. GECCO’05: Proceedings of the Seventh Annual Conference on Genetic and
[19] E. Alba, F. Chicano, Finding safety errors with aco, in: GECCO’07: Proceedings Evolutionary Computation, ACM, New York, NY, USA, 2005, pp. 1037–
of the 9th Annual Conference on Genetic and Evolutionary Computation, ACM, 1043.
New York, NY, USA, 2007, pp. 1066–1073. [43] H.G. Kayacik, M. Heywood, A.N. Zincir-Heywood, On evolving buffer overflow
[20] J. Koljonen, M. Mannila, M. Wanne, Testing the performance of a 2D nearest attacks using genetic programming, in: GECCO ’06: Proceedings of the Eighth
point algorithm with genetic algorithm generated gaussian distributions, Annual Conference on Genetic and Evolutionary Computation, ACM, New York,
Expert Systems with Applications: An International Journal 32 (3) (2007) 879– NY, USA, 2006, pp. 1667–1674.
889. [44] C. Del Grosso, G. Antoniol, E. Merlo, P. Galinier, Detecting buffer overflow via
[21] K.R. Walcott, M.L. Soffa, G.M. Kapfhammer, R.S. Roos, Timeaware test suite automatic test input data generation, computers and operations research
prioritization, in: ISSTA’06: Proceedings of the 2006 International Symposium (COR) focused issue on search based software engineering 35 (10), 3125–3143.
on Software Testing and Analysis, ACM, New York, NY, USA, 2006, pp. 1–12. [45] J. Stardom, Metaheuristics and the search for covering and packing arrays,
[22] S. Bouktif, H. Sahraoui, G. Antoniol, Simulated annealing for improving Master’s Thesis, Simon Fraser University, B.C., Canada.
software quality prediction, in: GECCO’06: Proceedings of the 8th Annual [46] M.B. Cohen, P.B. Gibbons, W.B. Mugridge, C.J. Colbourn, Constructing test
Conference on Genetic and Evolutionary Computation, ACM, New York, NY, suites for interaction testing, in: ICSE ’03: Proceedings of the 25th
USA, 2006, pp. 1893–1900. International Conference on Software Engineering, IEEE Computer Society,
[23] H.G. Kayacik, A.N. Zincir-Heywood, M. Heywood, Automatically evading IDS 2003, pp. 38–48.
using GP authored attacks, in: CISDA’07: Proceedings of the IEEE Symposium [47] M.B. Cohen, P.B. Gibbons, W.B. Mugridge, C.J. Colbourn, J.S. Collofello, Variable
on Computational Intelligence in Security and Defense Applications, IEE strength interaction testing of components, in: COMPSAC’03: Proceedings of
Computer Society, New York, NY, USA, 2007, pp. 153–160. the 27th Annual International Conference on Computer Software and
[24] T. Dybå, T. Dingsøyr, G.K. Hanssen, Applying systematic reviews to diverse Applications, IEEE Computer Society, 2003, p. 413.
study types: An experience report, in: ESEM 2007: Proceedings of the First [48] C.J.C.M.B. Cohen, A.C.H. Ling, Constructing strength three covering arrays with
International Symposium on Empirical Software Engineering and augmented annealing, in: ICSE’03: Proceedings of the 25th International
Measurement, 2007, pp. 225–234. Conference on Software Engineering, IEEE Computer Society, 2003, pp. 38–48.
[25] J. Wegener, K. Grimm, M. Grochtmann, H. Sthamer, B. Jones, Systematic testing [49] K.J. Nurmela, Upper bounds for covering arrays by tabu search, Discrete
of real-time systems, in: EuroSTAR’96: Proceedings of the Fourth International Applied Mathematics 138 (1-2) (2004) 143–152.
Conference on Software Testing Analysis and Review, 1996. [50] T. Shiba, T. Tsuchiya, T. Kikuno, Using artificial life techniques to generate test
[26] J.T. Alander, T. Mantere, G. Moghadampour, J. Matila, Searching protection cases for combinatorial testing, in: COMPSAC’04: Proceedings of the 28th
relay response time extremes using genetic algorithm-software quality by Annual International Computer Software and Applications Conference, IEEE
optimization, in: APSCOM’97: Fourth International Conference on Advances in Computer Society, Washington, DC, USA, 2004, pp. 72–77.
Power System Control, Operation and Management, vol. 1, 1997, pp. 95–99. [51] R.C. Bryce, C.J. Colbourn, One-test-at-a-time heuristic search for interaction
[27] J. Wegener, H. Sthamer, B.F. Jones, D.E. Eyres, Testing real-time systems using test suites, in: GECCO’07: Proceedings of the Nineth Annual Conference on
genetic algorithms, Software Quality Control 6 (2) (1997) 127–135. Genetic and Evolutionary Computation, ACM, New York, NY, USA, 2007, pp.
[28] M. O’Sullivan, S. Vössner, J. Wegener, Testing temporal correctness of real-time 1082–1089.
systems, in: EuroSTAR’98: Proceedings of the Sixth International Conference [52] N. Tracey, J. Clark, J. McDermid, K. Mander, Integrating safety analysis with
on Software Testing Analysis and Review, 1998. automatic test data generation for software safety verification, in: Proceedings
[29] N. Tracey, J. Clark, K. Mander, The way forward for unifying dynamic test of 17th International System Safety Conference, System Safety Society, 1999,
case generation: The optimisation-based approach, in: International pp. 128–137.
Workshop on Dependable Computing and Its Applications (DCIA), IFIP, 1998, [53] O. Abdellatif-Kaddour, P. Thévenod-Fosse, H. Waeselynck, Property-oriented
pp. 169–180. testing based on simulated annealing. URL <http://www.laas.fr/francois/SVF/
[30] F. Mueller, J. Wegener, A comparison of static analysis and evolutionary testing seminaires/inputs/02/olfapaper.pdf>.
for the verification of timing constraints, Real-Time Systems 21 (3) (1998) [54] H. Pohlheim, M. Conrad, A. Griep, Evolutionary safety testing of embedded
241–268. control software by automatically generating compact test data sequences, in:
[31] P. Puschner, R. Nossal, Testing the results of static worst-case execution-time SAE 2005 World Congress and Exhibition, 2005.
analysis, in: RTSS’98: Proceedings of the IEEE Real-Time Systems Symposium, [55] W. Afzal, R. Torkar, R. Feldt, A systematic mapping study on non-functional
IEEE Computer Society, Washington, DC, USA, 1998, p. 134. search-based software testing, in: 20th International Conference on Software
[32] H. Pohlheim, J. Wegener, Testing the temporal behavior of real-time software Engineering and Knowledge Engineering (SEKE), 2008.
modules using extended evolutionary algorithms, in: W. Banzhaf, J. Daida, A.E. [56] H.G. Groß, B. Jones, D. Eyres, Evolutionary algorithms for the verification of
Eiben, M.H. Garzon, V. Honavar, M. Jakiela, R.E. Smith (Eds.), Genetic and execution time bounds for real-time software, in: IEE Colloquium on
Evolutionary Computation – GECCO 1999, vol. 2, Orlando, FL, USA, 1999, p. Applicable Modelling, Verification and Analysis Techniques for Real-Time
1795. Systems, 1999, pp. 8/1–8/8.
[33] J. Wegener, P. Pitschinetz, H. Sthamer, Automated testing of real-time tasks, in: [57] L.C. Briand, Y. Labiche, M. Shousha, Using genetic algorithms for early
WAPATV’00: The First International Workshop on Automated Program schedulability analysis and stress testing in real-time systems, Genetic
Analysis, Testing and Verification, Limerick, Ireland, 2000. Programming and Evolvable Machines 7 (2) (2006) 145–170.
[34] H.-G. Groß, B.F. Jones, D.E. Eyres, Structural performance measure of [58] R.C. Bryce, Automatic generation of high coverage usability tests, in: CHI’05:
evolutionary testing applied to worst-case timing of real-time systems, IEE Extended Abstracts on Human Factors in Computing Systems, ACM, New York,
Proceedings – Software 147 (2) (2000) 25–30. NY, USA, 2005, pp. 1108–1109.
976 W. Afzal et al. / Information and Software Technology 51 (2009) 957–976
[59] D. Hoskins, R.C. Turban, C.J. Colbourn, Experimental designs in software Assurance Systems Engineering, IEEE Computer Society, Washington, DC, USA,
engineering: d-optimal designs and covering arrays, in: WISER’04: 1998, pp. 254–261.
Proceedings of the 2004 ACM workshop on Interdisciplinary Software [63] C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, B. Regnell, A. Wesslén,
Engineering Research, ACM, New York, NY, USA, 2004, pp. 55–66. Experimentation in Software Engineering: An Introduction, Kluwer Academic
[60] Y.-W. Tung, W. Aldiwan, Automating test case generation for the new Publishers, Norwell, MA, USA, 2000.
generation mission software system, Aerospace Conference Proceedings, vol. [64] A.C. Schultz, J.J. Grefenstette, K.A.D. Jong, Adaptive testing of controllers for
1, IEEE 1 (2000) 431–437. autonomous vehicles, in: AUV’92: Proceedings of the 1992 Symposium on
[61] D.M. Cohen, S.R. Dalal, M.L. Fredman, G.C. Patton, The AETG system: An Autonomous Underwater Vehicle Technology, 1992, pp. 158–164.
approach to testing based on combinatorial design, IEEE Transactions Software [65] A.C. Schultz, J.J. Grefenstette, K.A.D. Jong, Learning to Break Things: Adaptive
Engineering 23 (7) (1997) 437–444. Testing of Intelligent Controllers, IOP Publishing Ltd., Oxford Press, 1997
[62] Y. Lei, K.-C. Tai, In-parameter-order: A test generation strategy for pairwise (Chapter G3.5).
testing, in: HASE’98: The Third IEEE International Symposium on High-