You are on page 1of 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/322925317

REMAP: Using Rule Mining and Multi-objective Search for Dynamic Test Case
Prioritization

Conference Paper · April 2018


DOI: 10.1109/ICST.2018.00015

CITATIONS READS
18 467

5 authors, including:

Dipesh Pradhan Shuai Wang


Simula Research Laboratory Simula Research Laboratory
16 PUBLICATIONS   215 CITATIONS    84 PUBLICATIONS   1,235 CITATIONS   

SEE PROFILE SEE PROFILE

Tao Yue Marius Liaaen


Simula Research Laboratory Cisco Systems Norway
137 PUBLICATIONS   1,976 CITATIONS    31 PUBLICATIONS   541 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Dolphin View project

Certus View project

All content following this page was uploaded by Dipesh Pradhan on 15 February 2019.

The user has requested enhancement of the downloaded file.


REMAP: Using Rule Mining and Multi-Objective
Search for Dynamic Test Case Prioritization
Dipesh Pradhan1,2, Shuai Wang1, Shaukat Ali1, Tao Yue1,2, Marius Liaaen3
1
Simula Research Laboratory, Oslo, Norway
2
University of Oslo, Oslo, Norway, 3Cisco Systems, Oslo, Norway
{dipesh, shuai, shaukat, tao}@simula.no, marliaae@cisco.com

Abstract— Test case prioritization (TP) prioritizes test cases after pair to pair communication for a certain time (e.g., 5
into an optimal order for achieving specific criteria (e.g., higher minutes) fails, the test case (𝑇! ) verifying that the speed of fan
fault detection capability) as early as possible. However, the in VCS locks near the speed set by the user always fails as
existing TP techniques usually only produce a static test case well. This is in spite of the fact that all test cases are supposed
order before the execution without taking runtime test case
execution results into account. In this paper, we propose an
to be executed independently of one another. Moreover, we
approach for black-box dynamic TP using rule mining and noticed that when 𝑇! is executed as pass, another test case
multi-objective search (named as REMAP). REMAP has three (𝑇! ) that checks the CPU load measurement of VCS also
key components: 1) Rule Miner, which mines execution relations always passes. Thus, we argue that mining such execution
among test cases from historical execution data; 2) Static relations among test cases based on historical execution data
Prioritizer, which defines two objectives (i.e., fault detection can improve the effectiveness of TP. For instance, if 𝑇! is
capability (FDC) and test case reliance score (TRS)) and applies executed as fail, 𝑇! should be executed as early as possible
multi-objective search to prioritize test cases statically; and 3)
(e.g., execute 𝑇! right after 𝑇! ) since 𝑇! has a high chance to
Dynamic Executor and Prioritizer, which executes statically-
prioritized test cases and dynamically updates the test case order
fail, i.e., detect a fault. Additionally, if 𝑇! is executed as pass,
based on the runtime test case execution results. We empirically 𝑇! should be executed later (i.e., the priority of 𝑇! should be
evaluated REMAP with random search, greedy based on FDC, decreased), since 𝑇! has a high possibility to pass. Therefore,
greedy based on FDC and TRS, static search-based prioritization, it is essential to prioritize the test cases in a dynamic manner
and rule-based prioritization using two industrial and three open considering the runtime test case execution results, which
source case studies. Results showed that REMAP significantly have not been properly addressed [14, 17-19].
outperformed the other approaches for 96% of the case studies With such motivation in mind, this work proposes a black-
and managed to achieve on average 18% higher Average box TP approach named as REMAP, which uses rule mining
Percentage of Faults Detected (APFD).
and search to prioritize test cases in the absence of source
Keywords— multi-objective optimization; rule-mining; dynamic code. REMAP consists of three key components, i.e., Rule
test case prioritization; search; black-box regression testing. Miner (RM), Static Prioritizer (SP) and Dynamic Executor
and Prioritizer (DEP). First, RM defines fail rules and pass
I. INTRODUCTION rules for representing the execution relations among test cases
While developing modern software systems in industry, and mines these rules from the historical execution data.
regression testing is frequently applied to re-test the modified Second, SP defines two objectives: Fault Detection Capability
software to ensure that no new faults are introduced [1-3]. (𝐹𝐷𝐶) and Test Case Reliance Score (𝑇𝑅𝑆), and applies multi-
However, regression testing is an expensive maintenance objective search to statically prioritize test cases. 𝐹𝐷𝐶
process that can consume up to 80% of the overall testing measures the rate of failed execution of a test case in a given
budgets [4, 5]. Test case prioritization (TP) is one of the most time, while 𝑇𝑅𝑆 measures the extent to which the result of a
widely used approaches to improve regression testing with the test case can be used to predict the result of other test cases
aim to prioritize a set of test cases into an optimal order for (based on the mined rules from RM), e.g., the execution result
achieving certain criteria (e.g., fault detection capability) as of 𝑇! (as fail) can help to predict the execution result of 𝑇! (as
early as possible [6-9]. fail). Third, DEP executes the statically prioritized test cases
However, the existing TP techniques [7, 10-14] usually obtained from the SP and dynamically updates the test case
produce a list of static prioritized test cases for execution, i.e., order based on the runtime test case execution results, and fail
the order of prioritized test cases will not be updated based on rules and pass rules from RM.
the runtime test case execution results. Based on our To empirically evaluate REMAP, we employed a total of
collaboration with Cisco Systems [15, 16] focusing on cost- five case studies (two industrial ones and three open source
effectively testing video conferencing systems (VCSs), we ones): 1) two data sets from Cisco related with VCS testing, 2)
noticed that there may exist underlying relations among the two open data sets from ABB Robotics for Paint Control [20]
executions of test cases, which test engineers are not aware of and IOF/ROL [20], and 3) one open Google Shared Dataset of
when developing these test cases. For example, when the test Test Suite Results (GSDTSR) [21]. We compared REMAP
case (𝑇! ) that tests the amount of free memory left in VCS with 1) three baseline TP approaches, i.e., random search (RS)
[10, 12, 22] and two prioritization approaches based on relations (i.e., test case failing/passing together) when
Greedy [6, 8, 23]: G-1 based on 𝐹𝐷𝐶 and G-2 based on 𝐹𝐷𝐶 implementing and executing test cases. Note that we consider
and 𝑇𝑅𝑆, 2) search-based TP (SBTP) approach [24-26], and 3) each failed test case is caused by a separate fault since the link
rule-based TP (RBP) approach [27] (using the mined rules between actual faults and executed test cases are not explicit
from RM, i.e., fail rules and pass rules). in our context, and this has been assumed in other literature as
To assess the quality of the prioritized test cases, we use well [9, 20, 27].
the widely applied evaluation metric, i.e., Average Percentage Fig. 1 depicts a running example to explain our approach
with six real test cases from Cisco along with their execution
of Faults Detected (𝐴𝑃𝐹𝐷) [6, 7]. The results showed that the
results within seven test cycles. A test cycle is performed each
test cases prioritized by REMAP performed significantly
time the software is modified where a set of test cases from
better than 1) RS, G-2, SBTP, and RBP for all the five case the original test suite are executed. When a test case is
studies, and 2) G-1 for four case studies. In general, as executed, there exist two types of execution results, i.e., pass
compared to the five approaches, REMAP managed to achieve or fail. Pass implies that the test case did not find any fault(s)
on average 18% (i.e., 33.4%, 14.5%, 17.7%, 14.9%, and while Fail denotes that the test case managed to detect a
9.8%) higher values of 𝐴𝑃𝐹𝐷 for the five case studies. fault(s). From Fig. 1, we can observe that the execution of test
The key contributions of this paper include: cases 𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! , and 𝑇! failed four, two, three, three,
• We define two types of rules (fail rule and pass rule) one, and four times, respectively within the seven test cycles.
for representing execution relations among test cases Moreover, we can observe that there exist certain relations
and apply a rule mining algorithm (i.e., Repeated between the execution results of some test cases. For example,
Incremental Pruning to Produce Error Reduction [28]) when 𝑇! is executed as fail/pass, 𝑇! is always executed as
to mine these rules; fail/pass and vice versa, when both of them were executed.
• We define two objectives (i.e., 𝐹𝐷𝐶 and 𝑇𝑅𝑆) and T1: Fail T1: Pass T1: Pass T1: Fail T1: Pass T1: Fail T1: Fail
integrate these objectives into search to statically T2: Pass T2: NE T2: Pass T2: Fail T2: Pass T2: Fail T2: Pass
T3: Fail T3: Pass T3: Pass T3: Fail T3: Pass T3: NE T3: Fail
prioritize test cases;
T4: Pass T4: Fail T4: Pass T4: Fail T4: Pass T4: Fail T4: Pass
• We propose a dynamic approach using the mined rules T5: Pass T5: Fail T5: Pass T5: Pass T5: Pass T5: Pass T5: Pass
to update the statically-prioritized test cases based on T6: Fail T6: NE T6: NE T6: Fail T6: Fail T6: Fail T6: NE
the runtime test case execution results; 1 2 3 4
Test Cycle
5 6 7

• We empirically evaluate REMAP with five existing TP Fig. 1. Sample data showing the execution history for six test cases*.
*NE: Not Executed
approaches using five case studies.
Based on such an interesting observation, we can assume
The rest of the paper is organized as follows. Section II
that internal relations for test case execution can be extracted
motivates our work followed by a formalization of the TP
from historical execution data, which can then be used to
problem in Section III. Section IV presents REMAP in detail, guide prioritizing test cases dynamically. More specifically,
and Section V details the experiment design followed by when a test case is executed as fail (e.g., 𝑇! in Fig. 1), the
presenting the results in Section VI. Section VII presents the corresponding fail test cases (e.g., 𝑇! in Fig. 1) should be
related work, and Section VIII concludes this paper. executed earlier as there is a high chance that the
corresponding test case(s) is executed as fail. On the other
II. MOTIVATING EXAMPLE
hand, if a test case is executed as pass (e.g., 𝑇! in Fig. 1), the
We have been collaborating with Cisco Systems, Norway priority of the related pass test case(s) (𝑇! in Fig. 1) need to be
for more than nine years with the aim to improve the quality decreased (i.e., they need to be executed later). We argue that
of their video conferencing systems (VCSs) [15, 16, 25]. The identifying such internal execution relations among test cases
VCSs are used to organize high-quality virtual meetings can help to facilitate prioritizing test cases dynamically, which
without gathering the participants to one location. These VCSs is the key motivation of this work.
are used in many organizations such as IBM and Facebook.
Each VCS has on average three million lines of embedded C III. BASIC NOTATIONS AND PROBLEM REPRESENTATION
code, and they are developed continuously. Testing needs to This section presents the basic notations (Section III.A)
be performed each time the software is modified. However, it and problem representation (Section III.B).
is not possible to execute all the test cases since there is only a
limited time for execution and each test case requires much A. Basic Notations
time for execution (e.g., 10 minutes). Thus, the test cases need Notation 1. 𝑇𝑆 is an original test suite to be executed with
to be prioritized to execute the important test cases (i.e., test
𝑛 test cases, i.e., 𝑇𝑆 = {𝑇! , 𝑇! , … , 𝑇! }. For instance, in the
cases most likely to fail) as soon as possible.
running example (Fig. 1), 𝑇𝑆 consists of six test cases.
Through further investigation of VCS testing, we noticed
that a certain number of test cases turned to pass/fail together, Notation 2. 𝑇𝐶 is a set of 𝑝 test cycles that have been
i.e., when a test case passes/fails some other test case(s) executed, i.e., 𝑇𝐶 = {𝑡𝑐! , 𝑡𝑐! , … , 𝑡𝑐! }, such that one or more
passed/failed almost all the time (with more than 90% test cases are executed in each test cycle. For example, there
probability). This is despite the fact that the test cases do not are seven test cycles in the running example (Fig. 1), i.e., 𝑝=7.
depend upon one another when implemented and are supposed Notation 3. For a test case 𝑇! executed in a test cycle 𝑡𝑐! ,
to be executed independently. Moreover, the test engineers 𝑇! has two possible verdicts, i.e., 𝑣!" ∈ 𝑝𝑎𝑠𝑠, 𝑓𝑎𝑖𝑙 . For
from our industrial partner are not aware of these execution example in Fig. 1, the verdict for 𝑇! is fail, while the verdict
for 𝑇! is pass in 𝑡𝑐! . If a test case has not been executed in a B. Rule Miner (RM)
test cycle, it has no verdict. For instance in Fig. 1, 𝑇! has no We first define two types of rules, i.e., fail rule and pass
verdict (represented by 𝑁𝐸) in 𝑡𝑐! . 𝑉 𝑇! is a function that rule for representing the execution relations among test cases.
returns the verdict of a test case 𝑇! . A fail rule specifies an execution relation between two or
B. Problem Representation more test cases if a verdict of a test case(s) is linked to the fail
verdict of another test case. The fail rule is formed as:
Let 𝑆 represents all potential solutions for prioritizing TS !"#$
(𝑉 𝑇! ) 𝐴𝑁𝐷, … , 𝐴𝑁𝐷 𝑉(𝑇! ) 𝑉 𝑇! = 𝑓𝑎𝑖𝑙 ∧ 𝑇! ∉ {𝑇! , … , 𝑇! } ,
to execute, 𝑆 = {𝑠! , 𝑠! , … , 𝑠! }, where 𝑞 = 𝑛!. For instance in
the running example with six test cases (Fig. 1), the total where 𝑉 𝑇! is a function that returns the verdict of a test case
𝑇! (i.e., pass or fail). Note that for a fail rule, the execution
number of solutions 𝑞 = 720. Each solution 𝑠! is an order of
relation must hold true for the specific test cases with more
test cases from TS, 𝑠! = {𝑇!! , 𝑇!! , … , 𝑇!" } where 𝑇!" refers to a than 90% probability (i.e., confidence) as is often used in
test case that will be executed in the 𝑖 !! position for 𝑠! . literatures [29-32]. For example in Fig. 1, there exists a fail
!"#$
Test Case Prioritization (TP) Problem: TP problem aims to rule between 𝑇! and 𝑇! : (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙),
find a solution 𝑠! = {𝑇!! , 𝑇!! , … , 𝑇!" } such that: i.e., when 𝑇! has a verdict fail (executed as fail), 𝑇! always
𝐹(𝑇!" ) ≥ 𝐹(𝑇!" ), 𝑤ℎ𝑒𝑟𝑒 𝑖 < 𝑘 ∧ 1 ≤ 𝑖 ≤ 𝑛 − 1 and 𝐹(𝑇!" ) has the verdict fail and this rule holds true with 100%
is an objective function that represents the fault detection probability in the historical execution data (Fig. 1). With fail
capability of a test case in the 𝑖 !! position for the solution 𝑠! . rules, we aim to prioritize and execute the test cases that are
more likely to fail as soon as possible. For instance, for the
Dynamic TP Problem: The goal of dynamic TP is to motivating example in Section II, 𝑇! should be executed as
dynamically prioritize test cases in the solution based on the early as possible when 𝑇! is executed as fail.
runtime test case execution results to detect faults as soon as A pass rule specifies an execution relation between two or
possible. For a solution 𝑠! = {𝑇!! , 𝑇!! , … , 𝑇!" }, the dynamic more test cases if a verdict of a test case(s) is linked to the
TP problem aims to update the execution order of the test pass verdict of another test case. The pass rule is in the form:
cases in 𝑠! to obtain a solution 𝑠! ! = {𝑇!! ! , … , 𝑇!" ! }, such that 𝑉(𝑇! ) 𝐴𝑁𝐷, … , 𝐴𝑁𝐷 𝑉(𝑇! )
!"##
𝑉 𝑇! = 𝑝𝑎𝑠𝑠 ∧ 𝑇! ∉ {𝑇! , … , 𝑇! }
𝐹(𝑠! ! ) ≥ 𝐹(𝑠! ). 𝐹(𝑠! ! ) is a function that returns the value for where 𝑉 𝑇! is a function that returns the verdict of test case
the average percentage of fault detected for 𝑠! ! . 𝑇! (i.e., pass or fail). Similarly, for a pass rule, the execution
relation must hold true with more than 90% probability (i.e.,
IV. APPROACH: REMAP confidence) [29-32]. For instance in Fig. 1, there exists a pass
!"##
This section first presents an overview of our approach rule between 𝑇! and 𝑇! : (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠),
(named as REMAP) for dynamic TP that combines rule i.e., when 𝑇! has a verdict pass (executed as pass), the verdict
mining and search in Section IV.A followed by detailing its of 𝑇! is pass and this rule holds true with 100% probability
three key components (Section IV.B - Section IV.D). based on the historical execution data (Fig. 1). With pass
A. Overview of REMAP rules, we aim to deprioritize the test cases that are likely to
pass and execute them as late as possible. For instance, for the
Fig. 2 presents an overview of REMAP consisting of three motivating example, 𝑇! should be executed as late as possible
key components: Rule Miner (RM), Static Prioritizer (SP), and when 𝑇! is executed as pass.
Dynamic Executor and Prioritizer (DEP). The core of RM is To mine fail rules and pass rules, our RM employs the
to mine the historical execution results of test cases and Repeated Incremental Pruning to Produce Error Reduction
produce a set of execution relations (Rules depicted in Fig. 2) (RIPPER) algorithm [28]. We used RIPPER since it is one of
among test cases, e.g., if 𝑇! fails then 𝑇! fails in Section II. the most commonly applied rule-based algorithm [33], and it
Afterwards, SP takes the mined rules and historical execution has performed well for addressing different problems (e.g.,
results of test cases as input to statically prioritize test cases software defect prediction and feature subset selection) [32,
for execution. Finally, DEP executes the statically prioritized 34-37]. More specifically, RM takes as input a test suite with 𝑛
test cases obtained from the SP one at a time and dynamically test cases, historical test case execution data (with a set of test
updates the order of the unexecuted test cases based on the cycles that lists the verdict of all the test cases). After that, RM
runtime test case execution results (Fig. 2) in order to detect produces a set of fail rules and pass rules as output. Fig. 3
the fault(s) as soon as possible. demonstrates the overall process of RM.
input Recall that there are two possible verdicts for each
Rule Miner (RM) Static Prioritizer (SP) executed test case (Section III.A): pass or fail. For each test
Historical input output Rules input case in 𝑇𝑆, RM uses the RIPPER algorithm to mine rules
Fail rules Multi-objective
execution
data
Ripper
Pass rules search based on the historical execution data (i.e., 𝑇𝐶) (lines 1-3 in
input Fig. 3). The detailed description of the RIPPER algorithm can
updates Dynamic Executor output Database be checked from [28, 38]. Afterwards, the mined rules are
and Prioritizer (DEP) classified into a fail rule or pass rule one at a time (lines 4-8)
Executed output Execute and update test
input
Statically
Component until all of them are classified. If the verdict of the particular
test case case order iterative prioritized
Artifact
test case is fail, the rule related to this test case is classified as
test cases
results process a fail rule (lines 5-6); otherwise, the rule is classified as a pass
Fig. 2. Overview of REMAP. rule (lines 7 and 8 in Fig. 3). For instance, in Fig. 1, two rules
!"#$ Fault Detection Capability (𝑭𝑫𝑪): 𝐹𝐷𝐶 for a test case is
are mined for 𝑇! using RIPPER: (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) =
!"## defined as the rate of failed execution of a test case in a given
𝑓𝑎𝑖𝑙) as a fail rule and (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) time period, and it has been frequently applied in the literature
as a pass rule. This process of rule mining and classification is [13, 25]. The 𝐹𝐷𝐶 for a test case is calculated as: 𝐹𝐷𝐶 !" =
repeated for all the test cases in the test suite (lines 1-8 in Fig. !"#$%& !" !"#$% !!!" !! !"#$% ! !"#$%
3), and finally, the mined fail rule and pass rule are returned . 𝐹𝐷𝐶 of 𝑇! is calculated
!"#$%& !" !"#$% !!!" !! !"# !"!#$%!&
(line 9 in Fig. 3). based on the historical execution information of 𝑇! . For
instance, in Fig. 1, the 𝐹𝐷𝐶 for 𝑇! is 0.43 since it found fault
Input: Set of test cycles 𝑇𝐶 = {𝑡𝑐! , 𝑡𝑐! , … , 𝑡𝑐! } , set of test cases
three times out of seven executions. The 𝐹𝐷𝐶 for a solution 𝑠!
𝑇𝑆 = {𝑇! , 𝑇! , … , 𝑇! } ! !!!!!
Output: A set of fail rules and pass rules !!! !"#!" × !
can be calculated as: 𝐹𝐷𝐶 !" = , where, 𝑠!
Begin: !"#$
1 for (𝑇 ∈ 𝑇𝑆) do // for all the test cases represents any solution 𝑗 (e.g., {𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! } in Fig. 1),
2 𝑅𝑢𝑙𝑒𝑆𝑒𝑡 ⟵ ∅ 𝑚𝑓𝑑𝑐 represents the sum of 𝐹𝐷𝐶 of all the test cases in 𝑠𝑗 , and
3 𝑅𝑢𝑙𝑒𝑆𝑒𝑡 ⟵ 𝑅𝐼𝑃𝑃𝐸𝑅(𝑇, 𝑇𝐶) // mine rules for 𝑇 using 𝑇𝐶 𝑛 is the total number of test cases in 𝑠! . Notice that a higher
4 for (𝑟𝑢𝑙𝑒 𝑖𝑛 𝑅𝑢𝑙𝑒𝑆𝑒𝑡) do // for each mined rule value of 𝐹𝐷𝐶 implies a better solution. The goal is to
5 if (𝑉 𝑇 = 𝑓𝑎𝑖𝑙) then // classify as fail rule maximize the 𝐹𝐷𝐶 of a solution since we aim to execute the
6 𝑓𝑎𝑖𝑙_𝑟𝑢𝑙𝑒 ⟵ 𝑓𝑎𝑖𝑙_𝑟𝑢𝑙𝑒 ∪ 𝑟𝑢𝑙𝑒 test cases that fail (i.e., detect faults) as early as possible.
7 else if (𝑉 𝑇 = 𝑝𝑎𝑠𝑠) then // classify as pass rule
8 𝑝𝑎𝑠𝑠_𝑟𝑢𝑙𝑒 ⟵ 𝑝𝑎𝑠𝑠_𝑟𝑢𝑙𝑒 ∪ 𝑟𝑢𝑙𝑒 Test Case Reliance Score (𝑻𝑹𝑺): The 𝑇𝑅𝑆 for a solution 𝑠!
! !!!!!
9 return (𝑓𝑎𝑖𝑙!"#$ , 𝑝𝑎𝑠𝑠!"#$ ) !!! !"#!" × !
End can be calculated as: 𝑇𝑅𝑆 !" = , where 𝑜𝑡𝑟𝑝𝑠
!"#$%
Fig. 3. Overview of rule miner. represents the sum of 𝑇𝑅𝑆 of all the test cases in 𝑠𝑗 , 𝑛 is the
In this way, a set of fail rules and pass rules can be total number of test cases in 𝑠! , and 𝑇𝑅𝑆!" is the test case
obtained from RM. For instance, in the motivating example reliance score (𝑇𝑅𝑆) for the test case 𝑇! . The 𝑇𝑅𝑆 for a test
(Section II), we can obtain four fail rules and three pass rules case 𝑇! is defined as the number of unique test cases whose
for the six test cases from the seven test cycles (Fig. 1) as results can be predicted by executing 𝑇! using the defined fail
shown in TABLE I. rules and pass rules extracted from RM (Section IV.B). For
TABLE I. FAIL RULE AND PASS RULE FROM THE MOTIVATING EXAMPLE instance in Fig. 1, 𝑇𝑅𝑆 for 𝑇! is 1 since the execution of 𝑇!
Fail Rule Pass Rule
can only be used to predict the execution result of 𝑇! based on
!"#$ !"## the rule number 1 and 5 (TABLE I) while 𝑇𝑅𝑆 for 𝑇! is 0 as
1. (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) 5. (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠)
!"#$ !"## there is no test case that can be predicted based on the
2. (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) 6. (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) execution result of 𝑇! . The goal is to maximize the 𝑇𝑅𝑆 of a
!"#$ !"##
3. (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) 7. (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) (𝑉(𝑇! ) = 𝑝𝑎𝑠𝑠) solution since we aim to execute the test cases that can predict
!"#$
4. (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) (𝑉(𝑇! ) = 𝑓𝑎𝑖𝑙) the execution results of other test cases as early as possible.
We further integrated these two objectives into the most
C. Static Prioritizer (SP) widely used multi-objective search algorithm Non-dominated
History-based TP techniques have been widely applied to Sorting Genetic Algorithm II (NSGA-II) [41] that has been
prioritize the test cases based on their historical failure data, widely applied in different contexts [42-45]. Moreover, we
i.e., test cases that failed more often should be given higher chose the widely-used tournament selection operator [41] to
priorities [12, 13, 20, 39, 40]. For example, 𝑇! and 𝑇! failed select individual solutions with the best fitness for inclusion
the highest number of times (i.e., four times) in Fig. 1, and into the next generation. The single point crossover operator
therefore they should be given the highest priorities for is used to produce the offspring solutions from the parent
execution out of the six test cases. Moreover, we argue that solutions by swapping some test cases in the parent solutions.
the execution relations among the test cases should also be Swap mutation operator is used to randomly change the order
taken into account for TP. The test cases whose results can be of test cases in the solutions based on the pre-defined mutation
used to predict the results for more test cases (using fail rules !
probability, e.g., , where 𝑛(𝑡) represents the size of the test
and pass rules) need to be given higher priorities since their !(!)

execution result can help predict the result of other test cases. suite. Note that SP produces a set of non-dominated solutions
For example in Fig. 1, the execution result of 𝑇! is related to based on Pareto optimality [46-48] with respect to the defined
one test case 𝑇! , while the execution result of 𝑇! is not related objectives. The Pareto optimality theory states that a solution
to any test case as shown in TABLE I. Thus, it is ideal to 𝑠! dominates another solution 𝑠! if 𝑠! is better than 𝑠! in at
execute 𝑇! earlier than 𝑇! since based on the execution result least one objective and for all other objectives 𝑠! is not worse
of 𝑇! , the related test case (i.e., 𝑇! ) can be prioritized (if 𝑇! than 𝑠! [42]. Note that we randomly choose one solution from
fails) or deprioritized (if 𝑇! passes). the generated non-dominated solutions as input for the
Thus, our Static Prioritizer (SP) defines two objectives: component Dynamic Executor and Prioritizer (DEP) since all
the solutions produced by SP have equivalent quality.
Fault Detection Capability (𝐹𝐷𝐶) and Test Case Reliance
Score (𝑇𝑅𝑆), and uses multi-objective search to statically D. Dynamic Executor and Prioritizer (DEP)
prioritize test cases before execution (as shown in Fig. 2). We
The core of DEP is to execute the statically prioritized test
formally define two objectives to guide the search toward
cases obtained from SP (Section IV.C) and dynamically
finding optimal solutions in detail below.
update the test case order using the fail rules and pass rules execute. If there exists any test case in 𝐴𝐹 (line 1 in Algorithm
mined from RM (Section IV.B) as shown in Fig. 2. Thus, the 1), it is selected for execution (line 2 in Algorithm 1),
input for DEP is a prioritization solution (i.e., static prioritized removed from 𝐴𝐹 (line 3 in Algorithm 1), and returned (line
test cases) from SP and a set of mined rules (i.e., fail rules and 12 in Algorithm 1) to DEP (Fig. 4). For instance, initially, 𝑇!
pass rules) from RM. The overall process of DEP is presented is selected for execution from 𝐴𝐹 in Fig. 5. Afterward, the
in Fig. 4, and it is explained using the motivating example selected test case (i.e., 𝑇! ) is executed (line 6 in Fig. 4), and
(Section II) in Fig. 5. based on the verdict of the executed test case, the pass rule or
fail rule is employed (lines 7-12 in Fig. 4). Then, the executed
Input: Solution 𝑠! = {𝑇!! , … , 𝑇!" }, pass rules 𝑃𝑅, fail rules 𝐹𝑅 test case is added to the final solution (line 13 in Fig. 4) and
Output: Dynamically prioritized test cases removed from the statically prioritized solution for execution
Begin: (line 15 in Fig. 4). If there exists no related test case(s), the
1 𝑘⟵1 next test case is selected for execution (line 5 in Fig. 4).
2 𝐴𝐹 ⟵ 𝐴𝑙𝑤𝑎𝑦𝑠𝐹𝑎𝑖𝑙𝑖𝑛𝑔(𝑠! ) // set of test cases that always fail
3 𝐴𝑃 ⟵ 𝐴𝑙𝑤𝑎𝑦𝑠𝑃𝑎𝑠𝑠𝑖𝑛𝑔(𝑠! ) // set of test cases that always pass Algorithm 1: Get-test-case-execution
4 while (𝑘 ≤ 𝑛) and (termination_conditions_not_satisfied) do Input: Solution 𝑠! = {𝑇!! , … , 𝑇!" }, set of always fail test cases 𝐴𝐹,
5 𝑇𝐸 ⟵ 𝐺𝑒𝑡 − 𝑡𝑒𝑠𝑡 − 𝑐𝑎𝑠𝑒 − 𝑒𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛(𝑠! , 𝐴𝐹, 𝐴𝑃) set of always pass test cases 𝐴𝑃
6 𝑣𝑒𝑟𝑑𝑖𝑐𝑡 ⟵ 𝑒𝑥𝑒𝑐𝑢𝑡𝑒(𝑇𝐸) // execute the test case Output: A test case for execution
7 if (𝑣𝑒𝑟𝑑𝑖𝑐𝑡 = ‘pass’) then Begin:
8 if (𝑇𝐸 ∈ 𝑃𝑅) then // if the test case has a pass rule 1 if (|𝐴𝐹|> 0) then
9 move the related test cases backward to the end of solution 2 𝑇𝐸 ⟵ 𝐹(𝐴𝐹! ) // get the first fail test case
10 else if (𝑣𝑒𝑟𝑑𝑖𝑐𝑡 = ‘fail’) then 3 remove 𝐴𝐹! from 𝐴𝐹 // remove the fail test case
11 if (𝑇𝐸 ∈ 𝐹𝑅) then // if the test case has a fail rule 4 else 𝑇𝐸 ⟵ 𝐹(𝑠!! ) // if there are not any test cases in 𝐴𝐹
12 move the related test cases forward to execute next 5 𝑚𝑜𝑣𝑒_𝑇 ⟵ 𝑇𝑟𝑢𝑒 // initially test cases can be moved
13 𝑠!"#$% ⟵ 𝑠!"#$% ∪ 𝑇𝐸 // dynamically prioritized solution 6 𝑓𝑖𝑟𝑠𝑡_𝑇 ⟵ 𝑇𝐸
14 𝑘 ⟵ 𝑘 + 1 7 while (𝑇𝐸 in 𝐴𝑃 and 𝑚𝑜𝑣𝑒_𝑇) do
15 remove 𝑇𝐸 from 𝑠! // remove executed test case from 𝑠! 8 move 𝑠!! to end of 𝑠! // move test case to the end
16 return 𝑠!"#$% 9 𝑇𝐸 ⟵ 𝐹(𝑠!! )
End 10 if (𝑓𝑖𝑟𝑠𝑡_𝑇 = 𝑇𝐸) then
Fig. 4. Overview of dynamic executor and prioritizer. 11 𝑚𝑜𝑣𝑒_𝑇 ⟵ 𝐹𝑎𝑙𝑠𝑒 // do not move test case anymore
12 return 𝑇𝐸
End
T1 T4 T3 T6 T2 T5 T1 T4 T3 T6 T2 T5 To select the next test cases, Algorithm 1 is employed
again. If there exists no more test case in 𝐴𝐹, the first test case
Initial prioritized solution from SP T6 executed from the statically prioritized solution is selected for execution
T1 if it is not present in the set of pass test cases 𝐴𝑃 (lines 4-7 in
T6 T1 T4 T3 T2 T5 T6 T1 T4 T3 T2 T5 Algorithm 1). However, if a test case is present in 𝐴𝑃, the next
executed
T3 executed test case is selected from the static prioritized solution (lines
T4 7-11 in Algorithm 1). For example, 𝑇! is executed after 𝑇! in
T6 T1 T3 T4 T2 T5 T6 T1 T3 T4 T5 T2 Fig. 5 since it is the first test case from the statically
executed
T5 executed prioritized solution obtained from SP, and it is not included in
T2 𝐴𝑃. Based on the verdict of the test case (line 6 in Fig. 4), pass
T6 T1 T3 T4 T5 T2 T6 T1 T3 T4 T5 T2 rule or fail rule is employed once more (lines 7-12 in Fig. 4).
executed
Final prioritized solution from DEP If the test case has the verdict fail (line 10), it is checked if it is
Fig. 5. Example of dynamic test case execution and prioritization. a part of any fail rule (line 11 in Fig. 4). If it is indeed a part of
a fail rule, the related fail test case(s) is moved to the front of
Let us assume a prioritization solution is obtained as the solution 𝑠! (line 12), e.g., 𝑇! is moved in front of 𝑇! in Fig.
{𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! } (Fig. 5) from SP for the test cases in the 5 since there exists a fail rule between 𝑇! and 𝑇! .
motivating example (Section II). Moreover, seven fail rules Alternatively, if the test case has a verdict pass, it is
and pass rules are extracted from RM as shown in TABLE I. checked if it is a part of any pass rules (line 7 in Fig. 4). If so,
Initially, DEP checks if there are any test case(s) that always the related test case(s) is moved at the end of the solution. For
had the verdict fail or pass in the historical execution data. If example, 𝑇! is moved after 𝑇! in Fig. 5 since 𝑇! passed and
there exists any such test case(s), DEP adds them to the set 𝐴𝐹 there is a pass rule between 𝑇! and 𝑇! as shown in TABLE I.
(if the verdict is always fail) or the set 𝐴𝑃 (if the verdict is This process of executing and moving the test case is repeated
always pass) (lines 2 and 3 in Fig. 4). For example in Fig. 1, until all the test cases are executed (e.g., all six test cases are
𝑇! always had the verdict fail in historical execution data, and executed shown in Fig. 5) or the termination criteria (e.g., a
thus 𝑇! is added to 𝐴𝐹. predefined time budget) for the algorithm is met. The final
Afterwards, DEP identifies the test cases to execute using solution lists the optimal order of the test cases that were
the algorithm Get-test-case-execution (Algorithm 1). More executed. For example, the final execution order of these six
specifically, Get-test-case-execution uses the statically test cases in Fig. 5 is: {𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! , 𝑇! } with three fail test
prioritized solution, a set of always fail test cases 𝐴𝐹, and a cases: 𝑇! , 𝑇! , and 𝑇! . Finally, at the end of the test cycle, the
set of always pass test cases 𝐴𝑃, to find the test case to
historical execution results of test cases are updated based on B. Case Studies
the verdicts of the executed test cases as shown in Fig. 2. Note To evaluate REMAP, we selected a total five case studies:
that REMAP updates the mined rules after each test cycle two case studies for two different VCS products from Cisco
rather than after executing each test case since the mined rules (i.e., CS1 and CS2), two open source case studies from ABB
will not be changed largely if updated after executing each test Robotics for Paint Control (CS3) [20] and IOF/ROL (CS4)
case and rule mining is computationally expensive [35, 49]. [20], and Google Shared Dataset of Test Suite Results
Note that REMAP includes Static Prioritizer (i.e., SP in (GSDTSR) (CS5) [21]. For each case study, test case
Section IV.C) before Dynamic Executor and Prioritizer (i.e., execution result is linked to a particular test cycle such that
DEP in Section IV.D) since we observed that not all the test each test cycle is considered as occurrence of regression
cases can be associated with fail rules and pass rules (Section testing. More specifically, each case study contains historical
IV.B). For instance, in the motivating example (Section II), execution data of the test cases for more than 300 test cycles
the test case 𝑇! can be associated with no fail rules and pass as shown in TABLE III. The historical execution data of the
rules (Section IV.B). Thus, as compared with randomly test cases are used to mine execution relations using the RM
choosing a test case for execution when there is no mined rule (Section IV.B) and calculate the 𝐹𝐷𝐶 for each test case
to rely on, SP can help to improve the effectiveness of TP. (Section IV.C). Note that the open source case studies for CS3
V. EMPIRICAL EVALUATION DESIGN - CS5 are publicly available at [51].
TABLE III. OVERVIEW OF THE CASE STUDIES USED FOR MINING RULES
In this section, we present the experiment design (TABLE
II) with research questions, case studies, evaluation metrics, CS Data Set #Test Cases #Test Cycles #Verdicts
and statistical tests (Section V.A - Section V.D). CS1 Cisco Data Set1 60 8,322 296,042
CS2 Cisco Data Set2 624 6,302 149,039
TABLE II. OVERVIEW OF THE EXPERIMENT DESIGN
CS3 ABB Paint Control 89 351 25,497
Comparison Case Evaluation CS4 ABB IOF/ROL 1,941 315 32,162
RQ Statistical Test
with Study Metric CS5 Google GSDTSR 5,555 335 1,253,464
1 RS, Greedy Vargha and Delaney
2 SSBP
CS1 –
APFD 𝐴!" , Mann-Whitney
Note that if a test case has been executed more than once
CS5 in a test cycle, we consider its latest verdict in the test cycle as
3 RBP U Test
its actual verdict when mining the rules using RM (Section
A. Research Questions IV.B). This is because based on our industrial collaboration,
To evaluate the effectiveness of REMAP, we will address we noticed that a test case could be executed more than once
the following research questions. in a test cycle due to various reasons, e.g., device problems
(e.g., need to restart), network problems and test engineers
RQ1. Sanity check: Is REMAP better than random search only take the final verdict of the test case into account when
(RS) and Greedy for the five case studies? debugging or reporting.
This RQ is defined to check if our TP problem is non-
trivial to solve. We used RS and two types of Greedy C. Experiment Tasks and Evaluation Metric
algorithms: 1) G-1, which prioritizes the test cases based on For the industrial case studies (i.e., CS1 and CS2), we
the objective 𝐹𝐷𝐶 (Section IV.C) such that the test case with dynamically prioritize and execute the test cases in VCSs for
high 𝐹𝐷𝐶 are given high priority [50] and 2) G-2, which the next test cycle (that has not been executed yet, e.g., test
prioritizes the test cases based on two objectives: 𝐹𝐷𝐶 and cycle 6,303 for CS2) and evaluate the six approaches: RS, G-1,
𝑇𝑅𝑆, where the two objectives are combined such that each
G-2, SSBP, RBP, and REMAP. However, for the open source
objective is given equal weight.
case studies (i.e., CS3 - CS5), we do not have access to the
RQ2. Is REMAP better than static search-based TP actual test cases to execute them as mentioned in Section V.B.
approach (SSBP), i.e., static prioritization solutions Thus, we use the historical test execution results without the
obtained from SP (Section IV.C)? latest test cycle for prioritizing the test cases and compare the
This RQ is defined to evaluate if dynamic prioritization performance of different approaches based on the actual test
(DEP in Section IV.D) can indeed help to improve the case execution results from the latest test cycle. For instance,
effectiveness of TP as compared with TP approaches without
for CS5, which has execution results for 336 test cycles [51],
considering runtime test case execution results.
we used historical results from 335 test cycles to prioritize test
RQ3. Is REMAP better than rule-based TP approach cases for the next test cycle (i.e., 336). More specifically, we
(RBP), i.e., TP only based on the fail rules and pass rules look at the execution result of each test case from the latest
from RM (Section IV.B)? test cycle to know if the verdict of a test case is fail or pass,
Answering this RQ can help us to know if SP together and accordingly, we apply the fail rules or pass rules. Note
with DEP can help improve the effectiveness of TP as that such a way of comparison has been applied in the
compared with rule-based TP approach. Based on [27], we literature when it is difficult to execute the test cases at
designed rule-based TP approach (RBP) by applying the RM runtime in practice [20, 27].
and DEP components. Note the difference between RBP and We employ the widely used Average Percentage of Fault
our approach REMAP is that RBP uses the solutions produced Detected (𝐴𝑃𝐹𝐷) metric as the evaluation metric to compare
by RS (from RQ1) as input rather than a statically prioritized the performance of different approaches as is widely used in
solution from SP. the literature. The 𝐴𝑃𝐹𝐷 [6, 7] metric is widely used to
TABLE IV. COMPARISON OF APFD FOR THE FIVE CASE STUDIES USING THE VARGHA AND DELANEY STATISTICS AND MANN-WHITNEY U TEST
CS1 CS2 CS3 CS4 CS5
RQ Comparison ! 𝟏𝟐 ! 𝟏𝟐 ! 𝟏𝟐 ! 𝟏𝟐 ! 𝟏𝟐
𝑨 p-value 𝑨 p-value 𝑨 p-value 𝑨 p-value 𝑨 p-value
REMAP vs. RS 0.986 <0.0001 0.999 <0.0001 0.869 <0.0001 0.999 <0.0001 1 <0.0001
1 REMAP vs. G-1 0.993 <0.0001 0.921 <0.0001 1 <0.0001 0.917 <0.0001 1 <0.0001
REMAP vs. G-2 0.993 <0.0001 1 <0.0001 1 <0.0001 0.257 <0.0001 1 <0.0001
2 REMAP vs. SSBP 0.99 <0.0001 0.951 <0.0001 1 <0.0001 0.967 <0.0001 0.966 <0.0001
3 REMAP vs. RBP 0.881 <0.0001 0.921 <0.0001 0.784 <0.0001 0.917 <0.0001 0.943 <0.0001
measure the effectiveness of prioritized test cases in terms of Each approach (except G-1 and G-2) was executed 30
detecting the faults for a particular test cycle [8, 18, 19, 23]. times for each case study to account for the random variation
𝐴𝑃𝐹𝐷 for a solution 𝑠! can be calculated as: 𝐴𝑃𝐹𝐷!" = 1 − of search algorithms [52], such that each run produced 100
!
!!! !"! ! solutions (population size for SP). Therefore, a total of 3,000
+ , where 𝑛 denotes the number of test cases in the
!×! !! solutions were compared for each approach (except G-1 and
test suite 𝑇𝑆, 𝑚 denotes the number of faults detected by 𝑇𝑆, G-2) in each case study. Since G-1 and G-2 always produce
and 𝑇𝐹! represents the first test case in 𝑠! that detects the fault the same solution [8, 59, 60], they have one value per case
𝑖. Note that the value of 𝐴𝑃𝐹𝐷 metric is between 0 and 1 (0- study. The experiments were executed on a Mac OS Sierra
100%), and higher 𝐴𝑃𝐹𝐷 implies higher fault detection rates. with a processor Intel Core i7 2.8GHz and 16 GB of RAM.
D. Statistical Tests and Experiment Settings VI. RESULTS, ANALYSIS, AND OVERALL DISCUSSION
1) Statistical Tests: Using the guidelines from [52], the
A. RQ1. Sanity Check (REMAP vs. RS, G-1, and G-2)
Vargha and Delaney 𝐴!" statistics [53] and Mann-Whitney U
test [54] are used to statistically evaluate the results for the Recall that RQ1 aims to assess the effectiveness of
three research questions (i.e., RQ1- RQ3) as depicted in REMAP as compared to RS, G-1, and G-2 using 𝐴𝑃𝐹𝐷
TABLE II. The Vargha and Delaney 𝐴!" statistics is a non- (TABLE II). TABLE IV reports the results of comparing
parametric effect size measure and evaluates the probability of REMAP with RS, G-1, and G-2 in terms of 𝐴𝑃𝐹𝐷. Based on
yielding higher values for the evaluation metric for two the results from TABLE IV, it can be observed that REMAP
approaches 𝐴 and 𝐵. Mann-Whitney U test (p-value) is used managed to significantly outperform both RS and G-2 for all
to indicate whether the median values of the two samples are the five case studies, and G-1 for four case studies since 𝐴!"
statistically significant. For two approaches 𝐴 and 𝐵, 𝐴 has a values (the Vargha and Delaney statistics) are higher than 0.75
significantly better performance than 𝐵 if 𝐴!" is higher than and p-values are less than 0.0001.
0.5 and the p-value is less than 0.05. In our context, we use the Moreover, Fig. 6 presents the boxplot of 𝐴𝑃𝐹𝐷 produced
Vargha and Delaney comparison without transformation [55] by REMAP and RS, such that there are 3,000 values for each
since we are interested in any improvement in the case study. Note that G-1 and G-2 are not presented in Fig. 6
performance of the approach (e.g., REMAP) [42, 56]. since they produce only one value for each case study as
mentioned in Section V.D. TABLE V presents the average
2) Experiment Settings: We used the WEKA data mining 𝐴𝑃𝐹𝐷 produced by REMAP and RS and actual 𝐴𝑃𝐹𝐷 for G-1
software [57] to implement the component RM (Section and G-2. Based on Fig. 6 and TABLE V, we can observe that
IV.B). The input data format to be used by our RM is a file the median 𝐴𝑃𝐹𝐷 produced by REMAP are much higher than
composed of instances that contain test case verdicts such that RS and G-2 for all the five case studies, and G-1 for four case
each instance in the file represents a test cycle. We employed studies. Moreover, we can observe from TABLE V that on
a widely-used Java framework jMetal [58] to implement the average REMAP achieved a better 𝐴𝑃𝐹𝐷 of 33.4%, 14.5%,
component SP (Section IV.C) and the standard settings were and 17.7% as compared to RS, G-1, and G-2, respectively.
applied to configure the search algorithm (i.e., NSGA-II) [58]
TABLE V. APFD FOR DIFFERENT APPROACHES FOR FIVE CASE STUDIES
as are usually recommended [52]. More specifically, the
Approach CS1 CS2 CS3 CS4 CS5
population size is set as 100, the crossover rate is 0.9, the
REMAP 76.86% 76.07% 73.35% 93.24% 94.02%
mutation rate is 1/(Total number of test cases), and the RS 48.54% 49.82% 48.66% 49.69% 49.7%
maximum number of fitness evaluation (i.e., termination G-1 68.5% 42.1% 46.1% 95.5% 88.7%
criteria of the algorithm) is set as 50,000. For the case studies, G-2 62.5% 64.9% 22.47% 86.69% 88.5%
we encoded each test suite to be prioritized as an abstract SSBP 65.38% 67.19% 41.11% 79.54% 85.8%
format (e.g., JSON file), which contains the key information RBP 60.77% 65.56% 62.18% 83.21% 92.85%
of the test cases for prioritization, e.g., test case id, 𝐹𝐷𝐶. Thus, we can answer RQ1 as REMAP can significantly
Afterwards, NSGA-II is employed to produce the prioritized outperform RS, G-1, and G-2 for 93.3% of the case studies
test suite that is formed as the same abstract format (e.g., (i.e., 14 out of 15) defined in Section V.A. Overall, on average
JSON file), which consists of a list of prioritized test cases. REMAP achieved a better 𝐴𝑃𝐹𝐷 of 21.9% (i.e., 33.4%,
Finally, the prioritized test cases are selected from the original 14.5%, and 17.7%) compared to RS, G-1, and G-2.
test suite using the test case id and put for execution.
CS1 CS2 CS3 CS4 CS5
Fig. 6. Boxplot of APFD produced by different approaches for the five case studies*.
*A1: REMAP, A2: RS, A3: SSBP, A4: RBP

later. On contrary, G-1 performed better than REMAP for CS4


B. RQ2. (REMAP vs. SSBP) because all the test cases that failed in CS4 had a high value of
This research question aims to assess the effectiveness of 𝐹𝐷𝐶 (>0.9), and thus G-1 is more efficient in finding them
REMAP as compared to SSBP using the evaluation metric since it prioritizes the test cases based only on 𝐹𝐷𝐶. However,
𝐴𝑃𝐹𝐷. Based on the results from TABLE IV, we can observe note that higher 𝐹𝐷𝐶 does not always imply better 𝐴𝑃𝐹𝐷 as
that REMAP is able to perform significantly better than SSBP shown in TABLE V, where G-1 has worse performance than
for all the five case studies. Additionally, Fig. 6 shows that the even RS for the case studies CS2 and CS3.
median 𝐴𝑃𝐹𝐷 values produced by REMAP are much higher For RQ2, REMAP manages to significantly outperform
than SSBP for all the five case studies. Moreover, as static search-based prioritization (SSBP) since it is essential to
compared to SSBP, REMAP achieved a better 𝐴𝑃𝐹𝐷 of consider the runtime test case execution results in addition to
14.9% across the five case studies as shown in TABLE V. the historical execution data when addressing TP problem.
Therefore, we can answer RQ2, as REMAP can More specifically, the mined fail rules (Section IV.B) help to
significantly outperform SSBP for all the five case studies, prioritize the related test cases likely to fail (to execute earlier)
while the mined pass rules (Section IV.B) assist in
and on average REMAP had a better 𝐴𝑃𝐹𝐷 of 14.9% as
deprioritizing the test cases likely to pass (to execute later).
compared to SSBP. For RQ3, REMAP significantly outperforms rule-based
C. RQ3. (REMAP vs. RBP) prioritization (RBP). This can be explained by the fact that
RBP has no heuristics to prioritize the initial set of test cases
Recall that this research question aims to compare the
for execution and certain randomness is introduced when there
performance of REMAP with RBP for the five case studies
are no mined rules that can be used to choose the next test
using the evaluation metric 𝐴𝑃𝐹𝐷. It can be observed from
cases for execution. As compared with RBP, REMAP uses SP
TABLE IV that REMAP managed to significantly outperform
(Section IV.C) to prioritize the test cases and take them as
RBP for all the five case studies. Additionally, from Fig. 6, we
input for Dynamic Executor and Prioritizer (DEP). Therefore,
can observe that the median values for 𝐴𝑃𝐹𝐷 produced by
the randomness of selecting the test cases for execution can be
REMAP are always higher than RBP. On average, REMAP
reduced when no mined rules can be applied.
managed to achieve a higher mean 𝐴𝑃𝐹𝐷 of 9.8% for the five
Moreover, SP component (Section IV.C) employed multi-
case studies as compared to RBP based on the results from
objective search to statically prioritize test cases, which took
TABLE V.
on average 1.8 seconds, 19.6 seconds, 2.3 seconds, 9 minutes,
Thus, we can answer RQ3, as REMAP manages to
and 25.1 minutes for CS1, CS2, CS3, CS4, and CS5,
outperform RBP for all the five case studies significantly, and
respectively. The large difference in the running time for the
on average REMAP achieved a better 𝐴𝑃𝐹𝐷 of 9.8% as
five case studies is due to the wide range of test case number
compared to RBP.
across the five case studies (i.e., 60 – 5,555 as shown in
D. Discussion TABLE III). Based on the knowledge of VCS testing (Section
For RQ1, REMAP significantly outperformed RS for all II), the running time in seconds in acceptable when applying
the five case studies, which implies that the TP problems are in practice and thus REMAP is applicable in terms of
not trivial to solve and require an efficient TP approach. addressing test case prioritization problem at Cisco.
Moreover, REMAP performed better than G-1 for four case Furthermore, Fig. 7 shows the percentage of faults
studies and G-2 for all the five case studies, which can be detected when executing the test suite for the five case studies
explained by the fact that G-1 and G-2 greedily prioritize the using the four approaches: REMAP, G-1, G-2, SSBP, and
best test case one at a time in terms of the defined objectives RBP. As shown in Fig. 7, REMAP manages to detect the
until the termination conditions are met. However, G-1 and G- faults faster than the other approaches for almost all the case
2 may get stuck in local search space and result in sub-optimal studies. On an average, REMAP only required 21% of the test
solutions [59]. Moreover, REMAP dynamically prioritizes the cases to detect 81% of the faults for the five case studies while
test cases based on the execution result of the test case using G-1, G-2, SBP, and RBP took 41%, 37%, 43%, and 36% of
the mined rules defined in Section IV.B, which helps REMAP the test cases, respectively to detect the same amount of faults
to prioritize and execute the faulty test cases as soon as as compared with REMAP.
possible while executing the test cases less likely to find fault
CS1 CS2 CS4

CS3 CS5
Fig. 7. Percentage of total faults detected on executing the test suite for the five case studies using the five approaches.
For CS2, the mean 𝐴𝑃𝐹𝐷 produced by G-1 is worse than validity is the assumption that each failed test case indicates a
RS (TABLE V) since the test cases that failed in CS3 had a different failure in the system under test. However, the
very low value for 𝐹𝐷𝐶 (<0.2), i.e., the test cases did not fail mapping between fault and test case is not easily identifiable
many times in the past executions. Additionally, for CS3, the for both the two industrial case studies and three open source
mean 𝐴𝑃𝐹𝐷 obtained by G-1, G-2, and SSBP is worse than ones, and thus we assumed that each failed test case is
RS (as shown in TABLE V) due to the fact that the test cases assumed to cause a separate fault. Note that such assumption
that failed in CS3 had also a low 𝐹𝐷𝐶 (<0.5), and one test case is held in many literatures [9, 20, 27]. Moreover, we did not
that failed in CS3 had the 𝐹𝐷𝐶 value as 0. Thus, SSBP, G-1, execute the test cases from the three open source case studies
and G-2 are not able to prioritize that fail test case (with 𝐹𝐷𝐶 since we do not have access to the actual test cases. Therefore,
0) and it is executed very late (i.e., after executing more than we looked at the latest test cycle to obtain the test case results,
85% of the test suite) as shown in Fig. 7. However, REMAP which have been done in the existing literature when it is
and RBP still managed to find that fail test case by executing
challenging to execute the test cases [20, 27].
less than half the test suite. This is due to the fact that the
The threats to conclusion validity relate to the factors that
mined rules help to deprioritize the test cases likely to pass,
influence the conclusion drawn from the experiments [65]. In
and thus, they are able to execute the test case for execution
our context, conclusion validity threat arises due to the use of
faster. Therefore, we can conclude that REMAP has the best
randomized algorithms that is responsible for the random
effectiveness regarding detecting more faults by executing
variation in the produced results. To mitigate this threat, we
fewer test cases.
repeated each experiment 30 times for each case study to
E. Threats to Validity reduce the probability that the results were obtained by
The threats to internal validity consider the internal chance. Moreover, following the guidelines of reporting
results for randomized algorithms [52], we employed the
parameters (e.g., algorithm parameters) that might influence
Vargha and Delaney statistics test statistics as the effect size
the obtained results [61]. In our experiments, internal validity
measure to determine the probability of yielding higher
threats might arise due to experiments with only one set of performance by different algorithms and Mann-Whitney U
configuration settings for the algorithm parameters [62]. test for determining the statistical significance of results.
However, note that these settings are from the literature [63]. The threats to external validity are related to the factors
Moreover, to mitigate the internal validity threat due to that affect the generalization of the results [61]. To mitigate
parameter settings of the rule mining algorithm (i.e., RIPPER), this threat, we chose five case studies (two industrial ones and
we used the default parameter settings that have performed three open source ones) to evaluate REMAP empirically. It is
well in the state-of-the-art [34, 36, 57, 64]. also worth mentioning that such threats to external validity are
Threats to construct validity arise when the measurement common in empirical studies [42]. Another threat to external
metrics do not sufficiently cover the concepts they are validity is due to the selection of a particular search algorithm.
supposed to measure [8]. To mitigate this threat, we compared To reduce this threat, we selected the most widely used search
the different approaches using the Average Percentage of algorithm (NSGA-II) that has been applied in different
Fault Detected (𝐴𝑃𝐹𝐷) metric that has been widely employed contexts [42-44].
in the literature [8, 18, 19, 23]. Another threat to construct
VII. RELATED WORK B. Rule Mining for Regression Testing
There exists a large body of research on TP [7, 8, 18, 22, There are only a few works that focus on applying rule
24, 66, 67], and a broad view of the state-of-the-art on TP is mining techniques for TP [73, 74]. The authors in [73, 74]
presented on [17, 68]. The literatures have presented different modeled the system using Unified Modeling Language (UML)
prioritization techniques such as search-based [8, 24, 66] and and maintained a historical data store for the system.
linear programming based [23, 69, 70]. Since our approach Whenever the system is changed, association rule mining is
REMAP is based on historical execution results and rule used to obtain the frequent pattern of affected nodes that are
mining, we discuss the related work from these two angles. then used for TP. As compared with these studies [73, 74],
REMAP is different in at least two ways: 1) REMAP uses the
A. History-Based TP execution result of the test cases to obtain fail rules and pass
History-based prioritization techniques prioritize test cases rules; 2) REMAP dynamically updates the test order based on
based on their historical execution data with the aim to the test case execution results.
execute the test cases most likely to fail first [12]. History- Another work [27] proposed a rule mining based technique
based prioritization techniques can be classified into two for improving the effectiveness of the test suite for regression
categories: static prioritization and dynamic prioritization. testing. More specifically, they use association rule mining to
Static prioritization produces a static list of test cases that are mine the execution relations between the smoke test failures
not changed while executing the test cases while dynamic (smoke tests refer to a small set of test cases that are executed
prioritization changes the order of test cases at runtime. before running the regression test cases) and test case failures
Static TP: Most of the history-based prioritization using the historical execution data. When the smoke tests fail,
techniques produce a static order of test cases [9, 12-14, 39, the related test cases are executed. As compared to this
71, 72]. For instance, Kim et al. [12] prioritized test cases approach, REMAP has at least three key differences: 1) we
based on the historical execution data where they consider the aim at addressing test case prioritization problem while the
number of previous faults exposed by the test cases as the key test case order is not considered in [27]; 2) REMAP uses a
prioritizing factor. Elbaum et al. [9] used time windows from search-based test case prioritization component to obtain the
the test case execution history to identify how recently test static order of test case before execution, which is different
cases were executed and detected failures for prioritizing test than [27] that executes a set of smoke test cases to select the
cases. Park et al. [71] considered the execution cost of the test test cases for execution; 3) REMAP defines two sets of rules
case and the fault severity of the detected faults from the test (i.e., fail rule and pass rule) while only fail rule is considered
case execution history for TP. Wang et al. [16, 25] defined in [27].
fault detection capability (𝐹𝐷𝐶) as one of the objectives for
TP while using multi-objective search. As compared to the VIII. CONCLUSION
above-mentioned works, REMAP poses at least three This paper introduces a novel rule mining and search-
differences: 1) REMAP defines two types of rules: fail rule based dynamic prioritization approach (named as REMAP)
and pass rule (Section IV.B) and mines these rules from that has three key components (i.e., Rule Miner, Static
historical test case execution data with the aim to support TP; Prioritizer, and Dynamic Executor and Prioritizer) with the
2) REMAP defines a new objective (i.e., test case reliance aim to detect faults earlier. REMAP was evaluated using five
score for the component SP in Section IV.C) to measure to case studies: two industrial ones and three open source ones.
what extent, executing a test case can be used to predict the The results showed that REMAP achieved a higher Average
execution results of other test cases, which is not the case in Percentage of Fault Detected (𝐴𝑃𝐹𝐷) of 18% (i.e., 33.4%,
the existing literatures; and 3) REMAP proposes a dynamic 14.5%, 17.7%, 14.9%, and 9.8%) as compared to random
method to update the test case order based on the runtime test search, greedy based on fault detection capability (𝐹𝐷𝐶 ),
case execution results. greedy based on 𝐹𝐷𝐶 and test case reliance score (𝑇𝑅𝑆), static
Dynamic TP: Qu et al. [50] used historical execution data search-based prioritization, and rule-based prioritization.
to statically prioritize test cases and after that, the runtime In the future, we plan to evaluate REMAP with more test
case prioritization approaches (e.g., linear programming based
execution result of the test case(s) is used for dynamic
[23, 69, 70]) with more case studies to further strengthen the
prioritization using the relation among test cases. To obtain applicability of REMAP. Moreover, we want to involve test
relation among the test cases, they [50] group together the test engineers from our industrial partner to deploy and assess the
cases that detected the same fault in the historical execution effectiveness of REMAP in real industrial settings.
data such that all the test cases in the group are related. Our
work is different from [50] in at least three aspects: 1) ACKNOWLEDGEMENT
REMAP uses rule-mining to mine two types of execution This research was supported by the Research Council of
rules among test cases (Section IV.B), which is not the case in Norway (RCN) funded Certus SFI. Shuai Wang is also
[50]; 2) sensitive constant needs to be set up manually to supported by RFF Hovedstaden funded MBE-CR project. Tao
prioritize/deprioritize test cases in [50], however, REMAP Yue and Shaukat Ali are also supported by RCN funded Zen-
does not require such setting; 3) REMAP uses both 𝐹𝐷𝐶 and Configurator project, the EU Horizon 2020 project funded U-
𝑇𝑅𝑆 (Section IV.C) for static prioritization unlike [50] where Test, RFF Hovedstaden funded MBE-CR project, and RCN
only 𝐹𝐷𝐶 is considered for static prioritization. funded MBT4CPS project.
REFERENCES International Symposium on Software Testing and Analysis, pp. 12-22.
ACM, 2017.
[1] A. K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma, [21] S. Elbaum, A. Mclaughlin, and J. Penix. (2014). The Google Dataset of
"Regression testing in an industrial environment," Communications of Testing Results. Available: https://code.google.com/p/google-shared-
the ACM, vol. 41, no. 5, pp. 81-86. 1998. dataset-of-test-suite-results
[2] E. Borjesson and R. Feldt, "Automated system testing using visual gui [22] S. Elbaum, A. Malishevsky, and G. Rothermel, "Incorporating varying
testing tools: A comparative study in industry," in Proceedings of the test costs and fault severities into test case prioritization," in
Fifth International Conference on Software Testing, Verification and Proceedings of the 23rd International Conference on Software
Validation (ICST), pp. 350-359. IEEE, 2012. Engineering, pp. 329-338. IEEE Computer Society, 2001.
[3] G. Rothermel and M. J. Harrold, "A safe, efficient regression test [23] D. Hao, L. Zhang, L. Zang, Y. Wang, X. Wu, and T. Xie, "To be
selection technique," ACM Transactions on Software Engineering and optimal or not in test-case prioritization," IEEE Transactions on
Methodology (TOSEM), vol. 6, no. 2, pp. 173-210. 1997. Software Engineering, vol. 42, no. 5, pp. 490-505. 2016.
[4] C. Kaner, "Improving the maintainability of automated test suites," [24] Z. Li, Y. Bian, R. Zhao, and J. Cheng, "A fine-grained parallel multi-
Software QA, vol. 4, no. 4, 1997. objective test case prioritization on GPU," in Proceedings of the
[5] P. K. Chittimalli and M. J. Harrold, "Recomputing coverage International Symposium on Search Based Software Engineering, pp.
information to assist regression testing," IEEE Transactions on 111-125. Springer, 2013.
Software Engineering, vol. 35, no. 4, pp. 452-469. 2009. [25] S. Wang, S. Ali, T. Yue, Ø. Bakkeli, and M. Liaaen, "Enhancing test
[6] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, "Prioritizing case prioritization in an industrial setting with resource awareness and
test cases for regression testing," IEEE Transactions on Software multi-objective search," in Proceedings of the 38th International
Engineering, vol. 27, no. 10, pp. 929-948. 2001. Conference on Software Engineering Companion, pp. 182-191. 2016.
[7] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, "Test case [26] J. A. Parejo, A. B. Sánchez, S. Segura, A. Ruiz-Cortés, R. E. Lopez-
prioritization: An empirical study," in Proceedings of the International Herrejon, and A. Egyed, "Multi-objective test case prioritization in
Conference on Software Maintenance, pp. 179-188. IEEE, 1999. highly configurable systems: A case study," Journal of Systems and
[8] Z. Li, M. Harman, and R. M. Hierons, "Search algorithms for Software, vol. 122, pp. 287-310. 2016.
regression test case prioritization," IEEE Transactions on Software [27] J. Anderson, S. Salem, and H. Do, "Improving the effectiveness of test
Engineering, vol. 33, no. 4, pp. 225-237. IEEE, 2007. suite through mining historical data," in Proceedings of the 11th
[9] S. Elbaum, G. Rothermel, and J. Penix, "Techniques for improving Working Conference on Mining Software Repositories, pp. 142-151.
regression testing in continuous integration development ACM, 2014.
environments," in Proceedings of the 22nd ACM SIGSOFT [28] W. W. Cohen, "Fast effective rule induction," in Proceedings of the
International Symposium on Foundations of Software Engineering, pp. Twelfth International Conference on Machine Learning, pp. 115-123.
235-245. ACM, 2014. Elsevier, 1995.
[10] S. Elbaum, A. G. Malishevsky, and G. Rothermel, "Test case [29] W. Lin, S. A. Alvarez, and C. Ruiz, "Efficient adaptive-support
prioritization: A family of empirical studies," IEEE Transactions on association rule mining for recommender systems," Data mining and
Software Engineering, vol. 28, no. 2, pp. 159-182. 2002. knowledge discovery, vol. 6, no. 1, pp. 83-105. Springer, 2002.
[11] W. E. Wong, J. R. Horgan, S. London, and H. Agrawal, "A study of [30] W. Lin, S. A. Alvarez, and C. Ruiz, "Collaborative recommendation
effective regression testing in practice," in Proceedings of the Eighth via adaptive association rule mining," Data Mining and Knowledge
International Symposium on Software Reliability Engineering pp. 264- Discovery, vol. 6, pp. 83-105. 2000.
274. IEEE, 1997. [31] R. J. Bayardo Jr, "Brute-Force Mining of High-Confidence
[12] J.-M. Kim and A. Porter, "A history-based test prioritization technique Classification Rules," in KDD, pp. 123-126. 1997.
for regression testing in resource constrained environments," in [32] W. Shahzad, S. Asad, and M. A. Khan, "Feature subset selection using
Proceedings of the 24th International Conference on Software association rule mining and JRip classifier," International Journal of
Engineering (ICSE), pp. 119-129. IEEE, 2002. Physical Sciences, vol. 8, no. 18, pp. 885-896. 2013.
[13] A. Khalilian, M. A. Azgomi, and Y. Fazlalizadeh, "An improved [33] B. Qin, Y. Xia, S. Prabhakar, and Y. Tu, "A rule-based classification
method for test case prioritization by incorporating historical test case algorithm for uncertain data," in Proceedings of the 25th International
data," Science of Computer Programming, vol. 78, no. 1, pp. 93-116. Conference on Data Engineering (ICDE), pp. 1633-1640. IEEE, 2009.
2012. [34] B. Ma, K. Dejaeger, J. Vanthienen, and B. Baesens, "Software defect
[14] T. B. Noor and H. Hemmati, "A similarity-based approach for test case prediction based on association rule classification," 2011.
prioritization using historical failure data," in Proceedings of the 26th [35] A. Cano, A. Zafra, and S. Ventura, "An interpretable classification rule
International Symposium on Software Reliability Engineering (ISSRE), mining algorithm," Information Sciences, vol. 240, pp. 1-20. 2013.
pp. 58-68. IEEE, 2015. [36] S. Vijayarani and M. Divya, "An efficient algorithm for generating
[15] S. Wang, S. Ali, and A. Gotlieb, "Minimizing test suites in software classification rules," International Journal of Computer Science and
product lines using weight-based genetic algorithms," in Proceedings Technology, vol. 2, no. 4, 2011.
of the 15th Annual Conference on Genetic and Evolutionary [37] B. Seerat and U. Qamar, "Rule induction using enhanced RIPPER
Computation, pp. 1493-1500. ACM, 2013. algorithm for clinical decision support system," in Proceedings of the
[16] S. Wang, D. Buchmann, S. Ali, A. Gotlieb, D. Pradhan, and M. Liaaen, Sixth International Conference on Intelligent Control and Information
"Multi-objective test prioritization in software product line testing: an Processing (ICICIP), pp. 83-91. IEEE, 2015.
industrial case study," in Proceedings of the 18th International [38] P.-N. Tan, Introduction to data mining: Pearson Education India, 2006.
Software Product Line Conference, pp. 32-41. ACM, 2014. [39] D. Marijan, A. Gotlieb, and S. Sen, "Test case prioritization for
[17] S. Yoo and M. Harman, "Regression testing minimization, selection continuous regression testing: An industrial case study," in IEEE
and prioritization: a survey," Software Testing, Verification and International Conference on Software Maintenance (ICSM), pp. 540-
Reliability, vol. 22, no. 2, pp. 67-120. Wiley Online Library, 2012. 543. IEEE, 2013.
[18] C. Henard, M. Papadakis, M. Harman, Y. Jia, and Y. Le Traon, [40] D. Pradhan, S. Wang, S. Ali, T. Yue, and M. Liaaen, "STIPI: Using
"Comparing white-box and black-box test prioritization," in Search to Prioritize Test Cases Based on Multi-objectives Derived from
Proceedings of the 38th International Conference on Software Industrial Practice," in Proceedings of the International Conference on
Engineering (ICSE), pp. 523-534. IEEE, 2016. Testing Software and Systems, pp. 172-190. Springer, 2016.
[19] Y. Lu, Y. Lou, S. Cheng, L. Zhang, D. Hao, Y. Zhou, et al., "How does [41] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist
regression test prioritization perform in real-world software multiobjective genetic algorithm: NSGA-II," IEEE Transactions on
evolution?," in Proceedings of the 38th International Conference on Evolutionary Computation, vol. 6, no. 2, pp. 182-197. IEEE, 2002.
Software Engineering (ICSE), pp. 535-546. ACM, 2016. [42] F. Sarro, A. Petrozziello, and M. Harman, "Multi-objective software
[20] H. Spieker, A. Gotlieb, D. Marijan, and M. Mossige, "Reinforcement effort estimation," in Proceedings of the 38th International Conference
learning for automatic test case prioritization and selection in on Software Engineering, pp. 619-630. ACM, 2016.
continuous integration," in Proceedings of the 26th ACM SIGSOFT
[43] A. S. Sayyad, T. Menzies, and H. Ammar, "On the value of user [59] S. Tallam and N. Gupta, "A concept analysis inspired greedy algorithm
preferences in search-based software engineering: a case study in for test suite minimization," ACM SIGSOFT Software Engineering
software product lines," in Proceedings of the 35th International Notes, vol. 31, no. 1, pp. 35-42. 2006.
Conference on Software Engineering, pp. 492-501. IEEE, 2013. [60] P. Baker, M. Harman, K. Steinhofel, and A. Skaliotis, "Search based
[44] J. Sadeghi, S. Sadeghi, and S. T. A. Niaki, "A hybrid vendor managed approaches to component selection and prioritization for the next
inventory and redundancy allocation optimization problem in supply release problem," in Proceedings of the 22nd IEEE International
chain management: An NSGA-II with tuned parameters," Computers & Conference on Software Maintenance, pp. 176-185. IEEE, 2006.
Operations Research, vol. 41, pp. 53-64. 2014. [61] P. Runeson, M. Host, A. Rainer, and B. Regnell, Case study research
[45] D. Pradhan, S. Wang, S. Ali, and T. Yue, "Search-Based Cost-Effective in software engineering: Guidelines and examples: John Wiley & Sons,
Test Case Selection within a Time Budget: An Empirical Study," in 2012.
Proceedings of the Genetic and Evolutionary Computation Conference, [62] M. de Oliveira Barros and A. Neto, "Threats to Validity in Search-
pp. 1085-1092. ACM, 2016. based Software Engineering Empirical Studies," UNIRIO-Universidade
[46] N. Srinvas and K. Deb, "Multi-objective function optimization using Federal do Estado do Rio de Janeiro, techreport, vol. 6, 2011.
non-dominated sorting genetic algorithms," Evolutionary Computation, [63] A. Arcuri and G. Fraser, "On parameter tuning in search based software
vol. 2, no. 3, pp. 221-248. MIT Press, 1994. engineering," in Proceedings of the 3rd International Symposium on
[47] E. Zitzler, K. Deb, and L. Thiele, "Comparison of multiobjective Search Based Software Engineering, pp. 33-47. Springer, 2011.
evolutionary algorithms: Empirical results," Evolutionary Computation, [64] K. Song and K. Lee, "Predictability-based collective class association
vol. 8, no. 2, pp. 173-195. MIT Press, 2000. rule mining," Expert Systems with Applications, vol. 79, pp. 1-7. 2017.
[48] A. S. Sayyad and H. Ammar, "Pareto-optimal search-based software [65] C. Wohlin, P. Runeson, M. Host, M. Ohlsson, B. Regnell, and A.
engineering (POSBSE): A literature survey," in Proceedings of the 2nd Wesslen, "Experimentation in software engineering: an introduction.
International Workshop on Realizing Artificial Intelligence Synergies 2000," ed: Kluwer Academic Publishers, 2000.
in Software Engineering, pp. 21-27. IEEE, 2013. [66] K. R. Walcott, M. L. Soffa, G. M. Kapfhammer, and R. S. Roos,
[49] A. Veloso, W. Meira Jr, and M. J. Zaki, "Lazy associative "Timeaware test suite prioritization," in Proceedings of the
classification," in Sixth International Conference on Data Mining International Symposium on Software Testing and Analysis, pp. 1-12.
(ICDM), pp. 645-654. IEEE, 2006. ACM, 2006.
[50] B. Qu, C. Nie, B. Xu, and X. Zhang, "Test case prioritization for black [67] D. Pradhan, S. Wang, S. Ali, T. Yue, and M. Liaaen, "CBGA-ES: A
box testing," in Proceedings of the 31st Annual Computer Software and Cluster-Based Genetic Algorithm with Elitist Selection for Supporting
Applications Conference (COMPSAC), pp. 465-474. IEEE, 2007. Multi-Objective Test Optimization," in IEEE International Conference
[51] Supplementary material. on Software Testing, Verification and Validation, pp. 367-378. 2017.
Available: https://remap-ICST.netlify.com [68] C. Catal and D. Mishra, "Test case prioritization: a systematic mapping
[52] A. Arcuri and L. Briand, "A practical guide for using statistical tests to study," Software Quality Journal, vol. 21, no. 3, pp. 445-478. 2013.
assess randomized algorithms in software engineering," in Proceedings [69] L. Zhang, S.-S. Hou, C. Guo, T. Xie, and H. Mei, "Time-aware test-
of the 33rd International Conference on Software Engineering (ICSE), case prioritization using integer linear programming," in Proceedings
pp. 1-10. IEEE, 2011. of the Eighteenth International Symposium on Software Testing and
[53] A. Vargha and H. D. Delaney, "A critique and improvement of the CL Analysis, pp. 213-224. ACM, 2009.
common language effect size statistics of McGraw and Wong," Journal [70] S. Mirarab, S. Akhlaghi, and L. Tahvildari, "Size-constrained
of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101-132. regression test case selection using multicriteria optimization," IEEE
2000. TSE, vol. 38, no. 4, pp. 936-956. 2012.
[54] H. B. Mann and D. R. Whitney, "On a test of whether one of two [71] H. Park, H. Ryu, and J. Baik, "Historical value-based approach for
random variables is stochastically larger than the other," The Annals of cost-cognizant test case prioritization to improve the effectiveness of
Mathematical Statistics, pp. 50-60. JSTOR, 1947. regression testing," in Proceedings of the Secure System Integration
[55] G. Neumann, M. Harman, and S. Poulding, "Transformed vargha- and Reliability Improvement, pp. 39-46. IEEE, 2008.
delaney effect size," in Proceedings of the International Symposium on [72] Y. Fazlalizadeh, A. Khalilian, M. Abdollahi Azgomi, and S. Parsa,
Search Based Software Engineering, pp. 318-324. Springer, 2015. "Incorporating historical test case performance data and resource
[56] W. Martin, F. Sarro, and M. Harman, "Causal impact analysis for app constraints into test case prioritization," Tests and Proofs, pp. 43-57.
releases in google play," in Proceedings of the 24th International 2009.
Symposium on Foundations of Software Engineering, pp. 435-446. [73] P. Mahali, A. A. Acharya, and D. P. Mohapatra, "Test Case
ACM, 2016. Prioritization Using Association Rule Mining and Business Criticality
[57] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Test Value," in Computational Intelligence in Data Mining—Volume 2,
Practical machine learning tools and techniques: Morgan Kaufmann, ed: Springer, 2016, pp. 335-345.
2016. [74] A. A. Acharya, P. Mahali, and D. P. Mohapatra, "Model based test case
[58] J. J. Durillo and A. J. Nebro, "jMetal: A Java framework for multi- prioritization using association rule mining," in Computational
objective optimization," Advances in Engineering Software, vol. 42, no. Intelligence in Data Mining-Volume 3, ed: Springer, 2015, pp. 429-440.
10, pp. 760-771. Elsevier, 2011.

View publication stats

You might also like