Test Case Prioritization Techniques: Comparative Analysis

CHAPTER 4
TEST CASE PRIORITIZATION

TECHNIQUES:
COMPARATIVE ANALYSIS
TEST CASE PRIORITIZATION
TECHNIQUES:
COMPARATIVE ANALYSIS
4.1 Introduction
The recent trend of Evidence-based Software Engineering (EBSE) h a s
drawn the special attention to a comparative analysis as it h a s been
the basis of all EBSE. With the tremendous growth of the research in
the field of computer science, it h a s become essential to keep track of
the results of all research pertaining to specific field of computer
science. In the latest research, it h a s been observed that a periodic
systematic review of any research area always helps to keep an
individual up-to-date with the current state of knowledge related to
that area [7, 15, 5 1 , 60, 91]. Such reviews also point to weaker
areas where the empirical evidences are insufficient and further
studies are required.
A comparative analysis of particular research area is always
performed with the specific objectives. In the current study, the
research publications related to regression test case prioritization
published in various journals/magazines or presented in different
conferences/workshops have been considered. Only the publications
written in English have been considered for the ansdysis, and hence,
the potential loss of gathering some valuable information t h a t h a s
been published in a language other than English is admitted.
4.2 Parameters for comparison
The main objective of this analysis is to study the progress and

barriers in the area of test case prioritization for regression testing. To
53
achieve this objective, the comparative analysis was performed to get
adequate answers to following questions:
1. Which techniques for regression test case prioritization have been

compared empirically?
2. Which metrics have been used for finding the effectiveness of a
technique or comparing two techniques?
3. Which subject programs have been used to validate a particular
technique?
4. Can test case prioritization technique A be shown always better
t h a n technique B empirically?
5. Is it possible to generalize a technique that h a s been proved most
adequate empirically?
6. Can the prioritization techniques be classified?
4.3 Various test case prioritization techniques
The main test case prioritization techniques have been described and
discussed in literature survey (Chapter 2). These techniques were
further studied as a group of s u b techniques. For example, coverage
based prioritization technique was studied as total statement
coverage, additional statement coverage, total branch coverage,
additional branch coverage, total function coverage, additional branch
coverage, and so on. In the study, 35 different techniques have been
identified which contribute towards regression test case prioritization
(Table 4.1). The techniques have been labeled as RTP-n where RTP
stands for Regression Test-case Prioritization and n is the technique
n u m b e r which was assigned randomly. This label simply serves as an
index number for a technique and as such does not demonstrate any
characteristics of that technique. Table 4.1 contains the brief
description of these techniques with their respective origins.
54
Table 4 . 1 : Test case prioritization techniques
Label Name Brief Description Origin
RPT-1 Untreated No prioritization Rothermel et

RPT-2 Random Randomized ordering al. [149]
RPT-3 Optimal Forming optimum test suite
RPT-4 st-tot total statement coverage prioritization
RPT-5 st-add Additional statements coverage based
RPT-6 branch-tot Total branch coverage based
RPT-7 branch-add Additional branch coverage based
RPT-8 block-tot Total block coverage based Do et al.[46]
RPT-9 block-add Additional block coverage based
RPT-10 md-tot Total method coverage based Elbaum et
RPT-11 md-add Additional method coverage based al.[55]
RPT-12 FEP-tot Total fault-exposing potential based Rothermel
RPT-13 FEP-add Additional FEP based fl49]
RPT-14 rt-total statement to time ratio based You [187]
RPT-15 rt-add Add-statement to time ratio based
RPT-16 ILP-st-tot Tot-st coverage via ILP
RPT-17 ILP-st-add Additional statement coverage via ILP
RPT-18 BN-st-tot Prioritize via Bayesian Networks Do et al. [41]
RPT-19 BN-st-add Prioritize via Add-Bayesian Networks
RPT-20 GA-st st-coverage via Genetic Algorithm Zhang et
RPT-21 GA-md md-coverage via Genetic Algorithm al.[189]
RPT-22 md-FEP-tot Method coverage via FEP based
RPT-23 md-FEP- Additional method coverage via FEP
add
RPT-24 md-diff-tot Method and change information Elbaum et
RPT-25 md-diff-add Add -methods / c h a n g e information al.[55]
RPT-26 Conf-aware Configuration-aware based Qu al. [143]
RPT-27 Hist-value History-value based Kim [93]
RPT-28 SVD-based File cluster based via SVD Sherrif
[158]
RPT-29 Int-based Pair -wise interaction based Colborn [32]
RPT-30 Model- Transition coverage of software model Dalal[34]
based
RPT-31 Req-based Requirements coverage based Srikant[161]
RPT-32 Str-based Program structure based Jeffrey [82]
RPT-33 Srch-based Search-based algorithms Li et al.
[104]
RPT-34 Distr-based Test case profile distribution based Leon [102]
RPT-35 Prob-based Coverage probability based Kim [93]
55
4.4 Subject programs
The subject programs considered in different publications related to

the area of test case prioritization ranges from small programs
{LOC<1000) to large programs (uptolSOO KLOC). Most of the studies
considered the subject programs written in three programming
languages C, C++, and Java. The advantages of using these subject
programs have been supported by latest survey of 4 5 million lines of
code which concluded that majority of the lines was written in these
languages [130]. Most of the subject programs used in the empirical
studies are freely available for download at Software-artifact
Infrastructure Repository (SIR) [163]. It h a s been observed that these
programs have appeared in almost 60% of the software testing
research (Appendix A). The possible reasons for their maximum use in
research are their complete availability (along with test cases and test
suite), simplicity, manageability and diversity. The statistics about
these subject programs have been presented in Table 4.2.
Table 4.2: Subject program used in empirical studies

SUT Publications Max. Max test
Size(LOC) Suite Size
Siemens suite 17 726 19
SIR UNIX utilities 14 65632 1168
SIR Java programs 13 80400 877
Space 17 9564 169
Up to
Other C/C++ programs 10 3130
18000K
Other Java programs 16 6820 21
Other language
5 800 930
programs
Software models 5 N/A 877
Web applications 3 N/A N/A
(i) S i e m e n s suite is a collection of seven small language C

programs developed by researchers mainly for the purpose of
research in the area of software engineering. These programs
are the part of Software-artifact Infrastructure Repository and
available for download at S/i?[ 163]. These programs are teas
56
(airplane collision avoidance system), printtokens (lexical
analyzers), printtokens2 (lexical analyzers), schedule (priority
schedulers), schedule2 (priority schedulers), replace (pattern
recognition and substitution system), and totinfo (computes
statistics of given input data).
(ii) SIR UNIX utilities are the set of C programs used as utilities in
UNIX operating system. The programs are -flex, grep, gzip, sed,
vim, and hash. The programs are freely available for download
at S/i?[163].
(iii) SIR Java programs are small, medium, and large sized J a v a
programs available at SIR [163]. These programs are ant, jmeter,
jtopas, xml-security, nanoxml, and siena.
(iv) space is a large program developed for European Space Agency

(ESA). It is an interpreter for Array definition Language (ADL).
The program is freely available for download at SIR [163].
(v) Other C/C++ programs are those programs written in C/C++

which have not been taken from Software-artifact Infrastructure
Repository. Some of these programs are -javac, jikes, gcc,
Bugzilla, Mozilla.
(vi) Other Java programs are the programs coded in J a v a (not

taken from SIR). These programs are -TriTyp, GradeBook, and
JUnit.
(vii) Other language programs are the programs written in a

language other than C/C++ and Java.
(viii) Software models in the form of state machines and UML

diagrams have been used in a few studies [62].
(ix) Web applications are the web applications and web services
used in some research publications [19, 78, 137].
Database and Gf/7-application based software have never been used

as subject programs for the test case prioritization for regression
testing.
57
4.5 Software Metrics used in studies
The metrics used for the empirical studies in test case prioritization
area can be divided into two parts: metrics related to cost reduction of
testing, and the metrics related to fault detection ability of test suite.
Many metrics have been proposed which fall in either of these two
categories.
(i) Average Percentage of Fault Detection {APFD) determines the

percentage of faults detected by the prioritized test suite. APFD
h a s been widely used for the evaluation of regression testing
techniques in the areas of test suite minimization, reduction,
selection and prioritization. This metric is mainly applicable to
the m u t a n t s of subject programs.
(ii) Average Percentage of Fault Detection Cost-cognizant

{APFDc) [56] calculates the unit of fault severity detected per
unit test-cost. The technique is a variance of APFD which not
only determines the faults detected but also the execution cost
of test cases.
(iii) Average Percentage of Block Coverage {APBQ [104]

determines the percentage of the blocks executed by the
prioritized test suite.
(iv) Average Percentage of Statement Coverage time- aware

{APSCTA) [187] measures the rate at which the prioritized test
suite executes the statements of subject program.
(v) Normalized Average Percentage of Fault Detection {NAPFD)

[143] calculates both the fault detection efficiency and fault-
detection elapsed time of the prioritized test suite. This metric
indirectly determines the percentage of detected faults to the
percentage of execution time.
58
(vi) Coverage Cost {CC) [4] determines the percentage of the
program statement executed by the prioritized test suite in
allotted time frame.
(vii) Testing Importance of Module [TIM\ [105] determines the
severe fault proneness of module tested by a test case.
(viii) Total Severity of Fault Detected {TSFD) [160] is the
summation of severity values of all faults identified for a
system.
(ix) Average Severity of Fault Detected {ASFD) [160] for a
software requirement is calculated as the ratio of summation of
severity values of the faults identified for that requirement to
the TSFD.
(x) Weighted Percentage of Faults Detected {WPFD) [160] is the
ratio of the TSFD to the percentage of the test suite executed.
The u s e of these metrics in the area of regression test case
prioritization h a s been shown in Table 4.3. The metric APFD h a s been
widely used in most of the empirical and case studies. It also served
as the base for the emergence of other metrics like APBC, APFDc etc.
Most of the metrics required subject program with known/seeded
faults information. This is the main hindrance to generalize the
results obtained by using these metrics because in the real testing
scenario, the fault information may not be available in advance.
Table 4.3: Metrics used in the studies
S.N. Metrics Publication (%) Input Requirements
1. APBC 6 Source Code with known faults
2. APFD 58 Source Code with known faults
3. APFDc 12 Source Code with known faults
4. APSCTA 10 Source Code
5. ASFD 1 Source Code with known faults
6. CC 6 Source Code
7. NAPFD 3 Source Code with known faults
8. TIM 2 System Design
9. TSFD 1 Source Code with known faults
10. WPFD 1 Source Code with known faults
Through a study conducted by Andrew et al. [2], it h a s been found
that the m u t a n t s serve good basis for the testing purpose, still it is
59
very hard to develop m u t a n t s with proper fault distribution. The
locations of actual faults depend on many factors including
programming paradigms and the developer attitude [41,189]. It is
difficult to predict the locations of occurrence of faults.
4.6 Comparison of techniques
An overview of the techniques which are compared empirically h a s

been shown in Figure 4 . 1 . In this figure, the techniques are
represented with the help of labeled-circles and a line connecting two
circles represents the comparative study performed. The label, nE, on
these lines represents n empirical (experimental/case) studies
performed to compare the two end techniques. For example, line
connecting circles RPT-1 and RPT-7 h a s label 5E, which m e a n s that
techniques RPT-1 (untreated) and RPT-7(branch-add) have been
compared through five empirical studies. The remaining relations may
be interpreted in the similar way.
Some of the techniques have never been compared with the other
techniques. Such techniques have been represented with the help of
isolated circles. RPT-26, RPT-28, RPT-30, RPT-31, RPT-35 axe the
examples of some isolated techniques. Some of the techniques like
RPT-2, RPT-4 and RPT-6 have been involved in maximum comparative
studies. Such comparisons indicated the high usability of these
techniques for the cost reduction or fault detection. In some studies
[58, 59, 7 1 , 148, 149, 151], it h a s been observed that, though,
theoretically technique RPT-3 (optimal) is most efficient in
maximization of fault detection or code coverage, but due to large pool
of test cases, it becomes expensive to study all the permutation sets
of original test suite for determining the optimal set. A test suite with
n test cases will have n! permutation sets. The n u m b e r of permutation
sets grows exponentially (combinatorial explosion problem) with the

increase in the value of input n. This makes the implementation of the
optimal technique expensive and impractical.
60
.;^;*:m'^«v""
RPT- ~ •, ^
"'^ *T-\ V
2' J
.
RPT- \
\ /
^ — ^ '^ HPT- 1,, . « U ' "'WT^-\
RPT- \ • ' RFT-
*<—^ IE IE If^ \ RPT-
IE
HIE
» »/ 2t
iJ,
* ' * 6E . .-6E •<^r^' '
\ ^A '\ •T^. *^ le \ I >-»
•^"fj-f ** 6E V .
^ RPT
6E
xe -*m.
IE
M
«PT-
Figure 4.1: Comparison of empirical studies
This technique provides the upper bound value of the effectiveness of

the best testing technique under given conditions and hence, h a s it is
mostly used for reference purposes to analyze whether the proposed
techniques can be improved further. The results of the empirical
studies showed t h a t the perform.ance of a technique is program
specific [58, 59, 60, 111, 152]. A technique that seems effective for
61
one subject program may not provide same results for some other
programs. The results also showed that for some subject programs,
coarse-granularity level techniques (md-tot, md-add, etc.) outperforms
the fine-granularity level techniques (st-tot, st-add, etc.)
4.7 Analysis
In this comparative study some results have been identified which are
relevant to the parameters of comparison stated in section 4.2. The
main findings of this analysis have been shown below:
1) It was found that there exist at least 35 different test case
prioritization techniques. Except for five techniques; all other
techniques have been compared through empirical or case
studies. The results showed that some of the techniques have
been involved sufficient times in the studies making them most
common and usable test case prioritization techniques. The
techniques that have not been compared empirically are either
the emerging techniques or have different domains of
application. For example, it is difficult to compare a technique
(RPT-30) that requires software model (finite state machine,
UML diagrams) with a technique (RPT-6) that needs source code
of software with known faults.
2) The present study revealed that approximately 60% of the
studies used APFD as the metric for comparing the effectiveness
of different techniques. Most of the other metrics {APBC, APSC,
etc) have used APFD as the base for implementation and serve
as variance to the main metric APFD. The metric TIM h a s been
used in a few studies and provided encouraging results. The
main advantage of this metric is that unlike APFD and its
variances, it does not require faulty source code of the subject
software for implementation. This metric can be used for the
evaluation of a technique where only software design is
available as an input. Unfortunately, no study in the review
literature provided any indication about which metric is most
suitable for determining the effectiveness of a testing technique.
3) The subject programs used in the studies were mainly taken
from SIR[163]. In general, the size of subject programs
considered in the empirical studies ranges from 100 LOG to
approximately 18000KLOC. More than 80% of the programs
used in the literature have been written in three languages; C,
C++, and Java. The programs written in these languages were
either structured or object-oriented programs. Most of the
techniques have been evaluated through the small Siemens
Suite C programs (maximum size 726 LOG). Unfortunately, no
study concluded whether the evaluation results of the studies
with small programs have same impact on the study with the
large prograims. The empirical evaluations of most of the
techniques have been performed through the m u t a n t s . It is very
difficult to generalize the results of such empirical studies as
m u t a n t s cannot be as real as the actual programs. The results
of the present studies indicated that the effectiveness of a
testing technique is program specific. A testing technique may
be useful for one type of software and may not be equally
effective for another type of software [58]. The current studies
did not discuss about how component-based software (where
source code is not available) could be tested through the
available testing techniques. It h a s been found that web-
applications [48, 49, 78, 137], and GUI-based applications [20,
84, 112, 113, 114, 115, 116] have been rarely used as subject
programs for the validation of test case prioritization
techniques. In the similar way, database software h a s appeared
as a topic of interested for regression testing [35, 72, 7 3 , 178]
but it never appeared as SUT for test case prioritization. The
study could not conclude about the software specific suitability
of available testing techniques.
63
4) The different techniques were compared through experimental
or case studies. The results of these comparisons varied with
the structure and size of SUT. Though, there is no direct
inference drawn about the superiority of any of the mentioned
techniques, but it passed some hints about the superiority of
testing techniques with 'add' surrogates over the techniques
with 'tot' surrogates. A further general inference h a s been
drawn that stated that the granularity level of the techniques
plays significant role in their effectiveness. The fine-granularity
level techniques (statement level) generally lead to better
prioritization results compared to coarse- granularity (method
level) level techniques, though, some studies have shown
contradictory results [47,59] for specific subject programs. No
general solution h a s been put forward that could adequately
meet the complexity of the system, changing requirements and
diversity of applications.
5) One of the limitations of the current state of test case
prioritization for regression testing is that most of the
techniques are evaluated through controlled experimental
studies with small subject programs. Most of these programs
are m u t a n t s with known /seeded faults. The faults are injected
in such a way that they look real. The current analysis could
not report any publication that scaled the results of the
techniques with controlled experiments to large uncontrolled
software. The subject programs used in the literature are
mainly coded in C/C++ or Java. The studies do not draw any
relationship between consistency of the results with controlled
experiments and other types of applications (Database, Web,
GUI, and programs coded in other programming languages). It
h a s been observed that the techniques with feedback (st-add)
perform slightly better than the parallel techniques without
feedback (st-tot). No technique, in general, h a s been proved
superior to other techniques as the performance of a technique
64
was found highly software specific. A significant variance in
results of these techniques was found with the change in type of
software and granularity level.
6) It is difficult to classify the different techniques on the basis of
their performance. The techniques that are efficient for one type
of software provided imprecise results for the other types of
software requirements. The comparison of the techniques h a s
been done through different metrics, hence, their effectiveness
cannot be guaranteed in all conditions. The techniques, in
general, can be divided into two broad classes: safe testing
techniques and unsafe testing. The safe testing techniques
generate a prioritized test suite that h a s same fault detection
potential as the original test suite. An example of such
techniques is re-test all. The unsafe testing techniques
generally generate a prioritized test suite which h a s low fault
detection capability as compared to the original test suite. The
low performance of unsafe testing techniques may be due to
ignorance of potential test cases while forming prioritized test
suite.
4.8 Conclusions
The present comparative analysis highlighted the research

advancements which have been made in the direction of test case
prioritization for regression testing. Different test case prioritization
techniques have been evolved and evaluated empirically. The main
objective of these techniques can be divided into two categories 1)
tendency to reduce the cost of overall testing process and 2) to
increase the fault detection potential of prioritized test suite. In more
t h a n half of the reported studies, APFD h a s been used as the metric
for finding the adequacy of a technique. The review also indicated
that expect TIM, other metrics may be a s s u m e d as a variance of APFD.
The subject programs used in the studies are usually taken from SIR.
65
These programs have been written in C/C++ and Java. The studies
indicated that the adequacy of a testing technique depends on SUT.
66

Test Case Prioritization Techniques: Comparative Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Test Case Prioritization Techniques: Comparative Analysis

Uploaded by

Copyright:

Available Formats

CHAPTER 4

TEST CASE PRIORITIZATION

4.2 Parameters for comparison

The main objective of this analysis is to study the progress and

1. Which techniques for regression test case prioritization have been

4.3 Various test case prioritization techniques

RPT-1 Untreated No prioritization Rothermel et

The subject programs considered in different publications related to

Table 4.2: Subject program used in empirical studies

(i) S i e m e n s suite is a collection of seven small language C

(iv) space is a large program developed for European Space Agency

(v) Other C/C++ programs are those programs written in C/C++

(vi) Other Java programs are the programs coded in J a v a (not

(vii) Other language programs are the programs written in a

(viii) Software models in the form of state machines and UML

Database and Gf/7-application based software have never been used

(i) Average Percentage of Fault Detection {APFD) determines the

(ii) Average Percentage of Fault Detection Cost-cognizant

(iii) Average Percentage of Block Coverage {APBQ [104]

(iv) Average Percentage of Statement Coverage time- aware

(v) Normalized Average Percentage of Fault Detection {NAPFD)

4.6 Comparison of techniques

An overview of the techniques which are compared empirically h a s

sets grows exponentially (combinatorial explosion problem) with the

RPT- \ • ' RFT-

*<—^ IE IE If^ \ RPT-

\ ^A '\ •T^. *^ le \ I >-»

Figure 4.1: Comparison of empirical studies

This technique provides the upper bound value of the effectiveness of

The present comparative analysis highlighted the research

You might also like