Professional Documents
Culture Documents
4.1 Introduction
The recent trend of Evidence-based Software Engineering (EBSE) h a s
drawn the special attention to a comparative analysis as it h a s been
the basis of all EBSE. With the tremendous growth of the research in
the field of computer science, it h a s become essential to keep track of
the results of all research pertaining to specific field of computer
science. In the latest research, it h a s been observed that a periodic
systematic review of any research area always helps to keep an
individual up-to-date with the current state of knowledge related to
that area [7, 15, 5 1 , 60, 91]. Such reviews also point to weaker
areas where the empirical evidences are insufficient and further
studies are required.
A comparative analysis of particular research area is always
performed with the specific objectives. In the current study, the
research publications related to regression test case prioritization
published in various journals/magazines or presented in different
conferences/workshops have been considered. Only the publications
written in English have been considered for the ansdysis, and hence,
the potential loss of gathering some valuable information t h a t h a s
been published in a language other than English is admitted.
53
achieve this objective, the comparative analysis was performed to get
adequate answers to following questions:
The main test case prioritization techniques have been described and
discussed in literature survey (Chapter 2). These techniques were
further studied as a group of s u b techniques. For example, coverage
based prioritization technique was studied as total statement
coverage, additional statement coverage, total branch coverage,
additional branch coverage, total function coverage, additional branch
coverage, and so on. In the study, 35 different techniques have been
identified which contribute towards regression test case prioritization
(Table 4.1). The techniques have been labeled as RTP-n where RTP
stands for Regression Test-case Prioritization and n is the technique
n u m b e r which was assigned randomly. This label simply serves as an
index number for a technique and as such does not demonstrate any
characteristics of that technique. Table 4.1 contains the brief
description of these techniques with their respective origins.
54
Table 4 . 1 : Test case prioritization techniques
Label Name Brief Description Origin
55
4.4 Subject programs
56
(airplane collision avoidance system), printtokens (lexical
analyzers), printtokens2 (lexical analyzers), schedule (priority
schedulers), schedule2 (priority schedulers), replace (pattern
recognition and substitution system), and totinfo (computes
statistics of given input data).
(ii) SIR UNIX utilities are the set of C programs used as utilities in
UNIX operating system. The programs are -flex, grep, gzip, sed,
vim, and hash. The programs are freely available for download
at S/i?[163].
(iii) SIR Java programs are small, medium, and large sized J a v a
programs available at SIR [163]. These programs are ant, jmeter,
jtopas, xml-security, nanoxml, and siena.
57
4.5 Software Metrics used in studies
The metrics used for the empirical studies in test case prioritization
area can be divided into two parts: metrics related to cost reduction of
testing, and the metrics related to fault detection ability of test suite.
Many metrics have been proposed which fall in either of these two
categories.
58
(vi) Coverage Cost {CC) [4] determines the percentage of the
program statement executed by the prioritized test suite in
allotted time frame.
(vii) Testing Importance of Module [TIM\ [105] determines the
severe fault proneness of module tested by a test case.
(viii) Total Severity of Fault Detected {TSFD) [160] is the
summation of severity values of all faults identified for a
system.
(ix) Average Severity of Fault Detected {ASFD) [160] for a
software requirement is calculated as the ratio of summation of
severity values of the faults identified for that requirement to
the TSFD.
(x) Weighted Percentage of Faults Detected {WPFD) [160] is the
ratio of the TSFD to the percentage of the test suite executed.
The u s e of these metrics in the area of regression test case
prioritization h a s been shown in Table 4.3. The metric APFD h a s been
widely used in most of the empirical and case studies. It also served
as the base for the emergence of other metrics like APBC, APFDc etc.
Most of the metrics required subject program with known/seeded
faults information. This is the main hindrance to generalize the
results obtained by using these metrics because in the real testing
scenario, the fault information may not be available in advance.
Table 4.3: Metrics used in the studies
S.N. Metrics Publication (%) Input Requirements
1. APBC 6 Source Code with known faults
2. APFD 58 Source Code with known faults
3. APFDc 12 Source Code with known faults
4. APSCTA 10 Source Code
5. ASFD 1 Source Code with known faults
6. CC 6 Source Code
7. NAPFD 3 Source Code with known faults
8. TIM 2 System Design
9. TSFD 1 Source Code with known faults
10. WPFD 1 Source Code with known faults
Through a study conducted by Andrew et al. [2], it h a s been found
that the m u t a n t s serve good basis for the testing purpose, still it is
59
very hard to develop m u t a n t s with proper fault distribution. The
locations of actual faults depend on many factors including
programming paradigms and the developer attitude [41,189]. It is
difficult to predict the locations of occurrence of faults.
Some of the techniques have never been compared with the other
techniques. Such techniques have been represented with the help of
isolated circles. RPT-26, RPT-28, RPT-30, RPT-31, RPT-35 axe the
examples of some isolated techniques. Some of the techniques like
RPT-2, RPT-4 and RPT-6 have been involved in maximum comparative
studies. Such comparisons indicated the high usability of these
techniques for the cost reduction or fault detection. In some studies
[58, 59, 7 1 , 148, 149, 151], it h a s been observed that, though,
theoretically technique RPT-3 (optimal) is most efficient in
maximization of fault detection or code coverage, but due to large pool
of test cases, it becomes expensive to study all the permutation sets
of original test suite for determining the optimal set. A test suite with
n test cases will have n! permutation sets. The n u m b e r of permutation
60
.;^;*:m'^«v""
RPT- ~ •, ^
"'^ *T-\ V
2' J
.
RPT- \
\ /
^ — ^ '^ HPT- 1,, . « U ' "'WT^-\
IE
HIE
» »/ 2t
iJ,
* ' * 6E . .-6E •<^r^' '
•^"fj-f ** 6E V .
^ RPT
6E
xe -*m.
IE
M
«PT-
61
one subject program may not provide same results for some other
programs. The results also showed that for some subject programs,
coarse-granularity level techniques (md-tot, md-add, etc.) outperforms
the fine-granularity level techniques (st-tot, st-add, etc.)
4.7 Analysis
In this comparative study some results have been identified which are
relevant to the parameters of comparison stated in section 4.2. The
main findings of this analysis have been shown below:
1) It was found that there exist at least 35 different test case
prioritization techniques. Except for five techniques; all other
techniques have been compared through empirical or case
studies. The results showed that some of the techniques have
been involved sufficient times in the studies making them most
common and usable test case prioritization techniques. The
techniques that have not been compared empirically are either
the emerging techniques or have different domains of
application. For example, it is difficult to compare a technique
(RPT-30) that requires software model (finite state machine,
UML diagrams) with a technique (RPT-6) that needs source code
of software with known faults.
2) The present study revealed that approximately 60% of the
studies used APFD as the metric for comparing the effectiveness
of different techniques. Most of the other metrics {APBC, APSC,
etc) have used APFD as the base for implementation and serve
as variance to the main metric APFD. The metric TIM h a s been
used in a few studies and provided encouraging results. The
main advantage of this metric is that unlike APFD and its
variances, it does not require faulty source code of the subject
software for implementation. This metric can be used for the
evaluation of a technique where only software design is
available as an input. Unfortunately, no study in the review
literature provided any indication about which metric is most
suitable for determining the effectiveness of a testing technique.
3) The subject programs used in the studies were mainly taken
from SIR[163]. In general, the size of subject programs
considered in the empirical studies ranges from 100 LOG to
approximately 18000KLOC. More than 80% of the programs
used in the literature have been written in three languages; C,
C++, and Java. The programs written in these languages were
either structured or object-oriented programs. Most of the
techniques have been evaluated through the small Siemens
Suite C programs (maximum size 726 LOG). Unfortunately, no
study concluded whether the evaluation results of the studies
with small programs have same impact on the study with the
large prograims. The empirical evaluations of most of the
techniques have been performed through the m u t a n t s . It is very
difficult to generalize the results of such empirical studies as
m u t a n t s cannot be as real as the actual programs. The results
of the present studies indicated that the effectiveness of a
testing technique is program specific. A testing technique may
be useful for one type of software and may not be equally
effective for another type of software [58]. The current studies
did not discuss about how component-based software (where
source code is not available) could be tested through the
available testing techniques. It h a s been found that web-
applications [48, 49, 78, 137], and GUI-based applications [20,
84, 112, 113, 114, 115, 116] have been rarely used as subject
programs for the validation of test case prioritization
techniques. In the similar way, database software h a s appeared
as a topic of interested for regression testing [35, 72, 7 3 , 178]
but it never appeared as SUT for test case prioritization. The
study could not conclude about the software specific suitability
of available testing techniques.
63
4) The different techniques were compared through experimental
or case studies. The results of these comparisons varied with
the structure and size of SUT. Though, there is no direct
inference drawn about the superiority of any of the mentioned
techniques, but it passed some hints about the superiority of
testing techniques with 'add' surrogates over the techniques
with 'tot' surrogates. A further general inference h a s been
drawn that stated that the granularity level of the techniques
plays significant role in their effectiveness. The fine-granularity
level techniques (statement level) generally lead to better
prioritization results compared to coarse- granularity (method
level) level techniques, though, some studies have shown
contradictory results [47,59] for specific subject programs. No
general solution h a s been put forward that could adequately
meet the complexity of the system, changing requirements and
diversity of applications.
5) One of the limitations of the current state of test case
prioritization for regression testing is that most of the
techniques are evaluated through controlled experimental
studies with small subject programs. Most of these programs
are m u t a n t s with known /seeded faults. The faults are injected
in such a way that they look real. The current analysis could
not report any publication that scaled the results of the
techniques with controlled experiments to large uncontrolled
software. The subject programs used in the literature are
mainly coded in C/C++ or Java. The studies do not draw any
relationship between consistency of the results with controlled
experiments and other types of applications (Database, Web,
GUI, and programs coded in other programming languages). It
h a s been observed that the techniques with feedback (st-add)
perform slightly better than the parallel techniques without
feedback (st-tot). No technique, in general, h a s been proved
superior to other techniques as the performance of a technique
64
was found highly software specific. A significant variance in
results of these techniques was found with the change in type of
software and granularity level.
6) It is difficult to classify the different techniques on the basis of
their performance. The techniques that are efficient for one type
of software provided imprecise results for the other types of
software requirements. The comparison of the techniques h a s
been done through different metrics, hence, their effectiveness
cannot be guaranteed in all conditions. The techniques, in
general, can be divided into two broad classes: safe testing
techniques and unsafe testing. The safe testing techniques
generate a prioritized test suite that h a s same fault detection
potential as the original test suite. An example of such
techniques is re-test all. The unsafe testing techniques
generally generate a prioritized test suite which h a s low fault
detection capability as compared to the original test suite. The
low performance of unsafe testing techniques may be due to
ignorance of potential test cases while forming prioritized test
suite.
4.8 Conclusions
65
These programs have been written in C/C++ and Java. The studies
indicated that the adequacy of a testing technique depends on SUT.
66