Professional Documents
Culture Documents
DOI 10.1007/s10796-013-9430-0
Abstract Two important problems which can affect the use of sampling and feature selection without the itera-
the performance of classification models are high- tive step (e.g., using the ranked list from a single iteration,
dimensionality (an overabundance of independent features rather than combining the lists from multiple iterations),
in the dataset) and imbalanced data (a skewed class dis- and compare these results to those from the version which
tribution which creates at least one class with many fewer uses iteration. Our study is carried out using three groups of
instances than other classes). To resolve these problems datasets with different levels of class balance, all of which
concurrently, we propose an iterative feature selection were collected from a real-world software system. All of our
approach, which repeated applies data sampling (in order experiments use four different learners and one feature sub-
to address class imbalance) followed by feature selection set size. We find that our proposed iterative feature selection
(in order to address high-dimensionality), and finally we approach outperforms the non-iterative approach.
perform an aggregation step which combines the ranked
feature lists from the separate iterations of sampling. This Keywords Iterative feature selection · Software defect
approach is designed to find a ranked feature list which is prediction · Date sampling · High dimensionality · Class
particularly effective on the more balanced dataset result- imbalance
ing from sampling while minimizing the risk of losing data
through the sampling step and missing important features.
To demonstrate this technique, we employ 18 different fea- 1 Introduction
ture selection algorithms and Random Undersampling with
two post-sampling class distributions. We also investigate Two major challenges are found across many data min-
ing and machine learning problems: high dimensionality
and class imbalance. High dimensionality refers to datasets
which have a large number of independent attributes (fea-
T. M. Khoshgoftaar () · A. Napolitano · R. Wald tures), especially relative to the total number of instances. In
Empirical Software Engineering Laboratory, many high-dimensional datasets, only a small fraction of the
Department of Computer and Electrical, Engineering and
Computer Science, Florida Atlantic University, Boca Raton,
features actually provide useful information about the class
FL 33431, USA variable: the rest may be redundant (provide information
e-mail: khoshgof@fau.edu which is already contained in other features) or irrelevant
A. Napolitano (provide no useful information whatsoever). By removing
e-mail: anapoli1@fau.edu these useless features, classification models can be built
R. Wald with much lower computational cost, and often these mod-
e-mail: rwald1@fau.edu els will have greater performance than those built using all
of the features. In addition, the act of identifying the most
K. Gao
Eastern Connecticut State University
important features is often useful in and of itself, as it tells
Willimantic, CT 06226, USA practitioners which features are actually affecting the class
e-mail: gaok@easternct.edu variable.
802 Inf Syst Front (2014) 16:801–822
Class imbalance is a separate problem, wherein the class engineering (Gonzalez and Woods 2008) which has only
ratio is especially skewed. For example, in a binary-class recently been applied towards feature ranking (Mishra and
problem, this may result in one class having far fewer Sahu 2011). Two different post-sampling distributions are
instances than the other class. This is a major challenge used with Random Undersampling: 35:65 and 50:50. For
because typically, the minority class will also be the positive this study, we use data from the domain of software quality
class (that is, the class of interest). Classification models, on prediction, a field which seeks to use software metrics col-
the other hand, are far more likely to correctly classify the lected during the software development process in order to
majority (negative) class, because a model which improves identify which software modules are most likely to contain
performance on this class will have greater overall accuracy. faults (Lessmann et al. 2008; Song et al. 2011). In particu-
For example, in the domain of software defect prediction, lar, three groups of software datasets are used, all collected
there are (ideally) far fewer fault-prone (fp) software mod- from a real-world software system.
ules than not-fault-prone (nfp) modules. The goal is to iden- We use these 18 rankers, two levels of Random Under-
tify those modules which are likely to contain faults, but, sampling, and three groups of datasets with two different
without preprocessing, a model is more likely to correctly approaches: both the iterative strategy which applies under-
identify nfp modules, even if this means incorrectly assign- sampling and feature selection multiple times and aggre-
ing fp modules to the nfp class. Many solutions have been gates the resulting ranked lists, and a non-iterative strategy
proposed to address the class imbalance problem, but one which only performs undersampling and feature selection
popular choice is sampling: changing the dataset to make it once. In both cases, we use the resulting feature ranked lists
more balanced. This can be done either by adding instances (specifically, the top 8 features from these lists) along with
to the minority class or by removing instances from the four classification algorithms (Naı̈ve Bayes, Multilayer Per-
majority class, and by either random or directed approaches, ceptron, k-Nearest Neighbors, and Support Vector Machine)
but one of the simpler strategies, Random Undersampling to build models on the full (i.e., non-sampled) datasets.
(that is, randomly discarding majority-class instances), has This paper extends our previous work (Khoshgoftaar
proven especially effective. et al. 2012b) by incorporating a wider range of classifiers
Although much work has considered the high- and datasets, as well as by performing a more extensive
dimensionality and class imbalance problems indepen- statistical analysis and discussing threats to validity. The
dently, few have specifically addressed datasets which results show that our iterative approach outperforms (on
contain both of these problems, and fewer have proposed average) the non-iterative version for all levels of class
algorithms specifically designed to address these prob- imbalance, although the difference between the iterative and
lems in tandem. The goal of this paper is to present a new non-iterative approach is strongest on the most imbalanced
strategy for data preprocessing to resolve both problems. datasets. We also see that this distinction becomes more evi-
This strategy consists of multiple iterations of Random dent when using Random Undersampling to create a 50:50
Undersampling applied to modify the datasets into a chosen class ratio.
degree of class balance, followed by feature selection per- The remainder of the paper is organized as follows.
formed on the reduced dataset. After all iterations have been Section 2 presents related work. The 18 filter-based fea-
performed, the ranked feature lists from each are combined, ture ranking techniques and our proposed iterative feature
to reduce the influence of the randomness from the under- selection method, as well as the four classifiers and the
sampling step and ensure only those features which are associated classification performance metric used in the
relevant to a large proportion of the iterations are preserved study are described in Section 3. A case study using three
in the final ranking. This final ranking is used along with groups of datasets from a real-world software system is pro-
the original (non-sampled) dataset to build classification vided in Section 4. The threats to validity are discussed in
models. Section 5. Finally, conclusions and future work are indicated
To demonstrate this algorithm, we perform a case study in Section 6.
using 18 filter-based feature ranking techniques, along
with Random Undersampling as our sampling tech-
nique. These 18 feature selection algorithms include 2 Related work
six commonly-used algorithms (Chi-Squared, Information
Gain, Gain Ratio, Symmetrical Uncertainty, and two ver- Feature selection (FS), also known as attribute selection, is
sions of ReliefF (Witten et al. 2011)), 11 threshold-based a process of selecting some subset of the features which are
feature selection techniques proposed by our research team useful in building a classifier. It deletes as many unneces-
(Khoshgoftaar and Gao 2010), and the Signal-to-Noise con- sary features as possible, leaving only those features that
cept from the domain of electrical and communication are important to the class attribute. FS techniques can be
Inf Syst Front (2014) 16:801–822 803
extension of the Relief algorithm that can handle noise and as negative and below as positive. The better result is used.
multiclass datasets, and is implemented in the WEKA tool1 Each of the 11 metrics are calculated for each attribute indi-
(Witten et al. 2011). When the WeightByDistance vidually, and attributes with higher values for F-Measure,
(weight nearest neighbors by their distance) parameter is set Geometric Mean, Probability Ratio, Power, Area Under
as default (false), the algorithm is referred to as RF; when the the ROC Curve, Area Under the Precision-Recall Curve,
parameter is set to ‘true,’ the algorithm is referred to as RFW. Mutual Information, Kolmogorov-Smirnov Statistic, and
Odds Ratio and lower values for Gini Index and Deviance
3.1.2 Threshold-based feature selection are determined to better predict the class attribute. In this
manner, the attributes can be ranked from most to least
The threshold-based feature selection (TBFS) technique predictive based on each of the 11 metrics.
was proposed by our research team and implemented within
a. F-Measure (FM). FM is derived from recall (or true
WEKA (Witten et al. 2011). The procedure is shown in Algo-
positive rate) and precision.
rithm 1. Each independent attribute works individually with
the class attribute, and that two-attribute dataset is evalu- 2 × T P R(t) × P RE(t)
FM = max .
ated using different performance metrics. More specifically, t ∈[0,1] T P R(t) + P RE(t)
the TBFS procedure includes two steps: (1) normalizing Recall and precision are calculated at each point along
the attribute values so that they fall between 0 and 1; the normalized attribute range of 0 to 1. The maximum
and (2) treating those values as the posterior probabilities F-measure obtained by each attribute represents how
from which to calculate performance metrics. Note that no strongly that particular attribute relates to the class,
classifiers were built during the feature selection process. according to the F-measure.
Analogous to the procedure for calculating rates in a clas- b. Odds Ratio (OR). OR is defined as:
sification setting with a posterior probability, the true pos- T P R(t)(1 − F P R(t))
itive (T P R), true negative (T N R), false positive (F P R), OR = max
t ∈[0,1] (1 − T P R(t))F P R(t)
and false negative (F N R) rates can be calculated at each
j . T P R(t) T N R(t)
threshold t ∈ [0, 1] relative to the normalized attribute F = max
t ∈[0,1] F P R(t) F N R(t)
Precision P RE(t) is defined as the fraction of the predicted-
positive examples which are actually positive. The fea-
ture rankers we propose utilize these five rates as descri- OR is the maximum value of the ratio of the product of
bed below. The value is computed in both directions: first correct to incorrect predictions.
treating instances above the threshold (t) as positive and be- c. Power (Pow). Pow is defined as:
low as negative, then treating instances above the threshold Pow = max (1 − F P R(t))k − (1 − T P R(t))k
t ∈[0,1]
1 Waikato
= max (T N R(t))k − (F N R(t))k
Environment for Knowledge Analysis ( WEKA) is a popular t ∈[0,1]
suite of machine learning software written in Java, developed at the
for some integer k ≥ 1. Note that if k = 1, Power is
University of Waikato. WEKA is free software available under the GNU
General Public License. In this study, all experiments and algorithms equivalent to KS (described in Item g). In this work, we
were implemented in the WEKA tool. use k = 5 as done by Forman (2003).
Inf Syst Front (2014) 16:801–822 805
d. Probability Ratio (PR). PR is defined as: The larger the KS value, the better the attribute is able
T P R(t) to separate the two classes, and hence the more signifi-
PR = max cant the attribute is. The range of KS is between 0 and 1.
t ∈[0,1] F P R(t)
h. Deviance (Dev). Dev is the minimum residual sum of
e. Gini Index (GI). GI was first introduced by Breiman squares based on a threshold t. It measures the sum
et al. (1984) within the CART algorithm. For a given of the squared errors from the mean class given a par-
j (x) > t} and S̄t = {x |
threshold t, let St = {x | F titioning of the space based on the threshold t. As
(x) ≤ t}. The Gini index is calculated as:
F j it represents the total error found in the partitioning,
lower values are preferred.
GI = min 1 − P 2 (T P (t) | St ) + P 2 (F P (t) | St )
t∈[0,1] i. Geometric Mean (GM). GM is the square root of the
+ 1 − P 2 (T N (t) | S̄t ) + P 2 (F N (t) | S̄t ) product of the true positive rate and true negative rate.
GM ranges from 0 to 1, and an attribute that is perfectly
= min [2P RE(t)(1 − P RE(t)) + 2N P V (t)(1 − N P V (t))].
t∈[0,1] correlated to the class provides a value of 1. GM is a
useful performance measure since it is inclined to max-
where T P (t) is the number of true positives given imize the true positive rate and the true negative rate
threshold t (and similarly for T N (t), F P (t), and while keeping them relatively balanced. GM is calcu-
F N (t)). N P V or negative predictive value represents lated at each value of the normalized attribute range,
the percentage of examples predicted to be negative and the maximum value of GM is used as a measure of
that are actually negative and is very similar to the pre- attribute strength.
cision — in fact, it is often thought of as the precision j. Area Under the ROC Curve (AUC). Receiver Oper-
of instances predicted to be in the negative class. The ating Characteristic, or ROC, curves graph true positive
Gini index for the attribute is then the minimum Gini rate on the y-axis versus the false positive rate on
index at all decision thresholds t ∈ [0, 1]. the x-axis. The resulting curve illustrates the trade-off
f. Mutual Information (MI). Let c(x) ∈ {P , N } denote between true positive rate and false positive rate. In this
the actual class of instance x, and let c t (x) denote study, ROC curves are generated by varying the deci-
the predicted class based on the value of the attribute sion threshold t (between 0 and 1) used to transform
F j and a given threshold t. MI computes the criterion the normalized attribute values into a predicted class.
with respect to the number of times a feature value and AUC is used to provide a single numerical metric for
a class co-occur, the feature value occurs without the comparing the predictive power of each attribute.
class, and the class occurs without the feature value. k. Area Under the Precision-Recall Curve (PRC). PRC
The MI metric is defined as: is a single-value measure that originated from the area
p(ĉt , c) of information retrieval. A precision-recall curve is
MI = max p(ĉt , c) log generated by varying the decision threshold t from 0 to
t ∈[0,1] t p(ĉt )p(c)
ĉ ∈{P ,N } c∈{P ,N } 1 and plotting the recall (y-axis) and precision (x-axis)
where at each point in a similar manner to the ROC curve. The
| {x | (ĉt (x) = α) ∩ (c(x) = β)} | area under the PRC ranges from 0 to 1, and an attribute
p(ĉt = α, c = β) = , with more predictive power results in an area under the
|P |+|N |
| {x | ĉt (x) = α} |
PRC closer to 1.
p(ĉt = α) = ,
|P |+|N |
| {x | c(x) = α} | 3.1.3 Signal-to-noise ratio technique
p(c = α) = ,
|P |+|N |
α, β ∈ {P , N}. Signal-to-Noise ratio (S2N) (Goh et al. 2004) is a sim-
ple univariate ranking technique which defines how well
a feature discriminates between two classes in a two class
g. Kolmogorov-Smirnov Statistic (KS). KS measures problem. S2N, for a given feature, separates the means
the maximum difference between the cumulative dis- of the two classes relative to the sum of their standard
tribution functions of examples in each class based on deviations. The equation to calculate S2N is
the normalized attribute F j . The distribution function
Fc (t) for a class c is estimated by the proportion of μP − μN
examples x from class c with F j (x) ≤ t, t ∈ [0, 1]. In S2N =
σP + σN
a two class setting with c ∈ {P , N }, KS is computed as
KS = max |FP (t) − FN (t)|. where μP and μN are the mean values of a particular
t ∈[0,1] attribute for the samples from class P and class N, and
806 Inf Syst Front (2014) 16:801–822
σP and σN are the corresponding standard deviations. The built-in feature selection capability, and (2) they are com-
larger the S2N value, the more relevant the feature is to the monly used in both the software engineering and the data
class attribute. mining domains (Lessmann et al. 2008; Jiang et al. 2009;
Menzies et al. 2007). We employ the WEKA tool (Witten
3.2 The iterative feature selection approach et al. 2011) to implement these classifiers. Parameter set-
tings are subject to the data explored. Unless stated, we use
The proposed method is designed to deal with feature default parameter settings as specified in WEKA.
selection for imbalanced data. Algorithm 2 presents the
procedure of this approach. It consists of two basic steps: 3.3.1 Naı̈ve Bayes
1. Using the Random Undersampling (RUS) technique to
To determine the classification of an instance, one method
balance data. RUS creates the balanced data by ran-
is to use a probability model in which the features that were
domly removing examples from the majority class. In
chosen by the feature rankers are used as the conditions for
this work, we study two post-sampling proportions:
the probability of the sample being a member of the class.
35:65 and 50:50, meaning the ratio between the minor-
A basic probability model would look like p(C|F1 , . . . , Fn)
ity (fp) examples and majority (nfp) examples is 35:65
where Fi is the value of each feature used and C is the class
and 50:50, respectively, after sampling.
of the instance. This model is known as the posterior, and we
2. Applying a filter-based feature ranking technique to the
assign the instance to the class for which it has the largest
sampled data and ranking all the features according
posterior (Souza et al. 2005).
to their predictive powers (scores). We investigate 18
Unfortunately, it is quite difficult to determine the pos-
filter-based feature ranking techniques.
terior directly. Thus, it is necessary to use Bayes’s rule
In order to alleviate the biased results generated due to which states that the posterior equals the ratio of the prior
the sampling process, we repeat the two steps k times multiplied by the likelihood over the evidence, or
(k = 10 in this study) and aggregate k rankings using p(C)p(F1 , . . . , Fn |C)
the mean (average). Finally, the best set of attributes p(C|F1 , . . . , Fn ) =
p(F1 , . . . , Fn)
is selected from the original data to form the training
dataset. In reality, the formula above can be simplified by certain
assumptions. The evidence, p(F1 , . . . , Fn), is always con-
3.3 Classifiers stant for the specific dataset and therefore can be ignored
for the purposes of classification. The likelihood formula,
The software defect prediction models are built using p(F1 , . . . , Fn |C), can be simplified to i p(Fi |C) due to
four different classification algorithms, including Naı̈ve the naive assumption that all of the features are condition-
Bayes (NB) (Witten et al. 2011), Multilayer Percep- ally independent of all of the other features. This naı̈ve
tron (MLP) (Haykin 1999), k-Nearest Neighbors (KNN) assumption with the removal of the evidence parameter
(Witten et al. 2011), and Support Vector Machine (SVM) creates the Naı̈ve Bayes classifier.
(Cristianini and Shawe-Taylor 2000). These learners were p(C|F1 , . . . , Fn ) = p(C) p(Fi |C)
selected for two key reasons: (1) they do not have a i
Inf Syst Front (2014) 16:801–822 807
3.3.2 Multilayer perceptron weight vector, w, and the bias, ω0 . One aspect that must
be addressed is that there can be multiple discriminants
A multilayer perceptron (MLP) is a type of artificial neural that correctly classify the two classes. The support vector
network. Artificial neural networks consist of nodes which machine, or SVM, is a linear discriminant classifier which
are arranged in sets called layers. Each node in a layer has assumes that the best discriminant maximizes the distance
a connection coming from every node in the layer before between the two classes. This is measured in the distance
it and to every node in the layer after it. Each node takes from the discriminant to the samples of both classes (Liu
the weighted sum of all of the input nodes. Along with the 2009).
weighted sums, an activation function is also applied. The
application of the activation function to the result of the 3.4 Classification performance metric
weighted sum allows for a more clearly defined result by
further separating the instances in the two classes from each In this study, we use the Area Under the ROC (receiver
other. Neural networks are well known for being robust to operating characteristic) curve (i.e., AUC) to evaluate clas-
redundant features. However, neural networks sometimes sification models. The ROC curve graphs true positive rates
have problems with overfitting (Haykin 1999). versus the false positive rates (the positive class is syn-
onymous with the minority class). Traditional performance
3.3.3 K-nearest neighbors metrics for classifier evaluation consider only the default
decision threshold of 0.5. ROC curves illustrate the per-
The k-Nearest Neighbors, or KNN, learner is an example formance across all decision thresholds. A classifier that
of an instance-based and lazy learning algorithm. Instance provides a large area under the curve is preferable over a
based algorithms use only the training data without creat- classifier with a smaller area under the curve. A perfect clas-
ing statistics on which to base their hypotheses The KNN sifier provides an AUC that equals 1. AUC is one of the most
learner does this by calculating the distance of the test sam- widely used single numeric measures that provides a gen-
ple from every training instance, and the predicted class is eral idea of the predictive potential of the classifier. It has
derived from the k nearest neighbors. also been shown that AUC is of lower variance and is more
In the KNN learner, when we get a test sample we would reliable than other performance metrics such as precision,
like to classify, we tabulate the classes for each of the k clos- recall, and F-measure (Jiang et al. 2009).
est training samples (we used a k of five for our experiment)
and we determine the weight of each neighbor by taking
a measurement of dist1ance where distance is the distance 4 A case study
from the test sample. After the classes and weights are tab-
ulated, we add all of the weights from the neighbors of the 4.1 Datasets
positive class together and all of the weights of the nega-
tive class together. The prediction will be the class with the The case study data is obtained from the publicly avail-
largest cumulative weight (Souza et al. 2005). able PROMISE software project data repository (Boetticher
The KNN learner can use any metric that is appro- et al. 2007). We study the software measurement datasets for
priate to calculate the distance between the samples. The the Java-based Eclipse project (Zimmermann et al. 2007).
standard metric used in KNN is Euclidean Distance, The software metrics and defect data are aggregated at the
defined as software packages level; hence, a program module is a
Java package in Eclipse. We consider three releases of the
n Eclipse system, where the releases are denoted as 2.0, 2.1,
d(x, y) = (xi − yi )2 and 3.0 (Zimmermann et al. 2007).
i=1 Each system release contains the following information
(Zimmermann et al. 2007): name of the package for which
3.3.4 Support vector machine the metrics are collected (name); number of defects reported
6 months prior to release (pre-release defects); number
One of the most efficient ways to classify between two of defects reported 6 months after release (post-release
classes is to assume that both classes are linearly separated defects); a set of complexity metrics computed for classes or
from each other. This assumption allows us to use a dis- methods and aggregated by using average, maximum, and
criminant to split the instances into the two classes before total at the package level (complexity metrics); and structure
looking at the distribution between the classes. A linear dis- of abstract syntax tree(s) of the package consisting of the
criminant uses the formula g(x|w, ω0 ) = wT x + ω0 . In the node size, type, and frequency (structure of abstract syntax
case of the linear discriminant, we only need to learn the tree(s)).
808 Inf Syst Front (2014) 16:801–822
In our case study, we modify the original data by: (1) Although we use different learners here, a preliminary
removing all non-numeric attributes, including the package study showed that log2 n is still appropriate for various
names, and (2) converting the post-release defects attribute learners.
to a binary class attribute, where fault-prone (fp) is the After feature selection, we applied four learners (NB,
minority class and not-fault-prone (nfp) is the majority MLP, KNN, and SVM) to the training datasets with the
class. A program module’s membership in a given class selected features, and we used AUC to evaluate the perfor-
is determined by a post-release defects threshold, thd. A mance of the classification models.
program module (package) with thd or more post-release
defects is labeled as fp, while those with fewer than thd 4.3 Results & analysis
defects are labeled as nfp.
For a given system release we consider three values for All the results are reported in Tables 2, 3, and 4, each
thd: for releases 2.0 and 3.0, thd = {10, 5, 3}, and for representing the results for each group of the datasets using
release 2.1, thd = {5, 4, 2}. This results in three groups of NB, MLP, KNN, or SVM. The values presented in the tables
datasets, as shown in Table 1. A different set of thresholds show the average AUC for every classification model con-
is chosen for release 2.1 because we wanted to maintain rel- structed over the ten runs of five-fold cross-validation and
atively similar class distributions for the three datasets in across that particular group of datasets. Cross-validation is
a given group. Table 1 presents key details about the dif- a technique for estimating the performance of a predictive
ferent datasets used in our study. All datasets contain 209 model. In this study, for each of the five folds one fold is
software attributes, which include 208 independent (predic- used as the test data while the other four folds are used as
tor) attributes and one dependent attribute (fp or nfp label). training data. All the preprocessing steps (feature selection
The three groups of datasets exhibit different class distribu- and data sampling) are done on the training dataset. The pro-
tions with respect to the fp and nfp modules, where Eclipse cessed training data is then used to build the classification
1 is relatively the most imbalanced and Eclipse 3 is the least model and the resulting model is applied to the test fold.
imbalanced. This cross-validation is repeated five times (the folds), with
each fold used exactly once as the test data. The five results
4.2 Experiment design from the five folds then can be averaged to produce a single
estimation.
In the experiments, we applied 18 filter-based feature rank- An unpaired two tailed t-test was used for each paired
ing techniques and two levels of sampling to the datasets. comparison between two strategies (iterative and non-
The main purpose of the experiment is to compare the iterative processes). The t-test examines the null hypothesis
performance of the classification models when using two that the population means related to two independent group
different feature selection strategies. samples are equal against the alternative hypothesis that the
population means are different. The p-values are provided
• Strategy 1: Data sampling followed by feature ranking for each pair of comparisons in the tables. The significance
(denoted ‘No Iteration’). level is set to 0.05; when the p-value is less than 0.05, the
• Strategy 2: An iterative process of data sampling fol- two group means (between‘Iteration’ and ‘No Iteration’)
lowed by feature ranking (denoted ‘Iteration’) are significantly different from one another. For example,
for Eclipse 1 with NB (Table 2), when using RF (Index 4)
The process of the two strategies are shown in Fig. 1, to rank features and RUS35 to sample the data, the result
where the upper part represents strategy 1 and the lower part demonstrates that the iterative process significantly outper-
represents strategy 2. formed the non-iterative process because the p-value (0.03)
The number of features that are selected for modeling is less than the specified cutoff 0.05. For the AUC ranker
needs to be determined in advance. We choose log2 n (Index 16) with RUS35 sampler, the iterative process is
features that have the highest scores, where n is the num- better than the non-iterative process (p = 0.4), but the dif-
ber of the independent features in the original dataset. ference is not statistically significant. As another example,
For the three groups of Eclipse datasets log2 n = 8, for the same datasets and same learner, when using the PRC
where n = 208. We select log2 n attributes, because ranker (Index 17) with RUS35, the classification perfor-
1) related literature does not provide guidance on the mance for the two strategies are equal or very similar. This
appropriate number of features to select; and 2) one of can also be confirmed by the p-value, which is 0.91. For
our recent empirical studies (Khoshgoftaar et al. 2007) each paired comparison between ‘Iteration’ and ‘No Iter-
recommended using log2 n features when employing ation,’ one can always find which one performs better at
WEKA to build random forests learners for binary classi- 0.05 < p < 0.90 (marked with bold) or significantly better
fication in general and imbalanced datasets in particular. at p ≤ 0.05 (marked with bold) than the other or whether
Inf Syst Front (2014) 16:801–822 809
they have the same or similar performance with p ≥ 0.90 strategies we investigate in this study. The pairwise inter-
(marked with underline). action effects are also considered in the ANOVA test. The
The results demonstrate that the iterative method out- ANOVA model can be used to test the hypothesis that the
performed the non-iterative approach for most cases on all AUC for the main factors and/or for the interactions are
three groups of datasets. Table 5 shows the summary of the equal against the alternative hypothesis that at least one
comparisons between ‘Iteration’ and ‘No Iteration’ over all mean is different. If the alternative hypothesis is accepted,
the 18 filters for a given sampler and a given learner on multiple comparisons can be used to determine which of the
each group of datasets. The value in each cell represents means are significantly different from the others. Table 6
the number of better cases (p < 0.90) for ‘No Iteration’ shows the ANOVA results for each group of datasets. The
or ‘Iteration’ or equal cases (p ≥ 0.90). When RUS35 acts p-value is less than the cutoff 0.05 for Factor A (learner),
as the sampler, the iterative process performed better than B (ranker), and D (strategy) on each table, meaning that
the non-iterative process for 142 out of 216 cases (65.7 %) for these main factors, the alternate hypothesis is accepted,
overall, worse than the non-iterative process for 35 out of namely, at least two group means are significantly differ-
216 cases (16.2 %), and equal to the non-iterative process ent from each other. On the other hand, the p-value is
for the remaining 18.1 % of cases. When RUS50 was used greater than 0.05 for Factor C for the Eclipse 1 and Eclipse
as the sampler, the iterative process outperformed the non- 3 datasets, meaning that the two samplers (RUS35 and
iterative process for 85.6 % of cases. The phenomenon that RUS50) are not significantly different from each other for
the iterative feature selection technique is better than the these two groups of datasets. In addition, some p-values
non-iterative approach becomes more obvious when using of the pairwise interaction effects are greater than 0.05,
the 50:50 post-sampling proportion and when feature selec- while others are less than 0.05. The small p-value implies
tion is applied to more severely imbalanced datasets like that the pairwise interactions significantly affect classifica-
Eclipse 1 and Eclipse 2. tion performance in each group of datasets. In other words,
We also conducted a four-way ANalysis Of VAriance changing the value of one factor will significantly influence
(ANOVA) F test on the classification performance for each the value of the other factor, and vice verse.
group of the datasets separately to examine if the perfor- We further carried out a multiple comparison test on
mance difference (better/worse) is statistically significant or each main factor and the interactions (B×D and C×D) with
not. Four factors are designed as follows: Factor A repre- Tukey’s Honestly Significant Difference criterion. Since
sents four learners, Factor B represents 18 rankers, Factor this study is more interested in comparing the performances
C represents two samplers, and Factor D represents the two of the two different feature selection strategies, we only
(a) NB
1 CS 0.8569 0.8606 0.647 0.8471 0.8542 0.440
2 GR 0.8406 0.8478 0.303 0.8460 0.8553 0.254
3 IG 0.8611 0.8636 0.780 0.8476 0.8589 0.242
4 RF 0.8518 0.8681 0.030 0.8488 0.8858 0.000
5 RFW 0.8067 0.8389 0.000 0.8126 0.8603 0.000
6 SU 0.8590 0.8612 0.774 0.8458 0.8610 0.099
7 FM 0.8491 0.8469 0.777 0.8526 0.8580 0.499
8 OR 0.8640 0.8659 0.807 0.8480 0.8619 0.154
9 Pow 0.8425 0.8338 0.394 0.8266 0.8353 0.433
10 PR 0.8432 0.8172 0.013 0.8243 0.8304 0.610
11 GI 0.8455 0.8398 0.497 0.8451 0.8697 0.018
12 MI 0.8503 0.8537 0.691 0.8397 0.8514 0.183
13 KS 0.8466 0.8496 0.711 0.8408 0.8492 0.271
14 Dev 0.8556 0.8541 0.851 0.8428 0.8521 0.277
15 GM 0.8409 0.8478 0.407 0.8422 0.8447 0.760
16 AUC 0.8684 0.8745 0.400 0.8600 0.8717 0.121
17 PRC 0.8616 0.8607 0.910 0.8546 0.8602 0.556
18 S2N 0.8805 0.8888 0.309 0.8707 0.8886 0.035
(b) MLP
1 CS 0.8604 0.8692 0.498 0.8609 0.8700 0.494
2 GR 0.8405 0.8569 0.156 0.8551 0.8782 0.035
3 IF 0.8691 0.8738 0.699 0.8594 0.8717 0.340
4 RF 0.8333 0.8263 0.522 0.8352 0.8432 0.369
5 RFW 0.8289 0.8426 0.181 0.8330 0.8604 0.001
6 SU 0.8630 0.8707 0.474 0.8562 0.8799 0.050
7 FM 0.8480 0.8560 0.605 0.8631 0.8609 0.877
8 OR 0.8668 0.8676 0.948 0.8631 0.8754 0.356
9 Pow 0.8427 0.8417 0.925 0.8331 0.8436 0.381
10 PR 0.8408 0.8221 0.122 0.8300 0.8343 0.709
11 GI 0.8489 0.8437 0.629 0.8584 0.8780 0.100
12 MI 0.8580 0.8577 0.985 0.8541 0.8626 0.556
13 KS 0.8532 0.8609 0.622 0.8467 0.8558 0.537
14 Dev 0.8602 0.8631 0.809 0.8625 0.8677 0.695
15 GM 0.8462 0.8569 0.536 0.8482 0.8543 0.664
16 AUC 0.8818 0.8823 0.950 0.8701 0.8819 0.208
17 PRC 0.8720 0.8738 0.865 0.8590 0.8752 0.137
18 S2N 0.8675 0.8691 0.825 0.8652 0.8625 0.715
(c) KNN
1 CS 0.8916 0.8967 0.545 0.8841 0.8919 0.416
2 GR 0.8731 0.8902 0.028 0.8751 0.8940 0.024
3 IF 0.8926 0.9001 0.328 0.8799 0.8966 0.078
4 RF 0.8701 0.8883 0.009 0.8771 0.8931 0.009
5 RFW 0.7918 0.8353 0.000 0.8199 0.8797 0.000
6 SU 0.8925 0.8991 0.387 0.8806 0.8982 0.035
Inf Syst Front (2014) 16:801–822 811
Table 2 (continued)
present the pairwise interactions that involve the two strate- different if their intervals are disjoint, and are not signifi-
gies (Factor D). Note that for all the ANOVA and multiple cantly different if their intervals overlap. The assumptions
comparison tests, the significance level was set to 0.05. for constructing ANOVA models were validated. From these
Figures 2, 3 and 4 show the multiple comparisons on every figures we can see the following points:
group of datasets, each with six subfigures representing the
result for Factors A, B, C, D, B×D, and C×D, respec- • Among the four learners, SVM always demon-
tively. The figures display graphs with each group mean strated significantly better performance than the other
represented by a symbol (◦) and 95 % confidence interval classifiers. MLP and KNN performed averagely. NB
as a line around the symbol. Two means are significantly always resulted in the worst performance.
812 Inf Syst Front (2014) 16:801–822
(a) NB
1 CS 0.8516 0.8552 0.640 0.8486 0.8533 0.557
2 GR 0.8314 0.8517 0.029 0.8367 0.8478 0.192
3 IG 0.8538 0.8547 0.906 0.8499 0.8526 0.743
4 RF 0.8496 0.8645 0.087 0.8518 0.8737 0.006
5 RFW 0.8399 0.8623 0.038 0.8351 0.8699 0.000
6 SU 0.8506 0.8540 0.653 0.8484 0.8534 0.543
7 FM 0.8460 0.8520 0.494 0.8469 0.8504 0.674
8 OR 0.8432 0.8494 0.478 0.8461 0.8526 0.449
9 Pow 0.8326 0.8432 0.347 0.8308 0.8402 0.382
10 PR 0.8317 0.8319 0.983 0.8287 0.8347 0.586
11 GI 0.8307 0.8362 0.636 0.8430 0.8550 0.168
12 MI 0.8506 0.8543 0.644 0.8445 0.8506 0.471
13 KS 0.8448 0.8506 0.505 0.8448 0.8462 0.876
14 Dev 0.8470 0.8517 0.589 0.8456 0.8511 0.511
15 GM 0.8436 0.8489 0.553 0.8422 0.8480 0.509
16 AUC 0.8530 0.8530 0.999 0.8475 0.8526 0.538
17 PRC 0.8532 0.8526 0.937 0.8479 0.8506 0.742
18 S2N 0.8786 0.8806 0.764 0.8769 0.8797 0.686
(b) MLP
1 CS 0.8867 0.8930 0.452 0.8865 0.8931 0.418
2 GR 0.8747 0.8836 0.320 0.8760 0.8873 0.192
3 IF 0.8909 0.8926 0.823 0.8857 0.8907 0.514
4 RF 0.8661 0.8712 0.604 0.8706 0.8741 0.697
5 RFW 0.8514 0.8699 0.106 0.8632 0.8764 0.103
6 SU 0.8890 0.8925 0.650 0.8889 0.8923 0.665
7 FM 0.8835 0.8852 0.848 0.8857 0.8867 0.912
8 OR 0.8821 0.8916 0.269 0.8912 0.8919 0.920
9 Pow 0.8711 0.8809 0.360 0.8714 0.8771 0.573
10 PR 0.8638 0.8619 0.882 0.8705 0.8725 0.861
11 GI 0.8631 0.8611 0.868 0.8869 0.8877 0.925
12 MI 0.8888 0.8927 0.604 0.8877 0.8876 0.989
13 KS 0.8845 0.8851 0.943 0.8825 0.8850 0.790
14 Dev 0.8817 0.8892 0.379 0.8852 0.8860 0.921
15 GM 0.8835 0.8835 0.995 0.8811 0.8846 0.703
16 AUC 0.8914 0.8932 0.820 0.8850 0.8955 0.201
17 PRC 0.8867 0.8901 0.669 0.8878 0.8939 0.441
18 S2N 0.8870 0.8837 0.723 0.8860 0.8807 0.590
(c) KNN
1 CS 0.8923 0.8929 0.925 0.8861 0.8906 0.495
2 GR 0.8687 0.8863 0.042 0.8747 0.8826 0.283
3 IF 0.8908 0.8916 0.896 0.8852 0.8890 0.569
4 RF 0.8354 0.8485 0.138 0.8569 0.8709 0.046
5 RFW 0.8159 0.8395 0.091 0.8393 0.8734 0.001
6 SU 0.8909 0.8912 0.964 0.8848 0.8886 0.589
Inf Syst Front (2014) 16:801–822 813
Table 3 (continued)
• Among the 18 filter-based feature ranking techniques, showed very similar behavior for Eclipse 1 and Eclipse
AUC, PRC, and S2N performed either significantly bet- 3. This is consistent with the results obtained from
ter or better than most other filters. CS, IG, SU, FM, ANOVA.
OR, MI, KS, and Dev demonstrated similar perfor- • Between the two strategies, ‘Iteration’ always showed
mance and also showed relatively better classification significantly better behavior than ‘No Iteration.’ This
behavior than the other filters. Some filters, such as is especially evident for more severely imbalanced data
RFW and PR, always displayed low performance, while sets (Eclipse 1 and Eclipse 2).
other filters like Pow presented inconsistent perfor- • There are 36 groups for interaction B×D; these are
mance with respect to different groups of datasets. formed by each of 18 feature ranking techniques being
• Between the two samplers, RUS50 performed signif- combined with two feature selection strategies. The
icantly better than RUS35 for Eclipse 2, but they group means demonstrate that for Eclipse 1 and Eclipse
814 Inf Syst Front (2014) 16:801–822
(a) NB
1 CS 0.7887 0.7867 0.771 0.7830 0.7857 0.711
2 GR 0.7714 0.7715 0.987 0.7668 0.7753 0.338
3 IF 0.7913 0.7876 0.610 0.7842 0.7872 0.671
4 RF 0.8015 0.8046 0.579 0.8032 0.8079 0.529
5 RFW 0.8150 0.8140 0.893 0.8044 0.8065 0.788
6 SU 0.7844 0.7853 0.897 0.7801 0.7844 0.566
7 FM 0.7815 0.7842 0.707 0.7783 0.7806 0.755
8 OR 0.7915 0.7893 0.766 0.7809 0.7853 0.556
9 Pow 0.7895 0.7894 0.998 0.7858 0.7884 0.746
10 PR 0.7718 0.7694 0.844 0.7706 0.7687 0.878
11 GI 0.7701 0.7695 0.962 0.7715 0.7746 0.733
12 MI 0.7809 0.7844 0.624 0.7811 0.7818 0.926
13 KS 0.7824 0.7855 0.644 0.7823 0.7834 0.874
14 Dev 0.7822 0.7859 0.615 0.7820 0.7817 0.965
15 GM 0.7837 0.7859 0.725 0.7832 0.7844 0.854
16 AUC 0.7862 0.7860 0.977 0.7830 0.7856 0.713
17 PRC 0.7857 0.7865 0.908 0.7834 0.7858 0.720
18 S2N 0.8241 0.8289 0.631 0.8228 0.8282 0.597
(b) MLP
1 CS 0.8547 0.8558 0.899 0.8568 0.8539 0.728
2 GR 0.8292 0.8338 0.725 0.8326 0.8362 0.736
3 IF 0.8560 0.8551 0.912 0.8584 0.8568 0.842
4 RF 0.8343 0.8335 0.920 0.8467 0.8488 0.813
5 RFW 0.8472 0.8459 0.877 0.8449 0.8511 0.504
6 SU 0.8576 0.8543 0.663 0.8537 0.8523 0.862
7 FM 0.8531 0.8566 0.682 0.8556 0.8545 0.902
8 OR 0.8564 0.8551 0.860 0.8476 0.8477 0.988
9 Pow 0.8542 0.8537 0.959 0.8497 0.8539 0.659
10 PR 0.8312 0.8226 0.459 0.8295 0.8272 0.838
11 GI 0.8296 0.8220 0.524 0.8369 0.8266 0.308
12 MI 0.8513 0.8545 0.717 0.8578 0.8495 0.364
13 KS 0.8524 0.8538 0.873 0.8509 0.8516 0.936
14 Dev 0.8506 0.8541 0.694 0.8541 0.8520 0.822
15 GM 0.8529 0.8537 0.924 0.8505 0.8559 0.562
16 AUC 0.8579 0.8596 0.812 0.8551 0.8576 0.746
17 PRC 0.8552 0.8569 0.841 0.8534 0.8564 0.735
18 S2N 0.8629 0.8658 0.751 0.8608 0.8657 0.575
(c) KNN
1 CS 0.8569 0.8614 0.598 0.8531 0.8562 0.732
2 GR 0.8403 0.8389 0.897 0.8122 0.8126 0.971
3 IF 0.8574 0.8601 0.742 0.8547 0.8558 0.905
4 RF 0.7684 0.7718 0.469 0.7896 0.8058 0.012
5 RFW 0.7761 0.7871 0.102 0.7882 0.8070 0.004
6 SU 0.8539 0.8561 0.781 0.8517 0.8579 0.484
Inf Syst Front (2014) 16:801–822 815
Table 4 (continued)
2, in addition to the two main factors, the classifi- 2, RUS50 performed better than RUS35 for both strate-
cation performance was greatly influenced by their gies and ‘Iteration’ performed better than ‘No Iteration’
interactions. For example, for Eclipse 1, PR signifi- for both post-sampling cost ratios. This is also reflected
cantly outperformed RFW for ‘No Iteration,’ but this by the output of ANOVA, where p is 0.648. For Eclipse
pattern was not observed when the ‘Iteration’ strat- 1, RUS50 along with ‘No Iteration’ showed the worst
egy was adopted (see Fig. 2). For Eclipse 3, the performance among four combinations. If we rank
interaction factor did not play a significant role on the the remaining three in terms of their classification
performance. performance from worst to best, they are RUS35-NI
• For the interaction effect between sampler and strategy, (RUS35 along with ‘No Iteration’), RUS35-I (RUS35
there are four different combinations (groups), each of along with ‘Iteration’), and RUS50-I. The interac-
two sampling approaches along with two feature selec- tion factor for Eclipse 3 showed similar behavior to
tion strategies. The results demonstrate that for Eclipse Eclipse 1.
816 Inf Syst Front (2014) 16:801–822
Table 5 Performance
comparison between iteration Eclipse data Learner RUS35 RUS50
and no iteration
No Iteration Iteration equal No Iteration Iteration equal
1 NB 5 12 1 0 18 0
27.8 % 66.7 % 5.6 % 0.0 % 100.0 % 0.0 %
MLP 3 11 4 2 16 0
16.7 % 61.1 % 22.2 % 11.1 % 88.9 % 0.0 %
KNN 1 16 1 0 18 0
5.6 % 88.9 % 5.6 % 0.0 % 100.0 % 0.0 %
SVM 4 12 2 0 18 0
22.2 % 66.7 % 11.1 % 0.0 % 100.0 % 0.0 %
Total 13 51 8 2 70 0
18.1 % 70.8 % 11.1 % 2.8 % 97.2 % 0.0 %
2 NB 0 14 4 0 18 0
0.0 % 77.8 % 22.2 % 0.0 % 100.0 % 0.0 %
MLP 3 13 2 1 12 5
16.7 % 72.2 % 11.1 % 5.6 % 66.7 % 27.8 %
KNN 1 13 4 1 17 0
5.6 % 72.2 % 22.2 % 5.6 % 94.4 % 0.0 %
SVM 1 14 3 1 16 1
5.6 % 77.8 % 16.7 % 5.6 % 88.9 % 5.6 %
Total 5 54 13 3 63 6
6.9 % 75.0 % 18.1 % 4.2 % 87.5 % 8.3 %
3 NB 5 8 5 1 15 2
27.8 % 44.4 % 27.8 % 5.6 % 83.3 % 11.1 %
MLP 5 9 4 7 8 3
27.8 % 50.0 % 22.2 % 38.9 % 44.4 % 16.7 %
KNN 3 12 3 1 14 3
16.7 % 66.7 % 16.7 % 5.6 % 77.8 % 16.7 %
SVM 4 8 6 2 15 1
22.2 % 44.4 % 33.3 % 11.1 % 83.3 % 5.6 %
Total 17 37 18 11 52 9
23.6 % 51.4 % 25.0 % 15.3 % 72.2 % 12.5 %
Total 35 142 39 16 185 15
16.2 % 65.7 % 18.1 % 7.4 % 85.6 % 6.9 %
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91
RUS35 No Iteration
RUS50 Iteration
0.8705 0.871 0.8715 0.872 0.8725 0.873 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88
0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882
Factor B D Factor C D
Fig. 2 Eclipse1: multiple comparison
818 Inf Syst Front (2014) 16:801–822
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.85 0.86 0.87 0.88 0.89 0.9
RUS35 No Iteration
RUS50 Iteration
0.8795 0.88 0.8805 0.881 0.8815 0.882 0.8825 0.883 0.876 0.878 0.88 0.882 0.884 0.886 0.888
Factor B D Factor C D
Fig. 3 Eclipse2: multiple comparison
Inf Syst Front (2014) 16:801–822 819
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.81 0.82 0.83 0.84 0.85 0.86 0.87
RUS35 No Iteration
RUS50 Iteration
0.838 0.8385 0.839 0.8395 0.84 0.8405 0.8375 0.838 0.8385 0.839 0.8395 0.84 0.8405 0.841
Factor B D Factor C D
Fig. 4 Eclipse3: multiple comparison
820 Inf Syst Front (2014) 16:801–822
(a) Eclipse 1
A (Learner) 2.7831 3 0.9277 691.66 0.000
B (Ranker) 1.2479 17 0.0734 54.73 0.000
C (Sampler) 0.0004 1 0.0004 0.31 0.576
D (Strategy) 0.1579 1 0.1579 117.76 0.000
A×B 0.2551 51 0.0050 3.73 0.000
A×C 0.0046 3 0.0015 1.13 0.333
A×D 0.0102 3 0.0034 2.53 0.055
B×C 0.1264 17 0.0074 5.54 0.000
B×D 0.1403 17 0.0083 6.15 0.000
C×D 0.0423 1 0.0423 31.55 0.000
Error 11.4342 8525 0.0013
Total 16.2024 8639
(b) Eclipse 2
A (Learner) 4.3408 3 1.4469 1537.50 0.000
B (Ranker) 0.5406 17 0.0318 33.79 0.000
C (Sampler) 0.0060 1 0.0060 6.41 0.011
D (Strategy) 0.0589 1 0.0589 62.63 0.000
A×B 0.3166 51 0.0062 6.60 0.000
A×C 0.0022 3 0.0007 0.77 0.509
A×D 0.0038 3 0.0013 1.35 0.256
B×C 0.0694 17 0.0041 4.34 0.000
B×D 0.0684 17 0.0040 4.27 0.000
C×D 0.0002 1 0.0002 0.21 0.648
Error 8.0228 8525 0.0009
Total 13.4297 8639
(c) Eclipse 3
A (Learner) 9.5467 3 3.1822 3162.27 0.000
B (Ranker) 0.8582 17 0.0505 50.16 0.000
C (Sampler) 0.0009 1 0.0009 0.91 0.340
D (Strategy) 0.0062 1 0.0062 6.12 0.013
A×B 0.9947 51 0.0195 19.38 0.000
A×C 0.0020 3 0.0007 0.67 0.570
A×D 0.0042 3 0.0014 1.39 0.245
B×C 0.0415 17 0.0024 2.43 0.001
B×D 0.0096 17 0.0006 0.56 0.923
C×D 0.0010 1 0.0010 1.02 0.312
Error 8.5788 8525 0.0010
Total 20.0438 8639
paper are based upon the metrics and defect data obtained process of developing a useful defect predictor in the pres-
from nine datasets of a software project with three sep- ence of the problem of class imbalance and/or when there
arate releases. The use of three groups of datasets, each are a large number of software metrics to work with. The
with a different class ratio strengthens the generalization of proposed strategy of employing an iterative feature selection
our empirical results. The same (as our approach) analy- process of data sampling followed by feature ranking which
sis for another software system, especially from a different finally aggregates the results generated during the iterative
application domain, may provide different results – a likely process can be extended to any software system, especially
threat in all empirical software engineering research. How- high-assurance software and/or when there is a large col-
ever, a software quality practitioner would appreciate the lection of software metrics for building defect predictors.
Inf Syst Front (2014) 16:801–822 821
Moreover, as all our final conclusions are based on ten runs feature ranking techniques and two levels of post-sampling
of five-fold cross validation and statistical tests for signifi- class ratio (35:65 and 50:50) along with three groups of
cance, our findings are grounded in using sound methods. software-quality datasets, each with three separate releases.
Finally, it is observed that the type of classifier used for We build our models using four different classifiers (Naı̈ve
software quality prediction greatly affected the prediction Bayes, Multilayer Perceptron, k-Nearest Neighbor, and Sup-
results for the case study. Hence, software quality practi- port Vector Machine). We also compare the results from the
tioners should not rule out considering other classification iterative approach with a non-iterative approach using only
algorithms when applying the proposed approach. Also, the a single run of undersampling followed by feature selec-
selection of specific parameter settings for a given classifier tion, to demonstrate the importance of the iteration. All
is likely to cause the classifier to analyze the training data feature selection chooses log2 n software metrics, where
differently. n = 208 is the number found in the original dataset, and
Threats to internal validity for a software engineering thus 8 features are selected.
experiment are unaccounted for influences that may affect Our results demonstrate that the iterative approach gives
case study results. In the context of this study, poor fault- greater performance than the non-iterative approach, espe-
proneness estimates can be caused by a wide variety of cially when using 50:50 as the post-sampling class ratio and
factors, including measurement errors while collecting and especially on the most imbalanced datasets. We also find
recording software metrics, modeling errors due to the that the SVM learner always produces better classification
unskilled use of software applications, errors in model- results than the other learners, and that the AUC (Area
selection during the modeling process, and the presence Under the ROC Curve), PRC (Area Under the Precision-
of outliers and noise in the training dataset. Measurement Recall Curve), and S2N (Signal-to-Noise) rankers produce
errors are inherent to the data collection effort of a given better classification results than other rankers. To vali-
software project. In our study, a common model-building date these results, we use ANOVA analysis and Tukey’s
and model-evaluation approach that is used for all combina- Honestly Significant Difference criterion for multiple com-
tions of data sampling technique, feature selection method, parisons, and we find that the iterative vs. non-iterative and
and classifier has been adopted. Moreover, the experiments SVM vs. other classifiers results are statistically significant
and statistical analysis were performed by only one skilled at a 95 % confidence level. This gives us the confidence to
person in order to keep modeling errors to a minimum. say that the iterative approach’s resilience to random varia-
A software engineering domain expert is consulted to tions in the feature subset does lead to higher classification
set the number of software metrics to select from the orig- performance.
inal set after ranking the software metrics. This number Future work will consider additional datasets both from
may be different for another project. However, based on within the domain of software quality prediction and from
our extensive prior work in software quality estimation other domains. In addition, sampling techniques other than
and quantitative software engineering, we are confident the Random Undersampling will be considered.
selection of log2 n metrics in the attribute subset is large
enough to capture the quality-based characteristics of most
software projects. We have found very often that only a References
handful of project-specific metrics are relevant for defect
prediction, and that in many cases, very few metrics (among Boetticher, G., Menzies, T., Ostrand, T. (2007). Promise reposi-
tory of empirical software engineering data. [Online]. Available:
the available set) are selected by the learner in the final http://promisedata.org/.
software quality prediction model. Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification
and regression trees. Boca Raton: Chapman and Hall/CRC Press.
Chen, Z., Menzies, T., Port, D., Boehm, B. (2005). Finding the right
data for software cost modeling. IEEE Software, 22(6), 38–46.
6 Conclusion Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support
vector machines and other kernel-based learning methods, 2nd
Two major challenges which are found in a wide range of edn. Cambridge: Cambridge University Press.
data mining problems (such as software quality prediction) Forman, G. (2003). An extensive empirical study of feature selec-
tion metrics for text classification. Journal of Machine Learning
are high dimensionality and class imbalance. In this study, Research, 3, 1289–1305.
we propose an iterative feature ranking approach which Gao, K., Khoshgoftaar, T.M., Seliya, N. (2012). Predicting high-risk
addresses both of these problems, by applying Random program modules by selecting the right software measurements.
Undersampling multiple times and using feature selection Software Quality Journal, 20(1), 3–42.
Goh, L., Song, Q., Kasabov, N. (2004). A novel feature selection
to create separate ranked lists from each iteration, followed method to improve classification of gene expression data. In Pro-
by rank aggregation to produce one final feature list. To ceedings of the second conference on Asia-Pacific bioinformatics
demonstrate and test this technique, we use 18 filter-based (pp. 161–166). Dunedin.
822 Inf Syst Front (2014) 16:801–822
Gonzalez, R.C., & Woods, R.E. (2008). Digital image processing, 3rd Song, Q., Jia, Z., Shepperd, M., Ying, S., Liu, J. (2011). A general soft-
edn. New Jersey: Prentice Hall. ware defect-proneness prediction framework. IEEE Transactions
Haykin, S. (1999). Neural networks: a comprehensive foundation, 2nd On Software Engineering, 37(3), 356–370.
edn. New Jersey: Prentice Hall Interational, Inc. Souza, J., Japkowicz, N., Matwin, S. (2005). Stochfs: a framework for
Jeffery, I.B., Higgins, D.G., Culhane, A.C. (2006). Comparison and combining feature selection outcomes through a stochastic pro-
evaluation of methods for generating differentially expressed gene cess. In Knowledge discovery in databases: PKDD 2005 (Vol.
lists from microarray data. BMC Bioinformatics, 7(359). 3721, pp. 667–674).
Jiang, Y., Lin, J., Cukic, B., Menzies, T. (2009). Variance analysis in Votta, L.G., & Porter, A.A. (1995). Experimental software engineer-
software fault prediction models. In Proceedings of the 20th IEEE ing: A report on the state of the art. In Proceedings of the 17th.
international symposium on software reliability engineering (pp. International conference on software engineering (pp. 277–279).
99–108). Bangalore-Mysore. Seattle: IEEE Computer Society.
Jong, K., Marchiori, E., Sebag, M., van der Vaart, A. (2004). Feature Witten, I.H., Frank, E., Hall, M.A. (2011). Data mining: practi-
selection in proteomic pattern data with support vector machines. cal machine learning tools and techniques, 3rd edn. Burlington:
In Proceedings of the 2004 IEEE symposium on computational Morgan Kaufmann.
intelligence in bioinformatics and computational biology. Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B.,
Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M. (2009). Wesslen, A. (2012). Experimentation in software engineering.
The impact of gene selection on imbalanced microarray expres- Heidelberg/New York: Springer.
sion data. In Proceedings of the 1st international conference Zimmermann, T., Premraj, R., Zeller, A. (2007). Predicting defects for
on bioinformatics and computational biology; lecture notes in eclipse. In Proceedings of the 29th international conference on
bioinformatics (Vol. 5462, pp. 259–269). New Orleans. software engineering workshops (p. 76). Washington, DC: IEEE
Khoshgoftaar, T.M., & Gao, K. (2010). A novel software metric selec- Computer Society.
tion technique using the area under roc curves. In Proceedings
of the 22nd international conference on software engineering and
knowledge engineering (pp. 203–208). San Francisco.
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J. (2007). An empirical Taghi M. Khoshgoftaar, is a professor of the Department of Com-
study of learning from imbalanced data using random forest. In puter and Electrical Engineering and Computer Science, Florida
Proceedings of the 19th IEEE international conference on tools Atlantic University and the Director of the Data Mining and Machine
with artificial intelligence (Vol. 2, pp. 310–317). Washington, DC. Learning Laboratory, and Empirical Software Engineering Labora-
Khoshgoftaar, T.M., Gao, K., Bullard, L.A. (2012a). A comparative tory. His research interests are in big data analytics, data mining and
study of filter-based and wrapper-based feature ranking techniques machine learning, health informatics and bioinformatics, and software
for software quality modeling. International Journal of Reliability, engineering. He has published more than 500 refereed journal and
Quality and Safety Engineering, 18(4), 341–364. conference papers in these areas. He was the conference chair of the
Khoshgoftaar, T.M., Gao, K., Napolitano, A. (2012b). Exploring an IEEE International Conference on Machine Learning and Applications
iterative feature selection technique for highly imbalanced data (ICMLA 2012). He is the workshop chair of the IEEE IRI Health Infor-
sets. In Information Reuse and Integration (IRI), 2012 IEEE 13th matics workshop (2013). Also, he is the Editor-in Chief of the Big Data
international conference on (pp. 101–108). journal.
Kira, K., & Rendell, L.A. (1992). A practical approach to fea-
ture selection. In Proceedings of 9th international workshop on
machine learning (pp. 249–256). Kehan Gao, is an associate professor at the Department of Mathe-
Lessmann, S., Baesens, B., Mues, C., Pietsch, S. (2008). Bench- matics and Computer Science, Eastern Connecticut State University.
marking classification models for software defect prediction: a Her research interests include software engineering, software metrics,
proposed framework and novel findings. IEEE Transactions on software reliability and quality engineering, computer performance
Software Engineering, 34(4), 485–496. modeling, computational intelligence, and data mining and machine
Liu, T.-Y. (2009). Easyensemble and feature selection for imbalance Learning. She is a member of the IEEE, IEEE Computer Society, and
data sets. In Proceedings of the 2009 internationalc joint confer- IEEE Reliability Society.
ence on bioinformatics, systems biology and intelligent computing
(pp. 517–520). Washington, DC: IEEE Computer Society.
Liu, H., Motoda, H., Setiono, R., Zhao, Z. (2010). Feature selection: an
ever evolving frontier in data mining. In Proceedings of the fourth Amri Napolitano, is a research associate at the Department of Com-
international workshop on feature selection in data mining (pp. puter and Electrical Engineering and Computer Science (CEECS),
4–13). Hyderabad. Florida Atlantic University He received the Ph.D. and M.S. degrees
Menzies, T., Greenwald, J., Frank, A. (2007). Data mining static in Computer Science from Florida Atlantic University in 2009 and
code attributes to learn defect predictors. IEEE Transactions on 2006 respectively and the B.S. degree in Computer and Information
Software Engineering, 33(1), 2–13. Science from the University of Florida in 2004. His research interests
Mishra, D., & Sahu, B. (2011). Feature selection for cancer classifi- include data mining and machine learning, evolutionary computation,
cation: a signal-to-noise ratio approach. International Journal of and artificial intelligence.
Scientific & Engineering Research, 2(4).
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.
(2007). Detecting fault modules applying feature selection to clas-
sifiers. In Proceedings of 8th IEEE international conference on Randall Wald, is a research associate at the Department of Computer
information reuse and integration (pp. 667–672). Las Vegas. and Electrical Engineering and Computer Science (CEECS), Florida
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2010). Atlantic University. He studies different challenges in data mining and
Rusboost: a hybrid approach to alleviate class imbalance. IEEE machine learning as they apply to data from many different application
Transactions on Systems, Man & Cybernetics: Part A: Systems and domains, including machine condition monitoring, bioinformatics, and
Humans, 40(1), 185–197. social network mining.