Art:10.1007/s10796 013 9430 0 PDF

Inf Syst Front (2014) 16:801–822
DOI 10.1007/s10796-013-9430-0
A comparative study of iterative and non-iterative feature

selection techniques for software defect prediction
Taghi M. Khoshgoftaar · Kehan Gao ·
Amri Napolitano · Randall Wald
Published online: 27 April 2013

© Springer Science+Business Media New York 2013
Abstract Two important problems which can affect the use of sampling and feature selection without the itera-
the performance of classification models are high- tive step (e.g., using the ranked list from a single iteration,
dimensionality (an overabundance of independent features rather than combining the lists from multiple iterations),
in the dataset) and imbalanced data (a skewed class dis- and compare these results to those from the version which
tribution which creates at least one class with many fewer uses iteration. Our study is carried out using three groups of
instances than other classes). To resolve these problems datasets with different levels of class balance, all of which
concurrently, we propose an iterative feature selection were collected from a real-world software system. All of our
approach, which repeated applies data sampling (in order experiments use four different learners and one feature sub-
to address class imbalance) followed by feature selection set size. We find that our proposed iterative feature selection
(in order to address high-dimensionality), and finally we approach outperforms the non-iterative approach.
perform an aggregation step which combines the ranked
feature lists from the separate iterations of sampling. This Keywords Iterative feature selection · Software defect
approach is designed to find a ranked feature list which is prediction · Date sampling · High dimensionality · Class
particularly effective on the more balanced dataset result- imbalance
ing from sampling while minimizing the risk of losing data
through the sampling step and missing important features.
To demonstrate this technique, we employ 18 different fea- 1 Introduction
ture selection algorithms and Random Undersampling with
two post-sampling class distributions. We also investigate Two major challenges are found across many data min-
ing and machine learning problems: high dimensionality
and class imbalance. High dimensionality refers to datasets
which have a large number of independent attributes (fea-
T. M. Khoshgoftaar () · A. Napolitano · R. Wald tures), especially relative to the total number of instances. In
Empirical Software Engineering Laboratory, many high-dimensional datasets, only a small fraction of the
Department of Computer and Electrical, Engineering and
Computer Science, Florida Atlantic University, Boca Raton,
features actually provide useful information about the class
FL 33431, USA variable: the rest may be redundant (provide information
e-mail: khoshgof@fau.edu which is already contained in other features) or irrelevant
A. Napolitano (provide no useful information whatsoever). By removing
e-mail: anapoli1@fau.edu these useless features, classification models can be built
R. Wald with much lower computational cost, and often these mod-
e-mail: rwald1@fau.edu els will have greater performance than those built using all
of the features. In addition, the act of identifying the most
K. Gao
Eastern Connecticut State University
important features is often useful in and of itself, as it tells
Willimantic, CT 06226, USA practitioners which features are actually affecting the class
e-mail: gaok@easternct.edu variable.
802 Inf Syst Front (2014) 16:801–822
Class imbalance is a separate problem, wherein the class engineering (Gonzalez and Woods 2008) which has only
ratio is especially skewed. For example, in a binary-class recently been applied towards feature ranking (Mishra and
problem, this may result in one class having far fewer Sahu 2011). Two different post-sampling distributions are
instances than the other class. This is a major challenge used with Random Undersampling: 35:65 and 50:50. For
because typically, the minority class will also be the positive this study, we use data from the domain of software quality
class (that is, the class of interest). Classification models, on prediction, a field which seeks to use software metrics col-
the other hand, are far more likely to correctly classify the lected during the software development process in order to
majority (negative) class, because a model which improves identify which software modules are most likely to contain
performance on this class will have greater overall accuracy. faults (Lessmann et al. 2008; Song et al. 2011). In particu-
For example, in the domain of software defect prediction, lar, three groups of software datasets are used, all collected
there are (ideally) far fewer fault-prone (fp) software mod- from a real-world software system.
ules than not-fault-prone (nfp) modules. The goal is to iden- We use these 18 rankers, two levels of Random Under-
tify those modules which are likely to contain faults, but, sampling, and three groups of datasets with two different
without preprocessing, a model is more likely to correctly approaches: both the iterative strategy which applies under-
identify nfp modules, even if this means incorrectly assign- sampling and feature selection multiple times and aggre-
ing fp modules to the nfp class. Many solutions have been gates the resulting ranked lists, and a non-iterative strategy
proposed to address the class imbalance problem, but one which only performs undersampling and feature selection
popular choice is sampling: changing the dataset to make it once. In both cases, we use the resulting feature ranked lists
more balanced. This can be done either by adding instances (specifically, the top 8 features from these lists) along with
to the minority class or by removing instances from the four classification algorithms (Naı̈ve Bayes, Multilayer Per-
majority class, and by either random or directed approaches, ceptron, k-Nearest Neighbors, and Support Vector Machine)
but one of the simpler strategies, Random Undersampling to build models on the full (i.e., non-sampled) datasets.
(that is, randomly discarding majority-class instances), has This paper extends our previous work (Khoshgoftaar
proven especially effective. et al. 2012b) by incorporating a wider range of classifiers
Although much work has considered the high- and datasets, as well as by performing a more extensive
dimensionality and class imbalance problems indepen- statistical analysis and discussing threats to validity. The
dently, few have specifically addressed datasets which results show that our iterative approach outperforms (on
contain both of these problems, and fewer have proposed average) the non-iterative version for all levels of class
algorithms specifically designed to address these prob- imbalance, although the difference between the iterative and
lems in tandem. The goal of this paper is to present a new non-iterative approach is strongest on the most imbalanced
strategy for data preprocessing to resolve both problems. datasets. We also see that this distinction becomes more evi-
This strategy consists of multiple iterations of Random dent when using Random Undersampling to create a 50:50
Undersampling applied to modify the datasets into a chosen class ratio.
degree of class balance, followed by feature selection per- The remainder of the paper is organized as follows.
formed on the reduced dataset. After all iterations have been Section 2 presents related work. The 18 filter-based fea-
performed, the ranked feature lists from each are combined, ture ranking techniques and our proposed iterative feature
to reduce the influence of the randomness from the under- selection method, as well as the four classifiers and the
sampling step and ensure only those features which are associated classification performance metric used in the
relevant to a large proportion of the iterations are preserved study are described in Section 3. A case study using three
in the final ranking. This final ranking is used along with groups of datasets from a real-world software system is pro-
the original (non-sampled) dataset to build classification vided in Section 4. The threats to validity are discussed in
models. Section 5. Finally, conclusions and future work are indicated
To demonstrate this algorithm, we perform a case study in Section 6.
using 18 filter-based feature ranking techniques, along
with Random Undersampling as our sampling tech-
nique. These 18 feature selection algorithms include 2 Related work
six commonly-used algorithms (Chi-Squared, Information
Gain, Gain Ratio, Symmetrical Uncertainty, and two ver- Feature selection (FS), also known as attribute selection, is
sions of ReliefF (Witten et al. 2011)), 11 threshold-based a process of selecting some subset of the features which are
feature selection techniques proposed by our research team useful in building a classifier. It deletes as many unneces-
(Khoshgoftaar and Gao 2010), and the Signal-to-Noise con- sary features as possible, leaving only those features that
cept from the domain of electrical and communication are important to the class attribute. FS techniques can be
Inf Syst Front (2014) 16:801–822 803
divided into wrappers, filters, and embedded categories (Liu 3 Methodology

et al. 2010). Wrappers use a search algorithm to search
through the space of possible features, evaluate each subset 3.1 Filter-based feature ranking techniques
through a learning algorithm, and determine which ones are
finally selected in building a classifier. Filters use a simpler The procedure of feature ranking is to score each feature
statistical measure or some intrinsic characteristic to eval- according to a particular method, allowing the selection of
uate each subset or individual feature rather than using a the best features.
learning algorithm. Embedded approaches, like the wrapper
methods, are also linked to a learning algorithm; however 3.1.1 Standard techniques
this link is much stronger compared to wrappers, as FS
is included in the classifier construction in this case. Fea- The six commonly used filter-based feature ranking tech-
ture selection can also be categorized as ranking or subset niques examined in this work include (Witten et al. 2011):
selection (Liu et al. 2010). Feature ranking scores the Chi-Squared (CS), Information Gain (IG), Gain Ratio (GR),
attributes based on their individual predictive power, while two types of ReliefF (RF and RFW), and Symmetrical
subset selection selects subset of attributes that collec- Uncertainty (SU).
tively have good prediction capability. In this study, the FS The Chi-Squared, χ 2 , test is used to examine the distri-
techniques used belong to the filter-based feature ranking bution of the class as it relates to the values of the given
category. feature. The null hypothesis is that there is no correlation,
Numerous variations of FS have been employed in a i.e., each value is as likely to have instances in any one class
range of fields. Jong et al. (2004) introduced methods as any other class. Given the null hypothesis, the χ 2 statistic
for FS based on support vector machines. Jeffery et al. measures how far away the actual value is from the expected
(2006) compared some of the most commonly used feature value:
selection methods in identifying differentially expressed r nc
(Oi,j − Ei,j )2
genes in microarray data. In the context of text mining, χ2 =
Ei,j
Forman (2003) investigated multiple filter-based feature i=1 j =1
ranking techniques. Rodriguez et al. (2007) applied fea- where r is the number of different values of the feature, nc
ture subset selection with three filter-based models and is the number of classes (in this work, nc = 2), Oi,j is the
two wrapper-based models to five software engineering observed number of instances with value i which are in class
datasets. j, and Ei,j is the expected number of instances with value
Class imbalance, which appears in various domains i and class j. The larger this χ 2 statistic, the more likely it
(Kamal et al. 2009), is another significant problem in data is that the distribution of values and classes are dependent,
mining. One effective method for alleviating the adverse i.e., the feature is relevant to the class.
effect of skewed class distribution is sampling (Seiffert et al. Information Gain (IG) is the information provided about
2010). While considerable work has been done for feature the target class attribute Y, given the value of another
selection and data sampling separately, research on inves- attribute X. IG measures the decrease of the weighted aver-
tigating both together started recently. Chen et al. (2005) age impurity of the partitions compared to the impurity of
have studied data row pruning (data sampling) and data the complete set of data. A drawback of IG is that it tends to
column pruning (feature selection) in the context of soft- prefer attributes with a larger number of possible values, i.e.,
ware cost/effort estimation. However, the data sampling in if one attribute has a larger number of values, it will appear
their study was not specific for the class imbalance prob- to gain more information than those with fewer values, even
lem, and also the classification models were not for binary if it is actually no more informative. One strategy to solve
classification problems. this problem is to use the Gain Ratio (GR), which penalizes
Our research group recently studied various feature multiple-valued attributes. Symmetrical Uncertainty (SU) is
selection techniques, including filter-based and wrapper- another way to overcome the problem of IG’s bias toward
based methods (Khoshgoftaar et al. 2012a; Gao et al. 2012), attributes with more values, doing so by dividing by the sum
and applied them to a variety of software datasets. The of the entropies of X and Y.
results demonstrate that the performances of the classifi- Relief is an instance-based feature ranking technique
cation models were maintained or even improved when introduced by Kira and Rendell (1992). It measures the
over 85 % of the features were eliminated from the origi- importance of features by considering how much their val-
nal datasets. Also, filter-based ranking techniques not only ues change when comparing a randomly chosen instance
showed better classification performance but also had a with its nearest hit (an instance from the same class) and
lower computational cost. its nearest miss (one from a different class). ReliefF is an
804 Inf Syst Front (2014) 16:801–822
extension of the Relief algorithm that can handle noise and as negative and below as positive. The better result is used.
multiclass datasets, and is implemented in the WEKA tool1 Each of the 11 metrics are calculated for each attribute indi-
(Witten et al. 2011). When the WeightByDistance vidually, and attributes with higher values for F-Measure,
(weight nearest neighbors by their distance) parameter is set Geometric Mean, Probability Ratio, Power, Area Under
as default (false), the algorithm is referred to as RF; when the the ROC Curve, Area Under the Precision-Recall Curve,
parameter is set to ‘true,’ the algorithm is referred to as RFW. Mutual Information, Kolmogorov-Smirnov Statistic, and
Odds Ratio and lower values for Gini Index and Deviance
3.1.2 Threshold-based feature selection are determined to better predict the class attribute. In this
manner, the attributes can be ranked from most to least
The threshold-based feature selection (TBFS) technique predictive based on each of the 11 metrics.
was proposed by our research team and implemented within
a. F-Measure (FM). FM is derived from recall (or true
WEKA (Witten et al. 2011). The procedure is shown in Algo-
positive rate) and precision.
rithm 1. Each independent attribute works individually with
the class attribute, and that two-attribute dataset is evalu- 2 × T P R(t) × P RE(t)
FM = max .
ated using different performance metrics. More specifically, t ∈[0,1] T P R(t) + P RE(t)
the TBFS procedure includes two steps: (1) normalizing Recall and precision are calculated at each point along
the attribute values so that they fall between 0 and 1; the normalized attribute range of 0 to 1. The maximum
and (2) treating those values as the posterior probabilities F-measure obtained by each attribute represents how
from which to calculate performance metrics. Note that no strongly that particular attribute relates to the class,
classifiers were built during the feature selection process. according to the F-measure.
Analogous to the procedure for calculating rates in a clas- b. Odds Ratio (OR). OR is defined as:
sification setting with a posterior probability, the true pos- T P R(t)(1 − F P R(t))
itive (T P R), true negative (T N R), false positive (F P R), OR = max
t ∈[0,1] (1 − T P R(t))F P R(t)
and false negative (F N R) rates can be calculated at each
j . T P R(t) T N R(t)
threshold t ∈ [0, 1] relative to the normalized attribute F = max
t ∈[0,1] F P R(t) F N R(t)
Precision P RE(t) is defined as the fraction of the predicted-
positive examples which are actually positive. The fea-
ture rankers we propose utilize these five rates as descri- OR is the maximum value of the ratio of the product of
bed below. The value is computed in both directions: first correct to incorrect predictions.
treating instances above the threshold (t) as positive and be- c. Power (Pow). Pow is defined as:

low as negative, then treating instances above the threshold Pow = max (1 − F P R(t))k − (1 − T P R(t))k
t ∈[0,1]

1 Waikato
= max (T N R(t))k − (F N R(t))k
Environment for Knowledge Analysis ( WEKA) is a popular t ∈[0,1]
suite of machine learning software written in Java, developed at the
for some integer k ≥ 1. Note that if k = 1, Power is
University of Waikato. WEKA is free software available under the GNU
General Public License. In this study, all experiments and algorithms equivalent to KS (described in Item g). In this work, we
were implemented in the WEKA tool. use k = 5 as done by Forman (2003).
Inf Syst Front (2014) 16:801–822 805
d. Probability Ratio (PR). PR is defined as: The larger the KS value, the better the attribute is able
T P R(t) to separate the two classes, and hence the more signifi-
PR = max cant the attribute is. The range of KS is between 0 and 1.
t ∈[0,1] F P R(t)
h. Deviance (Dev). Dev is the minimum residual sum of
e. Gini Index (GI). GI was first introduced by Breiman squares based on a threshold t. It measures the sum
et al. (1984) within the CART algorithm. For a given of the squared errors from the mean class given a par-
j (x) > t} and S̄t = {x |
threshold t, let St = {x | F titioning of the space based on the threshold t. As
(x) ≤ t}. The Gini index is calculated as:
F j it represents the total error found in the partitioning,
lower values are preferred.
GI = min 1 − P 2 (T P (t) | St ) + P 2 (F P (t) | St )
t∈[0,1] i. Geometric Mean (GM). GM is the square root of the

+ 1 − P 2 (T N (t) | S̄t ) + P 2 (F N (t) | S̄t ) product of the true positive rate and true negative rate.
GM ranges from 0 to 1, and an attribute that is perfectly
= min [2P RE(t)(1 − P RE(t)) + 2N P V (t)(1 − N P V (t))].
t∈[0,1] correlated to the class provides a value of 1. GM is a
useful performance measure since it is inclined to max-
where T P (t) is the number of true positives given imize the true positive rate and the true negative rate
threshold t (and similarly for T N (t), F P (t), and while keeping them relatively balanced. GM is calcu-
F N (t)). N P V or negative predictive value represents lated at each value of the normalized attribute range,
the percentage of examples predicted to be negative and the maximum value of GM is used as a measure of
that are actually negative and is very similar to the pre- attribute strength.
cision — in fact, it is often thought of as the precision j. Area Under the ROC Curve (AUC). Receiver Oper-
of instances predicted to be in the negative class. The ating Characteristic, or ROC, curves graph true positive
Gini index for the attribute is then the minimum Gini rate on the y-axis versus the false positive rate on
index at all decision thresholds t ∈ [0, 1]. the x-axis. The resulting curve illustrates the trade-off
f. Mutual Information (MI). Let c(x) ∈ {P , N } denote between true positive rate and false positive rate. In this
the actual class of instance x, and let c t (x) denote study, ROC curves are generated by varying the deci-
the predicted class based on the value of the attribute sion threshold t (between 0 and 1) used to transform
F j and a given threshold t. MI computes the criterion the normalized attribute values into a predicted class.
with respect to the number of times a feature value and AUC is used to provide a single numerical metric for
a class co-occur, the feature value occurs without the comparing the predictive power of each attribute.
class, and the class occurs without the feature value. k. Area Under the Precision-Recall Curve (PRC). PRC
The MI metric is defined as: is a single-value measure that originated from the area
p(ĉt , c) of information retrieval. A precision-recall curve is
MI = max p(ĉt , c) log generated by varying the decision threshold t from 0 to
t ∈[0,1] t p(ĉt )p(c)
ĉ ∈{P ,N } c∈{P ,N } 1 and plotting the recall (y-axis) and precision (x-axis)
where at each point in a similar manner to the ROC curve. The
| {x | (ĉt (x) = α) ∩ (c(x) = β)} | area under the PRC ranges from 0 to 1, and an attribute
p(ĉt = α, c = β) = , with more predictive power results in an area under the
|P |+|N |
| {x | ĉt (x) = α} |
PRC closer to 1.
p(ĉt = α) = ,
|P |+|N |
| {x | c(x) = α} | 3.1.3 Signal-to-noise ratio technique
p(c = α) = ,
|P |+|N |
α, β ∈ {P , N}. Signal-to-Noise ratio (S2N) (Goh et al. 2004) is a sim-
ple univariate ranking technique which defines how well
a feature discriminates between two classes in a two class
g. Kolmogorov-Smirnov Statistic (KS). KS measures problem. S2N, for a given feature, separates the means
the maximum difference between the cumulative dis- of the two classes relative to the sum of their standard
tribution functions of examples in each class based on deviations. The equation to calculate S2N is
the normalized attribute F j . The distribution function
Fc (t) for a class c is estimated by the proportion of μP − μN
examples x from class c with F j (x) ≤ t, t ∈ [0, 1]. In S2N =
σP + σN
a two class setting with c ∈ {P , N }, KS is computed as
KS = max |FP (t) − FN (t)|. where μP and μN are the mean values of a particular
t ∈[0,1] attribute for the samples from class P and class N, and
806 Inf Syst Front (2014) 16:801–822
Algorithm 2: Iterative Feature Selection Algorithm

input :
1. Dataset with features = 1 ,...,m ;
2. Each instance x D is assigned to one of two classes c(x ) f p, nf p ;
3. Filter-based feature ranking technique CS, GR, IG, RF, RFW, SU, FM, OR, Pow, PR, GI, MI, KS, Dev, GM, AUC,
PRC, S2N ;
4. Data sampling technique: RUS35, RUS50 ;
5. A predefined threshold: number (percentage) of the features to be selected.
output:
Selected features.
for i = 1 do
Use to balance and get the balanced data i ;
Employ to rank features on , and get new rankings ( ), = 1 ,..., ;
Create feature ranking by combining the k different rankings ( )| = 1 , . . . , j with mean (average).
Select features according to feature ranking and a predefined threshold.
σP and σN are the corresponding standard deviations. The built-in feature selection capability, and (2) they are com-
larger the S2N value, the more relevant the feature is to the monly used in both the software engineering and the data
class attribute. mining domains (Lessmann et al. 2008; Jiang et al. 2009;
Menzies et al. 2007). We employ the WEKA tool (Witten
3.2 The iterative feature selection approach et al. 2011) to implement these classifiers. Parameter set-
tings are subject to the data explored. Unless stated, we use
The proposed method is designed to deal with feature default parameter settings as specified in WEKA.
selection for imbalanced data. Algorithm 2 presents the
procedure of this approach. It consists of two basic steps: 3.3.1 Naı̈ve Bayes
1. Using the Random Undersampling (RUS) technique to
To determine the classification of an instance, one method
balance data. RUS creates the balanced data by ran-
is to use a probability model in which the features that were
domly removing examples from the majority class. In
chosen by the feature rankers are used as the conditions for
this work, we study two post-sampling proportions:
the probability of the sample being a member of the class.
35:65 and 50:50, meaning the ratio between the minor-
A basic probability model would look like p(C|F1 , . . . , Fn)
ity (fp) examples and majority (nfp) examples is 35:65
where Fi is the value of each feature used and C is the class
and 50:50, respectively, after sampling.
of the instance. This model is known as the posterior, and we
2. Applying a filter-based feature ranking technique to the
assign the instance to the class for which it has the largest
sampled data and ranking all the features according
posterior (Souza et al. 2005).
to their predictive powers (scores). We investigate 18
Unfortunately, it is quite difficult to determine the pos-
filter-based feature ranking techniques.
terior directly. Thus, it is necessary to use Bayes’s rule
In order to alleviate the biased results generated due to which states that the posterior equals the ratio of the prior
the sampling process, we repeat the two steps k times multiplied by the likelihood over the evidence, or
(k = 10 in this study) and aggregate k rankings using p(C)p(F1 , . . . , Fn |C)
the mean (average). Finally, the best set of attributes p(C|F1 , . . . , Fn ) =
p(F1 , . . . , Fn)
is selected from the original data to form the training
dataset. In reality, the formula above can be simplified by certain
assumptions. The evidence, p(F1 , . . . , Fn), is always con-
3.3 Classifiers stant for the specific dataset and therefore can be ignored
for the purposes of classification. The likelihood formula,

The software defect prediction models are built using p(F1 , . . . , Fn |C), can be simplified to i p(Fi |C) due to
four different classification algorithms, including Naı̈ve the naive assumption that all of the features are condition-
Bayes (NB) (Witten et al. 2011), Multilayer Percep- ally independent of all of the other features. This naı̈ve
tron (MLP) (Haykin 1999), k-Nearest Neighbors (KNN) assumption with the removal of the evidence parameter
(Witten et al. 2011), and Support Vector Machine (SVM) creates the Naı̈ve Bayes classifier.

(Cristianini and Shawe-Taylor 2000). These learners were p(C|F1 , . . . , Fn ) = p(C) p(Fi |C)
selected for two key reasons: (1) they do not have a i
Inf Syst Front (2014) 16:801–822 807
3.3.2 Multilayer perceptron weight vector, w, and the bias, ω0 . One aspect that must
be addressed is that there can be multiple discriminants
A multilayer perceptron (MLP) is a type of artificial neural that correctly classify the two classes. The support vector
network. Artificial neural networks consist of nodes which machine, or SVM, is a linear discriminant classifier which
are arranged in sets called layers. Each node in a layer has assumes that the best discriminant maximizes the distance
a connection coming from every node in the layer before between the two classes. This is measured in the distance
it and to every node in the layer after it. Each node takes from the discriminant to the samples of both classes (Liu
the weighted sum of all of the input nodes. Along with the 2009).
weighted sums, an activation function is also applied. The
application of the activation function to the result of the 3.4 Classification performance metric
weighted sum allows for a more clearly defined result by
further separating the instances in the two classes from each In this study, we use the Area Under the ROC (receiver
other. Neural networks are well known for being robust to operating characteristic) curve (i.e., AUC) to evaluate clas-
redundant features. However, neural networks sometimes sification models. The ROC curve graphs true positive rates
have problems with overfitting (Haykin 1999). versus the false positive rates (the positive class is syn-
onymous with the minority class). Traditional performance
3.3.3 K-nearest neighbors metrics for classifier evaluation consider only the default
decision threshold of 0.5. ROC curves illustrate the per-
The k-Nearest Neighbors, or KNN, learner is an example formance across all decision thresholds. A classifier that
of an instance-based and lazy learning algorithm. Instance provides a large area under the curve is preferable over a
based algorithms use only the training data without creat- classifier with a smaller area under the curve. A perfect clas-
ing statistics on which to base their hypotheses The KNN sifier provides an AUC that equals 1. AUC is one of the most
learner does this by calculating the distance of the test sam- widely used single numeric measures that provides a gen-
ple from every training instance, and the predicted class is eral idea of the predictive potential of the classifier. It has
derived from the k nearest neighbors. also been shown that AUC is of lower variance and is more
In the KNN learner, when we get a test sample we would reliable than other performance metrics such as precision,
like to classify, we tabulate the classes for each of the k clos- recall, and F-measure (Jiang et al. 2009).
est training samples (we used a k of five for our experiment)
and we determine the weight of each neighbor by taking
a measurement of dist1ance where distance is the distance 4 A case study
from the test sample. After the classes and weights are tab-
ulated, we add all of the weights from the neighbors of the 4.1 Datasets
positive class together and all of the weights of the nega-
tive class together. The prediction will be the class with the The case study data is obtained from the publicly avail-
largest cumulative weight (Souza et al. 2005). able PROMISE software project data repository (Boetticher
The KNN learner can use any metric that is appro- et al. 2007). We study the software measurement datasets for
priate to calculate the distance between the samples. The the Java-based Eclipse project (Zimmermann et al. 2007).
standard metric used in KNN is Euclidean Distance, The software metrics and defect data are aggregated at the
defined as software packages level; hence, a program module is a
Java package in Eclipse. We consider three releases of the

n Eclipse system, where the releases are denoted as 2.0, 2.1,

d(x, y) = (xi − yi )2 and 3.0 (Zimmermann et al. 2007).
i=1 Each system release contains the following information
(Zimmermann et al. 2007): name of the package for which
3.3.4 Support vector machine the metrics are collected (name); number of defects reported
6 months prior to release (pre-release defects); number
One of the most efficient ways to classify between two of defects reported 6 months after release (post-release
classes is to assume that both classes are linearly separated defects); a set of complexity metrics computed for classes or
from each other. This assumption allows us to use a dis- methods and aggregated by using average, maximum, and
criminant to split the instances into the two classes before total at the package level (complexity metrics); and structure
looking at the distribution between the classes. A linear dis- of abstract syntax tree(s) of the package consisting of the
criminant uses the formula g(x|w, ω0 ) = wT x + ω0 . In the node size, type, and frequency (structure of abstract syntax
case of the linear discriminant, we only need to learn the tree(s)).
808 Inf Syst Front (2014) 16:801–822
In our case study, we modify the original data by: (1) Although we use different learners here, a preliminary
removing all non-numeric attributes, including the package study showed that log2 n is still appropriate for various
names, and (2) converting the post-release defects attribute learners.
to a binary class attribute, where fault-prone (fp) is the After feature selection, we applied four learners (NB,
minority class and not-fault-prone (nfp) is the majority MLP, KNN, and SVM) to the training datasets with the
class. A program module’s membership in a given class selected features, and we used AUC to evaluate the perfor-
is determined by a post-release defects threshold, thd. A mance of the classification models.
program module (package) with thd or more post-release
defects is labeled as fp, while those with fewer than thd 4.3 Results & analysis
defects are labeled as nfp.
For a given system release we consider three values for All the results are reported in Tables 2, 3, and 4, each
thd: for releases 2.0 and 3.0, thd = {10, 5, 3}, and for representing the results for each group of the datasets using
release 2.1, thd = {5, 4, 2}. This results in three groups of NB, MLP, KNN, or SVM. The values presented in the tables
datasets, as shown in Table 1. A different set of thresholds show the average AUC for every classification model con-
is chosen for release 2.1 because we wanted to maintain rel- structed over the ten runs of five-fold cross-validation and
atively similar class distributions for the three datasets in across that particular group of datasets. Cross-validation is
a given group. Table 1 presents key details about the dif- a technique for estimating the performance of a predictive
ferent datasets used in our study. All datasets contain 209 model. In this study, for each of the five folds one fold is
software attributes, which include 208 independent (predic- used as the test data while the other four folds are used as
tor) attributes and one dependent attribute (fp or nfp label). training data. All the preprocessing steps (feature selection
The three groups of datasets exhibit different class distribu- and data sampling) are done on the training dataset. The pro-
tions with respect to the fp and nfp modules, where Eclipse cessed training data is then used to build the classification
1 is relatively the most imbalanced and Eclipse 3 is the least model and the resulting model is applied to the test fold.
imbalanced. This cross-validation is repeated five times (the folds), with
each fold used exactly once as the test data. The five results
4.2 Experiment design from the five folds then can be averaged to produce a single
estimation.
In the experiments, we applied 18 filter-based feature rank- An unpaired two tailed t-test was used for each paired
ing techniques and two levels of sampling to the datasets. comparison between two strategies (iterative and non-
The main purpose of the experiment is to compare the iterative processes). The t-test examines the null hypothesis
performance of the classification models when using two that the population means related to two independent group
different feature selection strategies. samples are equal against the alternative hypothesis that the
population means are different. The p-values are provided
• Strategy 1: Data sampling followed by feature ranking for each pair of comparisons in the tables. The significance
(denoted ‘No Iteration’). level is set to 0.05; when the p-value is less than 0.05, the
• Strategy 2: An iterative process of data sampling fol- two group means (between‘Iteration’ and ‘No Iteration’)
lowed by feature ranking (denoted ‘Iteration’) are significantly different from one another. For example,
for Eclipse 1 with NB (Table 2), when using RF (Index 4)
The process of the two strategies are shown in Fig. 1, to rank features and RUS35 to sample the data, the result
where the upper part represents strategy 1 and the lower part demonstrates that the iterative process significantly outper-
represents strategy 2. formed the non-iterative process because the p-value (0.03)
The number of features that are selected for modeling is less than the specified cutoff 0.05. For the AUC ranker
needs to be determined in advance. We choose log2 n (Index 16) with RUS35 sampler, the iterative process is
features that have the highest scores, where n is the num- better than the non-iterative process (p = 0.4), but the dif-
ber of the independent features in the original dataset. ference is not statistically significant. As another example,
For the three groups of Eclipse datasets log2 n = 8, for the same datasets and same learner, when using the PRC
where n = 208. We select log2 n attributes, because ranker (Index 17) with RUS35, the classification perfor-
1) related literature does not provide guidance on the mance for the two strategies are equal or very similar. This
appropriate number of features to select; and 2) one of can also be confirmed by the p-value, which is 0.91. For
our recent empirical studies (Khoshgoftaar et al. 2007) each paired comparison between ‘Iteration’ and ‘No Iter-
recommended using log2 n features when employing ation,’ one can always find which one performs better at
WEKA to build random forests learners for binary classi- 0.05 < p < 0.90 (marked with bold) or significantly better
fication in general and imbalanced datasets in particular. at p ≤ 0.05 (marked with bold) than the other or whether
Inf Syst Front (2014) 16:801–822 809
Table 1 Data characteristics

Dataset Rel. thd #Attr. #Inst. #fp %fp #nfp %nfp
Eclipse 1 2.0 10 209 377 23 6.1 354 93.9

2.1 5 209 434 34 7.8 400 92.2
3.0 10 209 661 41 6.2 620 93.8
Eclipse 2 2.0 5 209 377 52 13.8 325 86.2
2.1 4 209 434 50 11.5 384 88.5
3.0 5 209 661 98 14.8 563 85.2
Eclipse 3 2.0 3 209 377 101 26.8 276 73.2
2.1 2 209 434 125 28.8 309 71.2
3.0 3 209 661 157 23.8 504 76.2
they have the same or similar performance with p ≥ 0.90 strategies we investigate in this study. The pairwise inter-
(marked with underline). action effects are also considered in the ANOVA test. The
The results demonstrate that the iterative method out- ANOVA model can be used to test the hypothesis that the
performed the non-iterative approach for most cases on all AUC for the main factors and/or for the interactions are
three groups of datasets. Table 5 shows the summary of the equal against the alternative hypothesis that at least one
comparisons between ‘Iteration’ and ‘No Iteration’ over all mean is different. If the alternative hypothesis is accepted,
the 18 filters for a given sampler and a given learner on multiple comparisons can be used to determine which of the
each group of datasets. The value in each cell represents means are significantly different from the others. Table 6
the number of better cases (p < 0.90) for ‘No Iteration’ shows the ANOVA results for each group of datasets. The
or ‘Iteration’ or equal cases (p ≥ 0.90). When RUS35 acts p-value is less than the cutoff 0.05 for Factor A (learner),
as the sampler, the iterative process performed better than B (ranker), and D (strategy) on each table, meaning that
the non-iterative process for 142 out of 216 cases (65.7 %) for these main factors, the alternate hypothesis is accepted,
overall, worse than the non-iterative process for 35 out of namely, at least two group means are significantly differ-
216 cases (16.2 %), and equal to the non-iterative process ent from each other. On the other hand, the p-value is
for the remaining 18.1 % of cases. When RUS50 was used greater than 0.05 for Factor C for the Eclipse 1 and Eclipse
as the sampler, the iterative process outperformed the non- 3 datasets, meaning that the two samplers (RUS35 and
iterative process for 85.6 % of cases. The phenomenon that RUS50) are not significantly different from each other for
the iterative feature selection technique is better than the these two groups of datasets. In addition, some p-values
non-iterative approach becomes more obvious when using of the pairwise interaction effects are greater than 0.05,
the 50:50 post-sampling proportion and when feature selec- while others are less than 0.05. The small p-value implies
tion is applied to more severely imbalanced datasets like that the pairwise interactions significantly affect classifica-
Eclipse 1 and Eclipse 2. tion performance in each group of datasets. In other words,
We also conducted a four-way ANalysis Of VAriance changing the value of one factor will significantly influence
(ANOVA) F test on the classification performance for each the value of the other factor, and vice verse.
group of the datasets separately to examine if the perfor- We further carried out a multiple comparison test on
mance difference (better/worse) is statistically significant or each main factor and the interactions (B×D and C×D) with
not. Four factors are designed as follows: Factor A repre- Tukey’s Honestly Significant Difference criterion. Since
sents four learners, Factor B represents 18 rankers, Factor this study is more interested in comparing the performances
C represents two samplers, and Factor D represents the two of the two different feature selection strategies, we only
Fig. 1 Feature selection strategies

810 Inf Syst Front (2014) 16:801–822
Table 2 Classification performance for Eclipse 1
Index Ranker RUS35 RUS50
Iterative? t-test Iterative? t-test

p p
no yes no yes
(a) NB
1 CS 0.8569 0.8606 0.647 0.8471 0.8542 0.440
2 GR 0.8406 0.8478 0.303 0.8460 0.8553 0.254
3 IG 0.8611 0.8636 0.780 0.8476 0.8589 0.242
4 RF 0.8518 0.8681 0.030 0.8488 0.8858 0.000
5 RFW 0.8067 0.8389 0.000 0.8126 0.8603 0.000
6 SU 0.8590 0.8612 0.774 0.8458 0.8610 0.099
7 FM 0.8491 0.8469 0.777 0.8526 0.8580 0.499
8 OR 0.8640 0.8659 0.807 0.8480 0.8619 0.154
9 Pow 0.8425 0.8338 0.394 0.8266 0.8353 0.433
10 PR 0.8432 0.8172 0.013 0.8243 0.8304 0.610
11 GI 0.8455 0.8398 0.497 0.8451 0.8697 0.018
12 MI 0.8503 0.8537 0.691 0.8397 0.8514 0.183
13 KS 0.8466 0.8496 0.711 0.8408 0.8492 0.271
14 Dev 0.8556 0.8541 0.851 0.8428 0.8521 0.277
15 GM 0.8409 0.8478 0.407 0.8422 0.8447 0.760
16 AUC 0.8684 0.8745 0.400 0.8600 0.8717 0.121
17 PRC 0.8616 0.8607 0.910 0.8546 0.8602 0.556
18 S2N 0.8805 0.8888 0.309 0.8707 0.8886 0.035
(b) MLP
1 CS 0.8604 0.8692 0.498 0.8609 0.8700 0.494
2 GR 0.8405 0.8569 0.156 0.8551 0.8782 0.035
3 IF 0.8691 0.8738 0.699 0.8594 0.8717 0.340
4 RF 0.8333 0.8263 0.522 0.8352 0.8432 0.369
5 RFW 0.8289 0.8426 0.181 0.8330 0.8604 0.001
6 SU 0.8630 0.8707 0.474 0.8562 0.8799 0.050
7 FM 0.8480 0.8560 0.605 0.8631 0.8609 0.877
8 OR 0.8668 0.8676 0.948 0.8631 0.8754 0.356
9 Pow 0.8427 0.8417 0.925 0.8331 0.8436 0.381
10 PR 0.8408 0.8221 0.122 0.8300 0.8343 0.709
11 GI 0.8489 0.8437 0.629 0.8584 0.8780 0.100
12 MI 0.8580 0.8577 0.985 0.8541 0.8626 0.556
13 KS 0.8532 0.8609 0.622 0.8467 0.8558 0.537
14 Dev 0.8602 0.8631 0.809 0.8625 0.8677 0.695
15 GM 0.8462 0.8569 0.536 0.8482 0.8543 0.664
16 AUC 0.8818 0.8823 0.950 0.8701 0.8819 0.208
17 PRC 0.8720 0.8738 0.865 0.8590 0.8752 0.137
18 S2N 0.8675 0.8691 0.825 0.8652 0.8625 0.715
(c) KNN
1 CS 0.8916 0.8967 0.545 0.8841 0.8919 0.416
2 GR 0.8731 0.8902 0.028 0.8751 0.8940 0.024
3 IF 0.8926 0.9001 0.328 0.8799 0.8966 0.078
4 RF 0.8701 0.8883 0.009 0.8771 0.8931 0.009
5 RFW 0.7918 0.8353 0.000 0.8199 0.8797 0.000
6 SU 0.8925 0.8991 0.387 0.8806 0.8982 0.035
Inf Syst Front (2014) 16:801–822 811
Table 2 (continued)

p p
no yes no yes
7 FM 0.8835 0.8848 0.892 0.8787 0.8887 0.279

8 OR 0.8883 0.9000 0.181 0.8807 0.8960 0.099
9 Pow 0.8659 0.8662 0.976 0.8542 0.8746 0.020
10 PR 0.8636 0.8524 0.178 0.8527 0.8678 0.102
11 GI 0.8681 0.8749 0.318 0.8786 0.9043 0.002
12 MI 0.8877 0.8903 0.781 0.8802 0.8880 0.386
13 KS 0.8826 0.8879 0.538 0.8717 0.8858 0.138
14 Dev 0.8890 0.8932 0.631 0.8810 0.8892 0.372
15 GM 0.8790 0.8838 0.586 0.8727 0.8822 0.279
16 AUC 0.9024 0.9093 0.130 0.8915 0.9061 0.013
17 PRC 0.8969 0.9032 0.421 0.8842 0.9029 0.020
18 S2N 0.8922 0.8934 0.809 0.8896 0.8918 0.625
(d) SVM
1 CS 0.9043 0.9061 0.825 0.8958 0.9038 0.448
2 GR 0.8882 0.8985 0.175 0.8946 0.9077 0.090
3 IG 0.9063 0.9085 0.788 0.8965 0.9076 0.198
4 RF 0.8721 0.8855 0.109 0.8768 0.8957 0.028
5 RFW 0.8468 0.8769 0.005 0.8569 0.8923 0.000
6 SU 0.9045 0.9087 0.570 0.8943 0.9098 0.086
7 FM 0.8953 0.9000 0.631 0.8998 0.9028 0.757
8 OR 0.9065 0.9055 0.894 0.8996 0.9065 0.465
9 Pow 0.8868 0.8824 0.613 0.8794 0.8852 0.492
10 PR 0.8847 0.8693 0.055 0.8752 0.8787 0.692
11 GI 0.8881 0.8883 0.983 0.8940 0.9146 0.013
12 MI 0.9000 0.9018 0.847 0.8928 0.9006 0.452
13 KS 0.8966 0.8983 0.871 0.8913 0.8990 0.455
14 Dev 0.9023 0.9022 0.996 0.8963 0.9020 0.565
15 GM 0.8931 0.8958 0.801 0.8912 0.8957 0.683
16 AUC 0.9118 0.9147 0.636 0.9075 0.9140 0.311
17 PRC 0.9048 0.9071 0.762 0.9012 0.9064 0.508
18 S2N 0.8985 0.8975 0.870 0.8957 0.8966 0.884
valuerepresents significantly better case (p ≤ 0.05)

value represents better case (0.05 < p < 0.90)
value represents equal case (p ≥ 0.90)
present the pairwise interactions that involve the two strate- different if their intervals are disjoint, and are not signifi-
gies (Factor D). Note that for all the ANOVA and multiple cantly different if their intervals overlap. The assumptions
comparison tests, the significance level was set to 0.05. for constructing ANOVA models were validated. From these
Figures 2, 3 and 4 show the multiple comparisons on every figures we can see the following points:
group of datasets, each with six subfigures representing the
result for Factors A, B, C, D, B×D, and C×D, respec- • Among the four learners, SVM always demon-
tively. The figures display graphs with each group mean strated significantly better performance than the other
represented by a symbol (◦) and 95 % confidence interval classifiers. MLP and KNN performed averagely. NB
as a line around the symbol. Two means are significantly always resulted in the worst performance.
812 Inf Syst Front (2014) 16:801–822

p p
no yes no yes
(a) NB
1 CS 0.8516 0.8552 0.640 0.8486 0.8533 0.557
2 GR 0.8314 0.8517 0.029 0.8367 0.8478 0.192
3 IG 0.8538 0.8547 0.906 0.8499 0.8526 0.743
4 RF 0.8496 0.8645 0.087 0.8518 0.8737 0.006
5 RFW 0.8399 0.8623 0.038 0.8351 0.8699 0.000
6 SU 0.8506 0.8540 0.653 0.8484 0.8534 0.543
7 FM 0.8460 0.8520 0.494 0.8469 0.8504 0.674
8 OR 0.8432 0.8494 0.478 0.8461 0.8526 0.449
9 Pow 0.8326 0.8432 0.347 0.8308 0.8402 0.382
10 PR 0.8317 0.8319 0.983 0.8287 0.8347 0.586
11 GI 0.8307 0.8362 0.636 0.8430 0.8550 0.168
12 MI 0.8506 0.8543 0.644 0.8445 0.8506 0.471
13 KS 0.8448 0.8506 0.505 0.8448 0.8462 0.876
14 Dev 0.8470 0.8517 0.589 0.8456 0.8511 0.511
15 GM 0.8436 0.8489 0.553 0.8422 0.8480 0.509
16 AUC 0.8530 0.8530 0.999 0.8475 0.8526 0.538
17 PRC 0.8532 0.8526 0.937 0.8479 0.8506 0.742
18 S2N 0.8786 0.8806 0.764 0.8769 0.8797 0.686
(b) MLP
1 CS 0.8867 0.8930 0.452 0.8865 0.8931 0.418
2 GR 0.8747 0.8836 0.320 0.8760 0.8873 0.192
3 IF 0.8909 0.8926 0.823 0.8857 0.8907 0.514
4 RF 0.8661 0.8712 0.604 0.8706 0.8741 0.697
5 RFW 0.8514 0.8699 0.106 0.8632 0.8764 0.103
6 SU 0.8890 0.8925 0.650 0.8889 0.8923 0.665
7 FM 0.8835 0.8852 0.848 0.8857 0.8867 0.912
8 OR 0.8821 0.8916 0.269 0.8912 0.8919 0.920
9 Pow 0.8711 0.8809 0.360 0.8714 0.8771 0.573
10 PR 0.8638 0.8619 0.882 0.8705 0.8725 0.861
11 GI 0.8631 0.8611 0.868 0.8869 0.8877 0.925
12 MI 0.8888 0.8927 0.604 0.8877 0.8876 0.989
13 KS 0.8845 0.8851 0.943 0.8825 0.8850 0.790
14 Dev 0.8817 0.8892 0.379 0.8852 0.8860 0.921
15 GM 0.8835 0.8835 0.995 0.8811 0.8846 0.703
16 AUC 0.8914 0.8932 0.820 0.8850 0.8955 0.201
17 PRC 0.8867 0.8901 0.669 0.8878 0.8939 0.441
18 S2N 0.8870 0.8837 0.723 0.8860 0.8807 0.590
(c) KNN
1 CS 0.8923 0.8929 0.925 0.8861 0.8906 0.495
2 GR 0.8687 0.8863 0.042 0.8747 0.8826 0.283
3 IF 0.8908 0.8916 0.896 0.8852 0.8890 0.569
4 RF 0.8354 0.8485 0.138 0.8569 0.8709 0.046
5 RFW 0.8159 0.8395 0.091 0.8393 0.8734 0.001
6 SU 0.8909 0.8912 0.964 0.8848 0.8886 0.589
Inf Syst Front (2014) 16:801–822 813
Table 3 (continued)

p p
no yes no yes
7 FM 0.8846 0.8873 0.711 0.8825 0.8886 0.412

8 OR 0.8825 0.8867 0.535 0.8806 0.8840 0.649
9 Pow 0.8707 0.8789 0.388 0.8680 0.8787 0.232
10 PR 0.8665 0.8642 0.841 0.8660 0.8726 0.470
11 GI 0.8675 0.8699 0.828 0.8846 0.8860 0.829
12 MI 0.8874 0.8891 0.808 0.8837 0.8874 0.613
13 KS 0.8849 0.8877 0.700 0.8830 0.8842 0.888
14 Dev 0.8866 0.8900 0.614 0.8862 0.8882 0.772
15 GM 0.8847 0.8854 0.930 0.8822 0.8861 0.624
16 AUC 0.8924 0.8929 0.935 0.8898 0.8926 0.669
17 PRC 0.8926 0.8941 0.798 0.8876 0.8929 0.438
18 S2N 0.8837 0.8870 0.507 0.8857 0.8840 0.756
(d) SVM
1 CS 0.9176 0.9207 0.511 0.9185 0.9206 0.647
2 GR 0.9051 0.9177 0.032 0.9077 0.9172 0.058
3 IG 0.9191 0.9201 0.820 0.9185 0.9210 0.580
4 RF 0.8890 0.9042 0.096 0.9024 0.9073 0.400
5 RFW 0.8757 0.8997 0.057 0.8818 0.9043 0.011
6 SU 0.9199 0.9198 0.990 0.9197 0.9201 0.919
7 FM 0.9135 0.9177 0.399 0.9165 0.9176 0.830
8 OR 0.9169 0.9181 0.819 0.9199 0.9209 0.827
9 Pow 0.9065 0.9116 0.488 0.9034 0.9115 0.255
10 PR 0.9010 0.8999 0.904 0.9013 0.9040 0.741
11 GI 0.9001 0.9015 0.875 0.9169 0.9201 0.515
12 MI 0.9177 0.9192 0.758 0.9161 0.9175 0.789
13 KS 0.9137 0.9179 0.404 0.9145 0.9154 0.858
14 Dev 0.9139 0.9187 0.363 0.9158 0.9173 0.770
15 GM 0.9127 0.9153 0.617 0.9139 0.9151 0.826
16 AUC 0.9181 0.9198 0.714 0.9164 0.9204 0.395
17 PRC 0.9199 0.9200 0.982 0.9182 0.9194 0.791
18 S2N 0.9125 0.9102 0.694 0.9118 0.9099 0.734
• Among the 18 filter-based feature ranking techniques, showed very similar behavior for Eclipse 1 and Eclipse
AUC, PRC, and S2N performed either significantly bet- 3. This is consistent with the results obtained from
ter or better than most other filters. CS, IG, SU, FM, ANOVA.
OR, MI, KS, and Dev demonstrated similar perfor- • Between the two strategies, ‘Iteration’ always showed
mance and also showed relatively better classification significantly better behavior than ‘No Iteration.’ This
behavior than the other filters. Some filters, such as is especially evident for more severely imbalanced data
RFW and PR, always displayed low performance, while sets (Eclipse 1 and Eclipse 2).
other filters like Pow presented inconsistent perfor- • There are 36 groups for interaction B×D; these are
mance with respect to different groups of datasets. formed by each of 18 feature ranking techniques being
• Between the two samplers, RUS50 performed signif- combined with two feature selection strategies. The
icantly better than RUS35 for Eclipse 2, but they group means demonstrate that for Eclipse 1 and Eclipse
814 Inf Syst Front (2014) 16:801–822

p p
no yes no yes
(a) NB
1 CS 0.7887 0.7867 0.771 0.7830 0.7857 0.711
2 GR 0.7714 0.7715 0.987 0.7668 0.7753 0.338
3 IF 0.7913 0.7876 0.610 0.7842 0.7872 0.671
4 RF 0.8015 0.8046 0.579 0.8032 0.8079 0.529
5 RFW 0.8150 0.8140 0.893 0.8044 0.8065 0.788
6 SU 0.7844 0.7853 0.897 0.7801 0.7844 0.566
7 FM 0.7815 0.7842 0.707 0.7783 0.7806 0.755
8 OR 0.7915 0.7893 0.766 0.7809 0.7853 0.556
9 Pow 0.7895 0.7894 0.998 0.7858 0.7884 0.746
10 PR 0.7718 0.7694 0.844 0.7706 0.7687 0.878
11 GI 0.7701 0.7695 0.962 0.7715 0.7746 0.733
12 MI 0.7809 0.7844 0.624 0.7811 0.7818 0.926
13 KS 0.7824 0.7855 0.644 0.7823 0.7834 0.874
14 Dev 0.7822 0.7859 0.615 0.7820 0.7817 0.965
15 GM 0.7837 0.7859 0.725 0.7832 0.7844 0.854
16 AUC 0.7862 0.7860 0.977 0.7830 0.7856 0.713
17 PRC 0.7857 0.7865 0.908 0.7834 0.7858 0.720
18 S2N 0.8241 0.8289 0.631 0.8228 0.8282 0.597
(b) MLP
1 CS 0.8547 0.8558 0.899 0.8568 0.8539 0.728
2 GR 0.8292 0.8338 0.725 0.8326 0.8362 0.736
3 IF 0.8560 0.8551 0.912 0.8584 0.8568 0.842
4 RF 0.8343 0.8335 0.920 0.8467 0.8488 0.813
5 RFW 0.8472 0.8459 0.877 0.8449 0.8511 0.504
6 SU 0.8576 0.8543 0.663 0.8537 0.8523 0.862
7 FM 0.8531 0.8566 0.682 0.8556 0.8545 0.902
8 OR 0.8564 0.8551 0.860 0.8476 0.8477 0.988
9 Pow 0.8542 0.8537 0.959 0.8497 0.8539 0.659
10 PR 0.8312 0.8226 0.459 0.8295 0.8272 0.838
11 GI 0.8296 0.8220 0.524 0.8369 0.8266 0.308
12 MI 0.8513 0.8545 0.717 0.8578 0.8495 0.364
13 KS 0.8524 0.8538 0.873 0.8509 0.8516 0.936
14 Dev 0.8506 0.8541 0.694 0.8541 0.8520 0.822
15 GM 0.8529 0.8537 0.924 0.8505 0.8559 0.562
16 AUC 0.8579 0.8596 0.812 0.8551 0.8576 0.746
17 PRC 0.8552 0.8569 0.841 0.8534 0.8564 0.735
18 S2N 0.8629 0.8658 0.751 0.8608 0.8657 0.575
(c) KNN
1 CS 0.8569 0.8614 0.598 0.8531 0.8562 0.732
2 GR 0.8403 0.8389 0.897 0.8122 0.8126 0.971
3 IF 0.8574 0.8601 0.742 0.8547 0.8558 0.905
4 RF 0.7684 0.7718 0.469 0.7896 0.8058 0.012
5 RFW 0.7761 0.7871 0.102 0.7882 0.8070 0.004
6 SU 0.8539 0.8561 0.781 0.8517 0.8579 0.484
Inf Syst Front (2014) 16:801–822 815
Table 4 (continued)

p p
no yes no yes
7 FM 0.8510 0.8561 0.585 0.8497 0.8562 0.499

8 OR 0.8568 0.8580 0.863 0.8453 0.8509 0.488
9 Pow 0.8565 0.8588 0.794 0.8504 0.8585 0.351
10 PR 0.8238 0.8207 0.824 0.8218 0.8231 0.921
11 GI 0.8257 0.8213 0.745 0.8265 0.8250 0.880
12 MI 0.8486 0.8564 0.403 0.8505 0.8528 0.825
13 KS 0.8509 0.8584 0.422 0.8505 0.8544 0.693
14 Dev 0.8488 0.8557 0.461 0.8501 0.8533 0.750
15 GM 0.8499 0.8589 0.351 0.8497 0.8549 0.605
16 AUC 0.8591 0.8586 0.952 0.8567 0.8579 0.896
17 PRC 0.8605 0.8612 0.934 0.8556 0.8573 0.851
18 S2N 0.8436 0.8432 0.966 0.8403 0.8419 0.848
(d) SVM
1 CS 0.8861 0.8852 0.853 0.8847 0.8852 0.922
2 GR 0.8702 0.8725 0.808 0.8650 0.8714 0.449
3 IF 0.8870 0.8860 0.819 0.8849 0.8862 0.804
4 RF 0.8598 0.8605 0.904 0.8706 0.8735 0.667
5 RFW 0.8732 0.8730 0.977 0.8728 0.8751 0.745
6 SU 0.8849 0.8857 0.871 0.8821 0.8858 0.495
7 FM 0.8834 0.8856 0.687 0.8824 0.8848 0.687
8 OR 0.8865 0.8861 0.923 0.8789 0.8820 0.547
9 Pow 0.8845 0.8848 0.956 0.8808 0.8848 0.457
10 PR 0.8619 0.8569 0.620 0.8615 0.8598 0.858
11 GI 0.8611 0.8567 0.667 0.8661 0.8624 0.624
12 MI 0.8825 0.8851 0.648 0.8815 0.8827 0.852
13 KS 0.8833 0.8852 0.724 0.8818 0.8835 0.784
14 Dev 0.8819 0.8840 0.717 0.8818 0.8832 0.816
15 GM 0.8841 0.8856 0.769 0.8814 0.8842 0.636
16 AUC 0.8865 0.8867 0.950 0.8846 0.8864 0.715
17 PRC 0.8852 0.8850 0.981 0.8834 0.8853 0.736
18 S2N 0.8835 0.8854 0.766 0.8838 0.8860 0.721
2, in addition to the two main factors, the classifi- 2, RUS50 performed better than RUS35 for both strate-
cation performance was greatly influenced by their gies and ‘Iteration’ performed better than ‘No Iteration’
interactions. For example, for Eclipse 1, PR signifi- for both post-sampling cost ratios. This is also reflected
cantly outperformed RFW for ‘No Iteration,’ but this by the output of ANOVA, where p is 0.648. For Eclipse
pattern was not observed when the ‘Iteration’ strat- 1, RUS50 along with ‘No Iteration’ showed the worst
egy was adopted (see Fig. 2). For Eclipse 3, the performance among four combinations. If we rank
interaction factor did not play a significant role on the the remaining three in terms of their classification
performance. performance from worst to best, they are RUS35-NI
• For the interaction effect between sampler and strategy, (RUS35 along with ‘No Iteration’), RUS35-I (RUS35
there are four different combinations (groups), each of along with ‘Iteration’), and RUS50-I. The interac-
two sampling approaches along with two feature selection factor for Eclipse 3 showed similar behavior to
tion strategies. The results demonstrate that for Eclipse Eclipse 1.
816 Inf Syst Front (2014) 16:801–822
Table 5 Performance
comparison between iteration Eclipse data Learner RUS35 RUS50
and no iteration
No Iteration Iteration equal No Iteration Iteration equal
1 NB 5 12 1 0 18 0
27.8 % 66.7 % 5.6 % 0.0 % 100.0 % 0.0 %
MLP 3 11 4 2 16 0
16.7 % 61.1 % 22.2 % 11.1 % 88.9 % 0.0 %
KNN 1 16 1 0 18 0
5.6 % 88.9 % 5.6 % 0.0 % 100.0 % 0.0 %
SVM 4 12 2 0 18 0
22.2 % 66.7 % 11.1 % 0.0 % 100.0 % 0.0 %
Total 13 51 8 2 70 0
18.1 % 70.8 % 11.1 % 2.8 % 97.2 % 0.0 %
2 NB 0 14 4 0 18 0
0.0 % 77.8 % 22.2 % 0.0 % 100.0 % 0.0 %
MLP 3 13 2 1 12 5
16.7 % 72.2 % 11.1 % 5.6 % 66.7 % 27.8 %
KNN 1 13 4 1 17 0
5.6 % 72.2 % 22.2 % 5.6 % 94.4 % 0.0 %
SVM 1 14 3 1 16 1
5.6 % 77.8 % 16.7 % 5.6 % 88.9 % 5.6 %
Total 5 54 13 3 63 6
6.9 % 75.0 % 18.1 % 4.2 % 87.5 % 8.3 %
3 NB 5 8 5 1 15 2
27.8 % 44.4 % 27.8 % 5.6 % 83.3 % 11.1 %
MLP 5 9 4 7 8 3
27.8 % 50.0 % 22.2 % 38.9 % 44.4 % 16.7 %
KNN 3 12 3 1 14 3
16.7 % 66.7 % 16.7 % 5.6 % 77.8 % 16.7 %
SVM 4 8 6 2 15 1
22.2 % 44.4 % 33.3 % 11.1 % 83.3 % 5.6 %
Total 17 37 18 11 52 9
23.6 % 51.4 % 25.0 % 15.3 % 72.2 % 12.5 %
Total 35 142 39 16 185 15
16.2 % 65.7 % 18.1 % 7.4 % 85.6 % 6.9 %
5 Threats to validity by students; and (4) developed in an industry/government

organization setting, and not in a laboratory.
A typical software development project is very human The software systems that are used in our case study
intensive, which can affect many areas of the development were developed by professionals in a large software devel-
process, including software quality and defect occurrence. opment organization using an established software develop-
Therefore, software engineering research that utilizes con- ment process and management practices. The software was
trolled experiments for evaluating the usefulness of empir- developed to address real-world problems. We note that our
ical models is not practical. The case study presented is an case study fulfills all of the above criteria specified by the
empirical software engineering effort, for which the soft- software engineering community. The rest of this section
ware engineering community demands that its subject has discusses threats to external validity and threats to internal
the following characteristics (Wohlin et al. 2012; Votta and validity.
Porter 1995): (1) developed by a group, and not by an Threats to external validity in an empirical software engi-
individual; (2) be as large as industry size projects, and neering effort are conditions that limit generalization of case
not a toy problem; (3) developed by professionals, and not study results. The analysis and conclusion presented in this
Inf Syst Front (2014) 16:801–822 817
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91
Factor A - learner Factor B - ranker

c d
RUS35 No Iteration
RUS50 Iteration
0.8705 0.871 0.8715 0.872 0.8725 0.873 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88
Factor C - sampler Factor D - strategy

e f
CS−NI
GR−NI
IG−NI
RF−NI
RFW−NI
SU−NI
FM−NI
OR−NI RUS35−NI
Pow−NI
PR−NI
GI−NI
MI−NI
KS−NI
Dev−NI
GM−NI RUS50−NI
AUC−NI
PRC−NI
S2N−NI
CS−I
GR−I
IG−I
RF−I RUS35−I
RFW−I
SU−I
FM−I
OR−I
Pow−I
PR−I
GI−I
MI−I RUS50−I
KS−I
Dev−I
GM−I
AUC−I
PRC−I
S2N−I
0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.864 0.866 0.868 0.87 0.872 0.874 0.876 0.878 0.88 0.882
Factor B D Factor C D
Fig. 2 Eclipse1: multiple comparison
818 Inf Syst Front (2014) 16:801–822
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.85 0.86 0.87 0.88 0.89 0.9

c d
RUS35 No Iteration
RUS50 Iteration
0.8795 0.88 0.8805 0.881 0.8815 0.882 0.8825 0.883 0.876 0.878 0.88 0.882 0.884 0.886 0.888
Factor C - sampler Factor D - strategy

e f
CS−NI
GR−NI
IG−NI
RF−NI
RFW−NI
SU−NI
FM−NI
OR−NI RUS35−NI
Pow−NI
PR−NI
GI−NI
MI−NI
KS−NI
Dev−NI
GM−NI RUS50−NI
AUC−NI
PRC−NI
S2N−NI
CS−I
GR−I
IG−I
RF−I RUS35−I
RFW−I
SU−I
FM−I
OR−I
Pow−I
PR−I
GI−I
MI−I RUS50−I
KS−I
Dev−I
GM−I
AUC−I
PRC−I
S2N−I
0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.876 0.878 0.88 0.882 0.884 0.886 0.888
Inf Syst Front (2014) 16:801–822 819
a b
CS
GR
IG
NB RF
RFW
SU
FM
MLP OR
Pow
PR
KNN GI
MI
KS
Dev
SVM GM
AUC
PRC
S2N
0.78 0.8 0.82 0.84 0.86 0.88 0.9 0.81 0.82 0.83 0.84 0.85 0.86 0.87

c d
RUS35 No Iteration
RUS50 Iteration
0.838 0.8385 0.839 0.8395 0.84 0.8405 0.8375 0.838 0.8385 0.839 0.8395 0.84 0.8405 0.841
Factor C - sampler Factor D - strat egy

e f
CS−NI
GR−NI
IG−NI
RF−NI
RFW−NI
SU−NI
FM−NI RUS35−NI
OR−NI
Pow−NI
PR−NI
GI−NI
MI−NI
KS−NI
Dev−NI
GM−NI RUS50−NI
AUC−NI
PRC−NI
S2N−NI
CS−I
GR−I
IG−I
RF−I RUS35−I
RFW−I
SU−I
FM−I
OR−I
Pow−I
PR−I
GI−I RUS50−I
MI−I
KS−I
Dev−I
GM−I
AUC−I
PRC−I
S2N−I
0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.836 0.837 0.838 0.839 0.84 0.841 0.842 0.843
820 Inf Syst Front (2014) 16:801–822
Table 6 Four-way ANOVA

for eclipse datasets Source Sum Sq. d.f. Mean Sq. F p-value
(a) Eclipse 1
A (Learner) 2.7831 3 0.9277 691.66 0.000
B (Ranker) 1.2479 17 0.0734 54.73 0.000
C (Sampler) 0.0004 1 0.0004 0.31 0.576
D (Strategy) 0.1579 1 0.1579 117.76 0.000
A×B 0.2551 51 0.0050 3.73 0.000
A×C 0.0046 3 0.0015 1.13 0.333
A×D 0.0102 3 0.0034 2.53 0.055
B×C 0.1264 17 0.0074 5.54 0.000
B×D 0.1403 17 0.0083 6.15 0.000
C×D 0.0423 1 0.0423 31.55 0.000
Error 11.4342 8525 0.0013
Total 16.2024 8639
(b) Eclipse 2
A (Learner) 4.3408 3 1.4469 1537.50 0.000
B (Ranker) 0.5406 17 0.0318 33.79 0.000
C (Sampler) 0.0060 1 0.0060 6.41 0.011
D (Strategy) 0.0589 1 0.0589 62.63 0.000
A×B 0.3166 51 0.0062 6.60 0.000
A×C 0.0022 3 0.0007 0.77 0.509
A×D 0.0038 3 0.0013 1.35 0.256
B×C 0.0694 17 0.0041 4.34 0.000
B×D 0.0684 17 0.0040 4.27 0.000
C×D 0.0002 1 0.0002 0.21 0.648
Error 8.0228 8525 0.0009
Total 13.4297 8639
(c) Eclipse 3
A (Learner) 9.5467 3 3.1822 3162.27 0.000
B (Ranker) 0.8582 17 0.0505 50.16 0.000
C (Sampler) 0.0009 1 0.0009 0.91 0.340
D (Strategy) 0.0062 1 0.0062 6.12 0.013
A×B 0.9947 51 0.0195 19.38 0.000
A×C 0.0020 3 0.0007 0.67 0.570
A×D 0.0042 3 0.0014 1.39 0.245
B×C 0.0415 17 0.0024 2.43 0.001
B×D 0.0096 17 0.0006 0.56 0.923
C×D 0.0010 1 0.0010 1.02 0.312
Error 8.5788 8525 0.0010
Total 20.0438 8639
paper are based upon the metrics and defect data obtained process of developing a useful defect predictor in the pres-
from nine datasets of a software project with three sep- ence of the problem of class imbalance and/or when there
arate releases. The use of three groups of datasets, each are a large number of software metrics to work with. The
with a different class ratio strengthens the generalization of proposed strategy of employing an iterative feature selection
our empirical results. The same (as our approach) analy- process of data sampling followed by feature ranking which
sis for another software system, especially from a different finally aggregates the results generated during the iterative
application domain, may provide different results – a likely process can be extended to any software system, especially
threat in all empirical software engineering research. How- high-assurance software and/or when there is a large col-
ever, a software quality practitioner would appreciate the lection of software metrics for building defect predictors.
Inf Syst Front (2014) 16:801–822 821
Moreover, as all our final conclusions are based on ten runs feature ranking techniques and two levels of post-sampling
of five-fold cross validation and statistical tests for signifi- class ratio (35:65 and 50:50) along with three groups of
cance, our findings are grounded in using sound methods. software-quality datasets, each with three separate releases.
Finally, it is observed that the type of classifier used for We build our models using four different classifiers (Naı̈ve
software quality prediction greatly affected the prediction Bayes, Multilayer Perceptron, k-Nearest Neighbor, and Sup-
results for the case study. Hence, software quality practi- port Vector Machine). We also compare the results from the
tioners should not rule out considering other classification iterative approach with a non-iterative approach using only
algorithms when applying the proposed approach. Also, the a single run of undersampling followed by feature selec-
selection of specific parameter settings for a given classifier tion, to demonstrate the importance of the iteration. All
is likely to cause the classifier to analyze the training data feature selection chooses log2 n software metrics, where
differently. n = 208 is the number found in the original dataset, and
Threats to internal validity for a software engineering thus 8 features are selected.
experiment are unaccounted for influences that may affect Our results demonstrate that the iterative approach gives
case study results. In the context of this study, poor fault- greater performance than the non-iterative approach, espe-
proneness estimates can be caused by a wide variety of cially when using 50:50 as the post-sampling class ratio and
factors, including measurement errors while collecting and especially on the most imbalanced datasets. We also find
recording software metrics, modeling errors due to the that the SVM learner always produces better classification
unskilled use of software applications, errors in model- results than the other learners, and that the AUC (Area
selection during the modeling process, and the presence Under the ROC Curve), PRC (Area Under the Precision-
of outliers and noise in the training dataset. Measurement Recall Curve), and S2N (Signal-to-Noise) rankers produce
errors are inherent to the data collection effort of a given better classification results than other rankers. To vali-
software project. In our study, a common model-building date these results, we use ANOVA analysis and Tukey’s
and model-evaluation approach that is used for all combina- Honestly Significant Difference criterion for multiple com-
tions of data sampling technique, feature selection method, parisons, and we find that the iterative vs. non-iterative and
and classifier has been adopted. Moreover, the experiments SVM vs. other classifiers results are statistically significant
and statistical analysis were performed by only one skilled at a 95 % confidence level. This gives us the confidence to
person in order to keep modeling errors to a minimum. say that the iterative approach’s resilience to random varia-
A software engineering domain expert is consulted to tions in the feature subset does lead to higher classification
set the number of software metrics to select from the orig- performance.
inal set after ranking the software metrics. This number Future work will consider additional datasets both from
may be different for another project. However, based on within the domain of software quality prediction and from
our extensive prior work in software quality estimation other domains. In addition, sampling techniques other than
and quantitative software engineering, we are confident the Random Undersampling will be considered.
selection of log2 n metrics in the attribute subset is large
enough to capture the quality-based characteristics of most
software projects. We have found very often that only a References
handful of project-specific metrics are relevant for defect
prediction, and that in many cases, very few metrics (among Boetticher, G., Menzies, T., Ostrand, T. (2007). Promise reposi-
tory of empirical software engineering data. [Online]. Available:
the available set) are selected by the learner in the final http://promisedata.org/.
software quality prediction model. Breiman, L., Friedman, J., Olshen, R., Stone, C. (1984). Classification
and regression trees. Boca Raton: Chapman and Hall/CRC Press.
Chen, Z., Menzies, T., Port, D., Boehm, B. (2005). Finding the right
data for software cost modeling. IEEE Software, 22(6), 38–46.
6 Conclusion Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support
vector machines and other kernel-based learning methods, 2nd
Two major challenges which are found in a wide range of edn. Cambridge: Cambridge University Press.
data mining problems (such as software quality prediction) Forman, G. (2003). An extensive empirical study of feature selec-
tion metrics for text classification. Journal of Machine Learning
are high dimensionality and class imbalance. In this study, Research, 3, 1289–1305.
we propose an iterative feature ranking approach which Gao, K., Khoshgoftaar, T.M., Seliya, N. (2012). Predicting high-risk
addresses both of these problems, by applying Random program modules by selecting the right software measurements.
Undersampling multiple times and using feature selection Software Quality Journal, 20(1), 3–42.
Goh, L., Song, Q., Kasabov, N. (2004). A novel feature selection
to create separate ranked lists from each iteration, followed method to improve classification of gene expression data. In Pro-
by rank aggregation to produce one final feature list. To ceedings of the second conference on Asia-Pacific bioinformatics
demonstrate and test this technique, we use 18 filter-based (pp. 161–166). Dunedin.
822 Inf Syst Front (2014) 16:801–822
Gonzalez, R.C., & Woods, R.E. (2008). Digital image processing, 3rd Song, Q., Jia, Z., Shepperd, M., Ying, S., Liu, J. (2011). A general soft-
edn. New Jersey: Prentice Hall. ware defect-proneness prediction framework. IEEE Transactions
Haykin, S. (1999). Neural networks: a comprehensive foundation, 2nd On Software Engineering, 37(3), 356–370.
edn. New Jersey: Prentice Hall Interational, Inc. Souza, J., Japkowicz, N., Matwin, S. (2005). Stochfs: a framework for
Jeffery, I.B., Higgins, D.G., Culhane, A.C. (2006). Comparison and combining feature selection outcomes through a stochastic pro-
evaluation of methods for generating differentially expressed gene cess. In Knowledge discovery in databases: PKDD 2005 (Vol.
lists from microarray data. BMC Bioinformatics, 7(359). 3721, pp. 667–674).
Jiang, Y., Lin, J., Cukic, B., Menzies, T. (2009). Variance analysis in Votta, L.G., & Porter, A.A. (1995). Experimental software engineer-
software fault prediction models. In Proceedings of the 20th IEEE ing: A report on the state of the art. In Proceedings of the 17th.
international symposium on software reliability engineering (pp. International conference on software engineering (pp. 277–279).
99–108). Bangalore-Mysore. Seattle: IEEE Computer Society.
Jong, K., Marchiori, E., Sebag, M., van der Vaart, A. (2004). Feature Witten, I.H., Frank, E., Hall, M.A. (2011). Data mining: practi-
selection in proteomic pattern data with support vector machines. cal machine learning tools and techniques, 3rd edn. Burlington:
In Proceedings of the 2004 IEEE symposium on computational Morgan Kaufmann.
intelligence in bioinformatics and computational biology. Wohlin, C., Runeson, P., Host, M., Ohlsson, M.C., Regnell, B.,
Kamal, A.H., Zhu, X., Pandya, A.S., Hsu, S., Shoaib, M. (2009). Wesslen, A. (2012). Experimentation in software engineering.
The impact of gene selection on imbalanced microarray expres- Heidelberg/New York: Springer.
sion data. In Proceedings of the 1st international conference Zimmermann, T., Premraj, R., Zeller, A. (2007). Predicting defects for
on bioinformatics and computational biology; lecture notes in eclipse. In Proceedings of the 29th international conference on
bioinformatics (Vol. 5462, pp. 259–269). New Orleans. software engineering workshops (p. 76). Washington, DC: IEEE
Khoshgoftaar, T.M., & Gao, K. (2010). A novel software metric selec- Computer Society.
tion technique using the area under roc curves. In Proceedings
of the 22nd international conference on software engineering and
knowledge engineering (pp. 203–208). San Francisco.
Khoshgoftaar, T.M., Golawala, M., Van Hulse, J. (2007). An empirical Taghi M. Khoshgoftaar, is a professor of the Department of Com-
study of learning from imbalanced data using random forest. In puter and Electrical Engineering and Computer Science, Florida
Proceedings of the 19th IEEE international conference on tools Atlantic University and the Director of the Data Mining and Machine
with artificial intelligence (Vol. 2, pp. 310–317). Washington, DC. Learning Laboratory, and Empirical Software Engineering Labora-
Khoshgoftaar, T.M., Gao, K., Bullard, L.A. (2012a). A comparative tory. His research interests are in big data analytics, data mining and
study of filter-based and wrapper-based feature ranking techniques machine learning, health informatics and bioinformatics, and software
for software quality modeling. International Journal of Reliability, engineering. He has published more than 500 refereed journal and
Quality and Safety Engineering, 18(4), 341–364. conference papers in these areas. He was the conference chair of the
Khoshgoftaar, T.M., Gao, K., Napolitano, A. (2012b). Exploring an IEEE International Conference on Machine Learning and Applications
iterative feature selection technique for highly imbalanced data (ICMLA 2012). He is the workshop chair of the IEEE IRI Health Infor-
sets. In Information Reuse and Integration (IRI), 2012 IEEE 13th matics workshop (2013). Also, he is the Editor-in Chief of the Big Data
international conference on (pp. 101–108). journal.
Kira, K., & Rendell, L.A. (1992). A practical approach to fea-
ture selection. In Proceedings of 9th international workshop on
machine learning (pp. 249–256). Kehan Gao, is an associate professor at the Department of Mathe-
Lessmann, S., Baesens, B., Mues, C., Pietsch, S. (2008). Bench- matics and Computer Science, Eastern Connecticut State University.
marking classification models for software defect prediction: a Her research interests include software engineering, software metrics,
proposed framework and novel findings. IEEE Transactions on software reliability and quality engineering, computer performance
Software Engineering, 34(4), 485–496. modeling, computational intelligence, and data mining and machine
Liu, T.-Y. (2009). Easyensemble and feature selection for imbalance Learning. She is a member of the IEEE, IEEE Computer Society, and
data sets. In Proceedings of the 2009 internationalc joint confer- IEEE Reliability Society.
ence on bioinformatics, systems biology and intelligent computing
(pp. 517–520). Washington, DC: IEEE Computer Society.
Liu, H., Motoda, H., Setiono, R., Zhao, Z. (2010). Feature selection: an
ever evolving frontier in data mining. In Proceedings of the fourth Amri Napolitano, is a research associate at the Department of Com-
international workshop on feature selection in data mining (pp. puter and Electrical Engineering and Computer Science (CEECS),
4–13). Hyderabad. Florida Atlantic University He received the Ph.D. and M.S. degrees
Menzies, T., Greenwald, J., Frank, A. (2007). Data mining static in Computer Science from Florida Atlantic University in 2009 and
code attributes to learn defect predictors. IEEE Transactions on 2006 respectively and the B.S. degree in Computer and Information
Software Engineering, 33(1), 2–13. Science from the University of Florida in 2004. His research interests
Mishra, D., & Sahu, B. (2011). Feature selection for cancer classifi- include data mining and machine learning, evolutionary computation,
cation: a signal-to-noise ratio approach. International Journal of and artificial intelligence.
Scientific & Engineering Research, 2(4).
Rodriguez, D., Ruiz, R., Cuadrado-Gallego, J., Aguilar-Ruiz, J.
(2007). Detecting fault modules applying feature selection to clas-
sifiers. In Proceedings of 8th IEEE international conference on Randall Wald, is a research associate at the Department of Computer
information reuse and integration (pp. 667–672). Las Vegas. and Electrical Engineering and Computer Science (CEECS), Florida
Seiffert, C., Khoshgoftaar, T.M., Van Hulse, J., Napolitano, A. (2010). Atlantic University. He studies different challenges in data mining and
Rusboost: a hybrid approach to alleviate class imbalance. IEEE machine learning as they apply to data from many different application
Transactions on Systems, Man & Cybernetics: Part A: Systems and domains, including machine condition monitoring, bioinformatics, and
Humans, 40(1), 185–197. social network mining.

Art:10.1007/s10796 013 9430 0 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Art:10.1007/s10796 013 9430 0 PDF

Uploaded by

Copyright:

Available Formats

Inf Syst Front (2014) 16:801–822

A comparative study of iterative and non-iterative feature

Published online: 27 April 2013

divided into wrappers, filters, and embedded categories (Liu 3 Methodology

Algorithm 2: Iterative Feature Selection Algorithm

Table 1 Data characteristics

Eclipse 1 2.0 10 209 377 23 6.1 354 93.9

Fig. 1 Feature selection strategies

Table 2 Classification performance for Eclipse 1

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

7 FM 0.8835 0.8848 0.892 0.8787 0.8887 0.279

valuerepresents significantly better case (p ≤ 0.05)

Table 3 Classification performance for Eclipse 2

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

7 FM 0.8846 0.8873 0.711 0.8825 0.8886 0.412

Table 4 Classification performance for Eclipse 3

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

Index Ranker RUS35 RUS50

Iterative? t-test Iterative? t-test

7 FM 0.8510 0.8561 0.585 0.8497 0.8562 0.499

5 Threats to validity by students; and (4) developed in an industry/government

Factor A - learner Factor B - ranker

Factor C - sampler Factor D - strategy

Factor A - learner Factor B - ranker

Factor C - sampler Factor D - strategy

Factor A - learner Factor B - ranker

Factor C - sampler Factor D - strat egy

Table 6 Four-way ANOVA

You might also like