You are on page 1of 12

Revisiting Unsupervised Learning for Defect Prediction

Wei Fu, Tim Menzies


Com. Sci., NC State, USA
wfu@ncsu.edu,tim.menzies@gmail.com

ABSTRACT achieved better results than standard supervised methods. We


Collecting quality data from software projects can be time-consuming repeated their study since, if their results were confirmed, this
and expensive. Hence, some researchers explore “unsupervised” would imply that decades of research into defect prediction [7, 8, 10,
approaches to quality prediction that does not require labelled data. 13, 16, 18, 19, 21, 25, 26, 29, 30, 35, 40, 41, 43, 51, 54] had needlessly
An alternate technique is to use “supervised” approaches that learn complicated an inherently simple task.
The standard method for software defect prediction is learning
arXiv:1703.00132v2 [cs.SE] 24 Jun 2017

models from project data labelled with, say, “defective” or “not-


defective”. Most researchers use these supervised models since, it from labelled data. In this approach, the historical log of known
is argued, they can exploit more knowledge of the projects. defects is learned by a data miner. Note that this approach requires
At FSE’16, Yang et al. reported startling results where unsu- waiting until a historical log of defects is available; i.e. until after the
pervised defect predictors outperformed supervised predictors for code has been used for a while. Another approach, explored by Yang
effort-aware just-in-time defect prediction. If confirmed, these re- et al., uses general background knowledge to sort the code, then
sults would lead to a dramatic simplification of a seemingly complex inspect the code in that sorted order. In their study, they assumed
task (data mining) that is widely explored in the software engineer- that more defects can be found faster by first looking over all the
ing literature. “smaller” modules (an idea initially proposed by Koru et al. [28]).
This paper repeats and refutes those results as follows. (1) There After exploring various methods of defining “smaller”, they report
is much variability in the efficacy of the Yang et al. predictors so their approach finds more defects, sooner, than supervised methods.
even with their approach, some supervised data is required to prune These results are highly remarkable:
weaker predictors away. (2) Their findings were grouped across N • This approach does not require access to labelled data; i.e. it can
projects. When we repeat their analysis on a project-by-project be applied just as soon as the code is written.
basis, supervised predictors are seen to work better. • It is extremely simple: no data pre-processing, no data mining,
Even though this paper rejects the specific conclusions of Yang just simple sorting.
et al., we still endorse their general goal. In our our experiments, Because of the remakrable nature of these results, this paper takes
supervised predictors did not perform outstandingly better than a second look at the Yang et al. results. We ask three questions:
unsupervised ones for effort-aware just-in-time defect prediction. RQ1: Do all unsupervised predictors perform better than
Hence, they may indeed be some combination of unsupervised supervised predictors?
learners to achieve comparable performance to supervised ones. The reason we ask this question is that if the answer is “yes”,
We therefore encourage others to work in this promising area. then we can simply select any unsupervised predictor built from
the the change metrics as Yang et al suggested without using any
KEYWORDS supervised data; if the answer is “no”, then we must apply some
Data analytics for software engineering, software repository mining, techniques to select best predictors and remove the worst ones.
empirical studies, defect prediction However, our results show that, when projects are explored sepa-
rately, the majority of the unsupervised predictors learned by Yang
ACM Reference format:
Wei Fu, Tim Menzies. 2017. Revisiting Unsupervised Learning for De- et al. perform worse than supervised predictors.
fect Prediction. In Proceedings of 2017 11th Joint Meeting of the European Results of RQ1 suggest that after building multiple predictors
Software Engineering Conference and the ACM SIGSOFT Symposium on the using unsupervised methods, it is required to prune the worst
Foundations of Software Engineering, Paderborn, Germany, September 4-8, predictors and only better ones should be used for future prediction.
2017 (ESEC/FSE’17), 12 pages. However, with Yang et al. approach, there is no way to tell which
DOI: 10.1145/3106237.3106257 unsupervised predictors will perform better without access to the
labels of testing data. To test that speculation, we built a new
learner, OneWay, that uses supervised training data to remove all
1 INTRODUCTION
but one of the Yang et al. predictors. Using this learner, we asked:
This paper repeats and refutes recent results from Yang et al. [54] RQ2: Is it beneficial to use supervised data to prune away
published at FSE’16. The task explored by Yang et al. was effort-ware all but one of the Yang et al. predictors?
just-in-time (JIT) software defect predictors. JIT defect predictors Our results showed that OneWay nearly always outperforms
are built on code change level and could be used to conduct defect the unsupervised predictors found by Yang et al. The success of
prediction right before developers commit the current change. They OneWay leads to one last question:
report an unsupervised software quality prediction method that RQ3: Does OneWay perform better than more complex stan-
dard supervised learners?
ESEC/FSE’17, Paderborn, Germany
2017. 978-1-4503-5105-8/17/09. . . $15.00 Such standard supervised learners include Random Forests, Lin-
DOI: 10.1145/3106237.3106257 ear Regression, J48 and IBk (these learners were selected based on
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

prior results by [13, 21, 30, 35]). We find that in terms of Recall some combination of unsupervised learners to achieve comparable
and Popt (the metric preferred by Yang et al.), OneWay performed performance to supervised. Therefore, even though we reject the
better than standard supervised predictors. Yet measured in terms specific conclusions of Yang et al., we still endorse the question
of Precision, there was no advantage to OneWay. they asked strongly and encourage others to work in this area.
From the above, we make an opposite conclusion to Yang et al.;
i.e., there are clear advantages to use supervised approaches over 3 BACKGROUND AND RELATED WORK
unsupervised ones. We explain the difference between our results 3.1 Defect Prediction
and their results as follows:
As soon as people started programming, it became apparent that
• Yang et al. reported averaged results across all projects; programming was an inherently buggy process. As recalled by
• We offer a more detailed analysis on a project-by-project basis. Maurice Wilkes [52], speaking of his programming experiences
The rest of this paper is organized as follows. Section 2 is a commen- from the early 1950s: “It was on one of my journeys between the
tary on Yang et al. study and the implication of this paper. Section 3 EDSAC room and the punching equipment that ‘hesitating at the
describes the background and related work on defect prediction. angles of stairs’ the realization came over me with full force that
Section 4 explains the effort-ware JIT defect prediction methods a good part of the remainder of my life was going to be spent in
investigated in this study. Section 5 describes the experimental finding errors in my own programs.”
settings of our study, including research questions that motivate It took decades to gather the experience required to quantify
our study, data sets and experimental design. Section 6 presents the size/defect relationship. In 1971, Fumio Akiyama [2] described
the results. Section 7 discusses the threats to the validity of our the first known “size” law, saying the number of defects D was a
study. Section 8 presents the conclusion and future work. function of the number of LOC where D = 4.86 + 0.018 ∗ LOC.
Note one terminological convention: in the following, we treat In 1976, Thomas McCabe argued that the number of LOC was
“predictors” and “learners” as synonyms. less important than the complexity of that code [33]. He argued that
code is more likely to be defective when his “cyclomatic complexity”
2 SCIENCE IN THE 21st CENURY measure was over 10. Later work used data miners to build defect
predictors that proposed thresholds on multiple measures [35].
While this paper is specific about effort-aware JIT defect prediction
Subsequent research showed that software bugs are not dis-
and the Yang et al. result, at another level this paper is also about
tributed evenly across a system. Rather, they seem to clump in
science in the 21st century.
small corners of the code. For example, Hamill et al. [15] report
In 2017, the software analytics community now has the tools,
studies with (a) the GNU C++ compiler where half of the files
data sets, experience to explore a bold wider range of options. There
were never implicated in issue reports while 10% of the files were
are practical problems in exploring all those possibilities specifically,
mentioned in half of the issues. Also, Ostrand et al. [44] studied
too many options. For example, in section 2.5 of [27], Kocaguneli
(b) AT&T data and reported that 80% of the bugs reside in 20% of the
et al. list 12,000+ different ways of estimation by analogy. We have
files. Similar “80-20” results have been observed in (c) NASA sys-
had some recent successes with exploring this space of options [7]
tems [15] as well as (d) open-source software [28] and (e) software
but only after the total space of options is reduced by some initial
from Turkey [37].
study to a manageable set of possibilities. Hence, what is needed are
Given this skewed distribution of bugs, a cost-effective quality
initial studies to rule our methods that are generally unpromising
assurance approach is to sample across a software system, then
(e.g. this paper) before we apply second level hyper-parameter
focus on regions reporting some bugs. Software defect predictors
optimization study that takes the reduced set of options.
built from data miners are one way to implement such a sampling
Another aspect of 21st century science that is highlighted by this
policy. While their conclusions are never 100% correct, they can be
paper is the nature of repeatability. While this paper disagrees the
used to suggest where to focus more expensive methods such as
conclusions of Yang et al., it is important to stress that their paper
elaborate manual review of source code [49]; symbolic execution
is an excellent example of good science that should be emulated in
checking [46], etc. For example, Misirli et al. [37] report studies
future work.
where the guidance offered by defect predictors:
Firstly, they tried something new. There are many papers in the
SE literature about defect prediction. However, compared to most • Reduced the effort required for software inspections in some
of those, the Yang et al. paper is bold and stunningly original. Turkish software companies by 72%;
Secondly, they made all their work freely available. Using the “R” • While, at the same time, still being able to find the 25% of the
code they placed online, we could reproduce their result, including files that contain 88% of the defects.
all their graphical output, in a matter of days. Further, using that Not only do static code defect predictors perform well compared to
code as a starting point, we could rapidly conduct the extensive ex- manual methods, they also are competitive with certain automatic
perimentation that leads to this paper. This is an excellent example methods. A recent study at ICSE’14, Rahman et al. [47] compared
of the value of open science. (a) static code analysis tools FindBugs, Jlint, and Pmd and (b) static
Thirdly, while we assert their answers were wrong, the question code defect predictors (which they called “statistical defect predic-
they asked is important and should be treated as an open and urgent tion”) built using logistic regression. They found no significant
issue by the software analytics community. In our experiments, differences in the cost-effectiveness of these approaches. Given
supervised predictors performed better than unsupervised– but not this equivalence, it is significant to note that static code defect
outstandingly better than unsupervised. Hence, they may indeed be prediction can be quickly adapted to new languages by building
Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

lightweight parsers that extract static code metrics. The same is not their simple unsupervised defect predictor is built on change met-
true for static code analyzers– these need extensive modification rics as shown in Table 1. These 14 different change metrics can be
before they can be used on new languages. divided into 5 dimensions [21]:
To build such defect predictors, we measure the complexity of • Diffusion: NS, ND, NF and Entropy.
software projects using McCabe metrics, Halstead’s effort metrics • Size: LA, LD and LT.
and CK object-oriented code mertics [5, 14, 20, 33] at a coarse gran- • Purpose: FIX.
ularity, like file or package level. With the collected data instances • History: NDEV, AGE and NUC.
along with the corresponding labels (defective or non-defective), • Experience: EXP, REXP and SEXP.
we can build defect prediction models using supervised learners
such as Decision Tree, Random Forests, SVM, Naive Bayes and Lo- Table 1: Change metrics used in our data sets.
gistic Regression [13, 22–24, 30, 35]. After that, such trained defect
Metric Description
predictor can be applied to predict the defects of future projects. NS Number of modified subsystems [38].
ND Number of modified directories [38].
3.2 Just-In-Time Defect Prediction NF Number of modified files [42].
Traditional defect prediction has some drawbacks such as prediction Entropy Distribution of the modified code across each file [6, 16].
at a coarse granularity and started at very late stage of software LA Lines of code added [41].
LD Lines of code deleted [41].
development circle [21], whereas in JIT defect prediction paradigm,
LT Lines of code in a file before the current change [28].
the defect predictors are built on code change level, which could FIX Whether or not the change is a defect fix [11, 55].
easily help developers narrow down the code for inspection and NDEV Number of developers that changed the modified files [32].
JIT defect prediction could be conducted right before developers AGE The average time interval between the last and the current
commit the current change. JIT defect prediction becomes a more change [10].
practical method for practitioners to carry out. NUC The number of unique changes to the modified files [6, 16].
Mockus et al. [38] conducted the first study to predict software EXP The developer experience in terms of number of changes [38].
REXP Recent developer experience [38].
failures on a telecommunication software project, 5ESS, by using
SEXP Developer experience on a subsystem [38].
logistic regression on data sets consisted of change metrics of the
project. Kim et al. [25] further evaluated the effectiveness of change
The diffusion dimension characterizes how a change is distributed
metrics on open source projects. In their study, they proposed to
at different levels of granularity. As discussed by Kamei et al. [21],
apply support vector machine to build a defect predictor based
a highly distributed change is harder to keep track and more likely
on software change metrics, where on average they achieved 78%
to introduce defects. The size dimension characterizes the size of a
accuracy and 60% recall. Since training data might not be available
change and it is believed that the software size is related to defect
when building the defect predictor, Fukushima et al. [9] introduced
proneness [28, 41]. Yin et al. [55] report that the bug-fixing process
cross-project paradigm into JIT defect prediction. Their results
can also introduce new bugs. Therefore, the Fix metric could be used
showed that using data from other projects to build JIT defect
as a defect evaluation metric. The History dimension includes some
predictor is feasible.
historical information about the change, which has been proven to
Most of the research into defect prediction does not consider
be a good defect indicator [32]. For example, Matsumoto et al. [32]
the effort1 required to inspect the code predicted to be defective.
find that the files previously touched by many developers are likely
Exceptions to this rule include the work of Arishom and Briand [3],
to contain more defects. The Experience dimension describes the
Koru et al. [28] and Kamei et al. [21]. Kamei et al. [21] conducted
experience of software programmers for the current change because
a large-scale study on the effectiveness of JIT defect prediction,
Mockus et al. [38] show that more experienced developers are less
where they claimed that using 20% of efforts required to inspect
likely to introduce a defect. More details about these metrics can
all changes, their modified linear regression model (EALR) could
be found in Kamei et al’s study [21].
detect 35% defect-introducing changes. Inspired by Menzies et al.’s
In Yang et al.’s study, for each change metric M of testing data,
ManualUp model (i.e., small size of modules inspected first) [36],
they build an unsupervised predictor that ranks all the changes
Yang et al. [54] proposed to build 12 unsupervised defect predictors
based on the corresponding value of M1(c) in descending order,
by sorting the reciprocal values of 12 different change metrics on
each testing data set in descending order. They reported that with where M(c) is the value of the selected change metric for each
20% efforts, many unsupervised predictors perform better than change c. Therefore, the changes with smaller change metric values
state-of-the-art supervised predictors. will ranked higher. In all, for each project, Yang et al. define 12
simple unsupervised predictors (LA and LD are excluded as Yang
4 METHOD et al. [54]).

4.1 Unsupervised Predictors 4.2 Supervised Predictors


In this section, we describe the effort-aware just-in-time unsuper- To further evaluate the unsupervised predictor, we selected some
vised defect predictors proposed by Yang et al. [54], which serves supervised predictors that already used in Yang et al.’s work.
as a baseline method in this study. As described by Yang et al. [54], As reported in both Yang et al.’s [54] and Kamei et al.’s [21]
1 Effortmeans time/labor required to inspect total number of files/code predicted as work, EALR outperforms all other supervised predictors for effort-
defective. aware JIT defect prediction. EALR is a modified linear regression
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

Y (x)
model [21] and it predicts Effort(x) instead of predicting Y (x), where Algorithm 1 Pseudocode for OneWay
Y (x) indicates whether this change is a defect or not (1 or 0) and Input: data train, data test, eval goal ∈ {F1, Popt , Recall, Precision, ... }
Effort(x) represents the effort required to inspect this change. Note Output: result
that this is the same method to build EALR as Kamei et al. [21]. 1: function OneWay(data train, data test, eval goal)
In defect prediction literature, IBk (KNN), J48 and Random Forests 2: all scores ← NULL
methods are simple yet widely used as defect learners and have 3: for metric in data train do
been proven to perform, if not best, quite well for defect predic- 4: learner ← buildUnsupervisedLearner(data train, metric)
tion [13, 30, 35, 50]. These three learners are also used in Yang et 5: scores ←evaluate(learner)
al’s study. For these supervised predictors, Y (x) was used as the 6: //scores include all evaluation goals, e.g., Popt, F1, ...
dependant variable. For KNN method, we set K = 8 according to 7: all scores.append(scores)
Yang et al. [54]. 8: end for
9: best metric ← pruneFeature(all scores, eval goal)
10: result ← buildUnsupervisedLearner(data test, best metric)
4.3 OneWay Learner 11: return result
Based on our preliminary experiment results shown in the follow- 12: end function
13: function pruneFeature(all scores, eval goal)
ing section, for the six projects investigated by Yang et al, some
14: if eval goal == NULL then
of 12 unsupervised predictors do perform worse than supervised
15: mean scores ← getMeanScoresForEachMetric(all scores)
predictors and there is no one predictor constantly working best 16: best metric ← getMetric(max(mean scores))
on all project data. This means we can not simply say which un- 17: return best metric
supervised predictor works for the new project before predicting 18: else
on the testing data. In this case, we need a technique to select the 19: best metric ← getMetric(max(all scores[“eval goal”]))
proper metrics to build defect predictors. 20: return best metric
We propose OneWay learner, which is a supervised predictor 21: end if
built on the implication of Yang et al’s simple unsupervised predic- 22: end function
tors. The pseudocode for OneWay is shown in Algorithm 1. In the
following description, we use the superscript numbers to denote
the line number in pseudocode. across all the project data? If not, how would it look like? Therefore,
The general idea of OneWay is to use supervised training data in RQ1, we report results for each project separately.
to remove all but one of the Yang et al. predictors and then apply Another observation is that even though Yang et al. [54] propose
this trained learner on the testing data. Specifically, OneWay firstly that simple unsupervised predictors could work better than super-
builds simple unsupervised predictors from each metric on training vised predictors for effort-aware JIT defect prediction, one missing
dataL4 , then evaluates each of those learners in terms of evaluation aspect of their report is how to select the most promising metric to
metricsL5 , like Popt , Recall, Precision and F1. After that, if the build a defect predictor. This is not an issue when all unsupervised
desirable evaluation goal is set, the metric which performs best on predictors perform well but, as we shall see, this is not the case.
the corresponding evaluation goal is returned as the best metric; As demonstrated below, given M unsupervised predictors, only
otherwise, the metric which gets the highest mean score over all a small subset can be recommended. Therefore it is vital to have
evaluation metrics is returnedL9 (In this study, we use the latter some mechanism by which we can down select from M models
one). Finally, a simple predictor is built only on such best metricL10 to the L  M that are useful. Based on this fact, we propose a
with the help of training data. Therefore, OneWay builds only one new method, OneWay, which is the missing link in Yang et al.’s
supervised predictor for each project using the local data instead study [54] and the missing final step they do not explore. There-
of 12 predictors directly on testing data as Yang et al [54] . fore, in RQ2 and RQ3, we want to evaluate how well our proposed
OneWay method performs compared to the unsupervised predictors
and supervised predictors.
5 EXPERIMENTAL SETTINGS
Considering our goals and questions, we reproduce Yang et al’s
5.1 Research Questions results and report for each project to answer RQ1. For RQ2 and
Using the above methods, we explore three questions: RQ3, we implement our OneWay method, and compare it with
unsupervised predictors and supervised predictors on different
• Do all unsupervised predictors perform better than supervised projects in terms of various evaluation metrics.
predictors?
• Is it beneficial to use supervised data to prune all but one of the 5.2 Data Sets
Yang et al. unsupervised predictors?
• Does OneWay perform better than more complex standard su- In this study, we conduct our experiment using the same data sets
pervised predictors? as Yang et al.[54], which are six well-known open source projects,
Bugzilla, Columba, Eclipse JDT, Eclipse Platform, Mozilla and Post-
When reading the results from Yang et al. [54], we find that they greSQL. These data sets are shared by Kamei et al. [21]. The statistics
aggregate performance scores of each learner on six projects, which of the data sets are listed in Table 2. From Table 2, we know that
might miss some information about how learners perform on each all these six data sets cover at least 4 years historical information,
project. Are these unsupervised predictors working consistently and the longest one is PostgreSQL, which includes 15 years of data.
Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

Table 2: Statistics of the studied data sets test whether two distributions are statistically significant at the
Avg LOC # Modified level of 0.05 [4, 54]. To measure the effect size of performance
Total % of
Project Period
Change Defects
per Files per scores among OneWay and supervised/unsupervised predictors, we
Change Change compute Cliff’s δ that is a non-parametric effect size measure [48].
Bugzilla 08/1998 - 12/2006 4620 36% 37.5 2.3
As Romano et al. suggested, we evaluate the magnitude of the effect
Platform 05/2001 - 12/2007 64250 14% 72.2 4.3
Mozilla 01/2000 - 12/2006 98275 5% 106.5 5.3 size as follows: negligible (|δ | < 0.147 ), small (0.147 ≤ |δ | < 0.33),
JDT 05/2001 - 12/2007 35386 14% 71.4 4.3 medium (0.33 ≤ |δ | < 0.474 ), and large (0.474 ≤ |δ |) [48].
Columba 11/2002 - 07/2006 4455 31% 149.4 6.2
PostgreSQL 07/1996 - 05/2010 20431 25% 101.3 4.5 5.4 Evaluation Measures
For effort-aware JIT defect prediction, in addition to evaluate how
learners correctly predict a defect-introducing change, we have
The total changes for these six data sets are from 4450 to 98275,
to take account the efforts that are required to inspect prediction.
which are sufficient for us to conduct an empirical study. In this
Ostrand et al. [45] report that given a project, 20% of the files
study, if a change introduces one or more defects then this change is
contain on average 80% of all defects in the project. Although there
considered as defect-introducing change. The percentage of defect-
is nothing magical about the number 20%, it has been used as a
introducing changes ranges from 5% to 36%. All the data and code
cutoff value to set the efforts required for the defect inspection
used in this paper is available online2 .
when evaluating the defect learners [21, 34, 39, 54]. That is, given
20% effort, how many defects can be detected by the learner. To be
5.3 Experimental Design
consistent with Yang et al, in this study, we restrict our efforts to
The following principle guides the design of these experiments: 20% of total efforts.
Whenever there is a choice between methods, data, etc., we will always To evaluate the performance of effort-aware JIT defect prediction
prefer the techniques used in Yang et al. [54]. learners in our study, we used the following 4 metrics: Precision,
By applying this principle, we can ensure that our experimental Recall, F1 and Popt , which are widely used in defect prediction
setup is the same as Yang et al. [54]. This will increase the validity literature [21, 35, 36, 39, 54, 56].
of our comparisons with that prior work.
When applying data mining algorithms to build predictive mod- True Positive
Precision =
els, one important principle is not to test on the data used in training. True Positive + False Positive
To avoid that, we used time-wise-cross-validation method which is True Positive
Recall =
also used by Yang et al. [54]. The important aspect of the following True Positive + False N eдative
experiment is that it ensures that all testing data was created after
2 ∗ Precision ∗ Recall
training data. Firstly, we sort all the changes in each project based F1 =
on the commit date. Then all the changes that were submitted in Recall + Precision
the same month are grouped together. For a given project data where Precision denotes the percentage of actual defective changes
set that covers totally N months history, when building a defect to all the predicted changes and Recall is the percentage of predicted
predictor, consider a sliding window size of 6, defective changes to all actual defective changes. F1 is a measure
that combines both Precision and Recall which is the harmonic mean
• The first two consecutive months data in the sliding window,
of Precision and Recall.
ith and i + 1th, are used as the training data to build supervised
predictors and OneWay learner.
• The last two months data in the sliding window, i + 4th and
i + 5th, which are two months later than the training data, are
used as the testing data to test the supervised predictors, OneWay
learner and unsupervised predictors.
After one experiment, the window slides by “one month” data.
By using this method, each training and testing data set has two
months data, which will include sufficient positive and negative
instances for the supervised predictors to learn. For any project that
includes N months data, we can perform N −5 different experiments
to evaluate our learners when N is greater than 5. For all the
unsupervised predictors, only the testing data is used to build the
model and evaluate the performance. Figure 1: Example of an effort-based cumulative lift
To statistically compare the differences between OneWay with chart [54].
supervised and unsupervised predictors, we use Wilcoxon single
ranked test to compare the performance scores of the learners in
this study the same as Yang et al. [54]. To control the false discover The last evaluation metric used in this study is Popt , which is
rate, the Benjamini-Hochberg (BH) adjusted p-value is used to defined as 1 − ∆opt , where ∆opt is the area between the effort (code-
churn-based) cumulative lift charts of the optimal model and the
2 https://github.com/WeiFoo/RevisitUnsupervised prediction model (as shown in Figure 1). In this chart, the x-axis is
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

considered as the percentage of required effort to inspect the change Table 3: Comparison in Popt : Yang’s method (A) vs. our im-
and the y-axis is the percentage of defect-introducing change found plementation (B)
in the selected change. In the optimal model, all the changes are EALR LT AGE
sorted by the actual defect density in descending order, while for the Project
A B A B A B
predicted model, all the changes are sorted by the actual predicted Bugzilla 0.59 0.59 0.72 0.72 0.67 0.67
value in descending order. Platform 0.58 0.58 0.72 0.72 0.71 0.71
According to Kamei et al. and Xu et al. [21, 39, 54], Popt can be Mozilla 0.50 0.50 0.65 0.65 0.64 0.64
normalized as follows: JDT 0.59 0.59 0.71 0.71 0.68 0.69
S(optimal) − S(m) Columba 0.62 0.62 0.73 0.73 0.79 0.79
Popt (m) = 1 −
S(optimal) − S(worst) PostgreSQL 0.60 0.60 0.74 0.74 0.73 0.73
Average 0.58 0.58 0.71 0.71 0.70 0.70
where S(optimal), S(m) and S(worst) represent the area of curve
under the optimal model, predicted model, and worst model, re-
spectively. Note that the worst model is built by sorting all the Table 4: Comparison in Recall : Yang’s method (A) vs. our
changes according to the actual defect density in ascending order. implementation (B)
For any learner, it performs better than random predictor only if EALR LT AGE
Project
the Popt is greater than 0.5. A B A B A B
Note that, following the practices of Yang et al. [54], we measure Bugzilla 0.29 0.30 0.45 0.45 0.38 0.38
Precision, Recall, F1 and Popt at the effort = 20% point. In this study, Platform 0.31 0.30 0.43 0.43 0.43 0.43
in addition to Popt and ACC (i.e., Recall) that is used in Yang et al’s Mozilla 0.18 0.18 0.36 0.36 0.28 0.28
work [54], we include Precision and F1 measures and they provide JDT 0.32 0.34 0.45 0.45 0.41 0.41
more insights about all the learners evaluated in the study from Columba 0.40 0.42 0.44 0.44 0.57 0.57
very different perspectives, which will be shown in the next section. PostgreSQL 0.36 0.36 0.43 0.43 0.43 0.43
Average 0.31 0.32 0.43 0.43 0.41 0.41
6 EMPIRICAL RESULTS
In this section, we present the experimental results to investigate
is to help visualize the median differences between unsupervised
how simple unsupervised predictors work in practice and evaluate
predictors and supervised predictors.
the performance of the proposed method, OneWay, compared with
The colors of the boxes within Figure 2 indicate the significant
supervised and unsupervised predictors.
difference between learners:
Before we start off, we need a sanity check to see if we can
fully reproduce Yang et al.’s results. Yang et al. [54] provide the • The blue color represents that the corresponding unsupervised
median values of Popt and Recall for the EALR model and the best predictor is significantly better than the best supervised predictor
two unsupervised models, LT and AGE, from the time-wise cross according to Wilcoxon signed-rank, where the BH corrected
evaluation experiment. Therefore, we use those numbers to check p-value is less than 0.05 and the magnitude of the difference
our results. between these two learners is NOT trivial according to Cliff’s
As shown in Table 3 and Table 4, for unsupervised predictors, LT delta, where |δ | ≥ 0.147.
and AGE, we get the exact same performance scores on all projects • The black color represents that the corresponding unsuper-
in terms of Recall and Popt . This is reasonable because unsupervised vised predictor is not significantly better than the best super-
predictors are very straightforward and easy to implement. For the vised predictor or the magnitude of the difference between these
supervised predictor, EALR, these two implementations do not have two learners is trivial, where |δ | ≤ 0.147.
differences in Popt , while the maximum difference in Recall is only • The red color represents that the corresponding unsupervised
0.02. Since the differences are quite small, then we believe that our predictor is significantly worse than the best supervised predictor
implementation reflects the details about EALR and unsupervised and the magnitude of the difference between these two learners
learners in Yang et al. [54]. is NOT trivial.
For other supervised predictors used in this study, like J48, IBk, From Figure 2, we can clearly see that not all unsupervised predic-
and Random Forests, we use the same algorithms from Weka pack- tors perform statistically better than the best supervised predictor
age [12] and set the same parameters as used in Yang et al. [54]. across all different evaluation metrics. Specifically, for Recall, on one
hand, there are only 12 2 , 3 , 6 , 2 , 3 and 2 of all unsupervised
RQ1: Do all unsupervised predictors perform better than 12 12 12 12 12
supervised predictors? predictors that perform statistically better than the best supervised
To answer this question, we build four supervised predictors and predictor on six data sets, respectively. On the other hand, there
are 126 , 6 , 4 , 6 , 5 and 6 of all unsupervised predictors perform
twelve unsupervised predictors on the six project data sets using 12 12 12 12 12
incremental learning method as described in Section 5.3. statistically worse than the best supervised predictor on the six data
Figure 2 shows the boxplot of Recall, Popt , F1 and Precision for sets, respectively. This indicates that:
supervised predictors and unsupervised predictors on all data sets. • About 50% of the unsupervised predictors perform worse than
For each predictor, the boxplot shows the 25th percentile, median the best supervised predictor on any data set;
and 75 percentile values for one data set. The horizontal dashed • Without any prior knowledge, we can not know which unsu-
lines indicate the median of the best supervised predictor, which pervised predictor(s) works adequately on the testing data.
Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

Recall Popt
Supervised Unsupervised Supervised Unsupervised
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Recall Popt

y V P P y V P P
LR RF J4
8 IBk NS ND NF ntrop LT E P
FIX NDE AG EX REX SEX NU
C LR RF J4
8 IBk NS ND NF ntrop LT E P
FIX NDE AG EX REX SEX NU
C
EA E EA E
F1 Precision
Supervised Unsupervised Supervised Unsupervised
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

F1 Precision

y V P P y V P P
LR RF J4
8 IBk NS ND NF ntrop LT E P
FIX NDE AG EX REX SEX NU
C LR RF J4
8 IBk NS ND NF ntrop LT E P
FIX NDE AG EX REX SEX NU
C
EA E EA E
Figure 2: Performance comparisons between supervised and unsupervised predictors over six projects (from top to bottom
are Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

Note that the above two points from Recall also hold for Popt . Note the implications of this finding: some extra knowledge
For F1, we see that only LT on Bugzilla and AGE on PostgreSQL is required to prune the worse unsupervised models, such as the
perform statistically better than the best supervised predictor. Other knowledge that can come from labelled data. Hence, we must
than that, no unsupervised predictor performs better on any data conclude the opposite to Yang et al.; i.e. some supervised labelled
set. Furthermore, surprisingly, no unsupervised predictor works data must be applied before we can reliably deploy unsupervised
significantly better than the best supervised predictor on any data defect predictors on testing data.
sets in terms of Precision. As we can see, Random Forests performs RQ2: Is it beneficial to use supervised data to prune away
well on all six data sets. This suggests that unsupervised predictors all but one of the Yang et al. predictors?
have very low precision for effort-aware defect prediction and can To answer this question, we compare the OneWay learner with
not be deployed to any business situation where precision is critical. all twelve unsupervised predictors. All these predictors are tested
Overall, for a given data set, no one specific unsupervised pre- on the six project data sets using the same experiment scheme as
dictor works better than the best supervised predictor across all we did in RQ1.
evaluation metrics. For a given measure, most unsupervised pre- Figure 3 shows the boxplot for the performance distribution of
dictors did not perform better across all data sets. In summary: unsupervised predictors and the proposed OneWay learner on six
data sets across four evaluation measures. The horizontal dashed
Not all unsupervised predictors perform better than super- line denotes the median value of OneWay. Note that in Figure 4, blue
vised predictors for each project and for different evaluation means this learner is statistically better than OneWay, red means
measures. worse, and black means no difference. As we can see, in Recall, only
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

Recall Popt
Unsupervised Proposed Unsupervised Proposed
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
NS ND NF trop
y LT FIX DEV AGE EXP EXP EXP UC ay NS ND NF trop
y LT FIX DEV AGE EXP EXP EXP UC ay
En N R S N eW En N R S N eW
On On
F1 Precision
Unsupervised Proposed Unsupervised Proposed
0.200.0 0.4 0.0 0.6

0.200.0 0.4 0.0 0.6


0.0 0.4 0.0 0.4 0.0 0.4 0.00

0.0 0.4 0.0 0.4 0.0 0.4 0.00

NS ND NF trop
y LT FIX DEV AGE EXP EXP EXP UC ay NS ND NF trop
y LT FIX DEV AGE EXP EXP EXP UC ay
En N R S N eW En N R S N eW
On On
Figure 3: Performance comparisons between the proposed OneWay learner and unsupervised predictors over six projects (from
top to bottom are Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

one unsupervised predictor, LT, outperforms OneWay in 46 data sets. all data sets. Note that, in practice, we can not know which unsu-
However, OneWay significantly outperform 12 9 , 9 , 9 , 10 , 8 and pervised predictor is the best out of the 12 unsupervised predictors
12 12 12 12
10 of total unsupervised predictors on six data sets, respectively. by Yang et al.’s method before we access to the labels of testing
12
This observation indicates that OneWay works significantly better data. In other words, to aid our analysis, the best unsupervised
than almost all learners on all 6 data sets in terms of Recall. ones in Table 5 are selected when referring to the true labels of
Similarly, we observe that only LT predictor works better than testing data, which are not available in practice. In that table, for
OneWay in 36 data sets in terms of Popt and AGE outperforms each evaluation measure, the number in green cell indicates that
OneWay only on the platform data set. For the remaining experi- the best unsupervised predictor has a large advantage over OneWay
ments, OneWay performs better than all the other predictors (on according to the Cliff’s δ ; Similarly, the yellow cell means medium
average, 9 out of 12 predictors). advantage and the gray cell means small advantage.
In addition, according to F1, only three unsupervised predictors From Table 5, we observe that out of 24 experiments on all
EXP/REXP/SEXP perform better than OneWay on the Mozilla data evaluation measures, none of these best unsupervised predictors
set and LT predictor just performs as well as OneWay (and has outperform OneWay with a large advantage according to the Cliff’s
no advantage over OneWay). We note that similar findings can be δ . Specifically, according to Recall and Popt , even though the best
observed in Precision measure. unsupervised predictor, LT, outperforms OneWay on four and three
Table 5 provides the median values of the best unsupervised data sets, all of these advantage are small. Meanwhile, REXP and
predictor compared with OneWay for each evaluation measure on EXP have a medium improvement over OneWay on one and two
data sets for F1 and Precision, respectively. In terms of the average
Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

Recall Popt F1 Precision


Supervised Proposed Supervised Proposed Supervised Proposed Supervised Proposed
0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6

0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6 0.0 0.6
Recall Popt F1 Precision

LR RF J48 IBk y LR RF J48 IBk ay LR RF J48 IBk ay LR RF J48 IBk ay


EA eWa EA eW EA eW EA eW
On On On On
Figure 4: Performance comparisons between the proposed OneWay learner and supervised predictors over six projects (from
top to bottom are Bugzilla, Platform, Mozilla, JDT, Columba, PostgreSQL).

Table 5: Best unsupervised predictor (A) vs. OneWay (B). The Table 6: Best supervised predictor (A) vs. OneWay (B). The
colorful cell indicates the size effect: green for large; yellow colorful cell indicates the size effect: green for large; yellow
for medium; gray for small. for medium; gray for small.
Recall Popt F1 Precision Recall Popt F1 Precision
Project Project
A (LT) B A (LT) B A (REXP) B A (EXP) B A (EALR) B A (EALR) B A (IBk) B A (RF) B
Bugzilla 0.45 0.36 0.72 0.65 0.33 0.35 0.40 0.39 Bugzilla 0.30 0.36 0.59 0.65 0.30 0.35 0.59 0.39
Platform 0.43 0.41 0.72 0.69 0.16 0.17 0.14 0.11 Platform 0.30 0.41 0.58 0.69 0.23 0.17 0.38 0.11
Mozilla 0.36 0.33 0.65 0.62 0.10 0.08 0.06 0.04 Mozilla 0.18 0.33 0.50 0.62 0.18 0.08 0.27 0.04
JDT 0.45 0.42 0.71 0.70 0.18 0.18 0.15 0.12 JDT 0.34 0.42 0.59 0.70 0.22 0.18 0.31 0.12
Columba 0.44 0.56 0.73 0.76 0.23 0.32 0.24 0.25 Columba 0.42 0.56 0.62 0.76 0.24 0.32 0.50 0.25
PostgreSQL 0.43 0.44 0.74 0.74 0.24 0.29 0.25 0.23 PostgreSQL 0.36 0.44 0.60 0.74 0.21 0.29 0.69 0.23
Average 0.43 0.42 0.71 0.69 0.21 0.23 0.21 0.19 Average 0.32 0.42 0.58 0.69 0.23 0.23 0.46 0.19

IBk. EALR is considered to be state-of-the-art learner for effort-


scores, the maximum magnitude of the difference between the best aware JIT defect prediction [21, 54] and all the other three learners
unsupervised learner and OneWay is 0.02. In other words, OneWay are widely used in defect prediction literature over past years [7,
is comparable with the best unsupervised predictors on all data 9, 13, 21, 30, 50]. We evaluate all these learners on the six project
sets for all evaluation measures even though the best unsupervised data sets using the same experiment scheme as we did in RQ1.
predictors might not be known before testing. From Figure 4, we have the following observations. Firstly, the
Overall, we find that (1) no one unsupervised predictor signifi- performance of OneWay is significantly better than all these four
cantly outperforms OneWay on all data sets for a given evaluation supervised predictors in terms of Recall and Popt on all six data
measure; (2) mostly, OneWay works as well as the best unsupervised sets. Also, EALR works better than Random Forests, J48 and IBk,
predictor and has significant better performance than almost all which is consistent with Kamei et al’s finding [21].
unsupervised predictors on all data sets for all evaluation measures. Secondly, according to F1, Random Forests and IBk perform
Therefore, the above results suggest: slightly better than OneWay in two out of six data sets. For most
As a simple supervised predictor, OneWay has competitive cases, OneWay has a similar performance to these supervised pre-
performance and it performs better than most unsupervised dictors and there is not much difference between them.
predictors for effort-aware JIT defect prediction. However, when reading Precision scores, we find that, in most
cases, supervised learners perform significantly better than OneWay.
Note the implications of this finding: the supervised learning Specifically, Random Forests, J48 and IBk outperform OneWay on
utilized in OneWay can significantly outperform the unsupervised all data sets and EALR is better on three data sets. This finding
models. is consistent with the observation in RQ1 where all unsupervised
RQ3: Does OneWay perform better than more complex stan- predictors perform worse than supervised predictors for Precision.
dard supervised predictors? From Table 6, we have the following observation. First of all, in
To answer this question, we compare OneWay learner with four terms of Recall and Popt , the maximum difference in median values
supervised predictors, including EALR, Random Forests, J48 and between EALR and OneWay are 0.15 and 0.14, respectively, which
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

are 83% and 23% improvements over 0.18 and 0.60 on Mozilla and 8 CONCLUSION AND FUTURE WORK
PostgreSQL data sets. For both measures, OneWay improves the av- This paper replicated and refutes Yang et al.’s results [54] on unsu-
erage scores by 0.1 and 0.11, which are 31% and 19% improvement pervised predictors for effort-ware just-in-time defect prediction.
over EALR. Secondly, according to F1, IBk outperforms OneWay Not all unsupervised predictors work better than supervised pre-
on three data sets with a large, medium and small advantage, re- dictors (on all six data sets, for different evaluation measures). This
spectively. The largest difference in median is 0.1. Finally, as we suggests that we can not randomly pick an unsupervised predictor
discussed before, the best supervised predictor for Precision, Ran- to perform effort-ware JIT defect prediction. Rather, it is necessary
dom Forests, has a very large advantage over OneWay on all data to use supervised methods to pick best models before deploying
sets. The largest difference is 0.46 on PostgreSQL data set. them to a project. For that task, supervised predictors like OneWay
Overall, according to the above analysis, we conclude that: are useful to automatically select the potential best model.
OneWay performs significantly better than all four supervised In the above, OneWay peformed very well for Recall, Popt and F1.
learners in terms of Recall and Popt ; It performs just as well Hence, it must be asked: “Is defect prediction inherently simple?
as other learners for F1. As for Precision, other supervised And does it need anything other than OneWay?”. In this context,
predictors outperform OneWay. it is useful to recall that OneWay’s results for precision were not
competitive. Hence we say, that if learners are to be deployed in
Note the implications of this finding: simple tools like OneWay domains where precision is critical, then OneWay is too simple.
perform adequately but for all-around performance, more sophisti- This study opens the new research direction of applying simple
cated learners are recommended. supervised techniques to perform defect prediction. As shown in
As to when to use OneWay or supervised predictors like Random this study as well as Yang et al.’s work [54], instead of using tradi-
Forests, that is an open question. According to “No Free Lunch tional machine learning algorithms like J48 and Random Forests,
Theorems”[53], no method is always best and we show unsuper- simply sorting data according to one metric can be a good defect
vised predictors are often worse on a project-by-project basis. So predictor model, at least for effort-aware just-in-time defect pre-
“best” predictor selection is a matter of local assessment, requiring diction. Therefore, we recommend the future defect prediction
labelled training data (an issue ignored by Yang et al). research should focus more on simple techniques.
For the future work, we plan to extend this study on other soft-
ware projects, especially those developed by the other programming
languages. After that, we plan to investigate new change metrics
to see if that helps improve OneWay’s performance.
7 THREATS TO VALIDITY
Internal Validity. The internal validity is related to uncontrolled 9 ADDENDUM
aspects that may affect the experimental results. One threat to the
internal validity is how well our implementation of unsupervised As this paper was going to press, we learned of new papers that
predictors could represent the Yang et al.’s method. To mitigate this updated the Yang et al. study: Liu et al. at EMSE’17 [31] and Huang
threat, based on Yang et al “R”code, we strictly follow the approach et al. at ICSMSE’17 [17]. We thank these authors for the courtesy
described in Yang et al’s work and test our implementation on the of sharing a pre-print of those new results. We also thank them
same data sets as in Yang et al. [54]. By comparing the performance for using concepts from a pre-print of our paper in their work3 .
scores, we find that our implementation can generate the same Regretfully, we have yet to return those favors: due to deadline
results. Therefore, we believe we can avoid this threat. pressure, we have not been able to confirm their results.
External Validity. The external validity is related to the possi- As to technical specifics, Liu et al. use a single churn measure
bility to generalize our results. Our observations and conclusions (sum of number of lines added and deleted) to build an unsupervised
from this study may not be generalized to other software projects. predictors that does remarkably better than OneWay and EARL
In this study, we use six widely used open source software project (where the latter could access all the variables). While this result
data as the subject. As all these software projects are written in java, is currently unconfirmed, it could well have “raised the bar” for
we can not guarantee that our findings can be directly generalized unsupervised defect prediction. Clearly, more experiments are
to other projects, specifically to the software that implemented in needed in this area. For example, when comparing the Liu et al.
other programming languages. Therefore, the future work might methods to OneWay and standard supervised learners, we could
include to verify our findings on other software project. (a) give all learners access to the churn variable; (b) apply the Yang
In this work, we used the data sets from [21, 54], where totally transform of M1(c) to all variables prior to learning; (c) use more
14 change metrics were extracted from the software projects. We elaborate supervised methods including synthetic minority over-
build and test the OneWay learner on those metrics as well. How- sampling [1] and automatic hyper-parameter optimization [7].
ever, there might be some other metrics that not measured in these
data sets that work well as indicators for defect prediction. For ACKNOWLEDGEMENTS
example, when the change was committed (e.g., morning, after- The work is partially funded by an NSF award #1302169.
noon or evening), functionality of the the files modified in this
change (e.g., core functionality or not). Those new metrics that are
not explored in this study might improve the performance of our 3Our pre-print was posted to arxiv.org March 1, 2017. We hope more researchers will
OneWay learner. use tools like arxiv.org to speed along the pace of software research.
Revisiting Unsupervised Learning for Defect Prediction ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany

REFERENCES [28] A Güneş Koru, Dongsong Zhang, Khaled El Emam, and Hongfang Liu. An
[1] Amritanshu Agrawal and Tim Menzies. ”better data” is better than ”better data investigation into the functional form of the size-defect relationship for software
miners” (benefits of tuning SMOTE for defect prediction). CoRR, abs/1705.03697, modules. IEEE Transactions on Software Engineering, 35(2):293–304, 2009.
2017. [29] Taek Lee, Jaechang Nam, DongGyun Han, Sunghun Kim, and Hoh Peter In.
[2] Fumio Akiyama. An example of software system debugging. In IFIP Congress (1), Micro interaction metrics for defect prediction. In Proceedings of the 19th ACM
volume 71, pages 353–359, 1971. SIGSOFT Symposium and the 13th European Conference on Foundations of Software
[3] Erik Arisholm and Lionel C Briand. Predicting fault-prone components in a java Engineering, pages 311–321. ACM, 2011.
legacy system. In Proceedings of the 2006 ACM/IEEE international symposium on [30] Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. Bench-
Empirical software engineering, pages 8–17. ACM, 2006. marking classification models for software defect prediction: A proposed frame-
[4] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac- work and novel findings. IEEE Transactions on Software Engineering, 34(4):485–
tical and powerful approach to multiple testing. Journal of the Royal Statistical 496, 2008.
Society. Series B (Methodological), pages 289–300, 1995. [31] Jinping Liu, Yuming Zhou, Yibiao Yang, Hongmin Lu, and Baowen Xu. Code
[5] Shyam R Chidamber and Chris F Kemerer. A metrics suite for object oriented churn: A neglected metric in effort-aware just-in-time defect prediction. In
design. IEEE Transactions on Software Engineering, 20(6):476–493, 1994. Empirical Software Engineering and Measurement, 2017 ACM/IEEE International
[6] Marco D’Ambros, Michele Lanza, and Romain Robbes. An extensive comparison Symposium on. IEEE, 2017.
of bug prediction approaches. In 2010 7th IEEE Working Conference on Mining [32] Shinsuke Matsumoto, Yasutaka Kamei, Akito Monden, Ken-ichi Matsumoto, and
Software Repositories, pages 31–41. IEEE, 2010. Masahide Nakamura. An analysis of developer metrics for fault prediction. In
[7] Wei Fu, Tim Menzies, and Xipeng Shen. Tuning for software analytics: Is it Proceedings of the 6th International Conference on Predictive Models in Software
really necessary? Information and Software Technology, 76:135–146, 2016. Engineering, page 18. ACM, 2010.
[8] Wei Fu, Vivek Nair, and Tim Menzies. Why is differential evolution better than [33] Thomas J McCabe. A complexity measure. IEEE Transactions on Software Engi-
grid search for tuning defect predictors? arXiv preprint arXiv:1609.02613, 2016. neering, (4):308–320, 1976.
[9] Takafumi Fukushima, Yasutaka Kamei, Shane McIntosh, Kazuhiro Yamashita, [34] Thilo Mende and Rainer Koschke. Effort-aware defect prediction models. In
and Naoyasu Ubayashi. An empirical study of just-in-time defect prediction 2010 14th European Conference on Software Maintenance and Reengineering, pages
using cross-project models. In Proceedings of the 11th Working Conference on 107–116. IEEE, 2010.
Mining Software Repositories, pages 172–181. ACM, 2014. [35] Tim Menzies, Jeremy Greenwald, and Art Frank. Data mining static code at-
[10] Todd L Graves, Alan F Karr, James S Marron, and Harvey Siy. Predicting fault in- tributes to learn defect predictors. IEEE Transactions on Software Engineering,
cidence using software change history. IEEE Transactions on Software Engineering, 33(1), 2007.
26(7):653–661, 2000. [36] Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayşe
[11] Philip J Guo, Thomas Zimmermann, Nachiappan Nagappan, and Brendan Mur- Bener. Defect prediction from static code features: current results, limitations,
phy. Characterizing and predicting which bugs get fixed: an empirical study of new approaches. Automated Software Engineering, 17(4):375–407, 2010.
microsoft windows. In Proceedings of the 32nd ACM/IEEE International Conference [37] Ayse Tosun Misirli, Ayse Bener, and Resat Kale. Ai-based software defect pre-
on Software Engineering - Volume 1, volume 1, pages 495–504. ACM, 2010. dictors: Applications and benefits in a case study. AI Magazine, 32(2):57–68,
[12] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, 2011.
and Ian H Witten. The weka data mining software: an update. ACM SIGKDD [38] Audris Mockus and David M Weiss. Predicting risk of software changes. Bell
Explorations Newsletter, 11(1):10–18, 2009. Labs Technical Journal, 5(2):169–180, 2000.
[13] Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. A [39] Akito Monden, Takuma Hayashi, Shoji Shinoda, Kumiko Shirai, Junichi Yoshida,
systematic literature review on fault prediction performance in software engi- Mike Barker, and Kenichi Matsumoto. Assessing the cost effectiveness of fault
neering. IEEE Transactions on Software Engineering, 38(6):1276–1304, 2012. prediction in acceptance testing. IEEE Transactions on Software Engineering,
[14] Maurice Howard Halstead. Elements of software science, volume 7. Elsevier New 39(10):1345–1357, 2013.
York, 1977. [40] Raimund Moser, Witold Pedrycz, and Giancarlo Succi. A comparative analysis of
[15] Maggie Hamill and Katerina Goseva-Popstojanova. Common trends in software the efficiency of change metrics and static code attributes for defect prediction.
fault and failure data. IEEE Transactions on Software Engineering, 35(4):484–496, In Proceedings of the 30th International Conference on Software Engineering, pages
2009. 181–190. ACM, 2008.
[16] Ahmed E Hassan. Predicting faults using the complexity of code changes. In [41] Nachiappan Nagappan and Thomas Ball. to predict system defect density. In
Proceedings of the 31st International Conference on Software Engineering, pages Proceedings of the 27th International Conference on Software Engineering, pages
78–88. IEEE, 2009. 284–292. IEEE, 2005.
[17] Qiao Huang, Xin Xia, and David Lo. Supervised vs unsupervised models: a holis- [42] Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. Mining metrics to
tic look at effort-aware just-in-time defect prediction. In 2017 IEEE International predict component failures. In Proceedings of the 28th International Conference
Conference on Software Maintenance and Evolution. IEEE, 2017. on Software Engineering, pages 452–461. ACM, 2006.
[18] Tian Jiang, Lin Tan, and Sunghun Kim. Personalized defect prediction. In [43] Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. Transfer defect learning. In
Proceedings of the 28th IEEE/ACM International Conference on Automated Software Proceedings of the 2013 International Conference on Software Engineering, pages
Engineering, pages 279–289. IEEE, 2013. 382–391. IEEE, 2013.
[19] Xiao-Yuan Jing, Shi Ying, Zhi-Wu Zhang, Shan-Shan Wu, and Jin Liu. Dictionary [44] Thomas J Ostrand, Elaine J Weyuker, and Robert M Bell. Where the bugs are. In
learning based software defect prediction. In Proceedings of the 36th International ACM SIGSOFT Software Engineering Notes, volume 29, pages 86–96. ACM, 2004.
Conference on Software Engineering, pages 414–423. ACM, 2014. [45] Thomas J Ostrand, Elaine J Weyuker, and Robert M Bell. Predicting the location
[20] Dennis Kafura and Geereddy R. Reddy. The use of software complexity metrics and number of faults in large software systems. IEEE Transactions on Software
in software maintenance. IEEE Transactions on Software Engineering, (3):335–343, Engineering, 31(4):340–355, 2005.
1987. [46] Corina S Pasareanu, Peter C Mehlitz, David H Bushnell, Karen Gundy-Burlet,
[21] Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Michael Lowry, Suzette Person, and Mark Pape. Combining unit-level symbolic
Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in- execution and system-level concrete execution for testing nasa software. In
time quality assurance. IEEE Transactions on Software Engineering, 39(6):757–773, Proceedings of the 2008 International Symposium on Software Testing and Analysis,
2013. pages 15–26. ACM, 2008.
[22] Taghi M Khoshgoftaar and Edward B Allen. Modeling software quality with. [47] Foyzur Rahman, Sameer Khatri, Earl T Barr, and Premkumar Devanbu. Com-
Recent Advances in Reliability and Quality Engineering, 2:247, 2001. paring static bug finders and statistical prediction. In Proceedings of the 36th
[23] Taghi M Khoshgoftaar and Naeem Seliya. Software quality classification model- International Conference on Software Engineering, pages 424–434. ACM, 2014.
ing using the sprint decision tree algorithm. International Journal on Artificial [48] Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda
Intelligence Tools, 12(03):207–225, 2003. Devine. Exploring methods for evaluating group differences on the nsse and
[24] Taghi M Khoshgoftaar, Xiaojing Yuan, and Edward B Allen. Balancing misclassi- other surveys: Are the t-test and cohenfisd indices the most appropriate choices.
fication rates in classification-tree models of software quality. Empirical Software In Annual Meeting of the Southern Association for Institutional Research, 2006.
Engineering, 5(4):313–330, 2000. [49] Forrest Shull, Ioana Rus, and Victor Basili. Improving software inspections by
[25] Sunghun Kim, E James Whitehead Jr, and Yi Zhang. Classifying software changes: using reading techniques. In Proceedings of the 23rd International Conference on
Clean or buggy? IEEE Transactions on Software Engineering, 34(2):181–196, 2008. Software Engineering, pages 726–727. IEEE, 2001.
[26] Sunghun Kim, Thomas Zimmermann, E James Whitehead Jr, and Andreas Zeller. [50] Burak Turhan, Tim Menzies, Ayşe B Bener, and Justin Di Stefano. On the relative
Predicting faults from cached history. In Proceedings of the 29th International value of cross-company and within-company data for defect prediction. Empirical
Conference on Software Engineering, pages 489–498. IEEE, 2007. Software Engineering, 14(5):540–578, 2009.
[27] Ekrem Kocaguneli, Tim Menzies, Ayse Bener, and Jacky W Keung. Exploiting [51] Song Wang, Taiyue Liu, and Lin Tan. Automatically learning semantic features
the essential assumptions of analogy-based effort estimation. IEEE Transactions for defect prediction. In Proceedings of the 38th International Conference on
on Software Engineering, 38(2):425–438, 2012. Software Engineering, pages 297–308. ACM, 2016.
ESEC/FSE’17, September 4-8, 2017, Paderborn, Germany Wei Fu, Tim Menzies

[52] Maurice V Wilkes. Memoirs ofa computer pioneer. Cambridge, Mass., London, Software Engineering, pages 157–168. ACM, 2016.
1985. [55] Zuoning Yin, Ding Yuan, Yuanyuan Zhou, Shankar Pasupathy, and Lakshmi
[53] David H Wolpert. The supervised learning no-free-lunch theorems. In Soft Bairavasundaram. How do fixes become bugs? In Proceedings of the 19th ACM
Computing and Industry, pages 25–42. Springer, 2002. SIGSOFT Symposium and the 13th European Conference on Foundations of Software
[54] Yibiao Yang, Yuming Zhou, Jinping Liu, Yangyang Zhao, Hongmin Lu, Lei Xu, Engineering, pages 26–36. ACM, 2011.
Baowen Xu, and Hareton Leung. Effort-aware just-in-time defect prediction: [56] Thomas Zimmermann, Rahul Premraj, and Andreas Zeller. Predicting defects for
simple unsupervised models could be better than supervised models. In Proceed- eclipse. In Proceedings of the Third International Workshop on Predictor Models in
ings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, page 9. IEEE Computer Society, 2007.

You might also like