You are on page 1of 6

Comparative Analysis of Machine Learning

Techniques for Fault Prediction


Prabhpahul Singh Ruchika Malhotra
Department of Computer Science Discipline of Software Engineering
Jaypee Institute of Information Technology Department of Computer Science & Engineering
NOIDA, India Delhi Technological University
pahulronaldo@gmail.com Delhi-110042, India
ruchikamalhotra@dce.edu

Abstract— The quality of the software can be improved by prediction models, we also require fault data from previous
determining the faulty portions of the software in the initial versions of the software product for training. Thus, we will
phases of software development. There are various machine develop software fault prediction models using previous defect
learning techniques in literature that can be used to create fault data and OO metrics to identify components of a software
prediction models using object oriented metrics. These models
which are prone to faults in the upcoming releases of the
will allow the developers and software practitioners to predict
faulty classes and concentrate the constraint resources in testing software.
these weaker portions of the software. In this work we analyze A key factor while developing software fault
and assess the predictive capability of six machine learning prediction models is the use of an efficient modeling
techniques. The results are validated using seven open source technique. These modeling techniques are classification
software. algorithms which learn from the provided historical fault data
of the software and identify faulty classes in the new versions
Keywords— Faultt Prediction; Machine Learning Techniques; on the basis of their learning. Traditionally, statistical
Object-Oriented Metrics; Software Quality
techniques such as logistic regression (LR) were used for
modeling software prediction models. However, various fault
I. INTRODUCTION prediction studies in literature have advocated the use Machine
The prime aim of software industry is to develop effective Learning (ML) techniques for this task [2,5]. ML techniques
quality software products which fulfills its requirements and are capable in extraction of worthy information from complex
has satisfied customers. However, in order to do so existing or difficult problem scenarios in less time [2]. Therefore, this
faults in the software should be removed as early as possible. study analyzes the applicability of these algorithms in
A fault which is detected in the early phases of software life software fault prediction domain. Furthermore, the results of
cycle can be easily corrected as compared to a fault which was ML algorithms are compared with those of tradition statistical
found in the later phases of software development lifecycle technique, LR. It may also be observed that the results
[1]. It has been ascertained that the cost to correct faults obtained by ML algorithms vary on different datasets. Thus, it
increases exponentially, if it is detected in the later phases [1]. is important to validate them using datasets from different
Thus, various researchers have rigorously developed and domains to confirm their effectiveness. Therefore, this study
evaluated several software fault prediction models, which are analyzes the capability of six ML algorithms viz. Adaboost
capable of early detection of faults. Remedial actions can be (AB), Bagging (BG), Decision Tree (J48), LogitBoost (LB),
effectively taken by software developers to modify the Naïve Bayes (NB) and Random Forest (RF) for developing
software product by removal of these faults. Moreover, software fault prediction models. Furthermore, the study
software managers will also be able to effectively plan compares the results of these ML algorithms with LR for
resource usage by assigning more resources to fault-prone developing defect prediction models. The comparison is
components of a software. These steps would ensure better performed statistically using Friedman test. This study
quality software products at optimum costs. explores the following research questions (RQ’s):
While developing software fault prediction models, RQ1: What is the effectiveness of ML algorithms (AB, BG,
researchers have evaluated a wide range of software metrics. J48, LB, NB and RF) for developing prediction models which
These metrics include process metrics, procedural metrics or determine faulty classes in a software?
Object-Oriented (OO) metrics [2]. Recent reviews on software
RQ2: What is the comparative performance of ML algorithms
fault prediction studies have ascertained OO metrics to be
(AB, BG, J48, LB, NB and RF) with the statistical technique
widely used in this domain [2-3]. Therefore, this study uses
LR for developing prediction models which determine faulty
Chidamber and Kemrer (CK) metrics suite [4], a popularly
classes in a software?
used OO metrics suite for developing defect prediction models
The above-mentioned research questions are answered by
[2-3]. The CK metrics suite contains metrics which represent
analyzing the performance of software fault prediction models
various structural properties of an OO software such as its
on seven open-source datasets. The performance of the fault
cohesion, size, reusability etc. In order to develop fault
prediction models is assessed using four measures viz. As pointed by Lessmann et al. [19], there were few
precision, recall, f-measure and the Area Under the Receiver studies in literature which statistically assess the effectiveness
Operating Characteristic Curve (AUC). Though, precision and of the developed defect prediction models. However, some of
recall are traditional performance measures, the use of AUC is the recent key studies which use ML algorithms in this domain
supported by various studies [2,6] as it is a robust performance have conducted rigorous statistical analysis to determine their
measure. The results point out the RF method as the best for effectiveness. Malhotra [20] evaluated the efficiency of 18 ML
developing software fault prediction models. algorithms for developing defect prediction using Friedman
Section II of this study gives a brief overview of related and post-hoc Nemenyi test on several android datasets by
literature. The research background and the various ML developing both within project and inter-release validation
algorithms are discussed in Section III and IV respectively. models. A study by Arar and Ayan [21] also assessed the
Section V describes the study’s results followed by threats to applicability of 3 ML and 3 search-based algorithms on four
validity in Section VI. Future work and conclusions are NASA datasets using Friedman test. Similarly, a study by
mentioned in Section VII. Harman et al. [22] evaluated the capability of a hybridized
technique i.e. Genetic Algorithms with Support Vector
II. RELATED LITERATURE Machine on Hadoop datasets using Wilcoxon test. De
Extensive research has been conducted in the domain of Carvalho et al. [6] also evaluated the use of Multi-objective
software defect prediction. A number of review studies particle Swarm Optimization along with several ML
extensively evaluated the conducted studies in this domain on algorithms and assessed its capability using Wilcoxon test in
various parameters. A review study by Radgenovic et al. [7] the domain of software defect prediction. Zhou et al. [23],
assessed 106 studies from 1991 to 2011 to evaluate the Arisholm et al. [24], Okutan et al. [25] and Canfora et al. [26]
relevance of various software metrics for developing defect also used statistical analysis to determine the effectiveness of
prediction models. According to their survey, Object-oriented ML algorithms in this domain.
metrics have been widely used in literature studies as III. RESEARCH BACKGROUND
compared to process metrics or traditional metrics extracted
from source code. A review by Catal and Diri [5] of 74 fault The various software metrics used in the study along with
prediction studies ascertained that the use of publicly available the dependent variable fault-proneness is discussed in this
datasets has increased significantly over the years. Moreover, section. The details of various datasets used in the study along
the review confirmed ML techniques as popular choices for with the performance measures is also provided in this section.
developing defect prediction models. This finding was A. OO Metrics and Fault-proneness
supported by a recent review conducted by Malhotra [2]. She Fault-proneness is defined as the likelihood that a specific
also confirmed the need for more number of studies which class would contain faulty code in the upcoming releases of
assessed the comparative performance of ML techniques with
the software product. It is binary in nature with values as
statistical techniques in the domain of software defect
“fault-prone” or “not fault-prone”.
prediction. Certain recent reviews have also investigated
The metrics belonging to the CK metrics suite are used as
studies which use search-based techniques for developing
predictors in this study. Another common measure of size
software defect prediction models [8-9]. A review by Afzal
known as Lines of Code (LOC), which counts the number of
and Torkar [8] assessed genetic programming for developing
lines of source code in a class, is also used as an independent
defect prediction models and another one by Malhotra et al.
variable. The CK metrics consists of two measures of
[9] analyzed the use of several search-based techniques for
reusability, viz. Number of Children (NOC), which represents
their effectiveness in this domain.
the number of direct subclasses and Depth of Inheritance Tree
It may be noted that a wide category of techniques has (DIT) which represents the position of the class in the tree
been evaluated for software defect prediction which includes hierarchy. The Coupling Between Object (CBO) metric counts
statistical, ML and the recently explored search-based the number of coupled classes for a specific class. Also,
algorithms. Though, statistical techniques have been found another coupling measure, Response For a Class (RFC)
effective in this domain [10-12], the use of ML algorithms has represents the number of methods which can respond to a
yielded improved results [2]. Dejaeger et al. [13] assessed the class’s message. The cohesive nature of classes is addressed
use of several Bayesian network classifiers along with by the shared variables between two classes, which is
statistical and other common ML algorithms for developing represented by the Lack of Cohesion among Methods
defect prediction models. Several other studies too such as the (LCOM) of a class metric. Weighted Methods per Class
one by Gyimothy et al. [14], Pai and Dugan [15], Vandecruys (WMC) estimates the number of methods in a specific class.
et al. [16] and Chen et al. [17] evaluated the use of both
B. Datasets Description
statistical and ML algorithms. A recent study by
Tantithamthavorn et al. [18] used 12 model validation In order to develop software fault prediction models, we
techniques using a statistical and two ML algorithms on 18 use seven open source data sets in this study. The seven data
datasets. sets are developed using Java. The description of these data
sets in terms of number of total classes and number of faulty
classes is mentioned in Table I. Some of the data sets are
extracted from the PROMISE [27], while some of them are weight threshold of 100, a shrinkage of 1.0, 10 iterations, and
collected using a data extraction tool developed by students of a seed value of 1.
Delhi Technological University named Defect Collection and J48 is a decision tree algorithm which uses normalized
Reporting System (DCRS) tool [28]. The DCRS tool is information gain as the criteria for splitting [30]. For each of
capable of extracting data from data sets which use GIT as a
the predictor variable, the information gain is computed and
version control repository. The tool computes OO metrics with
the attribute with the highest information gain is designated as
the aid of CKJM tool [29].
the root node. This process is performed recursively. The
TABLE I. DATA SET DETAILS
default parameter settings for WEKA were a confidence factor
Dataset Name Number of Number of
of 0.25, three folds, a seed value of 1 and a pruned tree.
data points classes with
faults NB is a Bayesian method which develops a classifier based
Bsf 75 55 on probability. Each of the predictor variables is assumed to
Click 403 85 be independent.
Zuzel 30 13 Bagging (BG) is another type of ensemble learning method
Xerxes 441 71 which creates bootstrap sample from the original data by
Ivy 614 600 repeatedly sampling the dataset. The sampling is done in
Log4j 351 23 accordance with uniform probability distribution. It may be
Wspomaganiepi 19 12 noted that the size of the bootstrap sample is exactly similar to
C. Perfromance Measures that of the original data. The parameters used by the WEKA
tool for the BG technique include 100% as the bag size, REP
The study uses four performance measures for evaluating
tree as the base classifier, a seed value of 1 and ten iterations.
the developed software fault prediction models. The
definition of these performance measures is as follows: RF is also an ensemble learner which consists of several
• Recall (Rec.): It is the ratio of correctly predicted decision trees. The individual trees of the forest are
faulty classes amongst the actually present faulty constructed with various subsets of the training data.
classes. It is also known as sensitivity. However, these subsets are constructed randomly and with
• Precision (Prec.): It is the ratio of correctly replacement. The outcome of the forest is determined as the
identified faulty classes amongst the predicted mode of the outputs generated by the individual constituent
faulty classes trees. A forest of 100 trees, a seed value of 1 and a maximum
• F-measure (FM): It is computed as the harmonic depth of 0 is used as parameter settings by the WEKA tool for
mean of precision and recall. RF.
• AUC: The AUC is a stable performance measure
as it simultaneously optimizes both recall along V. RESULTS AND ANALYSIS
with the percentage of correctly predicted non- This section describes the results of the study along with
faulty classes (specificity). It achieves an answers to each of the investigated RQ.
optimum cut-off which balances both recall and
specificity. A. RQ1: What is the effectiveness of ML algorithms (AB, BG,
J48, LB, NB and RF) for developing prediction models
which determine faulty classes in a software?
IV. MACHINE LEARNING ALGORITHMS
In this section, we discuss briefly, the various ML techniques The fault prediction models in the study are developed
used in the study. We used WEKA as a simulation tool for the using ten-fold cross validation [31]. This strategy divides the
investigated ML algorithms. The default parameter settings of given data sets into ten disjoint subsets. A model is developed
the WEKA tool were used for each ML technique. by providing nine of these subsets as the training data. The
Boosting is the process of aggregating various classifiers tenth remaining subset is used for validating the developed
model. This process is repeated until the prediction values on
which are weak in nature in order to construct a strong
each of the ten subsets are obtained, i.e. till all the subsets are
classifier. Thus, boosting is an ensemble of various classifiers
used at least once for the purpose of validation. As discussed
where each classifier is trained using a slightly different in section III.C, we evaluate the developed software fault
training data. The AB algorithm uses boosting with prediction models using recall, precision, FM and AUC. Table
distribution of weights on the training data to yield effective II states the precision and recall values obtained by the
results [30]. WEKA uses a seed value of 1, a weight threshold developed software fault prediction models on all the datasets.
of 100 and decision stump as the base classifier for AB. The The precision values of all the models ranged from 0.736-
LB algorithm uses AB as an additive model along with cost 0.975 indicated effective fault prediction models.
function provided by LR [30]. The default parameter settings
for the LB technique in the WEKA tool include a likelihood
threshold of -1.79, decision stump as the base classifier, a
TABLE II. PRECISION AND RECALL VALUES

Techniques Bsf Click wspomaganiepi Ivy Log4j Xerces Zuzel


Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec. Prec. Rec.
AB 0.844 0.838 0.781 0.808 0.831 0.833 0.955 0.977 0.924 0.94 0.77 0.836 0.793 0.793
BG 0.818 0.811 0.757 0.796 0.774 0.778 0.955 0.977 0.878 0.937 0.79 0.834 0.758 0.759
J48 0.834 0.824 0.766 0.801 0.831 0.833 0.955 0.977 0.912 0.934 0.788 0.836 0.758 0.759
LR 0.794 0.797 0.774 0.803 0.778 0.778 0.955 0.977 0.943 0.949 0.736 0.832 0.758 0.759
LB 0.822 0.824 0.787 0.811 0.889 0.889 0.955 0.977 0.917 0.937 0.799 0.839 0.793 0.793
NB 0.741 0.595 0.773 0.803 0.808 0.778 0.975 0.579 0.911 0.906 0.792 0.82 0.828 0.828
RF 0.849 0.851 0.752 0.766 0.889 0.889 0.955 0.971 0.919 0.934 0.792 0.82 0.758 0.759

TABLE III. F-MEASURE AND AUC VALUES

Techniques Bsf Click wspomaganiepi Ivy Log4j Xerces Zuzel


FM AUC FM AUC FM AUC FM AUC FM AUC FM AUC FM AUC
AB 0.840 0.854 0.769 0.669 0.829 0.819 0.966 0.789 0.925 0.704 0.772 0.747 0.793 0.796
BG 0.814 0.829 0.753 0.741 0.764 0.813 0.966 0.799 0.907 0.603 0.796 0.795 0.757 0.776
J48 0.828 0.758 0.754 0.556 0.829 0.708 0.966 0.413 0.918 0.603 0.789 0.613 0.757 0.750
LR 0.796 0.801 0.751 0.682 0.778 0.752 0.966 0.797 0.934 0.628 0.766 0.583 0.757 0.773
LB 0.823 0.860 0.769 0.718 0.889 0.868 0.966 0.841 0.920 0.619 0.801 0.764 0.793 0.805
NB 0.615 0.807 0.772 0.673 0.784 0.852 0.712 0.842 0.909 0.604 0.803 0.795 0.827 0.788
RF 0.850 0.847 0.758 0.765 0.889 0.826 0.963 0.849 0.924 0.666 0.803 0.795 0.757 0.815

Similarly, the recall values in majority of the cases ranged According to the figure, the average AUC values obtained by
from 0.759-0.977. Only in two cases, the recall values were only the J48 technique was less than the LR technique. In all
below 0.60. Thus, the investigated ML algorithms along with other cases the average AUC values obtained by the ML
LR techniques developed effective fault prediction models techniques were greater than the LR technique. The average
with acceptable precision and recall values. Table III states the FM values obtained by the ML techniques were also
FM and AUC values obtained by the developed fault competent with the LR technique.
prediction models. The FM values ranged from 0.615-0.966 The range of AUC values for all the ML techniques on all
for all the developed defect prediction models. Similarly, the the investigated data sets were 0.669-0.854 for AB, 0.603-
AUC values in majority of the cases ranged from 0.6-0.9. 0.829 for BG, 0.413-0.758 for J48, 0.619-0.868 for LB, 0.604-
Only in three specific cases the AUC values were less than 0.852 for NB and 0.666-0.849 for RF. The range of AUC
0.6. In fact, in most of the cases, the AUC values were greater values for models developed using the RF techniques was
than 0.7. This indicates effectiveness of the ML and the 0.583-0.801, which was lesser than most of the ML
statistical technique LR for developing software fault techniques. This indicates the superiority of ML techniques.
prediction models. The average values of f-measure and AUC
obtained by all the techniques over all the datasets are depicted
B. RQ2: What is the comparative performance of ML
in Figure 1.
algorithms (AB, BG, J48, LB, NB and RF) with the
statistical technique LR for developing prediction models
which determine faulty classes in a software?
In order to compare the performance of six ML algorithms
with the statistical technique LR, we use Freidman statistical
test on AUC values. This is because literature studies advocate
the use of AUC as a stable performance measure. The test was
conducted at a cut-off of 0.05. We evaluated the following
hypothesis:

Null Hypothesis (H0): The ML techniques (AB, BG, J48, LB,


NB and RF) have similar capability when compared with the
LR technique for developing software fault prediction models,
when assessed using AUC values.
Fig. 1. Average AUC and FM values of techniques
Alternate Hypothesis (H1): The ML techniques (AB, BG, J48, represent, reducing the threat to construct validity. As the
LB, NB and RF) have better capability when compared with study evaluates open source datasets, they can be easily used
the LR technique for developing software fault prediction by other researchers for replication and further
models, when assessed using AUC values. experimentation. This increases the external validity of the
study. It may also be noted that the evaluated datasets are
As already discussed in RQ1, most of the ML representative of several domains, thus increasing the
techniques obtained better average AUC values than the LR generalizability of the obtained results. However, all the
technique. This observation supports the use of ML techniques investigated datasets are developed in Java language, thus
in the domain of software fault prediction. Table IV states the limiting the applicability of results to only datasets developed
mean ranks obtained by different techniques after application in Java language.
of the Friedman test. It should be noted that the technique
obtaining the lowest rank is the best. The Friedman results VII. CONCLUSIONS AND FUTURE WORK
were significant with a p-value of less than 0.001 and a chi- This study analyzed the effectiveness of six ML
square value of 24.698 and six degrees of freedom. Therefore, techniques viz. Random Forests, Logitboost, Naïve Bayes,
we accept H1 hypothesis, which indicate the superiority of Adaboost, Bagging, J48 for developing software fault
ML algorithms in the domain of software fault prediction. prediction models along with the statistical technique logistic
According to Table IV, the best rank was obtained by the RF regression (LR). The performance of the developed models
technique, followed by the LB and NB techniques. The LR was assessed using precision, recall, AUC and F-measure. The
technique was poorer than most of the ML techniques except developed models were also compared using Friedman test to
J48. determine whether ML techniques are better for developing
software fault prediction models as compared to the ones
TABLE IV. FRIEDMAN RESULTS
developed using the LR technique. The conclusions can be
Technique Mean Rank using summarized as follows:
Friedman • The ML techniques are effective in the domain of
RF 1.86 software fault prediction as the majority of AUC
LB 2.57 values ranged from 0.6-0.9 and majority of f-measure
NB 3.57 values ranged from 0.615-0.966.
AB 3.86 • The Friedman results indicate that the Random
BG 4.07
Forests techniques is the best ML technique amongst
LR 5.29
the investigated techniques. Furthermore, the models
J48 6.79
developed using the LR technique are poorer than
most ML techniques except the J48 technique.
VI. THREATS TO VALIDITY The results can help researchers in choosing an efficient
technique for developing software fault prediction models.
This section mentions the various threats to validity
Also, they can be used for optimization and effective
of the study.
allocation of a project’s resources. In future, we would like to
The threat to conclusion validity is addressed by investigate search-based techniques for developing efficient
validating the study’s hypothesis with the help of Friedman fault prediction models.
test at a cut-off of 0.05. Since, Friedman test is non-parametric
in nature, we do not need to validate any assumptions on the REFERENCES
underlying data before the application of the test. Also, the use [1] Y. Singh and R. Malhotra, Object-Oriented Software Engineering.
of ten-fold cross validation method reduces validation bias [6], PHI Learning Pvt. Ltd. New Delhi, 2012.
thereby strengthening the obtained conclusions. The use of [2] R. Malhotra, “A systematic Review of Machine Learning
techniques for Software Fault prediction,” Applied Soft Comp., vol.
AUC as a performance measure has been advocated by
27, pp. 504-518, 2015.
various studies as it is a stable performance measure [2,6] [3] D. Radjenović, M. Heričko, R. Torkar and A. Živkovič, “Software
which is capable of assessing impartially, the models fault prediction metrics: A systematic literature
developed using imbalanced data. This step also enhances the review”, Information and Software Technology, vol. 55, no. 8, pp.
1397-1418, 2013.
conclusion validity of the study. The internal validity concerns [4] S. Chidamber and C. Kemerer, “A Metric Suite for Object-
itself with the “causal” effect of OO metrics on defect- Oriented design,” IEEE Transactions on Software Engineering,
proneness. However, studies which can determine this effect vol. 20, no. 6, pp. 476-493, 1994.
are difficult to conduct in practice as such studies require [5] C. Catal and B. Diri, “A systematic review of software fault
highly controlled experimental environments. As, determining prediction studies”, Expert systems with applications, vol. 36, no.
4, pp. 7346-7354, 2009.
the “causal” effect was not the aim of our study, we did not [6] A. B. De Carvalho, A. Pozo and S. R. Vergilio, “A symbolic fault-
address this threat. The OO metrics used in the study have prediction model based on multiobjective particle swarm
been used extensively for developing defect prediction models optimization”, Journal of Systems and Software, vol. 83, no. 5, pp.
in literature [2,3]. Thus, the use of these metrics is valid as 868-882, 2015.
[7] D. Radjenović, M. Heričko, R. Torkar and A. Živkovič, “Software
they are accurate representatives of the “constructs” they fault prediction metrics: A systematic literature
review”, Information and Software Technology, vol. 55, no. 8, pp. techniques for defect prediction models”, IEEE Transactions on
1397-1418, 2013. Software Engineering, vol. 43, no. 1, pp. 1-18, 2017.
[8] W. Afzal and R. Torkar, “On the application of genetic [19] S. Lessmann, B. Baesens, C. Mues, and S. Pietsch, “Benchmarking
programming for software engineering predictive modeling: A classification models for software defect prediction: A proposed
systematic review”, Expert Systems with Applications, vol. 38, no. framework and novel findings”, IEEE Transactions on Software
9, pp. 11984-11997, 2011. Engineering, vol. 34, no. 4, pp. 485-496, 2008.
[9] R. Malhotra, M. Khanna and R. R. Raje, “On the application of [20] R. Malhotra, “ An empirical framework for defect prediction using
search-based techniques for software engineering predictive machine learning techniques with Android software”, Applied Soft
modeling: A systematic review and future directions”, Swarm and Computing, vol. 49, pp. 1034-1050.
Evolutionary Computation, vol. 32, pp. 85-109, 2017. [21] O. F. Arar and K. Ayan, “Software defect prediction using cost-
[10] L. C. Briand, J. Wüst, J. W. Daly and D. V. Porter, “Exploring the sensitive neural network”, Applied Soft Computing, vol. 33, pp.
relationships between design measures and software quality in 263-277, 2015.
object-oriented systems”, Journal of systems and software, vol. 51, [22] M. Harman, S. Islam, Y. Jia, L. L. Minku, F. Sarro and K. Srivisut,
no. 3, pp. 245-273, 2000. “Less is more: Temporal fault predictive performance over
[11] H. M. Olague, L. H. Etzkorn, S. Gholston and S. Quattlebaum, multiple hadoop releases”, In International Symposium on Search
“Empirical validation of three software metrics suites to predict Based Software Engineering, pp. 240-246, 2014.
fault-proneness of object-oriented classes developed using highly [23] Y. Zhou, B. Xu and H. Leung, “On the ability of complexity
iterative or agile software development processes”, IEEE metrics to predict fault-prone classes in object-oriented
Transactions on software Engineering, vol. 33, no. 6, 2007. systems”, Journal of Systems and Software, vol. 83, no. 4, pp. 660-
[12] K. El Emam, S. Benlarbi, N. Goel and S. N. Rai, “The 674, 2010.
confounding effect of class size on the validity of object-oriented [24] E. Arisholm, L. C. Briand and E. B. Johannessen, “A systematic
metrics”, IEEE Transactions on Software Engineering, vol. 27, no. and comprehensive investigation of methods to build and evaluate
7, pp. 630-650, 2001. fault prediction models”, Journal of Systems and Software, vol. 83,
[13] K. Dejaeger, T. Verbraken and B. Baesens, “Toward no. 1, pp. 2-17, 2010.
comprehensible software fault prediction models using bayesian [25] A. Okutan and O. T. Yıldız, “Software defect prediction using
network classifiers”, IEEE Transactions on Software Bayesian networks”, Empirical Software Engineering, vol. 19, no.
Engineering, vol. 39, no. 2, pp. 237-257, 2013. 1, pp. 154-181, 2014.
[14] T. Gyimothy, R. Ferenc and I. Siket, “Empirical validation of [26] G. Canfora, A. De Lucia, M. Di Penta, R. Oliveto, A. Panichella
object-oriented metrics on open source software for fault and S. Panichella, “Multi-objective cross-project defect
prediction”, IEEE Transactions on Software engineering, vol. 31, prediction”, Software Testing, Verification and Validation (ICST),
no. 10, pp. 897-910, 2005. 2013 IEEE Sixth International Conference, pp. 252-261, 2013.
[15] G. J. Pai and J. B. Dugan, “Empirical analysis of software fault [27] https://code.google.com/p/promisedata/
content and fault proneness using Bayesian methods”, IEEE [28] R. Malhotra, K. Nagpal, P., Upmanyu, and N. Pritam, “Defect
Transactions on software Engineering, vol. 33, no. 10, pp. 675- Collection and Reporting System for Git based Open Source
686, 2007. Software,” In Proceedings of International Conference on Data
[16] O. Vandecruys, D. Martens, B. Baesens, C. Mues, M. De Backer Mining and Intelligent Computing, pp. 1-7, 2014.
and R. Haesen, “Mining software repositories for comprehensible [29] http://www.dmst.aueb.gr/dds/sw/ckjm/doc/index.html
software fault prediction models”, Journal of Systems and
software, vol. 81, no. 5, pp. 823-839, 2008. [30] I.H. Witten and E. Frank, Data Mining: Practical Machine
[17] N. Chen, S. C. Hoi and X. Xiao, ”Software process evaluation: a Learning Tools and Techniques, Morgan Kaufmann Publishers,
machine learning framework with application to defect USA 2005
management process”, Empirical Software Engineering, vol. 19, [31] M. Stone, “Cross-validatory choice and assessment of statistical
no. 6, pp. 1531-1564, 2014. predictions,” Journal of Royal Statistics and Society, vol. 36, pp.
[18] C. Tantithamthavorn, S. McIntosh, A. E. Hassan and K. 111-147, 1974.
Matsumoto, “An empirical comparison of model validation

You might also like