Predicting Project Delivery Rates Using The Naive-Bayes Classifier

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE
J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179 (DOI: 10.1002/smr.250)
Research
Predicting project delivery rates

using the Naive–Bayes classifier
B. Stewart∗,†
School of Computing and Information Technology, University of Western Sydney, Australia
SUMMARY
The importance of accurate estimation of software development effort is well recognized in software
engineering. In recent years, machine learning approaches have been studied as possible alternatives to
more traditional software cost estimation methods. The objective of this paper is to investigate the utility
of the machine learning algorithm known as the Naive–Bayes classifier for estimating software project
effort. We present empirical experiments with the Benchmark 6 data set from the International Software
Benchmarking Standards Group to estimate project delivery rates and compare the performance of the
Naive–Bayes approach to two other machine learning methods—model trees and neural networks. A project
delivery rate is defined as the number of effort hours per function point. The approach described is general
and can be used to analyse not only software development data but also data on software maintenance and
other types of software engineering. The paper demonstrates that the Naive–Bayes classifier has a potential
to be used as an alternative machine learning tool for software development effort estimation. Copyright 
2002 John Wiley & Sons, Ltd.
KEY WORDS : software effort estimation; Bayesian networks; machine learning; model trees; neural networks
1. INTRODUCTION
Accurate estimation of software development effort is crucial to the success of software development
projects. A project’s budget, planning, control, and management throughout the entire software
development lifecycle depend on reliable cost estimates. During the past 30 years estimation of
software development effort has received a significant amount of attention in software engineering
research. Many different estimation models have been developed, ranging from heuristic rule-of-
thumb approaches to formal mathematical models. The formal models can be grouped into two
broad categories: parametric models, and machine learning models. Parametric models represent the
∗ Correspondence to: Dr B. Stewart, School of Computing and Information Technology, University of Western Sydney,
Campbelltown Campus, Locked Bag 1797, Penrith South DC, NSW 1797, Australia.
† E-mail: b.stewart@uws.edu.au
Received 17 September 2001

Copyright  2002 John Wiley & Sons, Ltd. Revised 29 January 2002
162 B. STEWART
development effort as a parametrized function of predetermined cost factors, also referred to as metrics,
attributes, or cost drivers. The development effort of a new project is estimated by substituting for the
cost factors the actual project values. Model parameters are determined by calibration to historical data
on past projects. Some of the most well known parametric models are the COCOMO (Constructive
Cost Model) developed by Boehm [1,2], Albrecht’s function points [3], and the SLIM model developed
by Putnam [4].
During the past decade there have been a number of research studies published in the literature on
the use of machine learning techniques for estimating software development effort. The methods used
include decision trees [5–8], neural networks [7–9], and reasoning by analogy [10]. Machine learning
methods construct a model from a database of past projects which is then used to predict the software
development effort for new projects. In this paper we examine the use of another machine learning
algorithm for software development effort estimation—the Naive–Bayes classifier—and compare its
performance to two other methods—model trees and artificial neural networks. The Naive–Bayes
classifier is a well-known machine learning algorithm that has proved to have excellent classification
performance on small data sets. A brief overview of this algorithm and how it can be used for software
development effort estimation is given in Section 2. More detailed information on Naive–Bayes and
other Bayesian classifiers can be found in [11].
Software engineering data sets often contain many variables, some of which are only weakly related
to the variable of interest such as software effort. It is usually necessary to pre-process the data set in
various ways and select a subset of variables that show strong relationships to the variable for which
predictions are to be made. In our experimental work we used the mutual information measure to
select subsets of variables to include in the Naive–Bayes classifier. The mutual information measure
is a statistical measure that indicates the strength of association between a pair of random variables.
This measure has been used for finding significant relationships in data by several other researchers
[11–13].
We carried out empirical experiments using the data set Benchmark Release 6 from the International
Software Benchmarking Standards Group (ISBSG) [14]. The projects in the data set are sized in
terms of function points rather than lines of code. The data set is provided with a report describing
the variables and presenting a statistical analysis of the cost factors affecting project delivery rates.
A project delivery rate is defined as the number of hours per function point. Due to the nature of the
data set, we have focused on estimating project delivery rates rather than total project effort.
The paper makes two main contributions: (1) it demonstrates that the Naive–Bayes classifier has
the potential to be used as an additional technique for the prediction of software development and
maintenance effort; and (2) it shows that the mutual information measure is a useful measure for
selecting significant variables for the construction of Naive–Bayes classifiers. The approach described
in the paper is general and could be used to estimate values of any variables of interest provided that
sufficient historical data are available.
The remainder of the paper is organized as follows: Section 2 describes general Bayesian network
classifiers and the Naive–Bayes classifier as a special case of the Bayesian network classifier, Section 3
overviews model trees as an alternative method for project delivery rate estimation, Section 4
introduces neural networks, Section 5 introduces the concept of mutual information measure and its use
for selecting significant variables, Section 6 describes our experimental work, and Section 7 concludes
the paper and outlines our future work.
Copyright  2002 John Wiley & Sons, Ltd. J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179
PREDICTING PROJECT DELIVERY RATES USING THE NAIVE–BAYES CLASSIFIER 163
2. BAYESIAN NETWORK CLASSIFIERS
A fundamental problem in machine learning, data analysis, and pattern recognition is classification of
observed instances into predetermined categories or classes. For example, software projects could be
classified into categories according to their project delivery rate values. In this case the categories would
be subranges (intervals) of project delivery rate values. Classification would determine the interval for
a new project on the basis of the observed values of its remaining characteristics.
In machine learning, classification algorithms are grouped into two broad groups—supervised and
unsupervised classifiers. In supervised classifiers the categories into which the cases are to be assigned
must have been established prior to classification. In unsupervised classifiers no predetermined
categories are used, the necessary categories are determined by the algorithm itself. In this paper we
use the term classification to represent supervised classification, unless stated otherwise.
A classification task requires a database of cases from which is constructed a classification model
such as the Naive–Bayes model. This database is referred to as a training set and contains measurement
data on cases observed in the past together with their actual categories. The variable whose values are
the case categories (classes) is referred to as the class variable. The classification model derived from
the training data is then used to predict the categories of instances whose category is unknown. The
classification problem has been widely studied in statistics and artificial intelligence (AI) and a variety
of different classification methods have been developed. Some of the most popular approaches used
in AI include decision trees [15,16], neural networks [17], nearest-neighbour classifiers [18], genetic
algorithms [19] and more recently Bayesian network classifiers [11].
2.1. Bayesian networks
General Bayesian network classifiers are known as Bayesian networks, belief networks or causal
probabilistic networks. The theoretical concepts of Bayesian networks were invented by Judea Pearl in
the 1980s and are described in his pioneering book Probabilistic Reasoning in Intelligent Systems [13].
During the past decade Bayesian networks have gained popularity in AI as a means of representing
and reasoning with uncertain knowledge. Examples of practical applications include decision support,
safety and risk evaluation, control systems, and data mining [20]. In the software engineering field,
Bayesian networks have been used by Fenton [21] for software quality prediction. A wealth of articles
on this area of research can be found on the Agena Web site [21].
The state-of-the-art research papers on Bayesian networks are published in the proceedings of
the Annual Conference on Uncertainty in AI [22]. Theoretical principles of Bayesian networks are
described in several books, for example [13,23–26].
A Bayesian network consists of two components: (1) a directed acyclic graph representing the
structure of an application domain, and (2) conditional probability distributions associated with the
vertices in the graph.
The vertices of the graph represent the domain variables and the directed edges the relationships
between the variables. With every vertex is associated a table of conditional probabilities of the
vertex given each state of its parents. We denote a conditional probability table using the notation
P (xi |par(Xi )), where lower case xi denotes values of the corresponding random variable Xi and
par(Xi ) denotes a state of the parents of Xi . The graph together with the conditional probability
tables define the joint probability distribution contained in the data. Using the probabilistic chain rule,
164 B. STEWART
B C
Figure 1. A Bayesian network.
the joint distribution can be written in the product form P (x1 , x2 , . . . , xn ) =

ni=1 P (xi |par(Xi )),
where n is the number of vertices in the graph. An example of a simple Bayesian network is given in
Figure 1. The corresponding joint probability distribution can be written in the form P (a, b, c, d) =
P (a)P (b|a)P (c|a, b)P (d|b, c).
General unrestricted Bayesian networks may be regarded as unsupervised classifiers in the sense
that there is no specific variable designated as the class variable. In a Bayesian network all variables
are treated in the same way and any one can be regarded as the class variable. Classification in
a Bayesian network classifier involves performing probabilistic inference on the Bayesian network
using one of the available probabilistic inference algorithms, for example the algorithm of Lauritzen
and Spiegelhalter [27]. Probabilistic inference computes for each vertex in the graph the posterior
probability distribution P (xi |evidence), where xi represents the values of the variable Xi and evidence
represents a set of observed values of the remaining variables.
2.2. Naive–Bayes classifier
The Naive–Bayes classifier is a special case of Bayesian network classifier, derived by assuming that
the variables are conditionally independent given the class variable. Unlike the general Bayesian
network classifier, Naive–Bayes is a supervised classifier because one of the variables must be
designated as the class variable. The graphical structure of Naive–Bayes is represented by a tree
in which the class variable is the root and the remaining variables are the leaves. Directed edges
connect the root to the leaves. It is assumed that each variable is conditionally independent of the
remaining variables given the class variable. Classification in Naive–Bayes computes the posterior
probability distribution of the class variable given observed values of the remaining variables,
P (c|x1 , x2 , . . . , xn ). Unlike in a general Bayesian network, in Naive–Bayes the posterior probability
distribution P (c|x1 , x2 , . . . , xn ) can be computed efficiently from Bayes theorem:
P (c, x1 , x2 , . . . , xn )
P (c|x1 , x2 , . . . , xn ) = (1)
P (x1 , x2 , . . . , xn )
where
P (c, x1 , x2, . . . , xn ) = P (c)P (x1 |c)P (x2 |c) . . . P (xn |c) (2)
X Y
Figure 2. A Naive–Bayes network.
and
P (x1 , x2 , . . . , xn ) = P (c, x1 , x2 , . . . , xn ) (3)
c
The Naive–Bayes classifier is relatively simple to implement, efficient, robust with respect to noisy
or missing data, and performs surprisingly well in many domains. For small data sets it frequently
outperforms even more sophisticated state-of-the-art decision tree classifiers [16]. Some comparative
empirical studies are reported in [11].
Example. We illustrate the computations performed by the Naive–Bayes algorithm by means of a

simple example illustrated in Figure 2.
The graph in Figure 2 shows the structure of the Naive–Bayes classifier in which the variable C is
the class variable. Using the chain rule, the joint probability distribution corresponding to the graph
can be written in the form P (c, x, y) = P (c)P (x|c)P (y|c).
For simplicity, we assume that the variables C, X, and Y are binary and take on the values of 0 and 1.
We also assume that the probability distributions P (c), P (x|c), and P (y|c) have been computed from
the training data and have the values given in Tables I–III.
The Naive–Bayes classifier computes the conditional probability distribution P (c|x = x0 , y = y0 )
for some observed values x0 and y0 of the variables X and Y , respectively. Suppose that the observed
values are x0 = 0 and y0 = 1. Then the Naive–Bayes algorithm computes the conditional probabilities
P (c = 0|x = 0, y = 1) and P (c = 1|x = 0, y = 1) as follows:
P (c = 0)P (x = 0|c = 0)P (y = 1|c = 0)
P (c = 0|x = 0, y = 1) = (4)
sum
P (c = 1)P (x = 0|c = 1)P (y = 1|c = 1)
P (c = 1|x = 0, y = 1) = (5)
sum
sum = P (c = 0)P (x = 0|c = 0)P (y = 1|c = 0) + P (c = 1)P (x = 0|c = 1)P (y = 1|c = 1) (6)
Substituting the values from Tables I–III yields the distribution in Table IV.
The observed sample (x0 = 0, y0 = 1) will be classified into the category corresponding to the
larger probability value, in this case c = 1.
166 B. STEWART
Table I. The probability distribution of the variable C.
c P (c)
0 0.2
1 0.8
Table II. The conditional probability distribution P (x|c).
x c P (x|c)
0 0 0.1
0 1 0.3
1 0 0.9
1 1 0.7
Table III. The conditional probability distribution P (y|c).
y c P (y|c)
0 0 0.2
0 1 0.6
1 0 0.8
1 1 0.4
Table IV. The resulting conditional distribution P (c|x = 0, y = 1).
c x y P (c|x, y)
0 0 1 0.143
1 0 1 0.857
2.3. Estimating project delivery rates using the Naive–Bayes classifier
The estimation of project delivery rate value of a new project involves a prediction of the project
delivery rate on the basis of the observed values of the remaining project characteristics (variables).
The Naive–Bayes algorithm is a classification algorithm which computes the conditional probability
distribution of the class variable given the observed values of the remaining variables. We assume
that the class variable has discrete values, referred to as categories or classes. Naive–Bayes gives a
probability value for each category of the class variable indicating how likely it is that the observed
instance belongs to that category. For example, assuming that the project delivery rate variable has
been discretized into several intervals, Naive–Bayes can compute the probability that a given project’s
project delivery rate falls within each of these intervals. However, it cannot predict the actual value of
the project delivery rate.
To use Naive–Bayes for estimating project delivery rates, we have to adapt the algorithm for
prediction tasks. The goal is to predict the value of the class variable for a given observed instance,
rather than to classify that instance into one of several possible categories.
The value of the class variable (project delivery rate) may be approximated by its expected value
using the formula below.

i=n
E(pdr|instance) = mid i ∗ P (pdri |instance) (7)
i=1
The formula assumes that the project delivery rate variable was discretized into n intervals,
mid i denotes the mid point of the ith interval, computed as mid i = (loweri + upperi )/2, and
P (pdri |instance) denotes the conditional probability of the ith interval given the observed instance.
The midpoint of each interval approximates the interval’s project delivery rate value. The expected
value of the project delivery rate is computed as a weighted sum of the project delivery rates of the
individual intervals. The weights in the formula are the conditional probabilities P (pdri |instance)
computed by Naive–Bayes.
The limitation of this approach is that the resulting expected value may be affected significantly
by the width of the intervals. If too few intervals are chosen, midi values may be too far apart and
may poorly approximate the actual project delivery rate values. If too many intervals are chosen and
the training sample size is too small, the computed conditional probabilities P (pdri |instance) may be
inaccurate. This problem could be reduced by increasing the size of the training data set.
To assess the predictive performance of Naive–Bayes relatively to other methods, we used two other
prediction methods to estimate project delivery rates—model trees and neural networks.
3. ESTIMATING PROJECT DELIVERY RATES USING MODEL TREES
Model trees are a type of decision trees designed for prediction of numeric quantities rather than for
classification. There are two basic types of decision trees—classification trees and regression trees.
Classification trees are used for predicting classes (categories of the class variable) of test instances
while regression trees are used for predicting numeric values of the class variable. In classification
trees the class variable must have discrete values while in regression trees it must have numeric values.
Both classification and regression trees are constructed in the same way by recursive partitioning of the
training data set. Many different tree building algorithms have been developed, with two of the most
widely used systems being CART [15] and C4.5 [16]. The main difference between alternative tree
learning algorithms is in the way in which they determine the splitting variables on which to partition
the data.
Regression trees were first introduced by Breiman et al. [15]. In such trees the leaf nodes contain
a predicted value of the class variable which is computed by averaging all the training set values that
reach that leaf. To use a regression tree to predict the value of the class variable for a given test instance,
168 B. STEWART
the tree is traversed from the root towards the leaves until some leaf node has been reached. With each
internal node in the tree is associated a condition that determines which branch is to be followed next.
When a node is visited the condition is evaluated using the values of the test instance. Depending on
the outcome of the condition a branch is selected to the next node in the next lower level of the tree.
This process continues until a leaf node has been reached. The value at the leaf is the predicted value
for the given test instance.
Model trees were introduced by Quinlan [28] in 1992. They extend the idea of regression trees by
combining regression trees with regression equations. The leaf nodes in a model tree contain linear
regression models rather than single numeric values. To use a model tree to predict the project delivery
rate for a given project, the tree is traversed from the root to a leaf in the normal way and when a leaf
is reached the regression model is evaluated to obtain a raw predicted value. Rather than using the raw
value directly a smoothing formula is applied to the raw value to compute a smoothed predicted value.
Smoothing yields more accurate predictions.
For our experiments with model trees we used the M5 algorithm from the Weka machine learning
package developed at the University of Waikato [29]. We adapted the system to also compute the mean
magnitude of relative error and the PRED(x) measures. An excellent description of this package is
given in [30] together with the theoretical details of decision trees and many other machine learning
techniques.
4. ESTIMATING PROJECT DELIVERY RATES USING NEURAL NETWORKS
Neural networks are powerful pattern recognition tools that can be applied to solving prediction,
classification and clustering problems. Some of the basic ideas of neural networks originated in
research in neurophysiology in the 1940s. However, the most significant developments of neural
networks occurred during the 1980s after John Hopfield invented the back-propagation algorithm for
training neural networks in 1982. Since then they have become one of the most important tools for
pattern recognition and machine learning. A detailed description of neural network concepts is beyond
the scope of this article. We give only a brief outline of the feed forward neural network architecture
that we used in our experiments.
A feed forward neural network consists of nodes organized in layers and links connecting the nodes
in adjacent layers. The most common configuration is comprised of an input layer, one or more hidden
layers, and an output layer. The nodes in the input layer correspond to network inputs, and the nodes
in the output layer to network outputs. The number of hidden layers and the number of hidden nodes
are chosen arbitrarily. In general, increasing the number of hidden nodes increases the power of the
network to recognize patterns but may also lead to over-fitting. Over-fitting occurs when the weights
become too specialized to the training data and as a result the network’s ability to predict new samples
is reduced. To prevent over-fitting, for small data sets the number of hidden nodes should be relatively
small. The nodes in each layer are connected to the nodes in each adjacent layer by links. Figure 3
illustrates the basic architecture of the feed forward neural networks that we used in our experiments.
The network contains an input layer, a single hidden layer, and an output layer containing a single
node. The nodes in the input layer correspond to the variables in the data set, the hidden layer contains
hidden nodes, and the single node in the output layer represents the project delivery rate variable. In our
empirical experiments we used five hidden nodes and the number of input nodes varied in different
Input 1
Input 2 Output
Input 3
Input Layer Hidden Layer Output Layer
Figure 3. A feed forward neural network.
experiments. With each link i–j connecting the nodes i and j is associated a numerical weight wij ,
and with each node in the hidden and output layers is associated an activation function which computes
the node’s output. The most widely used activation function is a sigmoid function of the form
1
(8)
1 + exp(−x)

where x = i wij Inpij is the weighted sum of inputs into a node j .
All network inputs must be numeric values. If a data set contains non-numeric variables, these have
to be transformed into a numeric form. To achieve better performance all the variables are usually
scaled between 0 and 1.
Using a neural network for predicting project delivery rates involves two steps: (1) training the
network using a training data set; and (2) applying the network to predict the value of the project
delivery rate for a given project. During the training stage the weights of the links are learned from
the training data by means of the back-propagation algorithm. The weights are initially set to small
randomly generated values and are then iteratively updated. The data set is repeatedly traversed and
during each traversal the back-propagation algorithm updates the weights so that the total prediction
error is minimized. Mathematical details of the back-propagation algorithm are beyond the scope of
this article, we refer the reader to [17] for details.
After the network has been trained it can be used for prediction. To predict the project delivery rate
for a given project, the values of the project’s variables are entered as inputs into the nodes in the
input layer. Input layer nodes only copy their input values to their output values. The output values
from the input nodes become inputs into the hidden layer nodes. Each node in the hidden layer is
connected to every input node in the input layer. Hidden nodes compute their output by multiplying
the outputs from the input layer nodes by the corresponding link weights, summing the products, and
applying the sigmoid activation function to produce output. The single output node in the output layer
is connected to each hidden node in the hidden layer. The output from the output node is computed by
multiplying the outputs from the hidden nodes by the corresponding weights, adding up the products,
170 B. STEWART
and applying the sigmoid activation function. The value returned by the output node is the resulting
project delivery rate.
5. MUTUAL INFORMATION MEASURE
A database of empirical data may contain a large number of variables some of which may not
be relevant to the given classification task. The presence of redundant variables may reduce the
performance of some classification algorithms in terms of execution speed and accuracy. The execution
speed of most classification algorithms is inversely related to the number of variables in the model.
The more variables there are the slower the speed. Furthermore, redundant variables may affect
classification or prediction accuracy by causing over-fitting. The model will contain too many
parameters relative to the amount of training data available, and as a consequence will compute
parameter values that are too specialized to the training data and have poor generalization capabilities.
Decision tree models resolve this problem by pruning the decision trees [16].
In the experiments reported in this paper we also investigated the effects of reducing the number of
variables on the performance of the Naive–Bayes classifier in the context of predicting project delivery
rates. We performed comparative experiments with a full set of variables and two smaller subsets of the
original variables. To select a subset of variables that have a significant effect on the class variable we
used the mutual information measure. A variety of other approaches have been proposed in statistics
and AI for variable selection [31]. Our choice was motivated by the fact that the mutual information
measure has been used successfully by several researchers in AI for identifying strong relationships in
data [11] and that it can be computed relatively efficiently in O(n2 ) time, where n is the number of
variables in the data set. The mutual information measure is given by the formula
P (xi , xj )
I (Xi , Xj ) = P (xi , xj ) log (9)
i,j
P (xi )P (xj )
where Xi , Xj are random variables, P (xi , xj ) denotes the joint probability distribution of Xi and Xj ,
and P (xi ), P (xj ) denote marginal distributions.
6. EXPERIMENTAL WORK
The goal of our experimental work was to assess the feasibility of the Naive–Bayes classifier for
prediction of project delivery rates. We carried out experiments using the data disk ‘The Benchmark
Release 6’ purchased from ISBSG [14]. The data disk includes a report presenting a statistical analysis
of the factors affecting project delivery rates (PDR), where a project delivery rate is defined as the
number of hours per function point. Due to the nature of this database, we focused on estimating
project delivery rates rather than summary effort values.
To compare the predictive performance of Naive–Bayes to other prediction algorithms, we carried
out the same experiments using two alternative methods—model trees and artificial neural networks.
The results of these experiments are presented in Tables V and VI.
6.1. Data description
The data set contains data on 789 projects from 20 countries, drawn from many different industries
and business areas. Most of the projects are less than five years old. The data set contains 55 variables,
including discrete and numeric variables. Some projects do not provide values for all the variables.
6.2. Data preparation
The format of the data in the data disk differed from the format required by our software and hence pre-
processing was required. This included transformation of some variables into several simpler variables,
removal of some variables, and addition of the project delivery rate variable. Project delivery rate values
were computed by dividing the summary effort in hours by the number of function points.
Due to significant differences in project delivery rates in different industry types, we extracted from
the data set the data for seven types of organizations for which there was an adequate number of
projects and performed experiments with these smaller data sets. We used the following organization
types: (1) Banking; (2) Communication; (3) Electricity/gas/water; (4) Financial/property/business;
(5) Insurance; (6) Manufacturing; and (7) Public administration.
For each data set it was necessary to transform the numeric variables into discrete variables since
our current implementation of the Naive–Bayes classifier is designed to handle only discrete-valued
variables. We used equal-interval discretization for this purpose. All numeric variables were discretized
into 10 intervals. The pre-processing operations resulted in data sets of 65 variables (64 input variables
plus the class variable). Additionally, for each data set two smaller subsets of variables (16 and 8,
respectively) were selected using a combination of mutual information measure and our own judgment.
For the experiments with model trees and neural networks we used the same data sets as for
Naive–Bayes except that the project delivery rate variable was not discretized. For the neural network
experiments, all the symbolic variables were transformed to numeric variables and all the variables
scaled between 0 and 1.
6.3. Accuracy measures
To assess the accuracy of predicted values of project delivery rate, we used the mean magnitude of
relative error (MMRE) and the PRED(x) measures that are commonly used in software cost estimation
research. The MMRE measure is defined as the average of the magnitudes of relative errors (MRE) of
the n samples selected from the data set:

actuali − predictedi
MREi = (10)
actuali
1i=n
MMRE = MREi (11)
n i=1
A problem with the MMRE is that its value can be strongly influenced by a few very large MRE
values. For this reason, the accuracy is also usually measured in terms of the PRED(x) measure defined
as the fraction or percentage of the samples with MRE ≤ x. For example, PRED(0.25) is the fraction of
172
Table V. The results of experiments using 10-fold cross-validation.

B. STEWART
PRED(0.25) PRED(0.50) PRED(0.75) MMRE

Experiment NB MT NN NB MT NN NB MT NN NB MT NN
Banking 1 0.41 0.40 0.29 0.68 0.66 0.53 0.82 0.75 0.69 0.52 0.67 0.96
Banking 2 0.52 0.41 0.32 0.72 0.63 0.58 0.87 0.78 0.73 0.41 0.66 0.62
Copyright  2002 John Wiley & Sons, Ltd.

Banking 3 0.46 0.41 0.36 0.71 0.67 0.64 0.86 0.78 0.77 0.42 0.63 0.65
Communication 1 0.35 0.29 0.38 0.55 0.59 0.53 0.75 0.68 0.55 0.62 0.91 0.92
Communication 2 0.28 0.29 0.23 0.48 0.44 0.53 0.70 0.66 0.65 0.84 0.99 0.94
Communication 3 0.35 0.32 0.23 0.53 0.49 0.48 0.82 0.66 0.60 0.55 0.99 0.90
Electricity/gas/water 1 0.33 0.19 0.15 0.60 0.36 0.30 0.77 0.62 0.65 1.01 1.34 1.98
Financial/property/business 1 0.17 0.32 0.19 0.43 0.54 0.34 0.66 0.71 0.49 1.36 1.22 1.86
Insurance 1 0.34 0.31 0.19 0.61 0.58 0.44 0.80 0.73 0.61 0.61 0.77 0.89
Insurance 2 0.41 0.35 0.28 0.63 0.58 0.58 0.76 0.72 0.74 0.64 0.75 0.69
Insurance 3 0.35 0.33 0.34 0.65 0.61 0.56 0.78 0.74 0.68 0.65 0.76 0.78
Manufacturing 1 0.38 0.43 0.38 0.63 0.71 0.62 0.80 0.80 0.80 0.65 0.63 0.64
Manufacturing 2 0.38 0.32 0.45 0.60 0.64 0.60 0.80 0.77 0.73 0.63 0.68 0.59
Manufacturing 3 0.30 0.25 0.47 0.55 0.52 0.70 0.73 0.71 0.90 0.76 0.79 0.36
Public administration 1 0.16 .18 0.22 0.40 0.36 0.38 0.66 0.53 0.65 1.14 1.87 1.46
Public administration 2 0.20 0.15 0.15 0.44 0.35 0.30 0.67 0.55 0.52 1.30 2.53 1.98
J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179

Table VI. The results of experiments using training data.
PRED(0.25) PRED(0.50) PRED(0.75) MMRE

Experiment NB MT NN NB MT NN NB MT NN NB MT NN
Banking 1 0.89 0.51 0.96 0.96 0.76 0.98 0.98 0.85 0.99 0.13 0.49 0.09
Banking 2 0.73 0.56 0.77 0.89 0.75 0.92 0.93 0.81 0.95 0.23 0.46 0.20
Copyright  2002 John Wiley & Sons, Ltd.

Banking 3 0.68 0.46 0.60 0.85 0.76 0.87 0.93 0.84 0.92 0.29 0.51 0.29
Communication 1 0.78 0.32 0.93 0.87 0.76 0.99 0.97 0.85 1.00 0.20 0.58 0.07
Communication 2 0.67 0.37 0.74 0.79 0.61 0.88 0.91 0.66 0.91 0.29 0.79 0.29
Communication 3 0.62 0.54 0.60 0.76 0.76 0.79 0.90 0.85 0.86 0.32 0.59 0.42
Insurance 1 0.72 0.47 0.81 0.84 0.68 0.93 0.89 0.80 0.95 0.30 0.56 0.17
Insurance 2 0.49 0.42 0.59 0.68 0.72 0.79 0.78 0.79 0.84 0.53 0.60 0.43
Insurance 3 0.50 0.43 0.51 0.69 0.69 0.75 0.78 0.78 0.83 0.53 0.66 0.52
Manufacturing 1 0.93 0.75 0.96 0.96 0.89 0.99 0.99 0.93 1.00 0.09 0.22 0.06
Manufacturing 2 0.79 0.55 0.89 0.93 0.82 0.96 0.96 0.89 0.98 0.18 0.35 0.12
Manufacturing 3 0.82 0.48 0.86 0.91 0.73 0.95 0.94 0.86 0.98 0.19 0.40 0.14
PREDICTING PROJECT DELIVERY RATES USING THE NAIVE–BAYES CLASSIFIER
J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179

173
174 B. STEWART
the samples with MRE ≤ 0.25. To provide more detailed information, in this paper we also presented
the values for PRED(0.50) and PRED(0.75).
According to software engineering literature [32], a formal cost model is considered to function
well if its PRED(0.25) is greater than 0.75. Some researchers consider an MMRE of less than 0.25
to be good, while Boehm [1] suggested that the MMRE should be 0.10 or less. Most algorithmic and
machine learning models reported in the literature have been unable to satisfy these criteria for a variety
of projects and usually require a careful calibration for specific organizations.
6.4. Cross-validation
In machine learning a popular method for evaluating accuracy of classification algorithms is cross-
validation. The basic idea of this method is to divide the original data set into a predetermined number v
of subsets of as nearly equal size as possible and then repeatedly train and test the classifier. In each
iteration one of the subsets is selected as the test set and the remaining subsets are included in the
training set. Hence testing is performed on previously unseen data. For example, in 10-fold cross-
validation the data set is split into 10 nearly equal subsets and the train/test cycle is repeated 10 times,
each time using a different subset as the test set and the remainder of the data as the training set.
The prediction accuracy is computed by averaging the accuracies over the 10 iterations.
6.5. Summary of our approach
The major steps of our approach to estimating project delivery rates using the Naive–Bayes classifier
are summarized below.
Step 1. Pre-process the data. This may involve deletion, addition, transformation, and discretization of
variables. The project delivery rate variable must be discretized into discrete values.
Step 2. Create training data sets and test data sets. For cross-validation this is done by splitting the
original data set into a predefined number of subsets of equal size as described in Section 6.4.
For experiments with training data the whole data set is used as both training set and test set.
Step 3. Construct the Naive–Bayes model for this data set.
Step 4. Train the Naive–Bayes model using a training data set. This step computes the required
conditional probabilities.
Step 5. Use the model to predict project delivery rate for the projects in the associated test set as
explained in Section 2.3, and compute the accuracy measures.
Step 6. For cross-validation repeat the steps 4–6 until all the training sets have been processed.
Step 7. Compute the resulting accuracy measures by averaging the values of the measures over all test
sets.
6.6. Experimental results
This section describes the experimental results obtained using the data sets created from the Benchmark
Release 6 data disk. For each data set we performed three experiments using 10-fold cross-validation
as described in Section 6.4, and three experiments in which the models were tested on the training data
from which they were derived. The experiments labelled 1, 2, and 3 differed in the number of input
Table VII. The data sets used for the experiments.
Data set Size The seven variables used in experiment 3

Banking 91 (1) Primary programming language, (2) Project scope, (3) How
methodology acquired, (4) Used methodology, (5) Reference table
approach, (6) Upper CASE tool used, (7) Application type
Communication 41 (1) Primary programming language, (2) Application type, (3) FP
standards, (4) Business area type, (5) Recording method, (6) Upper
CASE tool used, (7) Development platform
Electricity/gas/water 47 (1) Primary programming language, (2) FP standards, (3) Max-
imum team size, (4) How methodology acquired, (5) Reference
table approach, (6) Business area type, (7) Resource level
Financial/property/business 78 (1) Primary programming language, (2) Maximum team size,
(3) FP standards, (4) Application type, (5) Business area type,
(6) Development platform, (7) Language type
Insurance 81 (1) Primary programming language, (2) Application type,
(3) Language type, (4) Business area type, (5) User base locations,
(6) Lower CASE tool with code generally used, (7) FP standards
Manufacturing 44 (1) Primary programming language, (2) FP standards, (3) Business
area type, (4) Application type, (5) Lower CASE tool with code
generally used, (6) Upper CASE tool used, (7) Language type
Public administration 70 (1) Primary programming language, (2) Business area type,
(3) Application type, (4) FP standards, (5) User base locations,
(6) Maximum team size, (7) User base business units
variables included in the model. They used 64, 15, and seven input variables respectively, plus the class
variable. A brief description of the data sets is given in Table VII. The results of the cross-validation
experiments are presented in Table VI and the results of the experiments with training data in Table VII.
In experiments 2 and 3 with each data set we used a reduced number of variables by excluding those
variables for which the values would be difficult to estimate at the start of a project and including only
the variables for which the values would be known at the beginning of a project. We selected the subsets
of variables by using a combination of the mutual information measure and our own judgment. For each
variable Xi we computed the value of the mutual information measure with the project delivery rate
variable, I (Xi , PDR), and then selected a subset of variables from those with the largest I (Xi , PDR)
values. Selecting only the variables with the largest values of the mutual information measure could
also include in the model variables for which values could not be obtained in early stages of the project
or which are strongly correlated with project delivery rate. For example, summary effort and project
elapsed time were used to compute the project delivery rate variable and hence are strongly correlated
to it. The variables such as work effort for planning, testing, and implementation would be difficult to
estimate at the start of a project and therefore were not included in the models. In experiment 2 we used
15 variables and in experiment 3 seven variables, plus the project delivery rate variable. For illustration,
for each data set the seven variables used in experiment 3 are listed in Table VII.
176 B. STEWART
Table VIII. Best performance in 21 10-fold cross-validation experiments.
Measure Naive–Bayes Model tree Neural network

PRED(0.25) 12 5 5
PRED(0.50) 13 6 2
PRED(0.75) 17 2 4
MMRE 15 2 4
Table IX. Best performance in 21 experiments on training data.
Measure Naive–Bayes Model tree Neural network

PRED(0.25) 5 0 17
PRED(0.50) 3 0 18
PRED(0.75) 3 0 19
MMRE 3 0 20
The results in Table V were obtained using 10-fold cross-validation as described in Section 6.4, and
hence were obtained from the unseen data not used to train the models. The best results are highlighted
in bold font.
For each of the three models, Naive–Bayes (NB), model tree (MT), and neural network (NN) the
results show a wide variation in predictive accuracy, depending on the data set used. In terms of
the MMRE values, the best predictions were obtained for the banking, insurance, communication, and
manufacturing data sets. The results for electricity/gas/water, financial/property/business systems,
and public administration data sets were much poorer. Table VIII shows the number of times each
of the three models achieved the best results on the accuracy measures in 21 experiments.
The results in Table VIII show that the Naive–Bayes models produced superior results to model
trees and neural networks on most cross-validation experiments. It can also be seen from Table V that
Naive–Bayes gave best overall results on five data sets out of seven.
The results in Table VI were obtained by classifying the training data sets. They indicate how well
the three types of models can predict project delivery rates for the projects from which they were
constructed. Similarly as in Table V, the results for electricity/gas/water, financial/property/business
systems, and public administration data sets are significantly worse than for the remaining data sets.
Table IX summarizes how many times each model achieved the best performance.
From Table IX as well as from Table VI it can be seen that neural networks achieved superior
performance in almost all experiments on training data. We used five hidden nodes in each neural
network model. Neural networks are prone to over-fitting and very high accuracy on training data may
indicate over-fitting. In some cases this can be alleviated by reducing the number of hidden nodes.
We conducted experiments, not reported here, with smaller numbers of hidden units (three and four)
but in general five hidden units gave better results on cross-validation. Model trees gave results that
were only slightly better than for cross-validation experiments. The reason for this could be that the
M5 algorithm uses pruning to eliminate redundant variables and constructs a pruned tree containing
only the most significant variables. This prevents over-fitting.
Comparing the results of experiments 1, 2, and 3 for each data set in Table V shows that the
reduction of the number of variables did not affect greatly the values of the accuracy measures. In some
experiments the values were better and in some slightly worse. The results indicate that the mutual
information measure is a useful technique for reducing the number of variables in data sets.
One of the reasons for the wide variation in performance of all three models on different data
sets may be the composition of the individual data sets. The projects in the banking data set were
predominantly from the banking business area and were mostly management information systems.
The insurance, communication, and manufacturing data sets were comprised of projects from a wider
range of business areas and application types. The three worst performing data sets contained projects
from the broadest range of organizations. Poorer results for the more heterogenous data sets may
be partly due to wide differences in software development practices of different organization types
and to the variations in resource requirements of different types of applications. Neither of the three
approaches was able to capture these differences in the models constructed from relatively small
data sets.
Another reason for relatively poor performance of all three machine learning approaches may be
that the training sample sizes were too small. Although the sample sizes were not widely different
for the different organization types, more heterogenous data sets would require larger training sets to
enable the models to encode the necessary relationships. It is also possible that some of the important
variables affecting project delivery rates in such organizations have not been included in the Benchmark
Release 6 data disk. The performance of all machine learning algorithms is affected by the size of the
training set. In general, the accuracy tends to improve with increasing size of the training set.
7. CONCLUSION AND FUTURE WORK
In this paper we investigated the feasibility of the machine learning algorithm called Naive–Bayes
classifier for estimating project delivery rates. The basic idea of this approach is to construct a
Naive–Bayes classifier from a historical database of past projects and then use it to predict project
delivery rates for new projects. To compare the performance of Naive–Bayes to other machine learning
methods we conducted the same experiments with two alternative methods—model trees and artificial
neural networks. Experimental results obtained with data from the Benchmark Release 6 data disk
from ISBSG [14] are presented in Tables V and VI. The results of cross-validation experiments in
Table V show that from the three approaches studied Naive–Bayes produced the best values of MMRE
and PRED(x) measures on most of the cross-validation experiments. The results in Table VI show that
the neural network approach gave the best results on training data experiments. The results of cross-
validation experiments indicate how well each model can then predict project delivery rate for unseen
projects and hence are of greater practical significance than the results on training data. However, none
of the models achieved cross-validation results that would be acceptable for practical estimation of
project delivery rates. As mentioned in Section 6.3, for satisfactory performance PRED(0.25) should
be greater than 0.75 and MMRE less than 0.25.
178 B. STEWART
Although the predictive performance achieved in our experiments is far from satisfactory for
practical applications, it needs to be stressed that the data sets used were relatively small and contained
a wide variety of projects from different countries, business areas, and application types. It is likely
that if the data were local to a particular organization or a group of similar organizations the results
would be significantly better. So far none of the machine learning approaches to the software estimation
problem reported in the literature have achieved completely satisfactory results [6–8,10].
It is possible that in the future, when data collection becomes a more widely practiced activity in
software organizations and large databases of past cases have been accumulated, machine learning
algorithms will yield better results and will play an important part in software cost estimation.
Until now the most popular machine learning algorithms used in software engineering research have
been decision tree learners and neural networks. To our knowledge, no studies have been reported
on using the Naive–Bayes classifier. Our paper attempts to fill this gap. When compared to other
machine learning approaches, Naive–Bayes has several advantages. For example, unlike decision trees
Naive–Bayes is not sensitive to noise in the data. This is an important consideration in software
engineering where databases often contain imprecise and incomplete information. Naive–Bayes is
less affected by the problem of over-fitting than neural networks. Over-fitting occurs when the model
parameters become highly specialized to the training data and have poor generalization capabilities
for new samples. The main disadvantage of our current implementation is the necessity to discretize
numeric variables into discrete values. The prediction results are sensitive to the number of intervals
chosen. In the future we plan to investigate the use of Naive–Bayes implementations that can handle
continuous variables.
Our empirical experiments have shown that the Naive–Bayes classifier is a valuable tool for analysis
of software engineering data which can be used as an alternative to other more widely used approaches
such as decision trees and neural networks. The results of our experiments have also shown that
selecting model variables on the basis of the mutual information measure is a useful technique that
can improve efficiency of the model without reducing its accuracy.
Our future research will focus on applying the Naive–Bayes classifier to a variety of software
engineering databases and comparing its performance to other machine learning approaches. We will
also investigate the feasibility of using more complex Bayesian network classifiers for analysis of
software engineering data.
REFERENCES
1. Boehm B. Software Engineering Economics. Prentice-Hall: Englewood Cliffs NJ, 1981.

2. http://sunset.usc.edu/research/COCOMOII/index.html.
3. Albrecht AJ, Gaffney JE. Software function, source lines of code, and development effort prediction. IEEE Transactions
in Software Engineering 1983; 9(6):639–648.
4. Putnam LH. A general empirical solution to the macro software sizing and estimating problem. IEEE Transactions in
Software Engineering 1978; 4(4):345–361.
5. Briand L, Basili V, Thomas W. A pattern recognition approach for software engineering data analysis. IEEE Transactions
in Software Engineering 1992; 18(11):931–942.
6. Briand LC, El Emam K, Surmann D, Wieczorek I, Maxwell KD. An assessment and comparison of common software cost
estimation modelling techniques. Proceedings of the 1999 International Conference on Software Engineering (ICSE’99).
ACM Press: New York NY, 1999; 313–322.
7. Jorgensen M. Experience with the accuracy of software maintenance task effort prediction models. IEEE Transactions in
8. Srinivasan K, Fisher D. Machine learning approaches to estimating software development effort. IEEE Transactions in
9. Shin M, Goel AL. Empirical data modelling in software engineering using radial basis function. IEEE Transactions in
10. Shepperd M, Schofield C. Estimating software project effort using analogies. IEEE Transactions in Software Engineering
1997; 23(12):736–743.
11. Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Machine Learning 1997; 29(2–3):131–163.
12. Chow CK, Liu CN. Approximating discrete probability distributions with dependence trees. IEEE Transactions On
Information Theory 1968; 14:462–467.
13. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann: San Mateo
CA, 1988.
14. International Software Benchmarking Standards Group, 2002. http://www.isbsg.org.au [27 March 2002].
15. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees. Wadsworth: Belmont CA, 1984.
16. Quinlan JR. C4.5:Programs for Machine Learning. Morgan Kaufmann: San Mateo CA, 1993.
17. Bishop CM. Neural Networks for Pattern Recognition. Oxford University Press: New York NY, 1995.
18. Cover TM, Hart PE. Nearest neighbour pattern classification. IEEE Transactions On Information Theory 1967; 13(1):21–
27.
19. Goldberg DE. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley: Reading MA, 1989.
20. Hugin Expert, Aalborg, Denmark, 2001. http://www.hugin.com/cases/ [27 March 2002].
21. Agena, London, UK, 2001. http://www.agena.co.uk/ [27 March 2002].
22. Association for Uncertainty in Artificial Intelligence, 2001. http://www.auai.org/ [27 March 2002].
23. Neapolitan RE. Probabilistic Reasoning in Expert Systems: Theory and Algorithms. John Wiley: New York NY, 1990.
24. Castillo E, Gutierrez JM, Hadi AS. Expert Systems and Probabilistic Network Models. Springer: New York NY, 1997.
25. Cowell RG, Dawid AP, Lauritzen SL, Spiegelhalter DJ. Probabilistic Networks and Expert Systems. Springer: New York
NY, 1999.
26. Jensen FV. An Introduction to Bayesian Networks. UCL Press: London, 1996.
27. Lauritzen SL, Spiegelhalter DJ. Local computations with probabilities on graphical structures and their applications to
expert systems (with discussion). Journal of the Royal Statistical Society Series B 1988; 50:157–224.
28. Quinlan JR. Learning with continuous classes. Proceedings of the Fifth Australian Joint Conference on Artificial
Intelligence, Adams N, Sterling L (eds.). World Scientific: Singapore, 1992; 343–348.
29. WEKA, The University of Waikato, New Zealand, 2001. http://www.cs.waikato.ac.nz/ml/weka [27 March 2002].
30. Witten IH, Frank E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan
Kaufmann: San Francisco CA, 2000.
31. Motoda H, Liu H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer: Dordrecht, 1998.
32. Pfleeger SL. Software Engineering Theory and Practice. Prentice-Hall: Upper Saddle River NJ, 1998.
AUTHOR’S BIOGRAPHY
Bozena Stewart is a Lecturer in Computing in the School of Computing and Information

Technology at the University of Western Sydney, Australia. Her research interests are
primarily in the fields of artificial intelligence and software engineering. Her current
research includes Bayesian networks, machine learning, software cost estimation, and
data mining. She teaches object-oriented programming, data structures and algorithms,
software engineering, and artificial intelligence. She holds a PhD degree in computer
science from the University of Technology, Sydney.

Predicting Project Delivery Rates Using The Naive-Bayes Classifier

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Predicting Project Delivery Rates Using The Naive-Bayes Classifier

Uploaded by

Copyright:

Available Formats

JOURNAL OF SOFTWARE MAINTENANCE AND EVOLUTION: RESEARCH AND PRACTICE

J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179 (DOI: 10.1002/smr.250)

Predicting project delivery rates

School of Computing and Information Technology, University of Western Sydney, Australia

Received 17 September 2001

2. BAYESIAN NETWORK CLASSIFIERS

2.1. Bayesian networks

Figure 1. A Bayesian network.

the joint distribution can be written in the product form P (x1 , x2 , . . . , xn ) =

2.2. Naive–Bayes classifier

Figure 2. A Naive–Bayes network.

Example. We illustrate the computations performed by the Naive–Bayes algorithm by means of a

Table I. The probability distribution of the variable C.

Table II. The conditional probability distribution P (x|c).

Table III. The conditional probability distribution P (y|c).

Table IV. The resulting conditional distribution P (c|x = 0, y = 1).

2.3. Estimating project delivery rates using the Naive–Bayes classifier

3. ESTIMATING PROJECT DELIVERY RATES USING MODEL TREES

4. ESTIMATING PROJECT DELIVERY RATES USING NEURAL NETWORKS

Input Layer Hidden Layer Output Layer

Figure 3. A feed forward neural network.

5. MUTUAL INFORMATION MEASURE

6.1. Data description

6.2. Data preparation

6.3. Accuracy measures

Table V. The results of experiments using 10-fold cross-validation.

PRED(0.25) PRED(0.50) PRED(0.75) MMRE

Copyright  2002 John Wiley & Sons, Ltd.

J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179

PRED(0.25) PRED(0.50) PRED(0.75) MMRE

Copyright  2002 John Wiley & Sons, Ltd.

J. Softw. Maint. Evol.: Res. Pract. 2002; 14:161–179

6.5. Summary of our approach

6.6. Experimental results

Table VII. The data sets used for the experiments.

Data set Size The seven variables used in experiment 3

Table VIII. Best performance in 21 10-fold cross-validation experiments.

Measure Naive–Bayes Model tree Neural network

Table IX. Best performance in 21 experiments on training data.

Measure Naive–Bayes Model tree Neural network

7. CONCLUSION AND FUTURE WORK

1. Boehm B. Software Engineering Economics. Prentice-Hall: Englewood Cliffs NJ, 1981.

Bozena Stewart is a Lecturer in Computing in the School of Computing and Information

You might also like