You are on page 1of 17

Software Qual J (2011) 19:537–552

DOI 10.1007/s11219-010-9112-9

A comparative study for estimating software


development effort intervals

Ayşe Bakır • Burak Turhan • Ayşe Bener

Published online: 9 September 2010


 Springer Science+Business Media, LLC 2010

Abstract Software cost/effort estimation is still an open challenge. Many researchers


have proposed various methods that usually focus on point estimates. Until today, software
cost estimation has been treated as a regression problem. However, in order to prevent
overestimates and underestimates, it is more practical to predict the interval of estimations
instead of the exact values. In this paper, we propose an approach that converts cost
estimation into a classification problem and that classifies new software projects in one of
the effort classes, each of which corresponds to an effort interval. Our approach integrates
cluster analysis with classification methods. Cluster analysis is used to determine effort
intervals while different classification algorithms are used to find corresponding effort
classes. The proposed approach is applied to seven public datasets. Our experimental
results show that the hit rate obtained for effort estimation are around 90–100%, which is
much higher than that obtained by related studies. Furthermore, in terms of point esti-
mation, our results are comparable to those in the literature although a simple mean/median
is used for estimation. Finally, the dynamic generation of effort intervals is the most
distinctive part of our study, and it results in time and effort gain for project managers
through the removal of human intervention.

Keywords Software effort estimation  Interval prediction  Classification 


Cluster analysis  Machine learning

A. Bakır (&)
Department of Computer Engineering, Boğaziçi University, 34342 Bebek, Istanbul, Turkey
e-mail: ayse.bakir@boun.edu.tr

B. Turhan
Department of Information Processing Science, University of Oulu, 90014 Oulu, Finland
e-mail: burak.turhan@oulu.fi

A. Bener
Ted Rogers School of Information Technology Management, Ryerson University,
Toronto M5B 2K3, Canada
e-mail: ayse.bener@ryerson.ca

123
538 Software Qual J (2011) 19:537–552

1 Introduction

As software becomes more important in many domains, the focus on its overall quality in
terms of technical product quality and process quality also increases. As a result, software
is blamed for business failures and the increased cost of business in many industries (Lum
et al. 2003). The underestimation of software effort causes cost overruns that lead to cost
cutting. Cost cutting means that some of the life cycle activities either can be skipped or
cannot be completed as originally planned. This causes a drop in software product quality.
To avoid the cost/quality death spiral, accurate cost estimates are vital (Menzies and Hihn
2006).
Software cost estimation is one of the critical steps in the software development life
cycle (Boehm 1981; Leung and Fan 2001). It is the process of predicting the effort required
to develop a software project. Such predictions assist project managers when they make
important decisions such as bidding for a new project, planning and allocating resources.
Inaccurate cost estimations may cause project managers to make wrong decisions. As
Leung and Fan state, underestimations may result in approving projects that would exceed
their budgets and schedules (Leung and Fan 2001). Overestimations, on the other hand,
may result in rejecting other useful projects and wasting resources.
Point estimates are generally used for project staffing and scheduling (Sentas et al.
2005). However, managers may easily make wrong decisions if they rely only on point
estimates and the associated error margins generated by cost estimation methods. Although
most methods proposed in the literature produce point estimates, Stamelos and Angelis
state that producing interval estimates is safer (Stamelos and Angelis 2001). They
emphasize that point estimates have a high impact on project managers, causing them to
make wrong decisions, since they include a high level of uncertainty as a result of unclear
requirements and their implications in the project. Interval estimates may be used for
predicting the cost of any current project in terms of completed ones. In addition, while
bidding for a new project, an interval estimate can easily be converted to a point estimate
by evaluating the values that fall into the same interval.
Up to now, interval estimation has consisted of finding either the confidence intervals
for point estimates or the posterior probabilities of predefined intervals and then fitting
regression-based methods to these intervals (Angelis and Stamelos 2000; Jorgensen 2002;
Sentas et al. 2003, 2005; Stamelos and Angelis 2001; Stamelos et al. 2003). However, none
of these approaches addresses the problem of cost estimation as a pure classification
problem. In this paper, we aim to convert cost estimation into a classification problem by
using interval estimation as a tool. The proposed approach integrates classification methods
with cluster analysis, which, to the best of our knowledge, is applied for the first time in the
software engineering domain. In addition, by using cluster analysis, effort classes are
determined dynamically instead of using manually predefined intervals. The approach uses
historical data of completed projects including their effort values.
The proposed approach includes three main phases: (1) clustering effort data so that
each cluster contains similar projects; (2) labeling each cluster with a class number and
determining the effort intervals for each cluster; and (3) classifying new projects to one of
the effort classes. We used various datasets to validate our approach, and our results
revealed much higher estimation accuracies than those in the literature. According to our
experimental study, we obtained higher hit rates for effort estimation. We also obtained
point estimates with simple approaches such as mean/median regression, and our perfor-
mance has been comparable to those in the literature.

123
Software Qual J (2011) 19:537–552 539

The rest of the paper is organized as follows: Sect. 2 discusses related work from the
literature. Section 3 describes the proposed approach in detail, while Sect. 4 presents the
experiments conducted. Section 5 comprises a presentation of the results and discussions.
Finally, conclusions and future work are presented in Sect. 6.

2 Related work

Previous work on software cost estimation mostly produced point estimates by using
regression methods (Baskeles et al. 2007; Boetticher 2001; Briand et al. 1992; Draper and
Smith 1981; Miyazaki et al. 1994; Shepperd and Schofield 1997; Srinivasan and Fisher
1995; Tadayon 2005). According to Boehm, the two most popular regression methods are
ordinary least square regression (OLS) and robust regression (Boehm et al. 2000). OLS is a
general linear model that uses least squares, whereas robust regression is the improved
version of OLS (Draper and Smith 1981; Miyazaki et al. 1994). Besides regression, various
machine learning methods are used for cost estimation. For example, back-propagation
multilayer perceptrons and support vector machines (SVM) have been used for effort
estimation in Baskeles et al. (2007) and Boetticher (2001), and Briand et al. (1992)
introduce a cost estimation method based on optimized set reduction (Baskeles et al. 2007;
Boetticher 2001; Briand et al. 1992). Other methods for point estimation include estimation
by analogy and neural networks. In Shepperd and Schofield (1997), high accuracies are
obtained by using analogy with prediction models, whereas in Tadayon (2005) and
Shepperd and Schofield (1997), a significant improvement is made on large datasets
through the use of an adaptive neural network model (Shepperd and Schofield 1997;
Tadayon 2005).
Fewer studies focus on interval estimation. They can be grouped into two main cate-
gories: (1) those that produce confidence intervals for point estimates and (2) those that
produce probabilities of predefined intervals. In category 1, interval estimates are gener-
ated during the estimation process, whereas in category 2, intervals are predefined before
the estimation process.
The first study that has empirically evaluated effort prediction interval models in the
literature is Angelis and Stamelos (2000). It compares the effort prediction intervals
derived from a bootstrap-based model with the prediction intervals derived from regres-
sion-based effort estimation models. However, the said study displays a confusion of
terms, and a critique was consequently made by Jorgensen in (2002) to clarify the
ambiguity (Jorgensen and Teigen 2002). In another study, an interval estimation method
based on expert judgment is proposed (Jorgensen 2003). Statistical simulation techniques
for calculating confidence intervals for project portfolios are presented in Stamelos and
Angelis (2001).
Two important studies for category 2 are Sentas et al. (2005), in which ordinal
regression is used to model the probabilities of both effort and productivity intervals, and
Sentas et al. (2003), which uses multinomial logistic regression for modeling productivity
intervals (Sentas et al. 2003, 2005). Both studies also include point estimate results of the
proposed models. Also, in Sentas et al. (2003), predefined intervals of productivity are used
in a Bayesian belief network to support expert opinion (Sentas et al. 2003). An empirical
comparison of the models that produce point estimates and predefined interval estimates is
given in Bibi et al. (2004). Firstly, in contrast to these studies, effort intervals are not
predefined manually in this paper. Instead, they are determined by clustering analysis.
Secondly, instead of using regression-based methods, we use classification algorithms that

123
540 Software Qual J (2011) 19:537–552

originate from the machine learning domain. Thirdly, point estimates can still be derived
from these intervals as we will show in the following sections.
NASA’s Software Engineering Laboratory also specified some guidelines for the esti-
mation of effort prediction intervals (NASA 1990). However, these guidelines may affect
the external validity of the results since they do not reflect the same characteristics of the
projects in other organizations.
Clustering analysis is not a new concept in the software cost estimation domain. Lee
et al. integrate clustering with neural networks in order to estimate the development cost
(Lee et al. 1998). They have found similar projects with clustering and used them to train
the network. In Gallego et al. (2007), the cost data are clustered, and then different
regression models are fitted to each cluster (Gallego et al. 2007). Similar to these studies,
we also use here cluster analysis for grouping similar projects. The difference of our
research in comparison to these studies is that we combine clustering through classification
methods for effort estimation.

3 The approach

There are three main steps in our approach: (1) grouping similar projects together by
cluster analysis; (2) determining the effort intervals for each cluster and specifying the
effort classes; and (3) classifying new projects into one of the effort classes. The
assumption behind applying cluster analysis to effort data is that similar projects have
similar development effort. The class-labeled clusters then become the input data for the
classification algorithm, which converts cost estimation into a classification process.

3.1 Cluster analysis

Cluster analysis is a technique for grouping data and finding similar structures in data. In
software cost estimation domain, clustering corresponds to grouping projects into clusters
based on their attributes. Similar projects are assigned to the same cluster, whereas dis-
similar projects belong to different clusters.
In this study, we use an incremental clustering algorithm called leader cluster (Alpaydin
2004) for cluster analysis. In this algorithm, the number of clusters is not predefined;
instead, the clusters are generated incrementally. Since one of our main objectives is to
generate the effort intervals dynamically, this algorithm is selected to group similar soft-
ware projects. Other clustering techniques that generate the clusters dynamically can also
be used, but this is out of the scope of this work. The pseudocode of the leader cluster
algorithm is given in Fig. 1 (Bakar et al. 2005).
In order to determine the similarity between two projects, Euclidean distance is used. It
is a widely preferred distance metric for software engineering datasets (Lee et al. 1998).

3.2 Effort classes

After the clusters and their centers are determined, the effort intervals are calculated for
each cluster.
In order to specify the effort intervals and classes, firstly, the minimum and maximum
values of the efforts of the projects residing in the same cluster are found. Secondly, these
minimum and maximum values are selected as the upper and lower bounds of the interval

123
Software Qual J (2011) 19:537–552 541

Assign the first data item to the first cluster.


Consider the next data item:
Find the distances between the new item and the existing cluster centers.
If (distance < threshold)
{
Assign this item to the nearest cluster
Recompute the value for that cluster center
}
Else
{
Assign it to a new cluster
}
Repeat step 2 until the total squared error is small enough.

Fig. 1 Pseudocode for leader cluster algorithm (Bakar et al. 2005)

that will represent that cluster. Finally, each cluster is given a class label, which will be
used for classifying new projects.

3.3 Classification

The class of a new project is estimated by using the class-labeled data generated in the
previous step. The resulting class corresponds to the effort interval that contains the effort
value of the new project.
We use three different classification algorithms for this step: one is parametric (linear
discrimination) and the others are non-parametric (k-nearest neighbor and decision tree).
These three algorithms are chosen to show how our approach performs with the algorithms
of different complexities. Linear discrimination is the simplest, whereas the decision tree is
the most complex one. k-nearest neighbor has moderate complexity depending on the size
of the training set.

3.3.1 Linear discrimination

Linear discrimination (LD) is a discriminant-based approach that tries to fit a model


directly for the discriminant between the class regions, without first estimating the like-
lihoods or posteriors (Alpaydin 2004). It assumes that the projects of a class are linearly
separable from the projects of other classes and require no knowledge of the densities
inside the class regions. The linear discriminant function is as:
X
d
gi hx j wi ; wi0 i ¼ wij xj þ wi0 ð1Þ
j¼1

where gi is the model, wi and wi0 are the model parameters and x is the software project
with d attributes. It is used to separate two or more classes.
Learning involves the optimization of the model parameters to maximize the classifi-
cation accuracy on a given set of projects. Because of its simplicity and comprehensibility,
linear discrimination is frequently used before trying a more complicated model.

3.3.2 k-nearest neighbor

The k-nearest neighbor (k-NN) algorithm is a simple but also powerful learning method
that is particularly suited for classification problems.

123
542 Software Qual J (2011) 19:537–552

k-NN assumes that all projects correspond to points in the n-dimensional Euclidean
space Rn, where n is the number of the project attributes. The algorithm’s output is the
class, which has the most examples among the k neighbors of the input project. The
neighbors are found by calculating the Euclidean distance from each project to the input
project.
The selection of k is very important. It is generally set as an odd number to minimize
ties as confusion generally appears between any two neighboring classes (Alpaydin 2004).
Although the algorithm is easy to implement, the amount of computation increases as the
training set grows in size.

3.3.3 Decision tree

Decision trees (DT) are hierarchical data structures that are based on a divide-and-conquer
strategy (Quinlan 1993). They can be used for both classification and regression and
require no assumptions concerning the data. In the case of classification, they are called
classification trees.
The nodes of a classification tree correspond to the attributes that best split data into
disjoint groups, while the leaves correspond to the average effort of that split. The quality
of the split is determined by an impurity measure. The tree is constructed by partitioning
the data recursively until no further partitioning is possible while choosing the split that
minimizes the impurity at every occasion (Alpaydin 2004).
Concerning the estimation of software effort, the effort of the new project can be
determined by traversing the tree from top to bottom along the appropriate paths.

4 Experimental study

Our purpose in this study is to convert the effort estimation problem into a classification
problem that includes the following phases: (1) clustering the effort data; (2) labeling each
cluster with a class number and determining the effort intervals for each cluster; and (3)
classifying the new projects. In addition, the point estimation performance of the approach
is tested by taking either the mean or the median of the effort values of the projects
included in the estimated class.
In this section, details about the validation of our approach on a number of datasets will
be given. MATLAB is used as a tool for all the analyses stated in this study.

4.1 Dataset description

In our experiments, data from two different sources are used: the Promise Data Repository
and the Software Engineering Research Laboratory (SoftLab) Repository (Boetticher et al.
2007; SoftLab 2009). Seven datasets are used in this study. Four of them, which are
cocomonasa_v1, coc81, desharnais_1_1 and nasa93, are taken from the Promise Data
Repository. The others, which are sdr05, sdr06 and sdr07, are taken from the SoftLab
(2009) Repository. These datasets contain data from different local software companies in
Turkey, which are collected by using the COCOMO II Data Collection Questionnaire
(Boehm 1999).
The datasets include a number of nominal attributes and two real-valued attributes:
Lines of Code and Actual Effort. An exemplary dataset is given in Table 1. Each row in
Table 1 corresponds to a different project. These projects are represented by the nominal

123
Software Qual J (2011) 19:537–552 543

Table 1 An example dataset


Project Nominal attributes (as defined in COCOMO II) LOC Effort

P1 1.00,1.08,1.30,1.00,1.00,0.87,1.00,0.86,1.00,0.70,1.21,1.00,0.91,1.00,1.08 70 278
P2 1.40,1.08,1.15,1.30,1.21,1.00,1.00,0.71,0.82,0.70,1.00,0.95,0.91,0.91,1.08 227 1,181
P3 1.00,1.08,1.15,1.30,1.06,0.87,1.07,0.86,1.00,0.86,1.10,0.95,0.91,1.00,1.08 177.9 1,248
P4 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 115.8 480
P5 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 29.5 120
P6 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 19.7 60
P7 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 66.6 300
P8 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 5.5 18
P9 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 10.4 50
P10 1.15,0.94,1.15,1.00,1.00,0.87,0.87,1.00,1.00,1.00,1.00,0.95,0.91,1.00,1.08 14 60
P11 1.00,1.00,1.15,1.11,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 16 114
P12 1.15,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 6.5 42
P13 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 13 60
P14 1.00,1.00,1.15,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00 8 42

Table 2 An overview of
Data source Dataset name # of Projects
the datasets
Promise cocomonasa_v1 60
coc81 63
desharnais_1_1 77
(updated version)
nasa93 93
SoftLab (2009) sdr05 25
sdr06 24
sdr07 40

attributes from the COCOMO II model along with their size in terms of LOC and the actual
effort spent for completing the projects.
We have used several datasets in the same format as provided in Table 1 in order to
validate our approach on a wide range of effort estimation data and to generalize our
results as much as possible. A list of all the datasets used in this study is given in Table 2.

4.2 Design

Before applying any method, all of the datasets are normalized in order to remove the
scaling effects on different dimensions. By using min–max normalization, project attribute
values are converted into the [0…1] interval (Shalabi and Shaaban 2006). After normal-
ization, the need for a dimension reduction technique to extract the relevant features arises.
In this paper, principal component analysis (PCA) is used (Alpaydin 2004). The main
purpose of PCA is to reduce the dimensions of the dataset so that it can still be efficiently
represented without losing much information. Specifically, PCA seeks dimensions in
which the variances are maximized. By applying PCA to each cluster after clustering, the
model shown in Fig. 2 is developed. Our aim in applying PCA separately to each cluster is

123
544 Software Qual J (2011) 19:537–552

Data

Min-Max Normalization

Normalized Data

Leader Cluster

On Each Cluster

PCA

Find Effort Intervals


for Each Cluster

10x10 Cross-Validation

KNN LD DT

Calculate

Fig. 2 Our proposed model

to extract separate features for each cluster so that we can obtain better results for both
classification and point estimation. The dataset given in Table 1 is used as an example in
Fig. 2 to show how our cost data are processed.
In Fig. 2, the projects in the cost dataset are illustrated as P1…P14. After the dataset is
normalized, projects are shown as P10 …P140 . The four clusters generated are named as C1,
C2, C3 and C4, which correspond to effort interval classes. As described earlier, the lower
and upper bounds for an effort interval class are determined dynamically by the minimum
and the maximum effort values of the projects that reside in the corresponding cluster.

4.3 Model

Normalized effort estimation data are given as input to this model. Firstly, the leader cluster
algorithm is applied to the normalized data to obtain project groups. Here, we selected the
number of clusters that minimize the total-squared error while keeping the distance below
the defined threshold value. The optimum value for the number of clusters is found by
testing all possibilities and calculating the total-squared error. Secondly, with PCA, each
cluster’s dimensions are reduced individually by using their own covariance matrices (the
proportion of variance is set to 0.90). The aim here is to prevent data loss within the

123
Software Qual J (2011) 19:537–552 545

clusters. PCA is applied to the entire data except the Effort column, which is the value that
we want to estimate. Thirdly, each cluster is assigned a class label, and the effort intervals
for each of them are determined. As stated in Sect. 3.2, minimum and maximum values are
selected as the interval bounds. Then, the effort data containing the projects with corre-
sponding class labels are given to each of the classification algorithms described in Sect. 3.
For the k-nearest neighbor algorithm, the nearest neighbor is selected. For linear discrim-
ination and decision tree algorithms, the predefined implementations of Matlab have been
used. Since separate training and test sets do not exist, the classification process is per-
formed in a 10 9 10 cross-validation loop. The data are shuffled 10 times into random order
and then divided into 10 bins in the cross-validation loop. The training set is built from nine
of the bins, and the remaining bin is used as the validation set. Classification algorithms are
first trained on the training set, and then, estimations and error calculations are made on the
validation set. The errors are collected during 100 cross-validation iterations, and then
MMRE, MdMRE and PRED values are calculated. Since we have three classification
methods, we have three sets of measurements.
In addition, point estimates are calculated at the classification stage in order to deter-
mine our point estimation performance. For this process, we decided to use the mean and
the median as our point estimators since they have been used by other studies in the
literature. For example, Sentas et al. (2003) represent each interval by a single represen-
tative value: the mean point or the median point (Sentas et al. 2003). At the classification
step, when the correct effort class is estimated, the mean and median of the effort values of
the projects belonging to that class are calculated.

4.4 Accuracy measures

Although our aim is to convert cost estimation to a classification problem, we want to give
the point estimate results of the proposed approach in order to make a comparison with
other studies. Thus, we have employed two types of accuracy measures in our experimental
study: (1) misclassification rate for classification and (2) MeanMRE (MMRE), Median-
MRE and PRED (25) for point estimates.

4.4.1 Misclassification rate

The misclassification rate is simply the proportion of the number of misclassified software
projects in a test set to the total number of projects to be classified in the same test set. It is
calculated for each classification algorithm in each model. The formula for calculating the
misclassification rate is as follows:
(
1X Nt
1 if y 6¼ y0
MR ¼ ð2Þ
Nt n¼1 0 otherwise

where Nt is the total number of training samples, y0 is the estimated effort and y is the
actual effort.
The misclassification rate can be thought as the complement of the hit rate that has been
mentioned in interval prediction studies; thus, our results are still comparable to those
studies:
%100 ¼ Misclassification Rate þ Hit Rate ð3Þ

123
546 Software Qual J (2011) 19:537–552

4.4.2 MMRE, MedianMRE and PRED (25)

These are the measures that are calculated from the relative error and the difference
between the actual and the estimated value.
The magnitude of relative error (MRE) is calculated by the following formula:
MRE ¼ jpredicted  actualj=actual ð4Þ
The mean magnitude of relative error (MMRE) is the mean of the MRE values, and
MedianMRE (MdMRE) is the median of the MRE values. Prediction at level r or
PRED(r) is used to examine the cumulative frequency of MRE for a specific error level.
For T estimations, the formula is as follows:
T 
100 X N
1 if MREi  100
PREDðNÞ ¼ ð5Þ
T i 0 otherwise

In this study, we take the desired error level as r = 25. PRED (25) is preferred over
MMRE and MdMRE, in terms of evaluating the stability and robustness of the estimations
(Conte et al. 1986; Stensrud et al. 2003). In order to say that a model performs well, the
MdMRE and MRE values should be low and the PRED (25) values should be high.

4.5 Scope and limitations

In this paper, we address the cost estimation problem as a classification problem and
propose an approach that integrates classification methods with cluster analysis. This
approach uses historical cost data and different machine learning techniques in order to
make predictions. Although our main aim is to predict effort intervals, we also demonstrate
that point estimates can be achieved through our approach as well. Therefore, the scope of
our work is relevant for practitioners who employ cost estimation practices.
One of the limitations of our approach is that we test only one clustering method in order to
obtain the effort classes. Other clustering techniques that create dynamic clusters can also be
used instead of the leader cluster. As a second limitation, we obtain point estimates through
simple approaches such as the mean/median regression. Regression-based models can be
used to increase the point estimation performance. However, our aim is not to demonstrate the
superiority of one algorithm over the others; instead, we provide an implementation of our
ideas using public datasets in order to demonstrate the applicability of our approach.
We address the threats to the validity of our work under three categories: (1) internal
validity, (2) external validity and (3) construct validity.
Internal validity fundamentally questions to what extent the cause–effect relationship
between dependent and independent variables exist. For addressing the threats to the
internal validity of our results, we used seven datasets and applied 10 9 10 cross-vali-
dation to overcome the ordering effects.
External validity, i.e. the generalizability, of results addresses the extent to which the
findings of a particular study are applicable outside the specifications of that study. To
ensure the generalizability of our results, we paid extra attention to include as many
datasets coming from various resources as possible and used seven datasets from two
different sources in our study. Our datasets contain a wide diversity of projects in terms of
their sources, their domains and the time period during which they were developed.
Datasets composed of software development projects from different organizations around
the world are used to generalize our results.

123
Software Qual J (2011) 19:537–552 547

Construct validity (i.e. face validity) assures that we are measuring what we actually
intended to measure. We use in our research MR, MMRE, MdMRE and PRED (25) for
measuring and comparing the performance of the model. The majority of effort estimation
studies use estimation-error-based measures for measuring and comparing the performance
of different methods. We also used error-based measures in our study since they are a
practical option for the majority of researchers. Moreover, using error-based methods
enables our study to be benchmarked with previous effort estimation research.

5 Results and discussions

The proposed approach is applied to and validated on all of the seven datasets. The results
are given in terms of accuracy measures mentioned in Sect. 4.
The effort clusters created for each dataset are given in Table 3. In order to show the
clustering efficiency, the minimum and maximum numbers of projects assigned to a cluster
are also given.
The classification results for effort interval estimation are given in Fig. 3. k-NN and LD
perform similarly for coc81, desharnais_1_1, nasa93 and sdr05. They both give a mis-
classification rate of 0% for coc81 and sdr05. For cocomonasa_v1 and sdr06, k-NN out-
performs the others, whereas LD is the best one for sdr07. In total, the proposed model
gives a misclassification rate of 0% for five cases in the best case and 17% in the worst
case.

Table 3 Effort clusters


Dataset # of Clusters # of Projects
for each dataset
Min Max

coc81 4 2 44
cocomonasa_v1 5 3 36
desharnais_1_1 9 2 21
nasa93 6 3 44
sdr05 3 3 16
sdr06 3 2 12
sdr07 4 6 16

Fig. 3 Effort misclassification rates for each dataset

123
548 Software Qual J (2011) 19:537–552

Table 4 Comparison
# of Clusters Hit rate (%)
of the results
Min Max

Sentas et al. 60.38 79.24


Our model 97 100

The outcomes concerning effort interval estimation yield some important results.
Considering classifiers, k-NN is the best performing one and LD follows it with a slight
difference, whereas DT is the worst performing one.
Since our main aim is effort interval classification, we focus on the misclassification rate
to measure how good our classification performance is. The misclassification rates are 0%
for most cases and around 17% in the worst case. There are not many studies in the
literature that investigate the effort interval classification. The most recent study on this
topic is Sentas et al.’s study, in which ordinal regression is used to model the probabilities
of both effort and productivity intervals (Sentas et al. 2005). In the said study, hit rates of
around 70% are obtained for productivity interval estimation on the coc81 dataset. In our
study, however, the hit rates for all datasets are between 90 and 100%. The main reason for
this is that we use similar projects in order to predict the project cost. This is achieved
through clustering the projects according to their attributes. Furthermore, the intervals in
the above-mentioned study are manually predefined, whereas we dynamically create them
by clustering. In Table 4, we compare our results with those of Sentas et al.
We also analyzed our results in terms of point estimation. We used a simple approach
based on using the means and medians of the intervals for point estimation. We should
once again note that our main aim is to determine the effort intervals. However, we also
show how our results can be easily converted to point estimates and can produce com-
parable results to previous ones.
In Table 5, we present point estimation results in terms of the three measures mentioned
in the previous section. Point estimates are determined by taking either the mean or the
median of the effort values of the projects. In terms of the point estimation performance,
k-NN and LD perform nearly the same and better than DT for all datasets. The performance
of all classifiers improves for all measures when the median is used for point estimation.
Especially for MMRE and MdMRE measures, the improvement is obvious. MMRE and
MdMRE results decrease to 13%, and PRED results increase to 86% for some datasets. Note
that a PRED value of 86% means that 86% of all estimations are within the 25% confidence
interval, which shows the stability and robustness of the model we propose.
Combining clustering with classification methods has helped us to achieve favorable
results by eliminating the effects of unrelated data. Our experimental results show that we
achieved much higher hit rates than those of previous studies. Although we simply use the
mean and the median of the effort interval values, the point estimation results are also
comparable to those in the literature. If a different model is fitted to each interval sepa-
rately, it is expected that our estimation results will further improve.

6 Conclusions and future work

Although various methods have been proposed within the scope of the literature, in this
paper, we handle the cost estimation problem in a different manner. We treat cost esti-
mation as a classification problem rather than a regression problem and propose an

123
Software Qual J (2011) 19:537–552 549

Table 5 Point estimation results (%)


Dataset Classifier Using the mean of projects Using the median of projects

MMRE MdMRE PRED MMRE MdMRE PRED

coc81 LD 189 183 33 131 131 33.6


k-NN 189 183 33 131 131 33.6
DT 192 190 29.6 134 131 30.2
cocomonasa_v1 LD 69 45 42.2 51 32 54.8
k-NN 69 45 42 51 32 54.6
DT 76 50 26.8 58 40 39.4
desharnais_1_1 LD 13 12 84.14 13 12 86.42
k-NN 13 12 84.14 13 12 86.71
DT 16 15 79 15 15 81.85
nasa93 LD 70 52 55.5 52 40 57.7
k-NN 69 52 55.5 52 40 57.7
DT 72 52 51.2 55 41 53.4
sdr05 LD 45 28 45.5 37 26 52
k-NN 45 28 45.5 37 26 52
DT 59 44 28.5 52 38 35
sdr06 LD 31 31 50.5 25 23 67
k-NN 30 31 50.5 24 23 67
DT 34 36 44.5 27 25 61
sdr07 LD 14 14 84.66 14 14 79.6
k-NN 14 13 81.33 14 14 76.3
DT 14 13 81.33 14 14 76.3

approach that classifies new software projects into one of the dynamically created effort
classes, with each corresponding to an effort interval. The prevention of overestimation
and underestimation is more practical through predicting the intervals instead of the exact
values. This approach integrates classification methods with cluster analysis, which is, to
the best of our knowledge, performed for the first time in the software engineering domain.
In contrast to previous studies, the intervals are not predefined but dynamically created
through clustering.
The proposed approach is validated on seven datasets taken from public repositories,
and the results are presented in terms of widely used performance measures. These results
point out the three important advantages our approach offers:
1. We obtain much higher effort estimation hit rates (around 90–100%) in comparison to
other studies in the literature.
2. For point estimation results, we can see that the MdMRE, MMRE and PRED (25)
values are comparable to those in the literature for most of the datasets although we
use simple methods such as mean and median regression.
3. Effort intervals are generated dynamically according to historical data. This method
removes the need for project managers to specify effort intervals manually and hence
prevents the waste of time and effort.
Future work includes the use of different clustering techniques to find effort classes and
to fit probabilistic models to the intervals. Also, regression-based models can be used for

123
550 Software Qual J (2011) 19:537–552

point estimation instead of taking the mean and the median of interval values, which would
enhance the point estimation performance.

Acknowledgments This research is supported in part by Tubitak under grant number EEEAG108E014.

References

Alpaydin, E. (2004). Introduction to machine learning. Cambridge: The MIT Press.


Angelis, L., & Stamelos, I. (2000). A simulation tool for efficient analogy based cost estimation. Journal of
Empirical Software Engineering, 5(1), 35–68.
Bakar, Z. A., Deris, M. M., & Alhadi, A. C. (2005). Performance analysis of partitional and incremental
clustering, Seminar Nasional Aplikasi Teknologi Informasi (SNATI).
Baskeles, B., Turhan, B., & Bener, A. (2007). Software effort estimation using machine learning methods. In
Proceedings of the 22nd international symposium on computer and information sciences (ISCIS 2007),
Ankara, Turkey, pp. 126–131.
Bibi, S., Stamelos, I., & Angelis, L. (2004). Software cost prediction with predefined interval estimates. In
First Software Measurement European Forum, Rome, Italy, January 2004.
Boehm, B. W. (1981). Software engineering economics. Advances in computer science and technology
series. Upper Saddle River, NJ: Prentice Hall PTR.
Boehm, B. W. (1999). COCOMO II and COQUALMO Data Collection Questionnaire. University of
Southern California, Version 2.2.
Boehm, B., Abts, C., & Chulani, S. (2000). Software development cost estimation approaches—A survey.
Annals of Software Engineering.
Boetticher, G. D. (2001). Using machine learning to predict project effort: empirical case studies in data-
starved domains. In First international workshop on model-based requirements engineering,
pp. 17–24.
Boetticher, G., Menzies, T., & Ostrand, T. (2007). PROMISE repository of empirical software engineering
data. West Virginia University, Department of Computer Science. http://www.promisedata.org/
repository.
Briand, L. C., Basili, V. R., & Thomas, W. M. (1992). A pattern recognition approach for software
engineering data analysis. IEEE Transactions on Software Engineering, 18(11), 931–942.
Conte, S. D., Dunsmore, H. E., & Shen, V. Y. (1986). Software engineering metrics and models. Menlo
Park, CA: Benjamin-Cummings.
Draper, N., & Smith, H. (1981). Applied regression analysis. London: Wiley.
Gallego, J. J. C., Rodriguez, D., Sicilia, M. A., Rubio, M. G., & Crespo, A. G. (2007). Software project
effort estimation based on multiple parametric models generated through data clustering. Journal of
Computer Science and Technology, 22(3), 371–378.
Jorgensen, M. (2002). Comments on ‘a simulation tool for efficient analogy based cost estimation’.
Empirical Software Engineering, 7, 375–376.
Jorgensen, M. (2003). An effort prediction interval approach based on the empirical distribution of previous
estimation accuracy. Information and Software Technology, 45, 123–126.
Jorgensen, M., & Teigen, K. H. (2002). Uncertainty intervals versus interval uncertainty: An alternative
method for eliciting effort prediction intervals in software development projects. In International
conference on project management (ProMAC), Singapore, pp. 343–352.
Lee, A., Cheng, C. H., & Balakrishnan, J. (1998). Software development cost estimation: Integrating neural
network with cluster analysis. Information and Management, 34, 1–9.
Leung, H., & Fan, Z. (2001). Software cost estimation. Handbook of software engineering and knowledge
engineering. ftp://cs.pitt.edu/chang/handbook/42b.pdf.
Lum, K., Bramble, M., Hihn, J., Hackney, J., Khorrami, M., & Monson, E. (2003). Handbook for software
cost estimation. NASA Jet Propulsion Laboratory, JPL D-26303.
Menzies, T., & Hihn, J. (2006). Evidence-based cost estimation for better-quality software. IEEE Software,
23(4), 64–66.
Miyazaki, Y., Terakado, M., Ozaki, K., & Nozaki, H. (1994). Robust regression for developing software
estimation models. Journal of Systems and Software, 1, 3–16.
NASA. (1990). Manager’s handbook for software development. Goddard Space Flight Center, Greenbelt,
MD, NASA Software Engineering Laboratory.
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufman.

123
Software Qual J (2011) 19:537–552 551

Sentas, P., Angelis, L., & Stamelos, I. (2003). Multinominal logistic regression applied on software pro-
ductivity prediction. In 9th Panhellenic conference in informatics, Thessaloniki.
Sentas, P., Angelis, L., Stamelos, I., & Bleris, G. (2005). Software productivity and effort prediction with
ordinal regression. Information and Software Technology, 47, 17–29.
Shalabi, L. A., & Shaaban, Z. (2006). Normalization as a preprocessing engine for data mining and the
approach of preference matrix. In IEEE proceedings of the international conference on dependability
of computer systems (DEPCOS-RELCOMEX’06).
Shepperd, M., & Schofield, M. (1997). Estimating software project effort using analogies. IEEE Transac-
tions on Software Engineering, 23(12), 736–743.
SoftLab. (2009). Software research laboratory, Department of Computer Engineering, Bogazici University.
http://www.softlab.boun.edu.tr.
Srinivasan, K., & Fisher, D. (1995). Machine learning approaches to estimating software development
effort. IEEE Transactions on Software Engineering, 21(2), 126–137.
Stamelos, I., & Angelis, L. (2001). Managing uncertainty in project portfolio cost estimation. Information
and Software Technology, 43(13), 759–768.
Stamelos, I., Angelis, L., Dimou, P., & Sakellaris, E. (2003). On the use of bayesian belief networks for the
prediction of software productivity. Information and Software Technology, 45, 51–60.
Stensrud, E., Foss, T., Kitchenham, B., & Myrtveit, I. (2003). A further empirical investigation of the
relationship between MRE and project size. Empirical Software Engineering.
Tadayon, N. (2005). Neural network approach for software cost estimation. International Conference on
Information Technology: Coding and Computing, 2, 815–818.

Author Biographies

Ayşe Bakır received her MSc degree in computer engineering from


Bogazici University in 2008 and her BSc degree in computer engi-
neering from Gebze Institute of Technology in 2006. Her research
interests include software quality modeling and software cost
estimation.

Burak Turhan received his PhD in computer engineering from


Bogazici University. After his postdoctoral studies at the National
Research Council of Canada, he joined the Department of Information
Processing Science at the University of Oulu. His research interests
include empirical studies on software quality, cost/defect prediction
models, test-driven development and the evaluation of new approaches
for software development.

123
552 Software Qual J (2011) 19:537–552

Ayşe B. Bener is an associate professor in the Ted Rogers School of


Information Technology Management. Prior to joining Ryerson,
Dr. Bener was a faculty member and Vice Chair in the Department of
Computer Engineering at Boğaziçi University. Her research interests
are software defect prediction, process improvement and software
economics. Bener has a PhD in information systems from the London
School of Economics. She is a member of the IEEE, the IEEE Com-
puter Society and the ACM.

123
Copyright of Software Quality Journal is the property of Springer Science & Business Media B.V. and its
content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's
express written permission. However, users may print, download, or email articles for individual use.

You might also like