Professional Documents
Culture Documents
net/publication/4175543
CITATIONS READS
34 65
1 author:
Chris Lokan
Australian Defence Force Academy
99 PUBLICATIONS 1,963 CITATIONS
SEE PROFILE
All content following this page was uploaded by Chris Lokan on 14 February 2016.
Chris Lokan
School of Information Technology and Electrical Engineering
UNSW@ADFA, Australian Defence Force Academy
Canberra ACT 2600, Australia
c.lokan@adfa.edu.au
Abstract accuracy statistics, when they are derived using different fit-
ness functions e.g. minimizing the sum of absolute devia-
When estimation models are derived from existing data, tions, minimizing MMRE, maximizing Pred(0.25)?
they are commonly evaluated using statistics such as mean Models that minimize squared deviations naturally give
magnitude of relative error. But when the models are de- greatest weight to large systems, where the scope for large
rived in the first place, it is usually by optimizing something estimation errors is greatest. Models that minimize relative
else — typically, as in statistical regression, by minimizing errors naturally give greatest weight to the most common
the sum of squared deviations. How do estimation models types of systems. If large systems were also the most com-
for typical software engineering data fare, on various com- mon, these two approaches would coincide. But software
mon accuracy statistics, if they are derived using other “fit- engineering data sets are generally skewed towards smaller
ness functions”? In this study, estimation models are built systems, so there is a trade-off to be evaluated.
using a variety of fitness functions, and evaluated using a In this study, estimation models are built using a variety
wide range of accuracy statistics. We find that models based of fitness functions, and evaluated using a wide range of ac-
on minimizing actual errors generally out-perform models curacy statistics. Data is sourced from the ISBSG data set.
based on minimizing relative errors. Given the nature of Genetic programming is used to derive the models, because
software engineering data sets, minimizing the sum of ab- it can work with arbitrary fitness functions.
solute deviations seems an effective compromise. The aim is observe how the models compare across a
Keywords: effort estimation, genetic programming, ac- range of fitness functions, and thus to gain some under-
curacy statistics, fitness functions. standing of which fitness functions are generally most ef-
fective with software engineering data.
The rest of this paper is organized as follows. Section 2
1. Introduction discusses related work. Section 3 gives an outline of ge-
netic programming (GP) and its use in software estimation,
Building an estimation model from existing data in- and explains why it is a suitable vehicle for this study. Sec-
volves finding a line of best fit for that data. This involves tion 4 describes the accuracy statistics used here to evaluate
minimizing some error criterion, or “fitness function”. One models, and the fitness functions used to derive the models.
model is better (“fitter”) than another if it yields a lower Section 5 describes the research method used. Results are
value for the chosen fitness function. For example, the er- presented and discussed in Section 6. Section 7 discusses
ror criterion used in ordinary least squares regression is the threats to the validity of this work. Conclusions and com-
sum of squared deviations. ments on possible future work are given in Section 8.
Though models are commonly built by minimizing
squared errors, they are not generally evaluated that way. 2. Related work
Numerous accuracy statistics have been proposed. In the
software engineering literature it is common to evaluate an There are two types of work that are related to this study:
estimation model in terms of its mean magnitude of relative investigations of the use of GP for software estimation, and
error (MMRE), or the fraction of estimates within a given studies concerning what might be termed the “infrastruc-
percentage of the actual value (Pred(l)). ture” of software estimation models.
This leads to the question: how do estimation models for Investigations of GP for software estimation are de-
typical software engineering data fare, on various common scribed in section 3.2, following a description of GP.
MSE: error = (estimate - actual)¾ The analysis presented in this paper was based on soft-
ware projects from Release 8 of the ISBSG data set1 .
LAD: error = abs (estimate - actual) Release 8 had data on 2,027 projects, with over 50 data
attributes recorded for each project. These attributes cap-
MRE: error = abs ( (estimate - actual) / actual ) ture the nature of the project itself, the development envi-
ronment and techniques used, and strengths and weaknesses
MER: error = abs ( (estimate - actual) / estimate ) perceived in individual projects. Most are known early in a
project, which is important for estimation.
is trickier, since its optimum value is 1, not zero. Com- Not all of the 2,027 projects were appropriate for inclu-
puting and subtracting 1 does not work, since negative val- sion in our analysis:
ues are possible and minimization will do the wrong thing.
abs(z-1) is no help, as that is MRE. As an approximation, Projects were excluded if they were not assigned a high
the following function was used: data quality rating (A or B) by ISBSG.
Z: error = max(estimate, actual)/min(estimate, actual) Projects were excluded if their size was measured in
anything except IFPUG function points.
This is always at least 1, and it makes sense to minimize it: Projects were excluded if their effort values covered
the closer it is to 1, the closer the estimate and actual values more than just the development team only.
are to each other.
Pred(l) is a blunt instrument. It is insensitive to the de-
Projects were excluded if they had any missing data on
four key ratio-scaled attributes: size, effort, team size,
gree of inaccuracy of estimates that fall outside the desired
and duration.
range. Rather than simply optimizing Pred(l) without regard
to the inaccurate estimates, functions were investigated that Projects whose normalized effort differs from recorded
would focus first on maximizing Pred(l), but would resolve effort by more than 50% were excluded (normalized
ties by considering a second criterion: MSE, LAD, or MRE. effort is ISBSG’s estimate of effort over an entire
As noted above, the GP software used here assumes that project, if reported effort only covers some phases of
small fitness values are better than large values: the fitness the life cycle).
function needs to be minimized, not maximized. Thus max-
imizing Pred(l) needs to be re-cast as minimizing the num- Projects were excluded if their size was below about 50
ber of projects for which the absolute relative error is not function points, or their effort below 100 hours. The
. This is done by assigning a very high penalty (10½¼ ) experience of practitioners is that projects below these
to estimates where the error exceeds . The penalty value sizes are too variable to be included in data sets for
overwhelms the second criterion; hence the first priority in effort estimation models.
optimization is to minimize the number of estimates that re-
ceive that penalty. Estimation equations that have the same
Projects were excluded if they were identified as out-
liers, or as high leverage points, on one or more of the
Pred(l) value are distinguished by the second criterion.
ratio-scaled attributes. The rationale for this was to
The following functions were used:
avoid the accuracy statistics being unduly influenced
Pred/LAD: error = abs(diff) + 10½¼ if MRE by a small number of extreme values. Outliers were
error = abs(diff) if MRE identified using box plots; leverage points by the in-
fluence.measures function in the statistical lan-
Pred/MSE: error = diff¾ + 10½¼ if MRE guage R.
error = diff¾ if MRE 1 www.isbsg.org