You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4175543

What should you optimize when building an estimation model?

Conference Paper · October 2005


DOI: 10.1109/METRICS.2005.55 · Source: IEEE Xplore

CITATIONS READS
34 65

1 author:

Chris Lokan
Australian Defence Force Academy
99 PUBLICATIONS 1,963 CITATIONS

SEE PROFILE

All content following this page was uploaded by Chris Lokan on 14 February 2016.

The user has requested enhancement of the downloaded file.


What Should You Optimize When Building an Estimation Model?

Chris Lokan
School of Information Technology and Electrical Engineering
UNSW@ADFA, Australian Defence Force Academy
Canberra ACT 2600, Australia
c.lokan@adfa.edu.au

Abstract accuracy statistics, when they are derived using different fit-
ness functions e.g. minimizing the sum of absolute devia-
When estimation models are derived from existing data, tions, minimizing MMRE, maximizing Pred(0.25)?
they are commonly evaluated using statistics such as mean Models that minimize squared deviations naturally give
magnitude of relative error. But when the models are de- greatest weight to large systems, where the scope for large
rived in the first place, it is usually by optimizing something estimation errors is greatest. Models that minimize relative
else — typically, as in statistical regression, by minimizing errors naturally give greatest weight to the most common
the sum of squared deviations. How do estimation models types of systems. If large systems were also the most com-
for typical software engineering data fare, on various com- mon, these two approaches would coincide. But software
mon accuracy statistics, if they are derived using other “fit- engineering data sets are generally skewed towards smaller
ness functions”? In this study, estimation models are built systems, so there is a trade-off to be evaluated.
using a variety of fitness functions, and evaluated using a In this study, estimation models are built using a variety
wide range of accuracy statistics. We find that models based of fitness functions, and evaluated using a wide range of ac-
on minimizing actual errors generally out-perform models curacy statistics. Data is sourced from the ISBSG data set.
based on minimizing relative errors. Given the nature of Genetic programming is used to derive the models, because
software engineering data sets, minimizing the sum of ab- it can work with arbitrary fitness functions.
solute deviations seems an effective compromise. The aim is observe how the models compare across a
Keywords: effort estimation, genetic programming, ac- range of fitness functions, and thus to gain some under-
curacy statistics, fitness functions. standing of which fitness functions are generally most ef-
fective with software engineering data.
The rest of this paper is organized as follows. Section 2
1. Introduction discusses related work. Section 3 gives an outline of ge-
netic programming (GP) and its use in software estimation,
Building an estimation model from existing data in- and explains why it is a suitable vehicle for this study. Sec-
volves finding a line of best fit for that data. This involves tion 4 describes the accuracy statistics used here to evaluate
minimizing some error criterion, or “fitness function”. One models, and the fitness functions used to derive the models.
model is better (“fitter”) than another if it yields a lower Section 5 describes the research method used. Results are
value for the chosen fitness function. For example, the er- presented and discussed in Section 6. Section 7 discusses
ror criterion used in ordinary least squares regression is the threats to the validity of this work. Conclusions and com-
sum of squared deviations. ments on possible future work are given in Section 8.
Though models are commonly built by minimizing
squared errors, they are not generally evaluated that way. 2. Related work
Numerous accuracy statistics have been proposed. In the
software engineering literature it is common to evaluate an There are two types of work that are related to this study:
estimation model in terms of its mean magnitude of relative investigations of the use of GP for software estimation, and
error (MMRE), or the fraction of estimates within a given studies concerning what might be termed the “infrastruc-
percentage of the actual value (Pred(l)). ture” of software estimation models.
This leads to the question: how do estimation models for Investigations of GP for software estimation are de-
typical software engineering data fare, on various common scribed in section 3.2, following a description of GP.

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
The last few years have seen several studies related to the of effort. The choice of fitness function is one of many con-
development and evaluation of software estimation models. trol parameters to be set when preparing for a GP run.
Issues investigated include data imputation methods to han- A new generation is formed from an existing generation
dle missing data [3]; the number of repetitions needed to through the application of three genetic operations. Repro-
draw reasonable inferences [8]; the use of simulated data duction involves identifying fitter individuals from the cur-
sets to investigate how different analysis methods influence rent population, for inclusion in the next population. There
the results [13, 16]; and what different accuracy statistics are several ways in which they might be selected; tourna-
actually measure [9]. ment selection was used here. Crossover involves taking
MMRE has received particular attention, since it is so two chromosomes, snipping out a sub-expression from each
commonly used as an accuracy statistic. Miyazaki noted one, and interchanging them. Mutation involves making a
that MMRE is lower for models that underestimate [12]. random change somewhere within a chromosome.
Kitchenham et al studied MMRE theoretically to identify The aim is that over many generations, functions will
what it really means, and noted that it is a measure of the evolve that estimate effort well. The process continues until
spread of z=(estimate/actual) [9], rather than a measure of the fitness of the best individual converges, or a specified
prediction accuracy. Foss et al used an example and sim- number of generations is reached. The fittest individual in
ulations to show that MMRE can lead to counter-intuitive the final generation is taken as the best solution for that run.
results [6], and concluded that it is unsafe to use MMRE as The whole process is normally repeated several times, using
an accuracy statistic. different seed values for the pseudo-random number gener-
Some authors have commented on different fitness func- ator.
tions. It has been noted that the choice of fitness function Key issues in GP include defining the fitness function,
affects the results [2, 11]; and that the choice of fitness defining the forms that the chromosomes can take, and
function will bias the search towards similar accuracy mea- maintaining enough diversity in the population to allow
sures [11]. Dolado [5] observed that correlation coefficient eventual convergence to a global optimum rather than set-
is not effective as a fitness function, and preferred mean tling on a local optimum. The first of these is the point of
squared error; this is the only direct reference I am aware investigation here; the second is handled here by defining a
of concerning the relative merit of different functions. grammar for chromosomes; the last depends largely on the
settings of control parameters.
3. Genetic programming
3.2. Application of GP in software estimation
3.1. Outline of GP
Dolado [5] was the first to investigate the suitability of
GP as a method for deriving software estimation models.
Genetic programming is a technique for tackling opti- He analyzed several small published data sets, finding mod-
mization problems. A brief outline is presented here; for els to estimate effort from a single independent variable,
more information see [10]. project size. Mean squared error was used as the fitness
The idea is based on the theory of evolution (“survival function, and MMRE and Pred(0.25) as the accuracy statis-
of the fittest”). Genetic operations on chromosomes lead to tics.
fitter individuals, which are more likely to survive. Over Burgess and Lefley [2] studied Desharnais’ data set of
time, the population as a whole improves. 81 projects, in which 9 independent variables were avail-
In this context, a “chromosome” is a symbolic expres- able for use. They minimized MMRE as a fitness function,
sion that calculates one output value (an estimate of soft- and used several accuracy statistics. They found that GP
ware project effort) from the values of some input param- performed well compared to other methods, and that it mer-
eters (measurements of various project characteristics). A ited further investigation, particularly to explore the effects
population of individuals means a collection of different es- of various parameter settings.
timation functions, some of which are “fitter” (give better Shan et al [15] studied 423 projects, from Release 7 of
estimates) than others. the ISBSG data set. MSE was used as both fitness function
The GP process starts with the random generation of and accuracy statistic. In a small experiment, using separate
some estimation functions. Probably none of them is much training and testing sets, they found that GP out-performed
good, but some will be fitter than others. They provide a OLS regression. They used Grammar Guided Genetic Pro-
starting point for evolution. gramming [17], experimenting with different grammars to
The fitness of an individual is evaluated using a “fitness describe the allowed forms of the chromosomes. Writ-
function”. A fitter individual is closer to the optimal solu- ing a grammar supplies both a way to represent syntactical
tion — in this case, one that gives more accurate estimates constraints on the chromosomes, and a way to incorporate

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
background knowledge to guide the search process. MMER: Mean magnitude of error relative to the estimate.
Lefley and Shepperd [11] studied the “Finnish data set” Proposed in [9], argued as intuitively preferable to
of 407 projects. Their fitness function was root average MMRE since at the time of estimation one wants to
mean squared error. They considered several accuracy know what errors to expect relative to the estimate.
statistics. They used a holdout sample to evaluate the ac- Median MER is also reported here.
curacy of their models, with chronological splitting as the
basis for forming the training and evaluation sets: projects Mean Þ : Kitchenham et al [9] recommended that accuracy
up to a given date formed the training set, and projects after should be measured in terms of z=(estimate/actual).
that date the testing set. It is asymmetric and favours models that minimize
Though these studies vary in data set size, independent overestimates (which could be a problem since over-
variables, fitness function, accuracy statistics, and valida- estimates are usually less serious than underesti-
tion procedures, they all indicate that GP has promise as a mates). The full distribution of z should be consid-
method for finding estimation models. ered when comparing prediction systems. Median
A particular advantage of GP is its ability to explore a of Þ is also reported here, as well as the mean of
vast search space of possible equations [5]. Another ad- 1/z=(actual/estimate).
vantage is that no assumptions need to be made about the
Pred(l): The proportion of estimates that are within a given
distributions of the data; GP can make its own decisions
percentage of the actual value. Widely used as an ac-
about dealing with outliers and the selection of relevant at-
curacy statistic since its description in [4], particularly
tributes [2].
in conjunction with MMRE. Pred(0.25) and Pred(0.50)
A challenge with GP is the need to determine appropriate
are reported here.
settings for numerous control parameters; this can take con-
siderable experimentation and may have a significant effect Means and medians of absolute errors are also reported.
on the results [2]. The fitness function is one of those pa- Foss et al [6] recommend LSD and RSD as accuracy
rameters. Once sensible settings are known for the other pa- statistics. These are intended as estimates of the standard
rameters, it is easy to experiment with various fitness func- deviation of the error term. Neither could be used here.
tions — hence the adoption of GP in this research. LSD involves logarithms of estimated values, so cannot be
The point here is not to assess the suitability of GP as computed for any model that generates any negative esti-
a vehicle for building estimation models, but rather to use mates; several such models emerge in this study. RSD can
it as a mechanism for creating models that are optimized only be used if estimates are based on a single predictor
towards different things. variable.

4. Fitness functions and accuracy statistics 4.2. Fitness functions


This section describes the accuracy statistics used to
The fitness functions chosen for study were selected ei-
evaluate models in this study, and the fitness functions used
ther because they are commonly used in building estimation
to build the models.
models, or they are commonly used in evaluating estimation
models.
4.1. Accuracy statistics
The functions chosen were:
These accuracy statistics were used to evaluate estima- ¯ MSE: since it underpins ordinary regression, and is re-
tion models in this study: lated to ʾ and SD.
ʾ : The square of the coefficient of correlation between ¯ LAD (least absolute deviation): an obvious alterna-
estimated and actual values. tive to ordinary regression, proposed centuries ago but
MSE: Mean squared error — used as an accuracy statistic overshadowed by least-squares regression because it
by [15]. was harder to compute and lacked theoretical under-
pinnings — both addressed now [1].
StDev: Standard deviation of error. Recommended by [6]
as an accuracy statistic; almost identical to root mean ¯ MRE: because it is so common as an accuracy statis-
squared error, used as an accuracy statistic in [11]. tic, and has also been used elsewhere as a fitness func-
tion [2].
MMRE: Mean magnitude of relative error. Widely used as
an accuracy statistic since its description in [4]. Me- ¯ MER and Z: since they are proposed as better alterna-
dian MRE is also reported here. tives to MRE.

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
¯ Pred(l), since it is so widely used as an accuracy statis-
tic. Pred/MRE: error = MRE + 10½¼ if MRE  
error = MRE if MRE 
Writing a fitness function consists of specifying how the
error term is calculated for a single data point, given its where diff = estimate - actual
known and estimated values. The GP software used here and MRE = abs(diff)/actual
always tries to minimize the overall fitness value, so the fit-
ness function needs to be written so that smaller values are 5. Research method
better.
Functions are easily writen for differences between esti- 5.1. Data set description
mated and actual values:

MSE: error = (estimate - actual)¾ The analysis presented in this paper was based on soft-
ware projects from Release 8 of the ISBSG data set1 .
LAD: error = abs (estimate - actual) Release 8 had data on 2,027 projects, with over 50 data
attributes recorded for each project. These attributes cap-
MRE: error = abs ( (estimate - actual) / actual ) ture the nature of the project itself, the development envi-
ronment and techniques used, and strengths and weaknesses
MER: error = abs ( (estimate - actual) / estimate ) perceived in individual projects. Most are known early in a
project, which is important for estimation.
is trickier, since its optimum value is 1, not zero. Com- Not all of the 2,027 projects were appropriate for inclu-
puting and subtracting 1 does not work, since negative val- sion in our analysis:
ues are possible and minimization will do the wrong thing.
abs(z-1) is no help, as that is MRE. As an approximation,  Projects were excluded if they were not assigned a high
the following function was used: data quality rating (A or B) by ISBSG.

Z: error = max(estimate, actual)/min(estimate, actual)  Projects were excluded if their size was measured in
anything except IFPUG function points.

This is always at least 1, and it makes sense to minimize it:  Projects were excluded if their effort values covered
the closer it is to 1, the closer the estimate and actual values more than just the development team only.
are to each other.
Pred(l) is a blunt instrument. It is insensitive to the de-
 Projects were excluded if they had any missing data on
four key ratio-scaled attributes: size, effort, team size,
gree of inaccuracy of estimates that fall outside the desired
and duration.
range. Rather than simply optimizing Pred(l) without regard
to the inaccurate estimates, functions were investigated that  Projects whose normalized effort differs from recorded
would focus first on maximizing Pred(l), but would resolve effort by more than 50% were excluded (normalized
ties by considering a second criterion: MSE, LAD, or MRE. effort is ISBSG’s estimate of effort over an entire
As noted above, the GP software used here assumes that project, if reported effort only covers some phases of
small fitness values are better than large values: the fitness the life cycle).
function needs to be minimized, not maximized. Thus max-
imizing Pred(l) needs to be re-cast as minimizing the num-  Projects were excluded if their size was below about 50
ber of projects for which the absolute relative error is not function points, or their effort below 100 hours. The
. This is done by assigning a very high penalty (10½¼ ) experience of practitioners is that projects below these
to estimates where the error exceeds . The penalty value sizes are too variable to be included in data sets for
overwhelms the second criterion; hence the first priority in effort estimation models.
optimization is to minimize the number of estimates that re-
ceive that penalty. Estimation equations that have the same
 Projects were excluded if they were identified as out-
liers, or as high leverage points, on one or more of the
Pred(l) value are distinguished by the second criterion.
ratio-scaled attributes. The rationale for this was to
The following functions were used:
avoid the accuracy statistics being unduly influenced
Pred/LAD: error = abs(diff) + 10½¼ if MRE   by a small number of extreme values. Outliers were
error = abs(diff) if MRE  identified using box plots; leverage points by the in-
fluence.measures function in the statistical lan-
Pred/MSE: error = diff¾ + 10½¼ if MRE   guage R.
error = diff¾ if MRE  1 www.isbsg.org

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
Variable Scale Description Variable Min Median Mean Max StDev
Effort Ratio Normalized project effort Size 46 233 385 3120 435
in person hours Effort 107 1943 3506 22050 4115
Size Ratio Application size in IFPUG Team size 1 5 6.85 43 6.20
adjusted function points Duration 1 4 7.58 30 5.02
Team size Ratio Maximum team size
Duration Ratio Project duration in elapsed Table 2. Summary of ratio-scaled variables
calendar months
OrgType Nominal Organization type (eg Banking,
Insurance, Manufacturing)
BusArea Nominal Business area to which the 5.2. Experimental design
application contributes
AppType Nominal Application type (eg MIS) The experiment took the form of a complete block de-
DevType Nominal Type of development (eq New sign [7, 18]. There is one factor, with 8 different treatments:
development, enhancement)
the fitness function, with the 8 functions being compared.
Platform Nominal Mainframe, midrange, PC
There are 634 objects: the 634 projects. Each treatment is
LangType Nominal Language type used (eg 3GL)
Language Nominal Specific main language used applied to each of the 634 objects; that is, an estimation
Year Ordinal Year of project completion model obtained by optimizing each of the fitness functions
is applied to each one of the projects. Thirty repetitions are
Table 1. Variables used in this study made for each treatment.
Applying every treatment to every object is inappropri-
ate if there is any question of learning, whereby participants
learn during earlier treatments and this affects later treat-
634 projects remained for study. These had high data ments. There is no issue of that here, when the objects are
quality ratings, consistent and comparable definitions, and inanimate — software development projects. Randomiza-
were not at the extreme high or low ends of the range. tion is not important in the experiment design for this rea-
Twelve variables were retained for each project. They son.
are described in Table 1. Normalized project effort is the Applying every treatment to every object, instead of us-
dependent variable. The independent variables are the other ing separate training and testing samples or a cross- valida-
three ratio-scale variables (size, team size, duration), and tion process, would also be inappropriate if absolute esti-
eight categorical variables (year of implementation, though mates were sought for the performance of each treatment.
a number from 1989 to 2002, is treated as an ordinal-scale The estimated accuracy would be too high. But in this situ-
value rather than a number on which arithmetic might be ation we seek relative comparisons, and the same situation
performed.) The independent variables are those believed applies for every treatment, so the complete block design
likely to have some explanatory value for project effort, seems appropriate.
which had not too many missing values.
Intuitively, one would expect some relationships to ex- 5.3. Experimental method
ist between size, team size, and duration. This could be
a problem, because if “independent” variables are strongly The GP software used was DCTG-GP [14]. Each sepa-
correlated they should not be included together in statisti- rate GP run generated one estimation model, based on one
cal estimations. However, in this data set the correlations fitness function.
between the independent variables are not strong enough to The parameters used for each run are given in Table 3.
be a problem (the highest is 0.38, between size and dura- The parameter values were chosen on the advice of expe-
tion); they are all correlated much more strongly with the rienced colleagues; where possible they match the settings
dependent variable. Thus it is reasonable to treat them as used in [2, 11].
independent. Thirty repetitions were made for each of the 8 treatments.
Summary statistics for the numeric variables are pre- In other words, thirty GP runs were made with each of the
sented in Table 2. 8 fitness functions, making 240 runs in total.
Raw numeric values are used, not log-transformed val- For each of the eight fitness functions, the estimated ef-
ues. None of the variables is normally distributed, as can fort for each project was calculated as the average of the
be seen in Table 2 where the mean values are all well above 30 individual estimates for that project from the 30 differ-
the medians. This would be a problem for ordinary statisti- ent estimation models built using that fitness function. Eight
cal regression, but is not a problem here since GP makes no vectors result, each of 634 elements, containing the overall
assumptions about the distributions of the data. estimated effort for each project from each fitness function.

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
Parameter Value Table 5 shows some expected behaviour: ʾ , MSE and
Population size 1000 Standard deviation of error are all best when MSE is the
Number of generations 500 fitness function. MMRE is best when MRE is the fitness
Number of runs 30 function. MMER is best when MER is the fitness function.
Mean Þ is best when an approximation to Þ is the fitness
Maximum initial tree depth 5
Maximum tree depth 9
Probability of crossover 0.1
function.
Probability of mutation 0.1 This does not apply to the fitness functions based on
Tournament size 3 Pred(0.25). Other fitness functions, particularly LAD and
Z, do better. This may be because Pred(l) is a step function
Table 3. Parameters used for GP system rather than a continuous function, which makes it harder to
evolve an optimum solution. The output from the individ-
ual GP runs suggests that convergence on a local optimum
Each vector was compared with the vector of actual ef- often happened early, and the system was unable to move
fort values, and a variety of accuracy statistics calculated. away from it to a global optimum. Combining this with the
Using the individual estimation models, analysis of vari- observation above that results from the Pred(l) fitness func-
ance was used to identify significant differences between tions were unstable, it seems that Pred(l) should not be used
the accuracy from each fitness function on each accuracy as a fitness function.
statistic. The first row of Table 5 supports Dolado’s observation
that correlation coefficient is not an effective fitness func-
6. Results and discussion tion [5]. There is very little difference between any of the
ʾ values, even when MSE, StDev, and absolute errors vary
by large amounts.
6.1. Average performance
There is nothing to choose between any of the fitness
Table 4 summarizes the actual and estimated effort val- functions when accuracy is measured in terms of Median
ues. Table 5 presents the accuracy obtained with each fit- MER and Median MRE (except for MER penalizing MRE
ness function on each accuracy statistic. and vice versa).
From Table 4 we see that the distribution of effort esti- There is little to choose between the fitness functions
mates from each fitness function broadly reflects the distri- when Pred(0.25) and Pred(0.50) are used to measure accu-
bution of actual effort. racy. No model does well, and all achieve roughly the same
MER are MRE are effectively reciprocals, so inevitably low level of poor performance.
they work in opposition. When MRE is used as the fitness LAD and Z do well across the board. They do almost as
function, MMRE is best and MMER worst, and vice versa. well as MSE on accuracy measures based on squared errors.
MER tends to underestimate significantly (as noted by They have the best absolute errors, and their Þ values are
[2] and [12]), and MER to overestimate significantly. This closest to 1.0. Their values for MRE are not as good as
is at least partly an artefact of their definitions. One would for the function designed to minimize MRE, but they do
hope that the goal of minimizing MER is achieved by mini- notably better than the MSE fitness function.
mizing the numerator (the difference between the estimated It appears that improvements to MSE occur at the ex-
and actual values). But the denominator is also relevant. pense of MRE, and vice versa. This is not surprising in
Underestimates produce smaller denominators than overes- software engineering data, where the most common projects
timates, so MER penalizes underestimates and favours over- are not the large projects.
estimates. In the case of MRE, MRE is capped at 1 if the Comparing the absolute fitness functions (MSE and
estimate is too low, but is unbounded for an overestimate, LAD) with the relative fitness functions (MER and MRE),
so MRE is biased towards underestimates. on almost every accuracy statistic the absolute fitness func-
The summary values for the Pred(l) fitness functions tions perform as well or better. The exceptions are the accu-
look similar to those from the other fitness functions, and racy statistics for which the relative fitness functions were
the performance of those functions appears quite reasonable specifically designed. This suggests that absolute fitness
in Table 5, but this is more good luck than genuine accuracy. functions should be preferred to relative fitness functions —
The generated estimation models include some that grossly a nice outcome, given that regression using MSE and LAD
underestimate effort for the 63% or so of projects that do not is easily done with statistical tools nowadays.
fall within the 25% target, and some that grossly overesti- Another explanation could be merely that GP is better
mate effort. It happens that the number of underestimates at minimizing direct values than ratios. There is certainly
roughly balances the number of overestimates. Attempting more variability in the models built with MRE and MER,
to optimize Pred(l) produces unstable results. and more frequent evidence of premature convergence to

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
Accuracy statistic MSE LAD MRE MER Z P(.25)/ P(.25)/ P(.25)/
MSE LAD MRE
Adjusted Ê 0.681 0.679 0.653 0.650 0.662 0.655 0.651 0.657
MSE (x 10 ) 5.388 5.602 8.243 8.808 6.040 5.584 6.134 5.846
StDev of error 2323 2369 2873 2936 2460 2421 2479 2420
Mean abs error 1413 1368 1596 1694 1385 1410 1435 1416
Median abs error 752 640 708 870 620 691 698 711
Mean Þ 1.489 1.287 0.863 1.753 1.215 1.424 1.493 1.432
Median Þ 1.157 1.019 0.684 1.344 0.956 1.117 1.154 1.123
Mean 1/Þ 0.976 1.115 1.676 0.817 1.174 1.012 0.969 1.011
Mean MRE 0.702 0.580 0.466 0.877 0.545 0.655 0.702 0.665
Median MRE 0.354 0.343 0.406 0.422 0.347 0.353 0.352 0.354
Mean MER 0.431 0.470 0.832 0.401 0.495 0.438 0.423 0.439
Median MER 0.352 0.351 0.532 0.356 0.356 0.352 0.345 0.348
Pred(0.25) 0.383 0.371 0.282 0.341 0.361 0.363 0.371 0.360
Pred(0.50) 0.610 0.663 0.640 0.555 0.686 0.633 0.612 0.631

Table 5. Accuracy statistics

the wrong optimum, so perhaps the relative fitness functions


are being judged unfairly. More experimentation is needed
MSE LAD MRE
to understand GP’s behaviour with these fitness functions.
Minimum 308 190 178
1  quartile 1253 1080 681 6.2. Statistical significance
Median 2298 1987 1352
3 quartile 4426 3962 2783 The observations above apply to the overall average
Maximum 25220 24120 19510
model found from the 30 repetitions for each fitness func-
Mean 3530 3159 2250
StDev 3369 3136 2428
tion. To test the statistical significance of the differences,
analysis of variance was used. For example, MMRE was
MER Z P(.25)/ computed separately for 150 estimation models (models
MSE built with Pred(l) were excluded from this stage of the anal-
Minimum 318 223 261 ysis), and ANOVA used to test the hypothesis that fitness
1  quartile 1476 1017 1253 function has no effect on MMRE; this approach was taken
Median 2659 1894 2220 for all 14 accuracy statistics. In every case the null hypoth-
3 quartile 5126 3734 4191 esis was decisively rejected (p 10 ½½).
Maximum 22990 24560 24050 Multiple comparisons were then made to identify which
Mean 4347 3103 3364 fitness functions produced results that differed significantly
StDev 4730 3052 3208 (0.05 chosen as the level of significance). The results are
summarized in Table 6. In each row, the five fitness func-
P(.25)/ P(.25)/ Actual
tions are listed in order of performance, from best to worst;
LAD MRE effort
Minimum 323 347 107
functions whose performance is not significantly different
1  quartile 1299 1282 917 at the 0.05 level are bracketed together.
Median 2253 2194 1943 As noted in the observations based on overall averages,
3 quartile 4258 4101 4115 the performance of the models found with MRE and MER
Maximum 28740 23760 22050 as fitness functions is universally worst, except on the ac-
Mean 3599 3337 3506 curacy statistics for which they were specifically designed.
StDev 3810 3174 4115 MRE and MER may have their place as accuracy statistics,
but not as fitness functions. It appears clear that fitness func-
Table 4. Effort estimates (Hours) with each tions based on MSE, LAD and Z perform better across a
fitness function wide range of accuracy statistics.
Considering only MSE, LAD, and Z, look again at Ta-
ble 6. The rows are ordered in decreasing order of how
much difference there is between the performance of the

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
Accuracy statistic Performance of fitness functions were very close to the average on most accuracy statistics
Mean MRE MRE, Z, LAD, MSE, MER (in fact, in many cases the average of 30 models performs
Median abs error Z, LAD, MRE, MSE, MER better than the best single model, as errors cancel each other
Mean Þ Z, LAD, MSE, MRE, MER out). GP runs based on MRE, MER and Pred(l) were less
Mean MER MER, MSE, LAD, Z, MRE
stable.
Median Þ LAD, Z, MSE, MRE, MER
Mean 1/Þ MSE, LAD, Z, MER, MRE Any fitness function can be used like this if a GP system
MSE (x 10 ) MSE, LAD, Z, MRE, MER is available. If one is not, some algorithm is required en-
Pred(0.50) Z, LAD, MRE, MSE, MER abling an estimation model to be built directly instead of by
StDev of error MSE, LAD, Z, MRE, MER search. In that light, MSE and LAD are more useful than
Median MRE MSE, LAD, Z, MRE, MER Z, since software is readily available to perform regressions
Adjusted Ê LAD, MSE, MRE, Z, MER based on MSE and LAD.
Mean abs error LAD, MSE, Z, MRE, MER The nature of software engineering data sets is that there
Median MER LAD, MSE, Z, MER, MRE are usually many more small projects than large ones. MSE
Pred(0.25) LAD, MSE, Z, MER, MRE gives greater weight to large projects, where there is more
scope for large estimation errors. By not squaring the er-
Table 6. Performance of fitness functions
ror term, LAD gives less weight to large projects and rela-
tively more weight to smaller projects. Perhaps this makes
it more suitable than MSE for building estimation models
three fitness functions. For mean MRE the difference be- from software engineering data sets, and explains its better
tween the average performance of Z (best of the 3) and MSE performance on most accuracy statistics. On the same ba-
(worst of the 3) is 0.16. In a range from 0.55 to 0.71, that sis, generalized linear models, such as Huber’s glm in R
is a variation of 22%. At the other end of the table the dif- and S-Plus, may be worth investigating.
ference between the average performance of LAD and MSE Naturally, though, each user needs to decide which ac-
(best) and Z (worst) represents a variation of about 4%. curacy statistics matter most to them in their particular sit-
This ordering suggests two things. One is a rough indi- uation, which might lead to another fitness function being
cation of which accuracy statistics provide real discrimina- preferred.
tion. The variation is over 10% down to MSE, a bit below
10% for Pred(0.5) and StDev, and 5% or less for the rest. 6.4. Not surprising?
The other is that Z looks like the best of MSE, LAD and
Z on accuracy statistics with large variation, and the worst In one sense it is not surprising that LAD outperforms
of the three on accuracy statistics with small variation. In MSE as a fitness function on heavily skewed data. Statis-
other words, when it outperforms other fitness functions it ticians note that “The criterion of minimizing the sum of
is by a large margin, but when it loses it is not by much. absolute deviations is preferable to that of least squares in
This implies that Z might be the best fitness function to use the presence of large disturbances (outliers) or . . . heavy
for building estimation models, as well as being argued [9] tails” [19]. The reason is that confidence intervals for the
to be the best statistic to use for assessing models. regression coefficients are smaller.
That reason is not directly relevant to a software esti-
6.3. Practical implications mator, who cares really about the performance of the equa-
tion on various accuracy statistics. We have provided ex-
It is all very well to suspect that certain fitness functions perimental evidence that the theoretical advantage of LAD
are better than others, but you need to be able to apply that translates to better performance on most of the accuracy
knowledge in practice to build useable estimation models. statistics commonly used in software estimation.
One approach is to use GP as outlined here to generate a
collection of estimation models from past data, using your 7. Threats to validity
preferred fitness function. These models can be applied to
a new project, giving a set of estimates on which an overall ISBSG data is a convenience sample, not necessarily rep-
estimate can be based. You would have no way of knowing resentative of software engineering data in general. This
whether the particular collection of models happened to be may mean these findings do not generalize. But it is typical
a very good, or very bad, or quite typical set. However, of software engineering data in its skewed nature, and it is
inspection of the output from GP runs based on MSE, LAD, that aspect of the data that is most important here, so that
and Z showed the results to be very stable, so these fitness risk should not be a problem.
functions should give trustworthy behaviour. The best and A greater problem could be the reliability of some of the
worst performed models built using these fitness functions data values themselves. In particular, normalized effort is

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
an estimate of full life-cycle effort, for projects that do not They are too unstable. Pred(l) is insensitive to how bad the
include all phases. Where normalisation is done, it is an errors are in estimates that fall outside the target accuracy,
estimate. Where it is not done, it may not mean that all so they run the risk of generating hugely inaccurate esti-
phases are covered: it might mean that no information about mates.
phases was available. Thus there is some uncertainty in the Fitness functions based directly on differences between
effort values used. One could choose to use only projects for estimates and actual values, rather than on relative errors,
which full phase data was available in the first place; I chose appear to perform better across a range of accuracy statis-
not to do that because it leaves a much smaller data set. tics. The exception is a fitness function based on an approx-
Projects whose effort was changed the most were excluded imation to Þ , the ratio of estimated to actual effort; in many
from the data set; perhaps a smaller threshold should be respects it was the best-performed fitness function of all. Its
used to improve the reliability of the effort data. This can disadvantage is that no algorithm is available to generate an
only be determined by further experimentation. estimation model directly that optimizes this statistic.
Different settings of the GP control parameters can influ- This leaves MSE and LAD as two fitness functions that
ence behaviour. Other settings might lead to different mod- can be easily used in practice and perform well across sev-
els being found. However, it is likely that this would affect eral accuracy statistics. In software engineering data sets,
speed of convergence, rather than accuracy. 500 generations where small projects generally outnumber large ones, this
were used here; the accuracy gain after about 150-200 gen- research suggests that LAD generally out-performs MSE.
erations was generally small, so I believe that convergence One direction for future work is to extend this work to
is reasonable and the risk here is low. Sometimes, partic- look at log-transformed data, as is common with software
ularly with the relative fitness functions, it seemed that GP engineering models. Another is to use separate subsets for
was unable to escape from a local optimum. This behaviour building and evaluating models, to get more reliable esti-
may be improved with more experimentation with control mates of actual differences between their performance.
parameters.
GP was used here to find non-linear models, using un- Acknowledgements
transformed data. But the usual approach to regression with
software engineering data is to use log-transforms to obtain
I would like to thank ISBSG for the use of its data for this
normally distributed data, and then to derive linear mod-
research. Bob McKay and Yin Shan helped with writing
els. We may see different behaviour if the fitness functions
the grammar, setting up the GP system and setting control
studied here were applied to log-transformed data. This is
parameters. The anonymous reviewers provided valuable
an area for future work.
comments and suggestions.
Using the same sample (the complete data set) for build-
ing and evaluating the models could affect the reliability of
the accuracy statistics. It is possible that models are over- References
fitted to the data. In particular, this could affect Section 6.2,
which drew inferences from the actual differences observed [1] D. Birkes and Y. Dodge. Alternative Methods of Regression.
between model performances. But all 8 treatments should Wiley, 1993.
be equally affected if that is the case, so this should not in- [2] C. Burgess and M. Lefley. Can genetic programming im-
validate the results. prove software effort estimation? A comparative evaluation.
Information and Software Technology, 43(14):863–873, De-
cember 2001.
8. Conclusions [3] M. Cartwright, M. Shepperd, and Q. Song. Dealing with
missing software project data. In Proceedings of 9th Interna-
Building a model from data involves minimizing some tional Software Metrics Symposium, pages 154–165. IEEE,
criterion (fitness function). When building estimation mod- September 2003.
els from software engineering data sets, this has tradition- [4] S. Conte, H. Dunsmore, and V. Shen. Software Engineering
ally been done by least-squares regression, which mini- Metrics and Models. Benjamin-Cummings, 1986.
mizes MSE — even though MSE is unrelated to most of [5] J. Dolado. On the problem of the software cost function.
Information and Software Technology, 43(1):61–72, January
the accuracy statistics used to assess the models.
2001.
We have investigated the performance of models built [6] T. Foss, E. Stensrud, B. Kitchenham, and I. Myrtveit. A
using a variety of fitness functions, across a range of com- simulation study of the model evaluation criterion MMRE.
monly used accuracy statistics. The models were evolved IEEE Transactions on Software Engineering, 29(11):985–
using genetic programming, using data from the ISBSG 995, November 2003.
data set. Untransformed data was used directly. [7] N. Juristo and A. Moreno. Basics of Software Engineering
Fitness functions based on Pred(l) should not be used. Experimentation. Kluwer, 2001.

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
[8] C. Kirsopp and M. Shepperd. Making inferences with small
numbers of training sets. IEE Proceedings – Software,
149(5):123–130, October 2002.
[9] B. Kitchenham, L. Pickard, and S. MacDonell. What accu-
racy statistics really measure. IEE Proceedings – Software,
148(3):81–85, June 2001.
[10] J. Koza. Genetic Programming: On the Programming of
Computers by Means of Natural Selection. MIT Press, Cam-
bridge MA, USA, 1992.
[11] M. Lefley and M. Shepperd. Using genetic programming
to improve software effort estimation based on general data
sets. In Proceedings of GECCO 2003, volume 2724 of
LNCS, pages 2477–2487. Springer-Verlag, September 2003.
[12] Y. Miyazaki, M. Terakado, K. Ozaki, and H. Nozaki. Ro-
bust regression for developing software estimation models.
Journal of Systems and Software, 27(1):3–16, October 1994.
[13] L. Pickard, B. Kitchenham, and S. Linkman. Using sim-
ulated data sets to compare data analysis techniques used
for software cost modelling. IEE Proceedings – Software,
148(6):165–174, December 2001.
[14] B. Ross. Logic-based genetic programming with definite
clause translation grammars. New Generation Computing,
19(4):313–337, 2001.
[15] Y. Shan, R. McKay, C. Lokan, and D. Essam. Software
project effort estimation using genetic programming. In
Proceedings of ICCCAS’02 (International Conference on
Communications, Circuits and Systems), pages 1108–1112,
Chengdu, China, July 2002.
[16] M. Shepperd and G. Kakoda. Using simulation to evaluate
prediction techniques. In Proceedings of 7th International
Software Metrics Symposium, pages 349–358. IEEE, 2001.
[17] P. Whigham. Grammatically-based genetic programming.
In J. Rosca, editor, Proceedings of the Workshop on Ge-
netic Programming: From Theory to Real-World Applica-
tions, pages 33–41, Tahoe City, California, USA, July 1995.
[18] C. Wohlin, P. Runeson, M. Host, M. Ohlsson, B. Regnell,
and A. Wesslen. Experimentation in Software Engineering:
an Introduction. Kluwer, 2000.
[19] S. Zanakis and J. Rustagi. Introduction to contributions in
regression and correlations. In S. Zanakis and J. Rustagi, ed-
itors, Optimization in Statistics, pages 7–9. North-Holland,
1982.

11th IEEE International Software Metrics Symposium (METRICS 2005)


1530-1435/05 $20.00 © 2005 IEEE
View publication stats

You might also like