Professional Documents
Culture Documents
PRED MMRE Study ESEM2008 PDF
PRED MMRE Study ESEM2008 PDF
net/publication/221494789
CITATIONS READS
59 983
2 authors, including:
Daniel Port
University of Hawai'i System
140 PUBLICATIONS 1,992 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Daniel Port on 12 March 2014.
51
and hopefully expanded. Our approach however, is not specific to versions of the COCOMO model, the two most prevalent being
any particular dataset or model. COCOMO I and COCOMO II. The one we use here (COCOMO
I) was chosen based on the publicly available COCOMO data
The current work aims to provide a better understanding of
within the PROMISE repository [16]. Here we study variations of
MMRE and PRED as an estimate of model accuracy and
the classic COCOMO I model to exemplify our points and
increasing confidence in their application. We would like to make
methods, and enable straightforward duplication and verification
it clear upfront that the current work is an empirical study for
of our results and claims. However, our methods are not limited
mainly this purpose and make no claim that either is a superior
to such models, and it will be evident that COCOMO I is fully
criterion to other measures. If we were to take a position, it would
exchangeable with other, and perhaps better cost estimation
be to advocate the use of the more standard mean squared error
models. The intent here is to define a set of experiments and
(MSE) [2] or a maximum likelihood estimator [22] over any of
examples that others may replicate in order to refute or improve
the criterion suggested in the literature. Neither appears to have
on our results and methods. The particular datasets and cost
been considered to date, however this is not the focus of the
models used here are simply a convenience.
current work and these will not be discussed further in this paper.
Boehm's Post-Architecture version of COCOMO I:
This paper is organized as follows: a discussion of related works;
⎛ 15 ⎞
then a description of models and datasets used; definitions for
∏ EM j j ⎟⎟⎠ * (size)
b
effort = a * ⎜
a
⎜
*ω (1.)
criterions (MMRE, PRED) and some of their analytic ⎝ j=1
characteristics as statistical estimators; use of standard error for Here, EMj are “effort multiplier” parameters whose values are
these accuracy estimators; research methodology; replicated chosen based on a project’s characteristics, and aj, a, b are
studies of MMRE extended to PRED; and lastly a discussion of domain specific “calibration” coefficients, either given as
empirical characteristics or MMRE and PRED. We purposely specified by Boehm in [11] or determined statistically (generally
have not provided a conclusion as the intent is to provide a via ordinary least squares regression) using historical project
deeper and more detailed understanding of MMRE and PRED, data. The dependent variable size is expressed either as KSLOC
and not to criticize or advocate their use. (thousand source lines of code) or in FP (function points) is
estimated directly or computed from a function point analysis.
2. RELATED WORK The model error ω is a random variable with distribution D (not
The available resources and literature on cost estimation research generally Normal). Model accuracy measures are estimating one
can be overwhelming. There exist a relatively large number of or more parameters of D.
empirically based estimation methods. Non-model-based methods
Table 1 shows the six COCOMO I model variations used in this
(e.g. “expert judgment”) usually do not play an important role in
work and their brief descriptions.
the empirical literature. Generally such methods do not output
point estimation data applicable to accuracy criterion (there are Table 1. COCOMO I model variations used in study
some research effort that are the exception). Still, they are widely Model a,b EMj aj
practiced intuitive methods used frequently in organizations (A) ln_LSR_CAT CLSR categorical CLSR
where a model-based approach would be too cumbersome or (B) aSb given none none
sufficient model-calibration data is unavailable. Model-based (C) given_EM given given none
methods can be split into generic-model-based (e.g. COCOMO, (D) ln_LSR_aSb OLS none none
SLIM, etc.) and domain specific-model-generation methods such (E) ln_LSR_EM OLS given OLS
as CART or Stepwise ANOVA. (F) LSR_a+bS OLS none none
The table entries are interpreted in the following way:
Besides the variety of cost estimation methods, there are a large
OLS: Ordinary Least Squares regression was used with the given
diversity of studies on the topic - some on evaluation of cost
project data set to determine the parameter values.
estimation in different contexts, some assessing current practices
CLSR: Categorical Least Squares regression was used with the
in the software industry, others focusing on calibration of cost
given project data set to determine the parameter values.
estimation models. See the Encyclopedia of Software
Given: The values of these parameters are given in [3] and not
Engineering [29] for an overview of cost estimation techniques as
derived statistically from the data set.
well as cost estimation studies. Also [18,30] list current studies
on software cost estimation. Categorical: The values of these parameters are considered non-
numerical, non-ordinal categories (e.g. the implied order of “L”
As mentioned in the introduction, unlike with MMRE, no “N” “H” “XH” values for effort multipliers are ignored).
detailed study could be found on the nature and efficacy of PRED None: The parameters are not used in this model.
as a software cost estimation model criterion. In spite of this,
PRED is a frequently used criterion as is evidenced by Models (A) - (E) use the functional form as the general
summations of model performances in [3]. COCOMO I model given equation (1) above, however model (F)
uses a simple linear a+b*(size) form. When there is a “ln_” in the
model name, the applicable project data was transformed by
3. COCOMO, DATASETS USED taking the natural logarithm (ln) for the analysis. All values were
In this study, we will use COCOMO, the Constructive Cost back transformed when used in the model and model calculations
Model [11] since, unlike other models such as PriceS or SLIM or (e.g. calculating MMRE’s).
and SEERSEM, it is an open model with substantial published
project data sets. All details for COCOMO are published in the An example reading of the table for model (B) states that it is the
text “Software Engineering Economics” [11]. There are several general COCOMO I model without using any effort multipliers
52
(and hence no calibration coefficients) and the values of the Although MMRE and PRED are still today the de facto standards
parameters a, b are taken from the values given in [3]. No for cost model accuracy measurement, they don’t specifically
regression is performed on the data set. measure accuracy. In fact, technically they are “estimators” of a
function of the parameters related to the distribution of the MRE
The historical data used for estimating the coefficients are taken
values. This in turn is presumably related to the error distribution
from the COCOMO81 [25], COCOMONASA [26], NASA93
of the model. As such, we will frequently refer to these as
[27] and Desharnais [28] PROMISE repository data sets. We note
“accuracy indicators” rather than measures when it is more
that in the course of this work we contributed numerous
appropriate. Several studies have noted that MRE distributions
corrections and clarifications for these data sets. Simulated data is
are essentially related to the simpler distribution of the values:
based on [9] and [2] and we direct the reader to these sources for
details on its applicability and construction. ŷi
zi =
yi
, (5)
4. ACCURACY ESTIMATORS
Which are clearly related to the distribution of the error
The field of cost estimation research suffers a lack of clarity
about the interpretation of model evaluation criterion. In residuals ε i = yi − ŷi but in a non-trivial way [17].
particular, for model accuracy, various indicators of accuracy –
both relative and absolute – have been introduced throughout the Kitchenham et al. report that MMRE and PRED are directly
cost estimation literature. For example, mean squared error related to measures of the spread and the kurtosis of the
(MSE), absolute residuals (AR) or balanced residual error (BRE). distribution of zi values [17], a fact of uncertain utility, but
Our literature review indicated that the most commonly used, by notable. A useful fact that follows easily from the weak law of
far, are the “mean magnitude relative error” or MMRE, and large numbers [22] is that both MMRE and PRED are consistent
“percentage relative error deviation within x” or PRED(x). Of estimators (i.e. they converge in probability to some parameter of
these two, the MMRE is the most widely used. Both are based on the distribution) [22]. This provides a meaningful and precise
the same basic unit-less value of magnitude relative error (MRE) interpretation of “accuracy measure” when they are viewed as
which is defined as estimators for the error distribution D for cost model and dataset.
This however, does not say anything about how good they are as
yi − ŷi estimators (e.g. bias, uniform convergence, rate of convergence,
MRE i = , (2)
yi MSE, variance, etc.). Indeed, there is substantial research that
address the quality of these estimators (such as [9] and [10]),
where yi is the actual effort and ŷi is the estimated effort for although not expressed or analyzed in the more standard
project i. It is argued that MRE is useful because it does not over statistical framework we use here.
penalize large projects and it is unit-less (i.e. scale independent). There has been a degree of debate regarding the efficacy of
MMRE is defined as the sample average of the MRE’s: MMRE and to a lesser extend PRED as accuracy measures, yet
1
N one thing is clear – they are both statistics for a sample of the
MMRE =
N ∑ MREi (3) MRE’s, and not the entire population and therefore they are
i=1
subject to standard error (SE). The SE must be accounted for if
To be more precise we should label (3) as MMREN to indicate that one is to have confidence in their application.
it is a sample statistic on N data points. To be consistent with
customary usage we will drop the subscript when there is no 5. STANDARD ERROR OF ACCURACY
confusion as to the number of data points used. Conte et al. [14]
consider MMRE ≤ .25 as an acceptable level of performance for
ESTIMATORS
effort prediction models. One of Boehm’s original motivations for creating COCOMO was
to increase the confidence managers have when estimating
PRED(x) [15] defines the average fraction of the MRE’s off by software projects. Curiously, despite this original motivation for
no more than x as defined by COCOMO, very little has been reported on the confidence in
accuracy measures for COCOMO estimates. This has led to a
{
N
1 1 if MREi ≤ x
PRED( x ) =
N ∑ 0 otherwise (4) surprising number of contradictory results in the theory and
I =1 practice of cost estimation.
Typically PRED(.25) is used, but some studies also look at The primary concern is that COCOMO models (and more
PRED(.3) with little difference in results. Generally PRED(.3) ≥ generally all cost estimation models) are “calibrated” with a
.75 is considered an acceptable model accuracy. There is some relatively small amount of data (which is frequently biased or
concern about what constitutes an appropriate value of x for “sanitized”). Various measures such as PRED(.25)=.5 and
PRED(x). Clearly the larger x is, the less information and MMRE=.35 are plainly presented stating just how “good” I
confidence we have in an accuracy estimate. However interesting should feel about the model’s accuracy and predictive
this question, we are interested in comparisons of PRED with capabilities. The reality of this is that these values only reflect the
MMRE and not the specific application of these measures. Note model accuracy for the data they were calibrated on. There is a
that inverse to MMRE, high PRED values are desirable. This is serious question in our “confidence” in these measures for
easily reversed to match MMRE by simply switching the 0-1 predicted values. Providing a standard error for these measures
values in (4) if desired. This inverse relationship should be kept and a clearer understanding of what this implies (e.g. How much
in mind when viewing our side by side comparisons. is bad?) is key to addressing the confidence question. For
example, if we understand the standard error from calibration
53
data (i.e. the sample population), we can generate appropriate distributions are poorly approximated with the basic
confidence intervals of these measures for the “true” population bootstrapping method. We discuss the distributional
of values being predicted. For example, a COCOMO calibration characteristics in a later section, we found that the BC-percentile,
that has a PRED(.25)=.5 for the sample data, one might state that or “bias corrected” method has been shown effective in
with 95% confidence that the value of the unknown parameter for approximating confidence intervals for the type of distributions
the population error distribution that is being estimated lies we are concerned with. For each calculation we chose 15,000
within the interval .38<PRED(.25)<.83. bootstrap iterations (well beyond the suggested number) using the
Excel Add-In poptools [20]. Our bootstrapping results have also
Generally the standard error for an estimator is difficult to
been replicated using another bootstrapping Excel Add-In’s,
compute analytically. However, bootstrapping [7, 8] is a well-
manual calculations, and some custom developed bootstrapping
known, well-accepted, and straightforward approach to
software that performs bootstrap iterations to a desired precision
approximating the standard error of an estimator. Briefly,
rather than using the arbitrarily chosen 15,000 iterations. All
bootstrapping is a “computer intensive” technique similar to
results were seen to be consistent, some of which we now
Mote-Carlo simulation that re-samples the data with replacement
describe.
to “reconstruct” the general population distribution. The
bootstrap is distinguished from the jackknife, used to detect
outliers, and cross-validation (or “holdouts”), used to make sure 7. MMRE AND PRED STUDIES
that results have predictive validity. In this section we provide 3 empirical studies that replicate or
extend existing studies. We have performed a dozen such studies
We use bootstrapping in various capacities to understand the and these are just a few representative examples.
standard error of PRED and MMRE for the COCOMO I model
using the three PROMISE data sets indicated previously. Our Study 1: model selection results
preliminary investigations listed in Table 2 show that the
A popular cost estimation research area is model selection. This
standard error for MMRE and PRED(.25) for various COCOMO
commonly involves advocating methods and criteria for choosing
I models are significant, and clearly worthy of more detailed
a particular estimation model format, calibration method, or use
study.
of calibration data (e.g. “pruning”, stratification, etc.) or a
Table 2. Overview of datasets with model (C) combination thereof (see [18]). Model selection research results
often appear to be contradictory across different data sets.
Data Set size MMRE SE PRED(.25) SE 95% Confidence
Validation methods such as “holdout” experiments, while
NASA93 93 .6 .14 .48 .05 .37 ≤ MMRE ≤ .94
intuitively may seem reasonable, are difficult to justify formally.
.38 ≤ PRED ≤ .6 Many model selection research results compare COCOMO
COCOMO81 63 .37 .04 .37 .06 .31 ≤ MMRE ≤ .45
models and calibration approaches to (presumably better)
.25 ≤ PRED ≤ .49 alternatives. To illustrate how standard error can be used to
COCOMO 60 .25 .03 .65 .06 .2 ≤ MMRE ≤ .32 obtain more confident results, we choose to study a number of
NASA .55 ≤ PRED ≤ .78 variations of COCOMO I itself. This study replicates results from
several other studies (at least in part), or are analogous enough to
6. RESEARCH METHOD indicate how the approach could be used with alternative models
As mentioned previously, we used four different PROMISE and calibration methods, including analogy models [19],
datasets to give a good overview of the effects of SE. We COSEEKMO [3], and simulated project data approaches [9].
calculated MMRE, PRED(.25) and PRED(.30) for the models
(A)-(E) on the COCOMO81, COCOMONASA and NASA93 Figure 1 visualizes the performance of MMRE with respect to the
datasets. The same accuracy indicators were calculated using the PROMISE datasets. Each graph on a diagram shows the estimator
models (E) and (F) for the Desharnais dataset using adjusted and location (MMRE) within its 95%-confidence interval and is
also raw function points as in [17]. The reason we chose fewer labeled with the model name and MMRE value. Remarkably, the
models for this dataset is that the Desharnais data does not have standard error for the MMRE’s for the same model vary greatly
COCOMO I effort multipliers. over different datasets. This perhaps in part explains some of the
inconstant results in the literature when different data sets were
We aim to obtain the standard error for MMRE and PRED for the used.
various models (A) - (F) and four PROMISE data sets. The
parameters of the z-distribution (related to the error distribution, While the precise probabilities for the likelihood of one value
see Section 3) are unknown, and is not known to be normally being greater than another were not calculated, it is clear enough
distributed. As such, the variance of MMRE and PRED as to see that the more overlap two confidence intervals have, the
estimators, which are needed for standard error, are difficult to less one is able to say “in confidence” that one value is greater
obtain analytically. A well-established method for obtaining (or less than) another. This informal statement can be made more
approximations for the standard error of estimators is to use precise by considering Vysochanskiï-Petunin inequality [21]
bootstrapping [7,8]. which is a refinement of Chebyshëv's inequality that places an
upper bound on the probability of how far a random variable can
Standard confidence intervals are also difficult to compute for all be from its mean with respect to its variance. Such tools are
but the sample mean for normally distributed data, so here too we essential for reliably understanding dispersion measures such as
resort to bootstrapping. A notable concern in bootstrapping the coefficient of variation (such as used in [3]) and presumably
confidence intervals is the effect of non-normally distributed MMRE and PRED. For our purposes here, the visual amount of
data. In particular, confidence intervals for highly skewed
54
overlap of confidence intervals will suffice to illustrate the effects Model A (ln_LSR_CAT) COCOMO81
of standard error on MMRE and PRED. MMRE: 0.11666, SE: 0.01102
Model B (aSb)
Performance ranking (where lower MMREs have higher rank) is Model C (given_EM)
MMRE: 0.77834, SE: 0.09621
not consistent over all four models as also indicated in table 3 MMRE: 0.37220, SE: 0.0366
points whereas the 95%-confidence interval for model (D) on the 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000
result, a more confident performance ranking accounting for this MMRE: 0.25392, SE: 0.03091
Model D (ln_LSR_aSb)
might be something like that listed in Table 4. Where two models MMRE: 0.31779, SE: 0.04521
in the same rank cannot be distinguished from one another in Model E (ln_LSR_EM)
MMRE: 0.13396, SE: 0.01497
terms of the model accuracy criterion. 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000
Table 3. Model ranking based on MMRE, not accounting for Model A (ln_LSR_CAT)
Standard Error. MMRE: 0.20007, SE: 0.03991 NASA93
Model B (aSb)
COCOMO81 COCOMONASA NASA93 MMRE: 0.64610, SE: 0.12126
1. A A A Model C (given_EM)
MMRE: 0.59497, SE: 0.014117
2. E E E Model D (ln_LSR_aSb)
3. C C C MMRE: 0.65149, SE: 0.09823
Model E (ln_LSR_EM)
4. B D B MMRE: 0.41679, SE: 0.08551
5. D B D 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000
Table 4. Model ranking based on MMRE, accounting for Figure 1. MMRE bootstrapped 95% confidence intervals
Standard Error at 95% confidence level.
should have similarly behaving required complexity within the
COCOMO81 COCOMONASA NASA93
same category value. The reasonable conclusion here is that
1. A A A, B, C, D, E
model (A) is not a realistic effort model for the data sets despite
2. C, E E -
its statistical performance. There is no confidence that the model
3. B, D B, C, D -
is accurate for predictions (i.e. data outside the calibration set).
So what can be concluded with confidence based on the MMRE
Looking at Figure 1 one might be excited about model (A) as it
results? The intervals for COCOMO81 in Figure 1 indicate that
appears to be consistently better than other models and with high
model E is significantly better than model D. Hence for
confidence (i.e. the confidence intervals do not overlap).
COCOMO81 data, we can be confident that adding effort
However this illustrates a fallacy of using purely statistical results
multipliers improves MMRE. The same result holds for the other
as a basis for model selection. By allowing our model parameters
data sets except NASA93. This provides reasonable confidence
to vary unconstrained, we are indeed able to calibrate a model
that in general adding effort multipliers will improve indeed
that fits the data very well. However, a quick look at the
MMRE.
parameter coefficients generated for this model reveal a number
of absurd effort parameter relationships. For example model (A) Figure 2 compares PRED results for the same models and
applied to NASA93 CLSR estimated the parameter values for the datasets as Figure 1. Unlike MMRE, the PRED rankings are
“required complexity” CPLX effort multiplier (see [3] for details consistent over all datasets (i.e. A > E > C > B > D where “>”
for this) to be L=-0.483, N=0.989, H=0.677, and VH=-0.745. means “higher PRED rank than”). Also, note that the confidence
Generally it is believed that higher required complexity requires intervals vary less in size. This, from a perspective that is easily
higher effort. This obviously runs contrary to such beliefs (which accessed and justified, supports a variety of assertions made in
other studies have empirically validated). Data enthusiasts might the literature that claim PRED is more consistent and “robust”
counter this objection with “perhaps this actually describes the than MMRE. In fact, as we will illustrate in a later section, that
true nature of required complexity” in that it is not ordinal. But unlike MMRE, PRED is not dependent on the variance of the
then looking at the values estimated for model (A) applied to MRE’s. Thus PRED is immune to large variances from outliers in
COCOMONASA where L=-0.66, N=0.659, H=0.653, and the data (i.e. it is more robust). This property explains why the
VH=7.24 contradicts this. The enthusiast may counter again by MMRE confidence intervals for models (E) and (D) overlap for
stating that perhaps it is the true nature of required complexity NASA93, but not for the other datasets. At the 95% confidence
varies with respect to the kind of projects each data set level we cannot claim that the PRED rankings are significant
represents. except for model (A) which can be thrown out for the same
reasons discussed previously. However, none of the PRED
Fair enough, but closer inspection of NASA93 and COCOMO
confidence intervals for (E) and (D) overlap, so we can be
NASA would reveal that the kinds of projects in both are very
confident that effort multipliers improve accuracy in OLS
similar. In fact, a large number of the projects in NASA93 are
from the COCOMONASA dataset. Surely similar data sets
55
Table 6. Model ranking accounting for Standard Error at
COCOMO81 Model A (ln_LSR_CAT)
PRED30: 0.98413, SE: 0.01556
95% confidence level (Desharnais, FP adj)
Model B (aSb)
PRED30: 0.26984, SE: 0.05606
MMRE ranking PRED(.25) ranking
Model C (given_EM)
PRED30: 0.42857, SE: 0.06234 1. F, G F, G
Model D (ln_LSR_aSb)
PRED30: 0.22222, SE: 0.05198
Model E (ln_LSR_EM)
PRED30: 0.58730, SE: 0.06191 Study 3: Simulation study of PRED
Our final study replicates the cost estimation simulation results in
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000
Model A (ln_LSR_CAT) [9] which suggest that MMRE is an unreliable criterion for
COCOMO NASA PRED30: 0.95000, SE: 0.02835
selecting among competing prediction models. The evidence
Model B (aSb)
PRED30: 0.65000, SE: 0.06198 presented for this was in observing the frequency from 1000 trials
Model C (given_EM)
PRED30: 0.71667, SE: 0.05827
for which MMRE would select a “true” model over four other
Model D (ln_LSR_aSb) models deliberately constructed to either over estimate or
PRED30: 0.55000, SE: 0.06410
underestimate 30 simulated Desharnais-like effort and size data
points. See [9] for further details of this investigation and
Model E (ln_LSR_EM)
PRED30: 0.86667, SE: 0.04378
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 construction and justification of simulated data. While numerous
alternative criterion were also investigated, curiously PRED was
NASA93
Model A (ln_LSR_CAT)
PRED30: 0.83871, SE: 0.03859 left out. We would like to note that there are numerous errors in
Model B (aSb) [9] and that some of the premises for which a “true” model is
PRED30: 0.48387, SE: 0.05160
Model C (given_EM)
deemed “best” are debatable and deserving of further careful
PRED30: 0.54839, SE: 0.05201 investigation. This withstanding, here we extend the results in [9]
Model D (ln_LSR_aSb)
PRED30: 0.39785, SE: 0.05106 by including PRED and also accounting for SE in the results.
Model E (ln_LSR_EM)
PRED30: 0.62366, SE: 0.05051 An OLS regression was performed on the Desharnais data set to
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 determine parameters for a model of type (D). These parameters
Figure 2. PRED(.3) bootstrapped 95% confidence intervals were then assumed to represent the parameters for the whole
population rather than just the Desharnais sample. The model
calibrated COCOMO I models. This result is consistent with the with these parameters, now of type (B), is called the “true” model
MMRE results above and further strengthens our confidence in it. as it is the “best” fit to the population data (this is one area where
Study 2: Consistency of evaluation with multiple criterions we have concern about this investigation). The simulated data set
was generated by creating 30 normally distributed values with
Figure 3 presents MMRE and PRED confidence intervals for two mean .3 and standard deviation .6 (see [9] for why these were
models using function points (FP) in the Desharnais data. chosen) and then calculating effort from the true model assuming
these are the residuals for this model. Size values were simply
Tables 5 and 6 indicate the ranking results for linear (F) and non-
generated as 50•i for i = 1,2,…,30. The competing models were
linear (D) models based on MMRE and PRED(.25). In [17] it is
of type (B) with differing values of the a and b parameters.
suggested that the fact that MMRE and PRED(.25) present
Model(28) has parameters selected to severely underestimate the
inconsistent rankings for selecting models (D) and (F) is evidence
simulated data. Model(29)’s parameters were selected to also
that they are measuring two different aspects of the error
underestimate, but only moderately. Model(30) severely
distribution. This may indeed be true, however we suggest if SE is
overestimates, while Model(31) only moderately overestimates.
taken into account as it is in Table 6, we cannot be confident in
this assertion. In Figure 3 one can see that there is a substantial 1000 simulated data sets were generated and MMRE and
overlap of the 95% confidence intervals. Hence, to have PRED(x) were computed for each set. One model is “selected”
confidence in the assertion made in [17], one would need a great
deal more data to reduce the SE, or use an alternative approach Model D (ln_LSR_aSb), FP adj
Desharnais
MMRE: 0.57689, SE: 0.09240
that is not subject to SE. The question of how much more data
would be needed is taken up in Section 8 with a simplified, yet Model F (LSR_aSb), FP adj
MMRE: 0.65335, SE: 0.11438
analogous example. However, from the methods presented there, Model D (ln_LSR_aSb), FP raw
we estimate that at least 4808 project data points would be needed MMRE: 0.59920, SE: 0.10459
to achieve 95% confidence that Model (F) has greater PRED(.25) Model F (LSR_aSb), FP raw
MMRE: 0.69727, SE: 0.12988
than Model (D), and the number for MMRE would be much
higher. 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000
Table 5. Model ranking not accounting for Standard Error Model D (ln_LSR_aSb), FP ajd
PRED25: 0.37037, SE: 0.05350
Desharnais
(Desharnais, FP adj)
Model F (LSR_aSb), FP adj
PRED25: 0.43210, SE: 0.05480
MMRE ranking PRED(.25) ranking
1. D F Model D (ln_LSR_aSb), FP raw
PRED25: 0.37037, SE: 0.05342
2. F D Model F (LSR_aSb), FP raw
PRED25: 0.41975, SE: 0.05500
56
over another if it has a “better” value – e.g. lower MMRE or
higher PRED(x). Our replicated and extended results are given in
Tables 7. 95%
Confidence
curious what its effect it might have on model selection. In Figure Kurtosis 0.46 Kurtosis -0.33
4 the left plot shows how the frequency of selections for the four
models as x ranges from 0 to 1.5 (in increments of .1) whereas the
plot on the right is when the 95% confidence interval was used.
As the tolerance level is increased, we see that the Figure 5. Histogram of bootstrapped MMRE and log-
underestimating models are selected more frequently and transformed MMRE for model (A), NASA93 dataset
57
For our example, we assume PRED for C is less than PRED for E
and so the intervals will not overlap when
A v e ra g e 0 .2 2
M e d ia n 0 .2 2
M ode 0 .2 1
S k e w n e ss 0 .1 6
K u r t o s is -0 .0 8
confidence intervals for models C and E are unlikely to overlap Model B (aSb)
PRED30: 0.65000, SE: 0.06198
by considering where the PRED’s are not within z0.05 ≈1.645 SE Model C (given_EM)
for each of the model’s respective standard errors where z0.05 is PRED30: 0.71667, SE: 0.05827
58
data sets and simulation data and have observed no significant parameters they estimate. This also implies that the SE’s should
differences with respect to the analysis for the example presented. uniformly converge to 0. MMRE is an average of continuous
Figure 8 shows a sample of four runs of randomly selecting a subset random variables (approximately log-normal) and we would not
of size n=1,2,…,93 from NASA93, calculating the MMRE and expect a large deviation in the rate of convergence, especially as the
PRED(.25) for this subset and for bootstrapping to estimate their data set gets large. In contrast with this, we anticipate a lot of
respective SE’s. From the distributional results discussed above we bounded variation for PRED(.25) since each indicator
could approximate the SE without bootstrapping, but we find it 1( MRE ≤ .25) has expected value P( MRE ≤ .25) and so this is
more confidence inspiring to use bootstrapping then verify with a essentially a series of Bernoulli trials (i.e. flipping a weighted coin)
non-bootstrap approximation. where at each n the PRED may incease by (1 − PREDn−1 ) / n or
We now present the experimental results as a series of “expected” decrease by PREDn −1 / n where PREDn −1 is the PRED(.25) value
and “observed” behaviors as the data size increases:
at n-1 data points, hence the variations are tightly bounded by the
Case n=1 previous values. Since each increase or decrease is divided by n, the
magnitude of these variations decrease as n increases. Hence we
Expect: MMRE will be the MRE for the single data point, SE = 0 as may see a few rare long runs up or down, but generally a repeated
there is no variance in the sample data. PRED will be 0 if MRE > series of short increases followed by short decreases that are smaller
.25 or 1 otherwise, SE = 0 for similar reasons. and smaller as n approaches 93.
Observed: Exactly as expected. This serves as a basic verification From that fact that MRE is an absolute value, the SE for MMRE is
that the experiment is correctly configured. dependent on P( z > 1) P( z ≤ 1) where the probability is from the
Case n=93 distribution of z (see Section 4). For large enough n this value
should be fairly constant when approximating this with an
Expect: All runs should have same MMRE and PRED values and empirically derived distribution (as we are doing with
respective SE’s as all data is being used and there is no random bootstrapping). More importantly however, the SE is also
variation from selecting subsets. dependent on the variance of z . This implies that for large point
Observed: Exactly as expected. This serves as another basic variations in the data, perhaps due to outliers, we would expect
verification that the experiment is correctly configured. We note large increases in an otherwise decreasing SE. Note that MMRE is
that the bootstrapping introduces less than a .01 error in the the average of n MRE’s, so its sample variance will be divided by
approximating SE for MMRE and .004 for approximating SE for n2 and hence the SE will be divided by n. The PRED SE should be
PRED(.25). quite well behaved as it only depends on
P( MRE ≤ .25) P( MRE > .25) and not the variance of the MRE’s as
Case n→93 with MMRE. As stated previously, we expect the approximate
Expect: MMRE and PRED are both consistent estimators for non- values for these probabilities as derived from the empirical
negative random variables hence we expect that their respective distribution used by the bootstrap to rapidly stabilize and therefore
values will uniformly converge (i.e. does not depend on the we would expect the PRED SE to generally decrease on the order
particular value converged to) in probability to the distribution of 1/n.
Observed: The experimental runs have the characteristics as
MMREvs datasize MMRE SE vs data size
described above with a few additional notable items. First is that
4 2 contrary to popular belief in the cost estimation folklore, we see that
3 1.5 in general, more data does not imply better accuracy values. Even
MMRE.
2 1
possibility of both decreasing and increasing MMRE and PRED
1 0.5 values as data size increases. All we can say with confidence is that
0
0 the criterions will eventually converge, and a bit on how much
0 20 40 60 80 100 0 50 100 variability to expect along the way. That we cannot predict for any
datasize data size given set of data that more data will improve the MMRE or PRED
is an absolutely crucial fact about the use of these as accuracy
PRED(.25) vs data size PRED(.25) SE vs data size criteria. This indicates that it is imperative to always consider the
0.3
SE of these accuracy measures when using them to avoid erroneous
1.2
0.25
results due to sampling phenomenon. We believe that this is a
1
major source of inconstant cost estimation research results, a few of
PRED(.25)_
PRED(.25)_
0.8 0.2
0.6 0.15
which we have exemplified and resolved in section 7.
0.4 0.1 Another notable observation is that we clearly see from large
0.2 0.05 variations in the MMRE runs in Figure 8 that there are at least 3
0 0 significant outliers in the NASA93 data. Hence as has been
0 20 40 60 80 100 0 20 40 60 80 100 frequently claimed in the literature (and expected as above),
data size data size MMRE is indeed very sensitive to outliers. We see that even very
near the full 93 data points that an outlier can cause a radical
Figure 8. MMRE and PRED(.25) performance with respect to variation in the MMRE. This by itself does not necessarily make
increasing data set size with model (C), NASA93 MMRE unreliable as an accuracy criterion. What does make it
59
unreliable is how outliers affect the MMRE SE. A reliable [12] J.M. Desharnais, "Analyse statistique de la productivitie des projets
estimator must have a strictly decreasing SE with increasing informatique a partie de la technique des point des fonction," masters
sample size. Given that an outlier may radically increase the thesis, Univ. of Montreal, 1989.
MMRE SE at any time, we cannot reliably estimate it for a given [13] M. Jørgensen, “How Much Does a Vacation Cost? or What is a
data size, and thus we are uncertain about the true value of the Software Cost Estimate?,” ACM SIGSOFT Software Engineering
error parameter estimated by MMRE. Notes, P. 5, Vol. 28, No. 6, 2003
[14] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, “Software engineering
In contrast, PRED is a reliable estimator. As expected, and metrics and models,” Benjamin-Cummings Publishing, 1986
observed, it is decreasing. Note that the small increases seen in [15] M. Jørgensen, “Experience with the accuracy of software maintenance
the PRED SE runs in Figure 8 are the result of approximation task effort prediction models,” IEEE Transactions on Software
errors and random variation from bootstrapping. Because the Engineering, Vol. 21, No. 8, 1995
sample MMRE’s are non-discrete, the SE is less effected by these [16] G. Boetticher, T. Menzies, and T. Ostrand. The PROMISE Repository
kinds of errors. We also observe that PRED SE tends to stabilize of Empirical Software Engineering Data, 2007.
quickly, and that after 10 data points all runs more or less http://promisedata.org/repository.
converge on the same trajectory. Thus we can reliably estimate [17] B. Kitchenham, L. Pickard, S. MacDonell, M. Sheppard, “What
the SE and have greater confidence in results based on PRED so accuracy statistics really measure,” Proceedings of the IEEE, Vol. 148,
long as we account for this SE. In this sense, PRED is “robust” as No. 3, 2001
an estimator. In talking about how much data is needed to [18] M. Jørgensen, M. Shepperd, :”A Systematic Review of Software
reliably calibrate a COCOMO model, Barry Boehm is noted for Development Cost Estimation Studies,” IEEE Trans. Software Eng.,
suggesting that 10 data points is sufficient. Our experiments here Vol. 33, No. 1, 2007
provide some support for this heuristic given the stability of [19] M. Shepperd, C. Schofield, “Estimating Software Effort using
PRED SE after 10 points. Anologies,” IEEE Transactions on Software Engineering, 1997
[20] G. Hood, http://www.cse.csiro.au/poptools, 01/19/2008
Finally we observe that MMRE’s sensitivity to outliers can be put [21] http://en.wikipedia.org/wiki/Vysochanskii-Petunin_inequality,
to practical use as an outlier “detector” for pruning rouge 01/19/2008
estimation data points. We expanded our experiment to indicate [22] R. Larsen, M. Marx, “An Introduction to Mathematical Statistics and
what point was added whenever the MMRE SE increased more its Applications,” Second Edition, Prentice Hall, 1986
than a given tolerance (in this experiment we used .025). We [23] L. Briand, T. Langley, and I. Wieczorek. “A replicated assessment and
subsequently removed that point from the data set and re-ran the comparison of common software cost modeling techniques,” in
experiment. This resulted in a data set whose MMRE SE Proceedings of the 22nd International Conference on Software
appeared reliable. We did not carefully investigate the pruned Engineering, Limerick, Ireland, 2000, pp. 377–386.
data to determine if they in fact could be classified as outliers. [24] K. Lum, J. Hihn, T. Menzies, “Studies in Software Cost Model
Behavior:Do We Really Understand Cost Model Performance?,”
Proceedings of the ISPA International Conference 2006, Seattle, WA
9. REFERENCES
[1] K. Moløkken, M. Jørgensen, “A Review of Surveys on Software Effort [25] COCOMO81 dataset, http://promisedata.org/repository/#coc81,
Estimation,” International Symposium on Empirical Software 12/29/2007
Engineering, Rome, Italy, 2003 [26] COCOMONASA dataset,
[2] I. Myrtweit, E. Stensrud, M. Shepperd, “Reliability and Validity in http://promisedata.org/repository/#cocomonasa_v1, 01/19/2008
Comparative Studies of Software Prediction Models,” IEEE [27] NASA93 dataset, http://promisedata.org/repository/#nasa93,
Transactions on Software Engineering, Vol. 31, No. 5, 2005 12/29/2007
[3] T. Menzies, Z. Chen, J. Hihn, K. Lum, “Selecting Best Practices for [28] Desharnis dataset, http://promisedata.org/repository/#desharnais,
Effort Estimation,” IEEE Transactions on Software Engineering, Vol. 12/29/2007
32, No. 11, 2006 [29] L. C. Briand, I. Wieczorek, “Resource Estimation in Software
[4] M. Shepperd “Evaluating Software Project Prediction Systems,” 11th Engineering”, Encyclopedia of Software Engineering, Pp. 1160 –
IEEE International Software Metrics Symposium, Como, Italy, 2005 1196, Wiley-Interscience Publishing, 2001
[5] T. Menzies, D. Port, Z. Chen, J. Hihn, S. Stukes, “Validation Methods [30] BESTweb – Better Estimation of Software Tasks,
for Calibrating Software Effort Models,” Proceedings of the 27th http://www.simula.no/~simula/se/bestweb/, 2007
international conference on Software engineering, 2005 [31] L. Briand, K. El Emam, D. Surmann, I. Wieczorek, K. Maxwell, “An
[6] I. Wieczorek, M. Ruhe, “How valuable is company-specific data Assessment and Comparison of Common Software Cost Estimation
compared to multi-company data for software cost estimation?,” Modeling Techniques,” Proceedings of the 21st international
Proceeding for the Eights IEEE Symposium on Software Metrics conference on Software engineering, Los Angeles, California, United
(METRICS 02), 2002 States, 1999
[7] Ch. Mooney, Robert Duval, “Bootstrapping: A Nonparametric [32] B. Kitchenham, E. Mendes, G. Travassoss, “Cross- vs. Within-
Approach to Statistical Inference,” Sage Publications; 1. edition (1993) Company Cost Estimation Studies: A Systematic Review,” IEEE
[8] B. Efron (1979). "Bootstrap methods: Another look at the jackknife,” Transactions on Software Engineering, Vol. 33, No. 5, 2007
The Annals of Statistics, 7, 1-26 [33] C. Mair, M. Shepperd, M. Jørgensen, “An Analysis of Data Sets Used
[9] T. Foss, E. Stensrud, B. Kitchenham, I. Myrtveit, “A simulation study to Train and Validate Cost Prediction Systems”, International
of the Model Evaluation Criterium MMRE,” IEEE Transactions on Conference on Software Engineering, St. Louis, Missouri, USA, 2005
Software Engineering, Vol. 20, No. 11, 2003 [34] Land, C. E. (1971), “Confidence intervals for linear functions of the
[10] M. Shepperd, G. Kadoda, “Using Simulation to Evaluate Prediction normal mean and variance,” Annals of Mathematical Statistics, 42,
Techniques,” Proc. Fifth Int’l Software Metrics Symp., 2001 1187-1205
[11] B. Boehm. Software Engineering Economics. Prentice Hall,1981
60