You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221494789

Comparative Studies of the Model Evaluation Criterions MMRE and PRED in


Software Cost Estimation Research

Conference Paper · January 2008


DOI: 10.1145/1414004.1414015 · Source: DBLP

CITATIONS READS

59 983

2 authors, including:

Daniel Port
University of Hawai'i System
140 PUBLICATIONS   1,992 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

software systems assurance View project

All content following this page was uploaded by Daniel Port on 12 March 2014.

The user has requested enhancement of the downloaded file.


Comparative Studies of the Model Evaluation Criterions
MMRE and PRED in Software Cost Estimation Research
Dan Port Marcel Korte -
University of Hawai’i at Manoa University of Applied Sciences and Arts Dortmund
2404 Maile Way, E303 Emil-Figge-Str. 42
Honolulu, Hawaii, USA 96822 44227 Dortmund, Germany
+1 808 956 7494 +49 231 755 6709
dport@hawaii.edu marcel.korte@stud.fh-dortmund.de

ABSTRACT (e.g. PROMISE repository [16]). Model criterions are also


Software cost model research results depend on model accuracy criteria referred to as “accuracy measures” or “accuracy indicators” and
such as MMRE and PRED. Despite criticism, MMRE has emerged as the the two most prevalent are the mean magnitude relative error
de facto standard criterion. Many alternatives have been proposed and (MMRE) and percentage relative error deviation (PRED) [14].
studied, surprisingly however PRED, the second most popular criterion, MMRE has been empirically studied and criticized in a number
has not been extensively studied. This work attempts to fill this gap in the of works such as [2,9,17]. A dizzying array of alternative
literature and expand the understanding and use of evaluation criterion in criterion has been suggested. Surprisingly, few works have
general. The majority of this work is empirically based, applying MMRE
directly studied PRED or even included it as an alternative for
and PRED to a number of COCOMO model variations with respect to a
simulated data set and four publicly available cost estimation data sets.
comparison.
We replicate a number of results based on MMRE and extend them to Therefore one objective of this work is to remedy this by
PRED. We study qualities of MMRE and PRED as sample estimator
providing an empirical comparative study of MMRE and PRED
statistics for parameters of a cost model error distribution. Standard error
is used to ensure greater confidence in replicated and new results based on
through side by side analysis of the two criterion, and by
sample data. replicating a few key empirical studies of MMRE and expanding
them to include PRED. Another objective is to advocate the
appropriate use of standard error (SE) when discussing model
Categories and Subject Descriptors criterion based on sample project data. This addresses a serious
D.2.9 [Cost Estimation]: Empirical study of MMRE and PRED problem noted by various studies such as [2,3,4] that show
as software cost estimation model evaluation criterion. inconclusive or contradictory results when using model accuracy
criterion to compare models. For example, in an ongoing
General Terms discussion in the literature about local calibration, Menzies, et al.
Management, Measurement, Performance, Reliability, [5] references eight different studies supporting local calibration,
Standardization, Theory, Verification, Accuracy. whereas Wieczorek, et al. [6] can’t find any significant difference
in the performance between local and global estimation models.
Briand and Kitchenham [31,32] also state inconclusive results in
Keywords this matter. Such inconstancies promote a palpable lack of
Cost Estimation, Cost Model, Standard Error, Confidence, confidence in software cost estimation research results, and
MMRE, PRED, Model Selection, Parameters, Calibration, consequently, a lack of confidence in software cost estimation
Bootstrapping, Confidence Interval accuracy criteria [1,17]. By accounting for SE, we facilitate
greater confidence in our results and help resolve some of the
1. INTRODUCTION aforementioned inconstancies.
Over the last 20+ years there have been a large number of cost
The research results mentioned above are based on a variety of
estimation research efforts. Improving estimation accuracy and
datasets, and [33] gives an overview of some of these datasets.
methods to support better estimates has been a major focus of
For these datasets, some works have raised the question “How
researchers and is also of ample interest by practitioners.
much and what quality of data is needed to obtain significant
Generally the development, validation, and comparisons of such
results?” [5]. This may seem to be a difficult question to answer,
research is empirical - applying model evaluation criterion (e.g.
and few works have attempted to address it in a meaningfully
model accuracy measures) to cost and effort datasets that have
quantitative way. In this work we will see that analyzing a
either been collected for this reason, or are publicly available
criterion’s SE as a function of the size of a dataset is a natural and
meaningful way to address this question.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are The statements and analysis in this work are based on four
not made or distributed for profit or commercial advantage and that publicly available datasets: COCOMO81 [25], COCOMONASA
copies bear this notice and the full citation on the first page. To copy [26], NASA93 [27] and the Desharnais dataset [28], and a
otherwise, or republish, to post on servers or to redistribute to lists, simulation of the latter as specified in [9] and [2]. We use these
requires prior specific permission and/or a fee. because they are well-known, straightforward, and publicly
ESEM’08, October 9–10, 2008, Kaiserslautern, Germany. available so that our results may be readily replicated, verified
Copyright 2008 ACM 978-1-59593-971-5/08/10…$5.00.

51
and hopefully expanded. Our approach however, is not specific to versions of the COCOMO model, the two most prevalent being
any particular dataset or model. COCOMO I and COCOMO II. The one we use here (COCOMO
I) was chosen based on the publicly available COCOMO data
The current work aims to provide a better understanding of
within the PROMISE repository [16]. Here we study variations of
MMRE and PRED as an estimate of model accuracy and
the classic COCOMO I model to exemplify our points and
increasing confidence in their application. We would like to make
methods, and enable straightforward duplication and verification
it clear upfront that the current work is an empirical study for
of our results and claims. However, our methods are not limited
mainly this purpose and make no claim that either is a superior
to such models, and it will be evident that COCOMO I is fully
criterion to other measures. If we were to take a position, it would
exchangeable with other, and perhaps better cost estimation
be to advocate the use of the more standard mean squared error
models. The intent here is to define a set of experiments and
(MSE) [2] or a maximum likelihood estimator [22] over any of
examples that others may replicate in order to refute or improve
the criterion suggested in the literature. Neither appears to have
on our results and methods. The particular datasets and cost
been considered to date, however this is not the focus of the
models used here are simply a convenience.
current work and these will not be discussed further in this paper.
Boehm's Post-Architecture version of COCOMO I:
This paper is organized as follows: a discussion of related works;
⎛ 15 ⎞
then a description of models and datasets used; definitions for
∏ EM j j ⎟⎟⎠ * (size)
b
effort = a * ⎜
a

*ω (1.)
criterions (MMRE, PRED) and some of their analytic ⎝ j=1
characteristics as statistical estimators; use of standard error for Here, EMj are “effort multiplier” parameters whose values are
these accuracy estimators; research methodology; replicated chosen based on a project’s characteristics, and aj, a, b are
studies of MMRE extended to PRED; and lastly a discussion of domain specific “calibration” coefficients, either given as
empirical characteristics or MMRE and PRED. We purposely specified by Boehm in [11] or determined statistically (generally
have not provided a conclusion as the intent is to provide a via ordinary least squares regression) using historical project
deeper and more detailed understanding of MMRE and PRED, data. The dependent variable size is expressed either as KSLOC
and not to criticize or advocate their use. (thousand source lines of code) or in FP (function points) is
estimated directly or computed from a function point analysis.
2. RELATED WORK The model error ω is a random variable with distribution D (not
The available resources and literature on cost estimation research generally Normal). Model accuracy measures are estimating one
can be overwhelming. There exist a relatively large number of or more parameters of D.
empirically based estimation methods. Non-model-based methods
Table 1 shows the six COCOMO I model variations used in this
(e.g. “expert judgment”) usually do not play an important role in
work and their brief descriptions.
the empirical literature. Generally such methods do not output
point estimation data applicable to accuracy criterion (there are Table 1. COCOMO I model variations used in study
some research effort that are the exception). Still, they are widely Model a,b EMj aj
practiced intuitive methods used frequently in organizations (A) ln_LSR_CAT CLSR categorical CLSR
where a model-based approach would be too cumbersome or (B) aSb given none none
sufficient model-calibration data is unavailable. Model-based (C) given_EM given given none
methods can be split into generic-model-based (e.g. COCOMO, (D) ln_LSR_aSb OLS none none
SLIM, etc.) and domain specific-model-generation methods such (E) ln_LSR_EM OLS given OLS
as CART or Stepwise ANOVA. (F) LSR_a+bS OLS none none
The table entries are interpreted in the following way:
Besides the variety of cost estimation methods, there are a large
OLS: Ordinary Least Squares regression was used with the given
diversity of studies on the topic - some on evaluation of cost
project data set to determine the parameter values.
estimation in different contexts, some assessing current practices
CLSR: Categorical Least Squares regression was used with the
in the software industry, others focusing on calibration of cost
given project data set to determine the parameter values.
estimation models. See the Encyclopedia of Software
Given: The values of these parameters are given in [3] and not
Engineering [29] for an overview of cost estimation techniques as
derived statistically from the data set.
well as cost estimation studies. Also [18,30] list current studies
on software cost estimation. Categorical: The values of these parameters are considered non-
numerical, non-ordinal categories (e.g. the implied order of “L”
As mentioned in the introduction, unlike with MMRE, no “N” “H” “XH” values for effort multipliers are ignored).
detailed study could be found on the nature and efficacy of PRED None: The parameters are not used in this model.
as a software cost estimation model criterion. In spite of this,
PRED is a frequently used criterion as is evidenced by Models (A) - (E) use the functional form as the general
summations of model performances in [3]. COCOMO I model given equation (1) above, however model (F)
uses a simple linear a+b*(size) form. When there is a “ln_” in the
model name, the applicable project data was transformed by
3. COCOMO, DATASETS USED taking the natural logarithm (ln) for the analysis. All values were
In this study, we will use COCOMO, the Constructive Cost back transformed when used in the model and model calculations
Model [11] since, unlike other models such as PriceS or SLIM or (e.g. calculating MMRE’s).
and SEERSEM, it is an open model with substantial published
project data sets. All details for COCOMO are published in the An example reading of the table for model (B) states that it is the
text “Software Engineering Economics” [11]. There are several general COCOMO I model without using any effort multipliers

52
(and hence no calibration coefficients) and the values of the Although MMRE and PRED are still today the de facto standards
parameters a, b are taken from the values given in [3]. No for cost model accuracy measurement, they don’t specifically
regression is performed on the data set. measure accuracy. In fact, technically they are “estimators” of a
function of the parameters related to the distribution of the MRE
The historical data used for estimating the coefficients are taken
values. This in turn is presumably related to the error distribution
from the COCOMO81 [25], COCOMONASA [26], NASA93
of the model. As such, we will frequently refer to these as
[27] and Desharnais [28] PROMISE repository data sets. We note
“accuracy indicators” rather than measures when it is more
that in the course of this work we contributed numerous
appropriate. Several studies have noted that MRE distributions
corrections and clarifications for these data sets. Simulated data is
are essentially related to the simpler distribution of the values:
based on [9] and [2] and we direct the reader to these sources for
details on its applicability and construction. ŷi
zi =
yi
, (5)
4. ACCURACY ESTIMATORS
Which are clearly related to the distribution of the error
The field of cost estimation research suffers a lack of clarity
about the interpretation of model evaluation criterion. In residuals ε i = yi − ŷi but in a non-trivial way [17].
particular, for model accuracy, various indicators of accuracy –
both relative and absolute – have been introduced throughout the Kitchenham et al. report that MMRE and PRED are directly
cost estimation literature. For example, mean squared error related to measures of the spread and the kurtosis of the
(MSE), absolute residuals (AR) or balanced residual error (BRE). distribution of zi values [17], a fact of uncertain utility, but
Our literature review indicated that the most commonly used, by notable. A useful fact that follows easily from the weak law of
far, are the “mean magnitude relative error” or MMRE, and large numbers [22] is that both MMRE and PRED are consistent
“percentage relative error deviation within x” or PRED(x). Of estimators (i.e. they converge in probability to some parameter of
these two, the MMRE is the most widely used. Both are based on the distribution) [22]. This provides a meaningful and precise
the same basic unit-less value of magnitude relative error (MRE) interpretation of “accuracy measure” when they are viewed as
which is defined as estimators for the error distribution D for cost model and dataset.
This however, does not say anything about how good they are as
yi − ŷi estimators (e.g. bias, uniform convergence, rate of convergence,
MRE i = , (2)
yi MSE, variance, etc.). Indeed, there is substantial research that
address the quality of these estimators (such as [9] and [10]),
where yi is the actual effort and ŷi is the estimated effort for although not expressed or analyzed in the more standard
project i. It is argued that MRE is useful because it does not over statistical framework we use here.
penalize large projects and it is unit-less (i.e. scale independent). There has been a degree of debate regarding the efficacy of
MMRE is defined as the sample average of the MRE’s: MMRE and to a lesser extend PRED as accuracy measures, yet
1
N one thing is clear – they are both statistics for a sample of the
MMRE =
N ∑ MREi (3) MRE’s, and not the entire population and therefore they are
i=1
subject to standard error (SE). The SE must be accounted for if
To be more precise we should label (3) as MMREN to indicate that one is to have confidence in their application.
it is a sample statistic on N data points. To be consistent with
customary usage we will drop the subscript when there is no 5. STANDARD ERROR OF ACCURACY
confusion as to the number of data points used. Conte et al. [14]
consider MMRE ≤ .25 as an acceptable level of performance for
ESTIMATORS
effort prediction models. One of Boehm’s original motivations for creating COCOMO was
to increase the confidence managers have when estimating
PRED(x) [15] defines the average fraction of the MRE’s off by software projects. Curiously, despite this original motivation for
no more than x as defined by COCOMO, very little has been reported on the confidence in
accuracy measures for COCOMO estimates. This has led to a
{
N
1 1 if MREi ≤ x
PRED( x ) =
N ∑ 0 otherwise (4) surprising number of contradictory results in the theory and
I =1 practice of cost estimation.
Typically PRED(.25) is used, but some studies also look at The primary concern is that COCOMO models (and more
PRED(.3) with little difference in results. Generally PRED(.3) ≥ generally all cost estimation models) are “calibrated” with a
.75 is considered an acceptable model accuracy. There is some relatively small amount of data (which is frequently biased or
concern about what constitutes an appropriate value of x for “sanitized”). Various measures such as PRED(.25)=.5 and
PRED(x). Clearly the larger x is, the less information and MMRE=.35 are plainly presented stating just how “good” I
confidence we have in an accuracy estimate. However interesting should feel about the model’s accuracy and predictive
this question, we are interested in comparisons of PRED with capabilities. The reality of this is that these values only reflect the
MMRE and not the specific application of these measures. Note model accuracy for the data they were calibrated on. There is a
that inverse to MMRE, high PRED values are desirable. This is serious question in our “confidence” in these measures for
easily reversed to match MMRE by simply switching the 0-1 predicted values. Providing a standard error for these measures
values in (4) if desired. This inverse relationship should be kept and a clearer understanding of what this implies (e.g. How much
in mind when viewing our side by side comparisons. is bad?) is key to addressing the confidence question. For
example, if we understand the standard error from calibration

53
data (i.e. the sample population), we can generate appropriate distributions are poorly approximated with the basic
confidence intervals of these measures for the “true” population bootstrapping method. We discuss the distributional
of values being predicted. For example, a COCOMO calibration characteristics in a later section, we found that the BC-percentile,
that has a PRED(.25)=.5 for the sample data, one might state that or “bias corrected” method has been shown effective in
with 95% confidence that the value of the unknown parameter for approximating confidence intervals for the type of distributions
the population error distribution that is being estimated lies we are concerned with. For each calculation we chose 15,000
within the interval .38<PRED(.25)<.83. bootstrap iterations (well beyond the suggested number) using the
Excel Add-In poptools [20]. Our bootstrapping results have also
Generally the standard error for an estimator is difficult to
been replicated using another bootstrapping Excel Add-In’s,
compute analytically. However, bootstrapping [7, 8] is a well-
manual calculations, and some custom developed bootstrapping
known, well-accepted, and straightforward approach to
software that performs bootstrap iterations to a desired precision
approximating the standard error of an estimator. Briefly,
rather than using the arbitrarily chosen 15,000 iterations. All
bootstrapping is a “computer intensive” technique similar to
results were seen to be consistent, some of which we now
Mote-Carlo simulation that re-samples the data with replacement
describe.
to “reconstruct” the general population distribution. The
bootstrap is distinguished from the jackknife, used to detect
outliers, and cross-validation (or “holdouts”), used to make sure 7. MMRE AND PRED STUDIES
that results have predictive validity. In this section we provide 3 empirical studies that replicate or
extend existing studies. We have performed a dozen such studies
We use bootstrapping in various capacities to understand the and these are just a few representative examples.
standard error of PRED and MMRE for the COCOMO I model
using the three PROMISE data sets indicated previously. Our Study 1: model selection results
preliminary investigations listed in Table 2 show that the
A popular cost estimation research area is model selection. This
standard error for MMRE and PRED(.25) for various COCOMO
commonly involves advocating methods and criteria for choosing
I models are significant, and clearly worthy of more detailed
a particular estimation model format, calibration method, or use
study.
of calibration data (e.g. “pruning”, stratification, etc.) or a
Table 2. Overview of datasets with model (C) combination thereof (see [18]). Model selection research results
often appear to be contradictory across different data sets.
Data Set size MMRE SE PRED(.25) SE 95% Confidence
Validation methods such as “holdout” experiments, while
NASA93 93 .6 .14 .48 .05 .37 ≤ MMRE ≤ .94
intuitively may seem reasonable, are difficult to justify formally.
.38 ≤ PRED ≤ .6 Many model selection research results compare COCOMO
COCOMO81 63 .37 .04 .37 .06 .31 ≤ MMRE ≤ .45
models and calibration approaches to (presumably better)
.25 ≤ PRED ≤ .49 alternatives. To illustrate how standard error can be used to
COCOMO 60 .25 .03 .65 .06 .2 ≤ MMRE ≤ .32 obtain more confident results, we choose to study a number of
NASA .55 ≤ PRED ≤ .78 variations of COCOMO I itself. This study replicates results from
several other studies (at least in part), or are analogous enough to
6. RESEARCH METHOD indicate how the approach could be used with alternative models
As mentioned previously, we used four different PROMISE and calibration methods, including analogy models [19],
datasets to give a good overview of the effects of SE. We COSEEKMO [3], and simulated project data approaches [9].
calculated MMRE, PRED(.25) and PRED(.30) for the models
(A)-(E) on the COCOMO81, COCOMONASA and NASA93 Figure 1 visualizes the performance of MMRE with respect to the
datasets. The same accuracy indicators were calculated using the PROMISE datasets. Each graph on a diagram shows the estimator
models (E) and (F) for the Desharnais dataset using adjusted and location (MMRE) within its 95%-confidence interval and is
also raw function points as in [17]. The reason we chose fewer labeled with the model name and MMRE value. Remarkably, the
models for this dataset is that the Desharnais data does not have standard error for the MMRE’s for the same model vary greatly
COCOMO I effort multipliers. over different datasets. This perhaps in part explains some of the
inconstant results in the literature when different data sets were
We aim to obtain the standard error for MMRE and PRED for the used.
various models (A) - (F) and four PROMISE data sets. The
parameters of the z-distribution (related to the error distribution, While the precise probabilities for the likelihood of one value
see Section 3) are unknown, and is not known to be normally being greater than another were not calculated, it is clear enough
distributed. As such, the variance of MMRE and PRED as to see that the more overlap two confidence intervals have, the
estimators, which are needed for standard error, are difficult to less one is able to say “in confidence” that one value is greater
obtain analytically. A well-established method for obtaining (or less than) another. This informal statement can be made more
approximations for the standard error of estimators is to use precise by considering Vysochanskiï-Petunin inequality [21]
bootstrapping [7,8]. which is a refinement of Chebyshëv's inequality that places an
upper bound on the probability of how far a random variable can
Standard confidence intervals are also difficult to compute for all be from its mean with respect to its variance. Such tools are
but the sample mean for normally distributed data, so here too we essential for reliably understanding dispersion measures such as
resort to bootstrapping. A notable concern in bootstrapping the coefficient of variation (such as used in [3]) and presumably
confidence intervals is the effect of non-normally distributed MMRE and PRED. For our purposes here, the visual amount of
data. In particular, confidence intervals for highly skewed

54
overlap of confidence intervals will suffice to illustrate the effects Model A (ln_LSR_CAT) COCOMO81
of standard error on MMRE and PRED. MMRE: 0.11666, SE: 0.01102

Model B (aSb)

Performance ranking (where lower MMREs have higher rank) is Model C (given_EM)
MMRE: 0.77834, SE: 0.09621

not consistent over all four models as also indicated in table 3 MMRE: 0.37220, SE: 0.0366

(i.e. ranks 4 and 5 are switched for COCOMONASA). Model D (ln_LSR_aSb)


MMRE: 0.99729, SE: 0.14374
Furthermore, the standard error intervals vary greatly in size: on Model E (ln_LSR_EM)
the COCOMO NASA dataset the biggest interval ranges 19 MMRE: 0.32531, SE: 0.04212

points whereas the 95%-confidence interval for model (D) on the 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

COCOMO81 dataset has a range of more than 66 points. This


indicates that MMRE is sensitive to both the data and the
Model A (ln_LSR_CAT)
MMRE: 0.06921, SE: 0.01250
COCOMO NASA
particular model used. That is, some models may provide more Model B (aSb)
MMRE: 0.33915, SE: 0.04906
confident accuracy results for a given data set than others. As a Model C (given_EM)

result, a more confident performance ranking accounting for this MMRE: 0.25392, SE: 0.03091
Model D (ln_LSR_aSb)
might be something like that listed in Table 4. Where two models MMRE: 0.31779, SE: 0.04521

in the same rank cannot be distinguished from one another in Model E (ln_LSR_EM)
MMRE: 0.13396, SE: 0.01497
terms of the model accuracy criterion. 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

Table 3. Model ranking based on MMRE, not accounting for Model A (ln_LSR_CAT)
Standard Error. MMRE: 0.20007, SE: 0.03991 NASA93
Model B (aSb)
COCOMO81 COCOMONASA NASA93 MMRE: 0.64610, SE: 0.12126

1. A A A Model C (given_EM)
MMRE: 0.59497, SE: 0.014117
2. E E E Model D (ln_LSR_aSb)
3. C C C MMRE: 0.65149, SE: 0.09823
Model E (ln_LSR_EM)
4. B D B MMRE: 0.41679, SE: 0.08551
5. D B D 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

Table 4. Model ranking based on MMRE, accounting for Figure 1. MMRE bootstrapped 95% confidence intervals
Standard Error at 95% confidence level.
should have similarly behaving required complexity within the
COCOMO81 COCOMONASA NASA93
same category value. The reasonable conclusion here is that
1. A A A, B, C, D, E
model (A) is not a realistic effort model for the data sets despite
2. C, E E -
its statistical performance. There is no confidence that the model
3. B, D B, C, D -
is accurate for predictions (i.e. data outside the calibration set).
So what can be concluded with confidence based on the MMRE
Looking at Figure 1 one might be excited about model (A) as it
results? The intervals for COCOMO81 in Figure 1 indicate that
appears to be consistently better than other models and with high
model E is significantly better than model D. Hence for
confidence (i.e. the confidence intervals do not overlap).
COCOMO81 data, we can be confident that adding effort
However this illustrates a fallacy of using purely statistical results
multipliers improves MMRE. The same result holds for the other
as a basis for model selection. By allowing our model parameters
data sets except NASA93. This provides reasonable confidence
to vary unconstrained, we are indeed able to calibrate a model
that in general adding effort multipliers will improve indeed
that fits the data very well. However, a quick look at the
MMRE.
parameter coefficients generated for this model reveal a number
of absurd effort parameter relationships. For example model (A) Figure 2 compares PRED results for the same models and
applied to NASA93 CLSR estimated the parameter values for the datasets as Figure 1. Unlike MMRE, the PRED rankings are
“required complexity” CPLX effort multiplier (see [3] for details consistent over all datasets (i.e. A > E > C > B > D where “>”
for this) to be L=-0.483, N=0.989, H=0.677, and VH=-0.745. means “higher PRED rank than”). Also, note that the confidence
Generally it is believed that higher required complexity requires intervals vary less in size. This, from a perspective that is easily
higher effort. This obviously runs contrary to such beliefs (which accessed and justified, supports a variety of assertions made in
other studies have empirically validated). Data enthusiasts might the literature that claim PRED is more consistent and “robust”
counter this objection with “perhaps this actually describes the than MMRE. In fact, as we will illustrate in a later section, that
true nature of required complexity” in that it is not ordinal. But unlike MMRE, PRED is not dependent on the variance of the
then looking at the values estimated for model (A) applied to MRE’s. Thus PRED is immune to large variances from outliers in
COCOMONASA where L=-0.66, N=0.659, H=0.653, and the data (i.e. it is more robust). This property explains why the
VH=7.24 contradicts this. The enthusiast may counter again by MMRE confidence intervals for models (E) and (D) overlap for
stating that perhaps it is the true nature of required complexity NASA93, but not for the other datasets. At the 95% confidence
varies with respect to the kind of projects each data set level we cannot claim that the PRED rankings are significant
represents. except for model (A) which can be thrown out for the same
reasons discussed previously. However, none of the PRED
Fair enough, but closer inspection of NASA93 and COCOMO
confidence intervals for (E) and (D) overlap, so we can be
NASA would reveal that the kinds of projects in both are very
confident that effort multipliers improve accuracy in OLS
similar. In fact, a large number of the projects in NASA93 are
from the COCOMONASA dataset. Surely similar data sets

55
Table 6. Model ranking accounting for Standard Error at
COCOMO81 Model A (ln_LSR_CAT)
PRED30: 0.98413, SE: 0.01556
95% confidence level (Desharnais, FP adj)
Model B (aSb)
PRED30: 0.26984, SE: 0.05606
MMRE ranking PRED(.25) ranking
Model C (given_EM)
PRED30: 0.42857, SE: 0.06234 1. F, G F, G
Model D (ln_LSR_aSb)
PRED30: 0.22222, SE: 0.05198
Model E (ln_LSR_EM)
PRED30: 0.58730, SE: 0.06191 Study 3: Simulation study of PRED
Our final study replicates the cost estimation simulation results in
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

Model A (ln_LSR_CAT) [9] which suggest that MMRE is an unreliable criterion for
COCOMO NASA PRED30: 0.95000, SE: 0.02835
selecting among competing prediction models. The evidence
Model B (aSb)
PRED30: 0.65000, SE: 0.06198 presented for this was in observing the frequency from 1000 trials
Model C (given_EM)
PRED30: 0.71667, SE: 0.05827
for which MMRE would select a “true” model over four other
Model D (ln_LSR_aSb) models deliberately constructed to either over estimate or
PRED30: 0.55000, SE: 0.06410
underestimate 30 simulated Desharnais-like effort and size data
points. See [9] for further details of this investigation and
Model E (ln_LSR_EM)
PRED30: 0.86667, SE: 0.04378

0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 construction and justification of simulated data. While numerous
alternative criterion were also investigated, curiously PRED was
NASA93
Model A (ln_LSR_CAT)
PRED30: 0.83871, SE: 0.03859 left out. We would like to note that there are numerous errors in
Model B (aSb) [9] and that some of the premises for which a “true” model is
PRED30: 0.48387, SE: 0.05160
Model C (given_EM)
deemed “best” are debatable and deserving of further careful
PRED30: 0.54839, SE: 0.05201 investigation. This withstanding, here we extend the results in [9]
Model D (ln_LSR_aSb)
PRED30: 0.39785, SE: 0.05106 by including PRED and also accounting for SE in the results.
Model E (ln_LSR_EM)
PRED30: 0.62366, SE: 0.05051 An OLS regression was performed on the Desharnais data set to
0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000 determine parameters for a model of type (D). These parameters
Figure 2. PRED(.3) bootstrapped 95% confidence intervals were then assumed to represent the parameters for the whole
population rather than just the Desharnais sample. The model
calibrated COCOMO I models. This result is consistent with the with these parameters, now of type (B), is called the “true” model
MMRE results above and further strengthens our confidence in it. as it is the “best” fit to the population data (this is one area where
Study 2: Consistency of evaluation with multiple criterions we have concern about this investigation). The simulated data set
was generated by creating 30 normally distributed values with
Figure 3 presents MMRE and PRED confidence intervals for two mean .3 and standard deviation .6 (see [9] for why these were
models using function points (FP) in the Desharnais data. chosen) and then calculating effort from the true model assuming
these are the residuals for this model. Size values were simply
Tables 5 and 6 indicate the ranking results for linear (F) and non-
generated as 50•i for i = 1,2,…,30. The competing models were
linear (D) models based on MMRE and PRED(.25). In [17] it is
of type (B) with differing values of the a and b parameters.
suggested that the fact that MMRE and PRED(.25) present
Model(28) has parameters selected to severely underestimate the
inconsistent rankings for selecting models (D) and (F) is evidence
simulated data. Model(29)’s parameters were selected to also
that they are measuring two different aspects of the error
underestimate, but only moderately. Model(30) severely
distribution. This may indeed be true, however we suggest if SE is
overestimates, while Model(31) only moderately overestimates.
taken into account as it is in Table 6, we cannot be confident in
this assertion. In Figure 3 one can see that there is a substantial 1000 simulated data sets were generated and MMRE and
overlap of the 95% confidence intervals. Hence, to have PRED(x) were computed for each set. One model is “selected”
confidence in the assertion made in [17], one would need a great
deal more data to reduce the SE, or use an alternative approach Model D (ln_LSR_aSb), FP adj
Desharnais
MMRE: 0.57689, SE: 0.09240
that is not subject to SE. The question of how much more data
would be needed is taken up in Section 8 with a simplified, yet Model F (LSR_aSb), FP adj
MMRE: 0.65335, SE: 0.11438

analogous example. However, from the methods presented there, Model D (ln_LSR_aSb), FP raw
we estimate that at least 4808 project data points would be needed MMRE: 0.59920, SE: 0.10459

to achieve 95% confidence that Model (F) has greater PRED(.25) Model F (LSR_aSb), FP raw
MMRE: 0.69727, SE: 0.12988
than Model (D), and the number for MMRE would be much
higher. 0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

Table 5. Model ranking not accounting for Standard Error Model D (ln_LSR_aSb), FP ajd
PRED25: 0.37037, SE: 0.05350
Desharnais
(Desharnais, FP adj)
Model F (LSR_aSb), FP adj
PRED25: 0.43210, SE: 0.05480
MMRE ranking PRED(.25) ranking
1. D F Model D (ln_LSR_aSb), FP raw
PRED25: 0.37037, SE: 0.05342
2. F D Model F (LSR_aSb), FP raw
PRED25: 0.41975, SE: 0.05500

0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000

Figure 3. MMRE and PRED 95% confidence intervals

56
over another if it has a “better” value – e.g. lower MMRE or
higher PRED(x). Our replicated and extended results are given in
Tables 7. 95%
Confidence

From table 7 we find that PRED(.25) performs about as well as


the best criterion as described in [9] (the standard deviation of the
MRE’s) and that our MMRE results were consistent with those
listed in [9]. We did not use bootstrapping for the 95%
confidence intervals as they could be approximated directly.
Given that all the models are of type (B), and we know the
residuals are normally distributed, the sample MMRE’s will be
approximately log-normal and the sample PRED’s will be
approximately normal (see Section 8 for more discussion on this). Figure 4. PRED(x) selection frequencies vs. PRED level x
Both MMRE and PRED are the sample means for their respective
overestimating models less frequently perhaps indicating that
distributions from the 1000 trials (not MRE’s) and so the Cox
PRED(x), like MMRE, favors models that underestimate.
method [34] provides a good approximation for the MMRE
However there is little confidence in this assertion given that at
interval while the standard normal interval [22] will work well
the 95% level we see that PRED(x) rapidly begins to select
for PRED.
Model(28) – the most severe underestimate – only after x is
Table 7. Number of times competing model selected over true above a ludicrous level of .7, and even then, the maximum is
model using MMRE and PRED(.25) criterion about 40% of the trails at PRED(1) (note the different scales).
Model(28) Model(29) Model(30) Model(31)
MMRE 974 1000 0 0 8. EMPIRICAL CHARACTERISTICS
PRED(.25) 369 488 110 231 Here we consider two important characteristics of PRED and
MMRE @ 95% 568 38 0 0 MMRE – their sample distributions and the behavior of SE with
PRED(.25) @ 95% 17 0 2 0 respect to data set size. Example applications of these
characteristics are also presented. The results are generally not
replications or extensions of previous studies, but were chosen
When applying the 95% confidence intervals we find good
for their interest as well as ease of validation.
evidence that MMRE indeed seems to select Model(28) over the
true model in over 50% of the trials which is well outside the Distributional characteristics of PRED and MMRE
approximately 5% random variation level. There is insufficient
evidence to support MMRE improperly selecting Model(29). The left side of Figure 5 is an example “reconstructed”
Remarkably we see no evidence that PRED(.25) is an un-reliable distribution of the bootstrapped MMRE’s from use of model (A)
model selection criterion. In fact, it appears to perform quite well on the NASA93 data set. It is, in theory, asymptotically normally
in this capacity. distributed [7], yet clearly it is not normally distributed for the
relatively small data set used as is evident by looking at the
As an additional validation, we repeated the trials increasing to closest fitting normal curve that is superimposed on the histogram
500 simulated data points. In this case we expect smaller SE’s in Figure 5. Other non-normality tests such as skewness and
and hence stronger evidence at the 95% confidence level. kurtosis, and normal p-plot are consistent with this finding (note
We see in Table 8 a dramatic strengthening of evidence that that we would not expect a Mode given that MMRE is not
MMRE improperly selects Model(28) - at least with respect to discrete). Our empirical investigations on the other models and
the discussion in [9] – since at the 95% confidence level it did so data sets, including the simulated data set, reveal that in general
for 100% of the 1000 trials. We see that PRED(.25) still behaves the MRE’s are log-normally distributed, and while unimodal,
remarkably well at the 95% confidence level but curiously seems they are skewed to the left. In addition to some analytic results
to select Model(29) frequently without confidence interval that support this belief, it is evidenced empirically by considering
consideration. We have no explanation for this currently. the log-transformed distribution displayed on the right of Figure 5
which looks decidedly normally distributed.
Table 8. Number of times competing model selected over true
model using MMRE and PRED(.25) criterion 500 pts A complete analytical characterization of the distribution of the
Model(28) Model(29) Model(30) Model(31)
MRE’s and consequently of the sample MMRE for COCOMO
MMRE 1000 1000 0 0 models is complex, but empirically it is clear that assuming
normality is unjustified and will lead to inconsistent results.
MMRE @ 95% 1000 123 0 0
PRED(.25) 268 766 0 3
PRED(.25) @ 6 14 0 0 Average 0.20 Average -1.63
Median 0.20 Median -1.63

Since PRED(x) has the tolerance level parameter x, we were Mode


Skewness
#N/A
0.70
Mode
Skewness
#N/A
0.19

curious what its effect it might have on model selection. In Figure Kurtosis 0.46 Kurtosis -0.33

4 the left plot shows how the frequency of selections for the four
models as x ranges from 0 to 1.5 (in increments of .1) whereas the
plot on the right is when the 95% confidence interval was used.
As the tolerance level is increased, we see that the Figure 5. Histogram of bootstrapped MMRE and log-
underestimating models are selected more frequently and transformed MMRE for model (A), NASA93 dataset

57
For our example, we assume PRED for C is less than PRED for E
and so the intervals will not overlap when
A v e ra g e 0 .2 2
M e d ia n 0 .2 2
M ode 0 .2 1
S k e w n e ss 0 .1 6
K u r t o s is -0 .0 8

PRED C + 1.645 SE PRED < PRED E − 1.645 SE PRED (7.)


C E

Substituting (6) and solving for N gives:


2
Figure 6. Histogram of bootstrapped PRED(.3) for model (D), ⎛ 1.645( SD1( MREC ≤ x ) + SD1( MREE ≤ x ) ) ⎞
> ⎜
COCOMO81 dataset N ⎟ (8.)
In contrast, PRED as seen in equation (4) is based on the sum of ⎝ PRED E ( x )− PREDC ( x )

indicator values for the MREs which will have a more tractable Applying equation (8) to our previous example, the COCOMO
binomial distribution [22]. Even though resulting distribution for NASA data set with PRED(.3) we find N>76 if 95% confidence
sample PRED estimators is discrete1, it is approximated very well is desired that PREDE>PREDC i.e. to have confidence that local
as a normal distribution as can be seen in Figure 6. This is due to calibration is PRED-superior over using a general model. To
the classic normal approximation to the binomial [22] which match Figure 7 where all intervals do not overlap we have at least
makes PRED easier analytically to work with over MMRE as we as much data as the maximum for any pair (again, keep in mind
now show in the next consideration. that there may be pairs that require less data). For COCOMO
Application: how much data for data significance? NASA this turns out to be Models (B) and (C) and (8) suggests
N>756. Given that the data set has only 60, we do not have a
Thus far we have only looked at 95%-confidence intervals. sufficient amount of data to conclude in confidence the rankings
However, especially from a model selection point of view it is in Table 3, but we do have a reasonable estimate of how much
interesting to ask: “How confident can I be that my chosen model more data we might need collect to get to such a conclusion.
will be significantly more accurate?” Therefore we chose a
typical model selection example, COCOMO NASA in Figure 2, The method presented is somewhat crude, but simple and
and decreased the significance level until the intervals no longer informative because we can easily visualize its meaning - the
overlap. In Figure 7, we found that this occurs at a confidence PRED(.3) intervals do not overlap. Strictly speaking we cannot
level of 32% or lower. Meaning that 68% of the time the actual say “the intervals do not overlap 95% of the time” without
value of the parameter for the error distribution may lie outside of resorting to more refined methods such as iterating a one-sided,
the interval where we cannot be certain how it compares to other two-sample un-pooled t-test, however such methods are more
values. A problem with this analysis is that it should only be complex and are not without their own interpretation challenges
applied to pairs of models, not all at the same time. When two (e.g. normality assumption valid for large samples, but t-tests
models are essentially the same, the confidence will be reduced oriented to small samples). There are also more refined methods
greatly to avoid interval overlap, yet there may be another model for approximating two-sample or multiple-sample confidence
outside of these that would not overlap at a much higher level of intervals that may provide tighter size estimates.
confidence. None the less, it illustrates an approach to obtaining a PRED and MMRE Estimator Performance
confidence level in choosing between two models.
We now focus our empirical microscope on the question of how
Alternatively we might ask the question: “How much data is well, as estimators, MMRE and PRED perform with respect to
needed to get significant results?” Because MMRE and PRED are data set size. The primary purpose of the experiments here is to
consistent estimators, we know that the standard error must compare expected versus observed behaviors of MMRE and
decrease as the number of data points increase. MMRE has a PRED when increasing sample size (i.e. more project data
complicated relationship between standard error and data size, points). In doing so, we illuminate a number of striking
however PRED is more tractable and it can be shown that differences between the two criterions.
SD1( MRE≤x )
SE PRED ( x ) ≈ (6.) The experiments were performed using Model (C) with the
N NASA93 data set. This was selected as both model and data are
reasonably representative. Neither are highly specialized or
(≈ means approximately equal) where SE stands for standard
“tuned and pruned” for specific results. We note in advance that
error and SD is the sample standard deviation of the indicator
we have performed the same experiments repeatedly with all the
values of the MRE values less than x (i.e. 1 for the MRE’s less
than x, 0 otherwise) for N data points. Since the bootstrap
distributions for PRED are approximately normal (or student t- Model A (ln_LSR_CAT)
distributed for small N), we can estimate N such that the 95% PRED30: 0.95000, SE: 0.02835

confidence intervals for models C and E are unlikely to overlap Model B (aSb)
PRED30: 0.65000, SE: 0.06198
by considering where the PRED’s are not within z0.05 ≈1.645 SE Model C (given_EM)
for each of the model’s respective standard errors where z0.05 is PRED30: 0.71667, SE: 0.05827

the 95% percentile for standard normal. Model D (ln_LSR_aSb)


PRED30: 0.55000, SE: 0.06410
Model E (ln_LSR_EM)
PRED30: 0.86667, SE: 0.04378

0.00000 0.20000 0.40000 0.60000 0.80000 1.00000 1.20000 1.40000


1
For a data set of size of n there are only 2n possible sums of 0-1 Figure 7. Bootstrapped PRED(.3) non-overlapping 32%
indicator values and thus only that many possible PRED values confidence intervals COCOMO NASA

58
data sets and simulation data and have observed no significant parameters they estimate. This also implies that the SE’s should
differences with respect to the analysis for the example presented. uniformly converge to 0. MMRE is an average of continuous
Figure 8 shows a sample of four runs of randomly selecting a subset random variables (approximately log-normal) and we would not
of size n=1,2,…,93 from NASA93, calculating the MMRE and expect a large deviation in the rate of convergence, especially as the
PRED(.25) for this subset and for bootstrapping to estimate their data set gets large. In contrast with this, we anticipate a lot of
respective SE’s. From the distributional results discussed above we bounded variation for PRED(.25) since each indicator
could approximate the SE without bootstrapping, but we find it 1( MRE ≤ .25) has expected value P( MRE ≤ .25) and so this is
more confidence inspiring to use bootstrapping then verify with a essentially a series of Bernoulli trials (i.e. flipping a weighted coin)
non-bootstrap approximation. where at each n the PRED may incease by (1 − PREDn−1 ) / n or
We now present the experimental results as a series of “expected” decrease by PREDn −1 / n where PREDn −1 is the PRED(.25) value
and “observed” behaviors as the data size increases:
at n-1 data points, hence the variations are tightly bounded by the
Case n=1 previous values. Since each increase or decrease is divided by n, the
magnitude of these variations decrease as n increases. Hence we
Expect: MMRE will be the MRE for the single data point, SE = 0 as may see a few rare long runs up or down, but generally a repeated
there is no variance in the sample data. PRED will be 0 if MRE > series of short increases followed by short decreases that are smaller
.25 or 1 otherwise, SE = 0 for similar reasons. and smaller as n approaches 93.
Observed: Exactly as expected. This serves as a basic verification From that fact that MRE is an absolute value, the SE for MMRE is
that the experiment is correctly configured. dependent on P( z > 1) P( z ≤ 1) where the probability is from the
Case n=93 distribution of z (see Section 4). For large enough n this value
should be fairly constant when approximating this with an
Expect: All runs should have same MMRE and PRED values and empirically derived distribution (as we are doing with
respective SE’s as all data is being used and there is no random bootstrapping). More importantly however, the SE is also
variation from selecting subsets. dependent on the variance of z . This implies that for large point
Observed: Exactly as expected. This serves as another basic variations in the data, perhaps due to outliers, we would expect
verification that the experiment is correctly configured. We note large increases in an otherwise decreasing SE. Note that MMRE is
that the bootstrapping introduces less than a .01 error in the the average of n MRE’s, so its sample variance will be divided by
approximating SE for MMRE and .004 for approximating SE for n2 and hence the SE will be divided by n. The PRED SE should be
PRED(.25). quite well behaved as it only depends on
P( MRE ≤ .25) P( MRE > .25) and not the variance of the MRE’s as
Case n→93 with MMRE. As stated previously, we expect the approximate
Expect: MMRE and PRED are both consistent estimators for non- values for these probabilities as derived from the empirical
negative random variables hence we expect that their respective distribution used by the bootstrap to rapidly stabilize and therefore
values will uniformly converge (i.e. does not depend on the we would expect the PRED SE to generally decrease on the order
particular value converged to) in probability to the distribution of 1/n.
Observed: The experimental runs have the characteristics as
MMREvs datasize MMRE SE vs data size
described above with a few additional notable items. First is that
4 2 contrary to popular belief in the cost estimation folklore, we see that
3 1.5 in general, more data does not imply better accuracy values. Even
MMRE.

in our analysis of expected behavior above, we anticipated the


SE

2 1
possibility of both decreasing and increasing MMRE and PRED
1 0.5 values as data size increases. All we can say with confidence is that
0
0 the criterions will eventually converge, and a bit on how much
0 20 40 60 80 100 0 50 100 variability to expect along the way. That we cannot predict for any
datasize data size given set of data that more data will improve the MMRE or PRED
is an absolutely crucial fact about the use of these as accuracy
PRED(.25) vs data size PRED(.25) SE vs data size criteria. This indicates that it is imperative to always consider the
0.3
SE of these accuracy measures when using them to avoid erroneous
1.2
0.25
results due to sampling phenomenon. We believe that this is a
1
major source of inconstant cost estimation research results, a few of
PRED(.25)_
PRED(.25)_

0.8 0.2
0.6 0.15
which we have exemplified and resolved in section 7.
0.4 0.1 Another notable observation is that we clearly see from large
0.2 0.05 variations in the MMRE runs in Figure 8 that there are at least 3
0 0 significant outliers in the NASA93 data. Hence as has been
0 20 40 60 80 100 0 20 40 60 80 100 frequently claimed in the literature (and expected as above),
data size data size MMRE is indeed very sensitive to outliers. We see that even very
near the full 93 data points that an outlier can cause a radical
Figure 8. MMRE and PRED(.25) performance with respect to variation in the MMRE. This by itself does not necessarily make
increasing data set size with model (C), NASA93 MMRE unreliable as an accuracy criterion. What does make it

59
unreliable is how outliers affect the MMRE SE. A reliable [12] J.M. Desharnais, "Analyse statistique de la productivitie des projets
estimator must have a strictly decreasing SE with increasing informatique a partie de la technique des point des fonction," masters
sample size. Given that an outlier may radically increase the thesis, Univ. of Montreal, 1989.
MMRE SE at any time, we cannot reliably estimate it for a given [13] M. Jørgensen, “How Much Does a Vacation Cost? or What is a
data size, and thus we are uncertain about the true value of the Software Cost Estimate?,” ACM SIGSOFT Software Engineering
error parameter estimated by MMRE. Notes, P. 5, Vol. 28, No. 6, 2003
[14] S. D. Conte, H. E. Dunsmore, and V. Y. Shen, “Software engineering
In contrast, PRED is a reliable estimator. As expected, and metrics and models,” Benjamin-Cummings Publishing, 1986
observed, it is decreasing. Note that the small increases seen in [15] M. Jørgensen, “Experience with the accuracy of software maintenance
the PRED SE runs in Figure 8 are the result of approximation task effort prediction models,” IEEE Transactions on Software
errors and random variation from bootstrapping. Because the Engineering, Vol. 21, No. 8, 1995
sample MMRE’s are non-discrete, the SE is less effected by these [16] G. Boetticher, T. Menzies, and T. Ostrand. The PROMISE Repository
kinds of errors. We also observe that PRED SE tends to stabilize of Empirical Software Engineering Data, 2007.
quickly, and that after 10 data points all runs more or less http://promisedata.org/repository.
converge on the same trajectory. Thus we can reliably estimate [17] B. Kitchenham, L. Pickard, S. MacDonell, M. Sheppard, “What
the SE and have greater confidence in results based on PRED so accuracy statistics really measure,” Proceedings of the IEEE, Vol. 148,
long as we account for this SE. In this sense, PRED is “robust” as No. 3, 2001
an estimator. In talking about how much data is needed to [18] M. Jørgensen, M. Shepperd, :”A Systematic Review of Software
reliably calibrate a COCOMO model, Barry Boehm is noted for Development Cost Estimation Studies,” IEEE Trans. Software Eng.,
suggesting that 10 data points is sufficient. Our experiments here Vol. 33, No. 1, 2007
provide some support for this heuristic given the stability of [19] M. Shepperd, C. Schofield, “Estimating Software Effort using
PRED SE after 10 points. Anologies,” IEEE Transactions on Software Engineering, 1997
[20] G. Hood, http://www.cse.csiro.au/poptools, 01/19/2008
Finally we observe that MMRE’s sensitivity to outliers can be put [21] http://en.wikipedia.org/wiki/Vysochanskii-Petunin_inequality,
to practical use as an outlier “detector” for pruning rouge 01/19/2008
estimation data points. We expanded our experiment to indicate [22] R. Larsen, M. Marx, “An Introduction to Mathematical Statistics and
what point was added whenever the MMRE SE increased more its Applications,” Second Edition, Prentice Hall, 1986
than a given tolerance (in this experiment we used .025). We [23] L. Briand, T. Langley, and I. Wieczorek. “A replicated assessment and
subsequently removed that point from the data set and re-ran the comparison of common software cost modeling techniques,” in
experiment. This resulted in a data set whose MMRE SE Proceedings of the 22nd International Conference on Software
appeared reliable. We did not carefully investigate the pruned Engineering, Limerick, Ireland, 2000, pp. 377–386.
data to determine if they in fact could be classified as outliers. [24] K. Lum, J. Hihn, T. Menzies, “Studies in Software Cost Model
Behavior:Do We Really Understand Cost Model Performance?,”
Proceedings of the ISPA International Conference 2006, Seattle, WA
9. REFERENCES
[1] K. Moløkken, M. Jørgensen, “A Review of Surveys on Software Effort [25] COCOMO81 dataset, http://promisedata.org/repository/#coc81,
Estimation,” International Symposium on Empirical Software 12/29/2007
Engineering, Rome, Italy, 2003 [26] COCOMONASA dataset,
[2] I. Myrtweit, E. Stensrud, M. Shepperd, “Reliability and Validity in http://promisedata.org/repository/#cocomonasa_v1, 01/19/2008
Comparative Studies of Software Prediction Models,” IEEE [27] NASA93 dataset, http://promisedata.org/repository/#nasa93,
Transactions on Software Engineering, Vol. 31, No. 5, 2005 12/29/2007
[3] T. Menzies, Z. Chen, J. Hihn, K. Lum, “Selecting Best Practices for [28] Desharnis dataset, http://promisedata.org/repository/#desharnais,
Effort Estimation,” IEEE Transactions on Software Engineering, Vol. 12/29/2007
32, No. 11, 2006 [29] L. C. Briand, I. Wieczorek, “Resource Estimation in Software
[4] M. Shepperd “Evaluating Software Project Prediction Systems,” 11th Engineering”, Encyclopedia of Software Engineering, Pp. 1160 –
IEEE International Software Metrics Symposium, Como, Italy, 2005 1196, Wiley-Interscience Publishing, 2001
[5] T. Menzies, D. Port, Z. Chen, J. Hihn, S. Stukes, “Validation Methods [30] BESTweb – Better Estimation of Software Tasks,
for Calibrating Software Effort Models,” Proceedings of the 27th http://www.simula.no/~simula/se/bestweb/, 2007
international conference on Software engineering, 2005 [31] L. Briand, K. El Emam, D. Surmann, I. Wieczorek, K. Maxwell, “An
[6] I. Wieczorek, M. Ruhe, “How valuable is company-specific data Assessment and Comparison of Common Software Cost Estimation
compared to multi-company data for software cost estimation?,” Modeling Techniques,” Proceedings of the 21st international
Proceeding for the Eights IEEE Symposium on Software Metrics conference on Software engineering, Los Angeles, California, United
(METRICS 02), 2002 States, 1999
[7] Ch. Mooney, Robert Duval, “Bootstrapping: A Nonparametric [32] B. Kitchenham, E. Mendes, G. Travassoss, “Cross- vs. Within-
Approach to Statistical Inference,” Sage Publications; 1. edition (1993) Company Cost Estimation Studies: A Systematic Review,” IEEE
[8] B. Efron (1979). "Bootstrap methods: Another look at the jackknife,” Transactions on Software Engineering, Vol. 33, No. 5, 2007
The Annals of Statistics, 7, 1-26 [33] C. Mair, M. Shepperd, M. Jørgensen, “An Analysis of Data Sets Used
[9] T. Foss, E. Stensrud, B. Kitchenham, I. Myrtveit, “A simulation study to Train and Validate Cost Prediction Systems”, International
of the Model Evaluation Criterium MMRE,” IEEE Transactions on Conference on Software Engineering, St. Louis, Missouri, USA, 2005
Software Engineering, Vol. 20, No. 11, 2003 [34] Land, C. E. (1971), “Confidence intervals for linear functions of the
[10] M. Shepperd, G. Kadoda, “Using Simulation to Evaluate Prediction normal mean and variance,” Annals of Mathematical Statistics, 42,
Techniques,” Proc. Fifth Int’l Software Metrics Symp., 2001 1187-1205
[11] B. Boehm. Software Engineering Economics. Prentice Hall,1981

60

View publication stats

You might also like