You are on page 1of 10

Ecological Modelling 176 (2004) 349–358

Model validation using equivalence tests


Andrew P. Robinson a,∗ , Robert E. Froese b,1
a Department of Forest Resources, University of Idaho, P.O. Box 441133, Moscow, Idaho, ID 83843, USA
b School of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI, USA
Received 23 April 2003; received in revised form 28 November 2003; accepted 17 January 2004

Abstract
Model validation that is based on statistical inference seeks to construct a statistical comparison of model predictions against
measurements of the target process. Previously, such validation has commonly used the hypothesis of no difference as the null
hypothesis, that is, the null hypothesis is that the model is acceptable. This is unsatisfactory, because using this approach tests
are more likely to validate a model if they have low power. Here we suggest the usage of tests of equivalence, which use the
hypothesis of dissimilarity as the null hypothesis, that is, the null hypothesis is that the model is unacceptable. Thus, they flip
the burden of proof back onto the model. We demonstrate the application of equivalence testing to model validation using an
empirical forest growth model and an extensive database of field measurements. Finally we provide some simple power analyses
to guide future model validation exercises.
© 2004 Elsevier B.V. All rights reserved.
Keywords: FIA; FVS; Process model; Empirical model; Statistical model; Mathematical model

1. Introduction cations, and potential tests. The options are manifold,


but the guidelines are few.
Validation is a central aspect to the responsible ap- Statistical hypothesis testing, as distinguished from
plication of models to scientific and managerial prob- graphical or descriptive techniques, offers a frame-
lems. The importance of validation to those who con- work that is particularly attractive for model valida-
struct and use models is well recognized (Caswell, tion. A test would compare a sample of observations
1976; Gentil and Blake, 1981; Reynolds et al., 1981; taken from the target population against a sample of
Mayer and Butler, 1993; Oreskes et al., 1994; Rykiel, predictions taken from the model. The validity of the
1996; Loehle, 1997; Vanclay and Skovsgaard, 1997; model is then assessed by examining the accuracy of
Robinson and Ek, 2000). However, there is little con- model predictions. Such tests have numerous advan-
sensus on what is the best way to proceed. This is at tages: they provide an objective and quantifiable met-
least in part due to the variety of models, model appli- ric, they are amenable to reduction to a binary out-
come, and therefore permit computation of error prob-
ability rates, and they accommodate sample-based un-
certainty into the test result.
∗ Corresponding author. Tel.: +1-208-885-7115;
Not surprisingly, a number of statistical tools have
fax: +1-208-885-6226.
been applied to validation problems. For example,
E-mail addresses: andrewr@uidaho.edu (A.P. Robinson),
froese@mtu.edu (R.E. Froese). Freese (1960) introduced an accuracy test based on
1 Tel.: +1-906-487-2723; fax: +1-906-487-2915. the standard χ2 test. Ottosson and Håkanson (1997)

0304-3800/$ – see front matter © 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.ecolmodel.2004.01.013
350 A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358

used R2 and compared with so-called highest-possible inappropriate. Tests of bioequivalence have been de-
R2 , which are predictions from common units (par- veloped in response to this dilemma, and we propose
allel time-compatible sets). Jans-Hammermeister and that such tests can also be usefully applied to model
McGill (1997) used an F -statistic-based lack of fit test. validation in ecology, or more generally.
Landsberg et al. (2003) used R2 and relative mean This notion has been introduced in the forestry lit-
bias. Bartelink (1998) graphed field data and predic- erature, outside of the larger discussion of equivalence
tions with confidence intervals. Finally, Alewell and testing. Reynolds (1984) and Gregoire and Reynolds
Manderscheid (1998) used R2 and normalized mean (1988) discussed the way the hypotheses were con-
absolute error (NMAE). structed for the test developed by Freese (1960) as is
It is recognized, however, that the traditional appli- traditionally applied. This test is is defined in terms of
cation of hypothesis tests is inappropriate for model an accuracy requirement:
validation (Mayer and Butler, 1993; Loehle, 1997).
P(|D| ≤ e) ≥ 1 − α (1)
This is because the usual null hypothesis is that there
is no difference between the means, for example, of where D is the error in model prediction and e is the
the populations. This is tantamount to saying that the specified allowable error. If the probability of an error
model meets the accuracy standard. The null hypothe- exceeding the allowable limit is less than some crit-
sis would be of no difference between the mean of the ical value, then the model is deemed to have passed
model predictions and the mean of the process that it the accuracy test. Under normality assumptions, this
is supposed to mimic. This seems inappropriate, not requirement may be reduced to a statistical null hy-
to say unscientific, because the burden of proof rests pothesis (Gregoire and Reynolds, 1988):
on the alternative hypothesis, which is here that the
model is not acceptable. This test strategy is also un- e2
σ2 ≤ (2)
satisfactory because failure to reject the null hypoth- χ1−α
2
esis could be due to the model being acceptable, but
it can also be interpreted as the user merely having where σ 2 is the variance of the prediction errors. This
chosen a test with low power. is a test of model accuracy, in that in order to pass,
Therefore, in model validation, the null hypothesis a model must have an acceptably low combination of
should be that the model is not valid (Loehle, 1997). bias and imprecision. Power studies of Freese’s test
The null hypothesis would be that the model does not may be found in Wang and LeMay (1996) and Wang
meet the accuracy standard, and the alternative hy- et al. (1996).
pothesis would be that it does. There is no reason that Reynolds (1984) acknowledged that all that can be
such a test cannot be constructed: hypothesis testing is concluded if the null hypothesis is not rejected is that
essentially symmetric, in that one establishes a rejec- there is insufficient evidence to do so; this is not the
tion region that reflects ones interest, be that interest same as concluding that the model is valid. He pro-
to test similarity or difference. Put another way, the posed that the test be reversed, so that the null hy-
burden of proof can be shifted. pothesis would be that the model does not meet the
The basic problem, that the burden of proof seems accuracy standard, thus shifting the burden of proof
to be invested in the wrong side of the hypothesis, is onto the model. Gregoire and Reynolds (1988) called
parallel to a similar situation in biostatistics. The ap- these alternatives “liberal” and “conservative” tests,
proval of generic pharmaceutical drugs requires ev- and noted that it is possible that the liberal test favours
idence of bioequivalence with established drugs in the conclusion that the model is valid while the con-
both the United States and the European Community servative test favours the opposite.
(Berger and Hsu, 1996). The regulations do not re- The Freese (1960) test illustrates the need to dis-
quire that the generic drug be better than the accepted tinguish between the objective of the test and the
drug, but merely that it be as good as the accepted specific statistical hypotheses employed. Reynolds
drug, by one or more of a number of accepted metrics (1984) notes that the translation of an error bound
of drug quality. Here again, a null hypothesis that the (Eq. (1)) into a variance bound (Eq. (2)) does not
mean responses of the drugs are equivalent would be distinguish between systematic and stochastic error
A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358 351

components. If the systematic error component is hypotheses involving precision, bias, response rates,
large enough, then no matter how small the stochastic survival times or goodness-of-fit may be constructed
component is, the error criterion may be impossible (Wellek, 2003). For illustration, we pose a simple
to meet. Hence, Freese’s test has been presented in premise: a model is not valid if model predictions are
the context where systematic error is expected to be biased. We conduct a simple experiment using as a
estimated and corrected for during testing (Freese, metric the difference between model predictions and
1960; Reynolds, 1984; Gregoire and Reynolds, 1988). observations of real data.
Gregoire and Reynolds (1988) distinguish between
xdi = xi − x̂i (3)
three cases of application of the methodology: when
systematic error is known to be zero, when a correc- where xi is measured on the unit and x̂i is predicted
tion for systematic error is anticipated, and where a for the unit, from the model. We will use x̄d , the sam-
test for significant systematic error will be undertaken ple mean of these differences, as the basis for testing.
before a bias correction is made. We summarize the relevant elements for model vali-
Therefore, care must be taken to select a metric of dation using a two one-sided t-test, TOST, as an ex-
model performance and structure the test so that useful ample here. Other tests are the paired t-test for equiv-
conclusions may be drawn. For example, some model alence, which we demonstrate using data below, and
users may be principally concerned about systematic the signed rank sum statistic (Wellek, 2003).
error, or bias, such as managers concerned with large The key innovation that equivalence testing relies
scale averages rather than specific cases. For these upon is the subjective choice of a region within which
users, tests of means to identify equivalence may be differences between test and reference data are con-
most useful. Others are more concerned with testing sidered negligible. For example, one might nominate
questions of whether the precision of model predic- a region of indifference as being ±25% of the stan-
tions meets a necessary standard, and may be happy dard deviation. This region is implemented as an in-
to simply quantify systematic error and correct for it. terval around the nominated metric. That is, we might
We briefly review the potential utility of equivalence say that if the absolute value of the mean of the differ-
testing to model validation, and specifically for testing ences is less than 25% of the standard deviation, then
the equivalence of models evaluated in terms of mean it is negligible. Although this introduces a measure of
prediction error, or bias. We then demonstrate these subjectivity, that is not a great disadvantage. In any
techniques using an empirical forest growth model and case, selection of the size of the test is still subjec-
an extensive database of tree increment core measures. tive. Furthermore, if one disagrees with the indiffer-
Finally, we produce power curves that summarize the ence criterion, it is straightforward to reverse-engineer
sample sizes that would be necessary to detect nomi- the test, as long as sufficient information has been
nated levels of similarity. provided. We shall later indicate what information is
required.
1.1. Equivalence testing Having established the region of indifference, it is
then enough to determine whether or not a special con-
The problem of testing for equivalence is not unique fidence interval for the metric, e.g. the mean of the
to model validation. As noted above, testing for bioe- differences, is contained completely within the region
quivalence has similar properties (Berger and Hsu, (Fig. 1). If the region of indifference completely en-
1996). The same issues appear in psychological lit- compasses the confidence interval, then the two pop-
erature (Rogers et al., 1993). A formal treatment of ulations are deemed significantly similar. If not, then
the problem and its suite of solutions can be found in the null hypothesis of difference is not rejected. In
Wellek (2003). the case of the TOST, the confidence interval is cal-
The first step is to determine a metric of perfor- culated as two one-sided confidence intervals of size
mance, which depends on the nature of the research α. Note that, although this is algebraically identical
questions being examined. Equivalence tests can be to a two-sided confidence interval of size 2 × α, this
constructed to compare experimental treatments, or is a coincidence, and is not true in higher dimen-
to compare test data to reference values. Research sions (Berger and Hsu, 1996). Also, this is not neces-
352 A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358

Fig. 1. A demonstration of the two one-sided t-test (TOST) equivalence test. Our null hypothesis is that the mean of the population is
different from 0, and our alternative hypothesis is that the mean is not different from 0. The test is simply to check to see whether the
two one-sided α-level confidence intervals of the parameter are contained within the rejection region (−, +). In case (a), the test of
dissimilarity is rejected. In cases (b) and (c) the test of dissimilarity is not rejected, i.e. we lack sufficient evidence to conclude that the
mean is 0. Note that in case (c) the conjecture is probably true but the data are too variable to be certain.

sarily the best test to use in any given situation, but which is another benefit of this approach to the
does provide a handy example as a basis for explana- test.
tion. This brief overview captures the essential elements
Fig. 1 shows three of the possible outcomes: firstly, of equivalence testing: establishment of a region of
that the populations are the same and the sample indifference and comparison of that region with a
size is sufficiently large to detect it, secondly that sample-based confidence interval. We now demon-
the populations differ, and thirdly, that the popu- strate these techniques using an empirical forest
lations are indeed identical, but the sample size is growth model and an extensive database of tree in-
insufficient to detect that. This figure suggests that crement core measures, and a more powerful test for
the power of the test will increase with sample size, equivalence: the paired t-test.
A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358 353

2. Materials and methods BAL BAL


+ b11 + b12
100 log(DBH + 1)
We now demonstrate a hypothesis test for validation
using FVS, a well established non-spatial tree-level + b13 LOC × DBH2 (4)
forest growth model, and an extensive database of tree where dds is the increment of squared diameter, HAB a
increment core measurements from the USDA Forest model intercept representing habitat type, LOC an in-
Inventory and Analysis (FIA) group. Again, we exam- tercept representing nearest National Forest, ASP the
ine the model under the premise that the model is not direction faced by the slope in degrees, SL the per-
valid if model predictions are biased. All data anal- centage ground angle deviation from horizontal, EL
yses were performed in the statistical environment R the elevation above sea-level, CCF the crown compe-
(Ihaka and Gentleman, 1996). tition factor, a measure of competition (Krajicek et al.,
Our goal was only to demonstrate these test proce- 1961), CR the live crown ratio, and BAL is the stand
dures; we are pursuing a more wide-ranging validation basal area in trees larger than the tree of interest. This
of the model separately. For this exercise we used only model was fit separately to each species (Wykoff et al.,
grand fir (Abies grandis (Dougl. ex D. Don) Lindl.), 1982). For grand fir, tree heights were predicted from
one of the most productive mid-elevation species diameter outside bark measured at 1.37 m (4 ft 6 in.)
in the state (Burns and Honkala, 1990; Brown and above ground (dbh) using Eq. (5), and volumes were
Chojnacky, 1996; O’Laughlin, 2002). There were thus then predicted using Eq. (6). Conversions to metric
3393 trees for the comparison of diameter growth, units were necessary. These models are documented
and 549 plots for the comparison of volume growth, in Wykoff et al. (1982).
in the database.
 
8.19365
h(dbh) = 0.3048 × exp 5.00233 −
2.1. Model: FVS (dbh/2.54)+1
+ 1.37 (5)
The forest vegetation simulator (FVS) is a non-
  2 
spatial, tree-level, forest growth model and frame- dbh h(dbh)
work, with about 20 geographical variants (Wykoff et V(dbh, h(dbh)) = 0.00234 ×
2.54 0.3048
al., 1982; Teck et al., 1997; Robinson and Monserud,
2003). The variant that we used covers northern Idaho × 0.028317 (6)
and western Montana, and originated as Prognosis
(Stage, 1973). FVS is used by industry, academia and Trees with dbh of less than 22.86 cm (9 in.) were ig-
government, for purposes including inventory updat- nored for the volume computations, as they are not
ing, developing silvicultural prescriptions, assessing considered to hold any merchantable volume. Such
value, and as a teaching tool (Teck et al., 1997). volume functions are predicated on a particular prod-
We confined the validation exercise to the diameter uct; these are based on the volume of sawlogs, which
growth sub-model. FVS is referred to as “diameter typically have a minimum length of 2.44 m (8 ft) and
driven”, because the principal growth sub-model is for a minimum small-end diameter of 20.32 cm (8 in.).
tree diameter, and diameter increment is either directly These volumes were summed within species and plot,
or indirectly used as a predictor in other sub-models. and converted to a per-hectare basis. Plots that did not
The current version of the diameter growth model is contain grand fir were ignored.
as follows (Wykoff, 1990):
2.2. Data: FIA
log(dds) = HAB + LOC + b1 cos (ASP)SL
Forest inventory and analysis (FIA) data were col-
+ b2 sin (ASP)SL + b3 SL + b4 SL2
lected on a systematic grid across all forest condi-
CCF
+ b5 EL + b6 EL2 + b7 HAB tions in Idaho and Montana by the USDA Forest Ser-
100 vice (Bechtold and Zarnoch, 1999). Data from two
+ b8 log(DBH) + b9 CR + b10 CR2 sample designs were used because the FIA sample
354 A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358

design changed during the study period. The dataset period for the purposes of predicting the other trees.
comprised 2756 unique field locations, of which 2287 The projections were corrected to reflect outside bark
(83%) were from the old design and 469 (17%) were measurements using ratios of inside bark to outside
from the new design. The design details that are rele- bark diameters from Wykoff et al. (1982). Tree vol-
vant to this exercise are summarized below. umes for each time period were then estimated using
In the old design, field locations were established Eqs. (5) and (6).
systematically on a 5000 m grid, with each field loca- An important caveat is that we have measurements
tion representing 2500 ha (about 6200 ac). A cluster only for trees that existed on the plot at the current
of 5, 7, or 10 variable-radius point samples was es- time and information on recent mortality—trees that
tablished at each field location for trees greater than died within the growth measurement period. We do not
12.7 cm dbh (5.0 in.). Point sample trees were selected have information on harvest removals. It is possible
using probability proportional to size via a constant that other trees existed 10 years before, and affected
basal area factor of 9.183 m2 /ha (40 ft2 /ac). At each growing conditions for some time, and have since dis-
point and for each tree of all species in the sample, appeared.
the following attributes were recorded: dbh, height,
and crown length, among other variables. Ten-year 2.3. Equivalence testing
inside bark radial increment was measured for a sub-
sample that included two trees, by species, for each We focus on two metrics. Firstly, we compare the
5.1 cm diameter class 7.6 cm and larger. Slope, aspect, accuracy of diameter projections by comparing the di-
elevation, habitat type and geographic location were ameter increment predictions that arise from Eq. (4)
recorded for each plot cluster. against tree-level measures from the FIA database.
In 1998, FIA began the transition to a new inven- This can be thought of as a data-based validation of
tory design. The regional 5000 m grid used previously the model, in that it tests for unbiasedness (Rykiel,
was replaced by a hexagonal grid, with each hexagon 1996). Secondly, we compare plot-level predictions of
covering approximately 2430 ha (6000 ac). The FIA volume increment against plot-level observations of
plot location associated with each hexagon was the volume increment. Our goal is not to represent this as
existing FIA plot closest to the center, if one existed; a true test of volume projection, as we are still using
otherwise a new plot location close to the center was only the diameter core data. Instead we use this as an
established. The field location layout was a cluster of ad hoc transformation of the diameter measures into a
four 0.0169 ha (1/24 ac) fixed-area subplots. At each form that will be more useful to model users: firstly,
subplot, trees of all species greater than 12.7 cm dbh to a per hectare value estimated at the plot level, and
were measured for dbh, height, and crown length, secondly, with due emphasis on the larger trees, as
among other variables. As with the previous design, errors in diameter projection will be more or less im-
10-year inside bark radial increment was measured portant to model utility depending on the level of di-
for a subsample that included two trees, by species, ameter. Therefore, this second test can be thought of
in 5.1 cm diameter classes 7.6 cm and larger. as a validation of the model utility.
In either case, diameter growth rates for unsampled The testing proceeded as follows; see Wellek
trees were imputed from those of similar species, di- (2003) for derivations. We elected a paired t-test
ameter class, and either plot, or habitat class, as avail- of equivalence, which is more powerful than the
able. Subtraction of the actual or imputed diameter TOST described above, but requires the assumption
growth from the current diameters gave us a picture of normality (Wellek, 2003). Firstly, we examined
of the likely stand conditions 10 years before the quantile–quantile plots of the differences, which were
measurement. We then projected these measurements acceptably normal (Gaussian) in distribution (not
forward 10 years, using FVS. This is referred to as shown here). We then nominated a series of relative
back-dating (Wykoff et al., 1982). We only did these and absolute criteria for each metric. The null and
projections for the trees for which we actually had alternative hypotheses in each case were:
measurements. Thus the imputed diameters were only
used to represent stand conditions at the first time H0 : µp − µm = 0 (7)
A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358 355

H1 : µp − µm = 0 (8)
where p refers to predicted and m refers to measured.
Three criteria were expressed relative to the sample
standard deviation of the differences, sp−m :
(1) 10%;
(2) 25%; and
(3) 50%.
The latter pair correspond to strict and liberal choices
according to guidelines in Wellek (2003). Then, for
diameter, we also used 0.25 and 0.50 cm/decade, and
for volume we used 1 and 5 m3 /ha/decade, as absolute
cutoffs.
Note that the relative and absolute tests are not in-
trinsically different, they merely involve different ap-
proaches to setting the desired limits. We include both
because they facilitate interpretation of the results. Or- Fig. 2. A family of power curves. The null hypothesis is that
the means are different, and the alternative hypothesis is that
dinarily one would choose one such criterion, but our the means are identical. Each curve shows the probability of
goal was to demonstrate the equivalence test rather detecting significant similarity when the populations are identical,
than to perform such a test rigorously. We calculated as a function of the sample size, at size α = 0.01. The curves are
the: labeled with the ratio of the nominated interval of indifference
to the standard deviation of the sample data. Thus, the 1/2 curve
• mean (x̄d ); corresponds to the case where the standard deviation is twice
• standard deviation (sxd ); and

as large as the interval within which differences are considered
• standard error (sx̄d = sxd / n) of the growth pro- negligible. A horizontal dark gray line marks α = 0.01.
jection errors.
Here, n is the number of data pairs. Those criteria in where Ft is the cumulative distribution function for
absolute units were then converted relative to the stan- the non-central t distribution.
dard deviation for the purposes of testing. Thus, each This same relation (Eq. (10)) was then used to gen-
criterion () was expressed relative to the standard de- erate power curves (see Fig. 2). The family of curves
viation regardless of whether it was originally chosen was generated by establishing regions of indifference
in absolute or relative terms. relative to the standard deviation of the sample, and
We next calculated the non-centrality parameter, calculating the probabilities of rejecting the null hy-
ψ2 = n × 2 . Then the cutoff C̃α;n−1 () for a test of pothesis of dissimilarity under a range of sample sizes
size α is the α-quantile of the non-central F distribu- at α = 0.01.
tion with degrees of freedom ν1 = 1 and ν2 = n − 1, Finally, we used Eq. (10) again to calculate a pro-
and non-centrality parameter ψ2 , as calculated above. file of the possible tests for our diameter dataset. This
We finally calculated the t-value corresponding to provides a snapshot of the power functions for dif-
the observed mean and standard deviation by ferent test sizes as a function of the half-length of
x̄d the nominated region of indifference. We refer to the
td = (9)
sx̄d half-length as these regions are here assumed to be
symmetrical, although of course they need not be, and
and compared it with the cutoff. If the t-value was
are most commonly referenced as being ±, which
lower than the cutoff then we rejected the null hypoth-
leads to the label: a half-length of . This graph sum-
esis of dissimilarity. We calculated power using the
marizes the performance of the test relative to not
following relation from Wellek (2003):
only the stated test parameters but also other plausible
β̃α;n−1 () = 2Ft (C̃α;n−1 ()) − 1 (10) ones.
356 A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358

Table 1
Statistical summary of the equivalence tests for diameter and volume prediction error, for A. grandis
Metric Criterion Dissimilarity Mean S.D. Epsilon Cutoff t statistic Power

Diameter (cm) 0.25 Not rejected −0.19 1.9 0.13 5.33 −5.72 1.00
Diameter (cm) 0.50 Rejected −0.19 1.9 0.26 12.96 −5.72 1.00
Diameter (%) 10 Not rejected −0.19 1.9 0.10 3.50 −5.72 1.00
Diameter (%) 25 Rejected −0.19 1.9 0.25 12.21 −5.72 1.00
Diameter (%) 50 Rejected −0.19 1.9 0.50 27.77 −5.72 1.00
Volume (m3 /ha) 1 Not rejected 2.47 10.6 0.09 0.14 5.47 0.11
Volume (m3 /ha) 5 Rejected 2.47 10.6 0.47 8.69 5.47 1.00
Volume (%) 10 Not rejected 2.47 10.6 0.10 0.19 5.47 0.15
Volume (%) 25 Not rejected 2.47 10.6 0.25 3.52 5.47 1.00
Volume (%) 50 Rejected 2.47 10.6 0.50 9.30 5.47 1.00
All tests were at α = 0.01. The test is performed by comparing the absolute values of the t statistic against the cutoff. If the t statistic is
higher than the cutoff then the hypothesis of significant difference is not rejected. The mean and standard deviation of the diameters are
in cm/decade, and the mean and standard deviation of the volumes are in m3 /ha/decade.

3. Results difference between the power that arises from the tests
using either size. Furthermore, the test is very powerful
The results of the validation exercise are summa- for all regions of indifference with half-length greater
rized in Table 1. For 10-year diameter increment, than 8 m3 /ha/decade.
the null hypothesis of dissimilarity is not rejected at We also calculated a few of the statistics that have
0.25 cm/decade, but is rejected at 0.5 cm/decade, also been used for model validation. The R2 statistic be-
it is rejected at 25% and 50%, but not at 10% of the tween predicted and observed was 0.375. The NMAE
standard deviation. The power is very high in each was 5.33 and the relative bias was 47.8. Finally,
case, never dropping below 1 to two decimal places. Freese’s test, which can be reconfigured to be similar
For 10-year volume increment, the null hypothesis of to a test of equivalence (Reynolds, 1984; Gregoire
dissimilarity is not rejected at 1 m3 /ha/decade, but it is
rejected at 5 m3 /ha/decade. Likewise, it is rejected at
50% of the standard deviation, but not at 10 or 25%.
Again, the power is high except for the test against
1 m3 /ha/decade and the test against 10%, but this is a
very stringent level, so the result is not surprising.
The power curves are presented in Fig. 2. They show
the probability of correctly rejecting the null hypothe-
sis of dissimilarity as a function of the sample size, at
size α = 0.01. That is, if the populations are identical,
this is the probability of rejecting the null hypothesis
of dissimilarity. The curves are labeled with the ra-
tio of the nominated interval within which differences
are considered negligible to the standard deviation of
the sample data. Referring to Table 1 the comparison
for volume at 10% corresponds to an  of 0.10, which
corresponds to the 1/10 curve on Fig. 2. This suggests
that a sample size at least double that currently used
would be required to obtain a decent power for the test.
Fig. 3. A profile plot of power curves. Each curve shows the prob-
The profile curve is presented in Fig. 3. It shows
ability of detecting similarity as a function of the region of indif-
clearly that despite the large sample size that we have ference for the volume growth prediction test, if the populations
available for the volume comparison, there is some are identical. The curves are labeled with the size of the test.
A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358 357

and Reynolds, 1988) did not reject the null hypothe- The authors that had used it did not provide any cutoff
sis of dissimilarity at 0.25 and 0.5 cm/decade (Freese, of what would be acceptable, preferring to use it as
1960). However, this test combines bias and impre- a metric (Ottosson and Håkanson, 1997; Alewell and
cision and would not be expected to necessarily give Manderscheid, 1998; Landsberg et al., 2003). The
the same result as the equivalence test. NMAE was 5.33 (Alewell and Manderscheid, 1998).
The relative bias was 47.8 (Landsberg et al., 2003).
Again, no cutoffs were presented, suggesting that the
4. Discussion values should be interpreted as summary statistics.
Each of the tests was quick and easy to apply, none
The model seems reasonable as a scientific state- being particularly more difficult or time-consuming
ment about the likely decadal growth of grand fir diam- than any other, once the data were cleaned and or-
eter. The null hypothesis of dissimilarity was rejected ganized. The advantage of the equivalence tests and
at a level considered stringent in other fields, and only Freese’s test is that test criteria can be established in
failed to be rejected in the extreme case of 10%. This terms of meaningful constraints, i.e. constraints that
is an encouraging result. Likewise, the test of dis- are measured in the same units as the data.
similarity was rejected at 0.5 cm/decade for diameter As noted earlier, we subjectively chose the region
growth, which is a good return compared with the av- of indifference to be applied to the errors in growth
erage decadal diameter growth rate of 3.67 cm/decade projections. It is reasonable to expect that different in-
for grand fir from the FIA database. difference regions might arise from different applica-
However, in terms of practical utility, the model did tions. In order to facilitate reverse-engineering of this
not fare so well. The null hypothesis of dissimilarity test, only the following quantities are needed:
was not rejected at the stringent and extreme levels,
and was rejected at the liberal level. This is despite • mean of the differences;
the considerable power of the test for our hypothesis • standard deviation of the differences; and
(see Fig. 3). In terms of the absolute criteria, the test • the sample size.
was rejected at 20 m3 /ha/decade but failed to be re-
jected at 10 m3 /ha/decade. This is a less satisfactory
outcome for model users. Table 1 shows that the bias 5. Conclusion
is estimated at 2.47 m3 /ha/decade.
The estimated volume growth for grand fir from the We have introduced and demonstrated tests of
FIA data was about 27.9 m3 /ha/decade. The contrast in equivalence as tools for model validation. These tests
the results noted above is a consequence of two things. flip the burden of proof on to the modeller. They do
Firstly, the volumes and diameters will give a different not follow the goodness of fit approach, where the
weight to errors. Volume is allometrically related to null hypothesis is similarity. Instead, similarity is the
diameter, with the power being somewhere between 2 alternative hypothesis, and the null hypothesis is dis-
and 3 (2.615 for grand fir in our data), so prediction similarity. These tests formalize the suggestions of
errors for diameter will be magnified in large trees. As Loehle (1997) and locate them in a statistical frame-
noted above, trees with dbh of less than 22.86 cm (9 work, within which such considerations as power and
in.) were ignored for the volume computations, as they size can be considered.
are not considered to hold any merchantable sawlog
volume. Secondly, we had a six-times larger sample
size for our diameter predictions. A more rigorous test Acknowledgements
would account for the hierarchical structure through a
mixed-effects model. The data used for this exercise were kindly supplied
We compared our results for 10-year diameter by by the Interior West Forest Inventory and Analysis
growth to those of other statistics that have been re- Region in Ogden, UT. The authors are grateful to Pro-
ported during exercises of model validation. The R2 fessor Peter Dalgaard for assistance in computing the
statistic between predicted and observed was 0.375. quantiles of a non-central F distribution. Support for
358 A.P. Robinson, R.E. Froese / Ecological Modelling 176 (2004) 349–358

this research was from the University of Idaho Forest Mayer, D.G., Butler, D.G., 1993. Statistical validation. Ecol.
Biometrics Lab and the University of Idaho College Modelling 68, 21–32.
O’Laughlin, J., 2002. Idaho forest health conditions—2002 update.
of Natural Resources. Contribution 958, Idaho Forest, Wildlife and Range Experiment
Station.
Oreskes, N., Shrader-Frechette, K., Belitz, K., 1994. Verification,
References validation, and confirmation of numerical models in the earth
sciences. Science 263, 641–646.
Alewell, C., Manderscheid, B., 1998. Use of objective criteria Ottosson, F., Håkanson, L., 1997. Presentation and analysis of
for the assessment of biogeochemical ecosystem models. Ecol. a model simulating the pH response of lake liming. Ecol.
Modelling 107, 213–224. Modelling 105, 89–111.
Bartelink, H.H., 1998. Radiation interception by forest trees: Reynolds Jr., M.R., 1984. Estimating the error in model
a simulation study on effects of stand density and foliage predictions. Forest Sci. 30 (2), 454–469.
clustering on absorption and transmission. Ecol. Modelling 105, Reynolds Jr., M.R., Burkhart, H.E., Daniels, R.F., 1981.
213–225. Procedures for statistical validation of stochastic simulation
Bechtold, W.A., Zarnoch, S.J., 1999. Field methods and data models. Forest Sci. 27 (2), 349–364.
processing techniques associated with mapped inventory plots. Robinson, A.P., Ek, A.R., 2000. The consequences of hierarchy
In: C. Aguirre-Bravo, C.R. Franco (Eds.), compilers. North for modelling in forest ecosystems. Can. J. Forest Res. 30 (12),
American Science Symposium: Toward a Unified Framework 1837–1846.
for Inventorying and Monitoring Forest Ecosystem Resources. Robinson, A.P., Monserud, R.A., 2003. Criteria for comparing
Guadalajara, Mexico (November 2–6, 1998). Proceedings the adaptability of forest growth models. Forest Ecol. Manage.
RMRS-P-12. U.S. Department of Agriculture, Forest Service, 172 (1), 53–67.
Fort Collins, CO. pp. 421–424. Rogers, J.L., Howard, K.I., Vessey, J.T., 1993. Using significance
Berger, R.L., Hsu, J.C., 1996. Bioequivalence trials, intersection– tests to evaluate equivalence between two experimental groups.
union tests and equivalence confidence sets. Stat. Sci. 11 (4), Psychol. Bull. 113 (3), 553–565.
283–319. Rykiel, E.J., 1996. Testing ecological models—the meaning of
Brown, M.J., Chojnacky, D.C., 1996. Idaho’s forests, 1991. validation. Ecol. Modelling 90 (3), 229–244.
Resource Bulletin INT-RB-88, USDA Forest Service. Stage, A.R., 1973. Prognosis model for stand development.
Burns, R.M., Honkala, B.H., 1990. Silvics of North America. 1. RP-INT-137, USDA Forest Service.
Conifers. Agriculture Handbook 654, USDA Forest Service. Teck, R., Moeur, M., Adams, J., 1997. Proceedings of the Forest
Caswell, H., 1976. The validation problem. In: Patten, B. (Ed.), Vegetation Simulator Conference. GTR-INT-373, USDA Forest
Systems Analysis and Simulation in Ecology, vol. 4. Academic Service.
Press, New York, pp. 313–325. Vanclay, J.K., Skovsgaard, J.P., 1997. Evaluating forest growth
Freese, F., 1960. Testing accuracy. Forest Sci. 6 (2), 139–145. models. Ecol. Modelling 98 (1), 1–12.
Gentil, S., Blake, G., 1981. Validation of complex ecosystem Wang, Y., LeMay, V.M., 1996. Sequential accuracy testing plans
models. Ecol. Modelling 14, 21–38. for the applicability of existing tree volume equations. Can. J.
Gregoire, T.G., Reynolds Jr., M.R., 1988. Accuracy testing and Forest Res. 26, 525–536.
estimation alternatives. Forest Sci. 34 (2), 302–320. Wang, Y., LeMay, V.M., Marshall, P.L., 1996. Relative efficiency
Ihaka, R., Gentleman, R.C., 1996. R: A language for data analysis and reliability of parametric and nonparametric sequential
and graphics. J. Comput. Graphical Stat. 5 (3), 299–314. accuracy testing plans. Can. J. Forest Res. 26, 1724–1730.
Jans-Hammermeister, D.C., McGill, W.B., 1997. Evaluation of Wellek, S., 2003. Testing Statistical Hypotheses of Equivalence.
three simulation models used to describe plant residue Chapman & Hall, London.
decomposition in soil. Ecol. Modelling 104, 1–13. Wykoff, W.R., 1990. A basal area increment model for individual
Krajicek, J.E., Brinkman, K.A., Gingrich, S.F., 1961. Crown conifers in the northern Rocky Mountains. Forest Sci. 36, 1077–
competition—a measure of density. Forest Sci. 7, 35–42. 1104.
Landsberg, J.J., Waring, R.H., Coops, N.C., 2003. Performance of Wykoff, W.R., Crookston, N.L., Stage, A.R., 1982. User’s guide
the forest productivity model 3-PG applied to a wide range of to the stand prognosis model. GTR-INT-133, USDA Forest
forest types. Forest Ecol. Manage. 172, 199–214. Service.
Loehle, C., 1997. A hypothesis testing framework for evaluating
ecosystem model performance. Ecol. Modelling 97, 153–165.

You might also like