Professional Documents
Culture Documents
To cite this article: Dennis A. Ahlburg (1995) Simple versus complex models: Evaluation,
accuracy, and combining, Mathematical Population Studies: An International Journal of
Mathematical Demography, 5:3, 281-290, DOI: 10.1080/08898489509525406
Taylor & Francis makes every effort to ensure the accuracy of all the information
(the “Content”) contained in the publications on our platform. However, Taylor
& Francis, our agents, and our licensors make no representations or warranties
whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and
views of the authors, and are not the views of or endorsed by Taylor & Francis. The
accuracy of the Content should not be relied upon and should be independently
verified with primary sources of information. Taylor and Francis shall not be liable
for any losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or indirectly in
connection with, in relation to or arising out of the use of the Content.
This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan, sub-
licensing, systematic supply, or distribution in any form to anyone is expressly
forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Downloaded by [Purdue University] at 09:36 17 January 2015
Mathematical Population Studies © 1995 OPA (Overseas Publishers Association)
1995, Vol. 5(3), pp. 281-290 Amsterdam B.V. Published under license by
Reprints available directly from the publisher Gordon and Breach Science Publishers SA.
Photocopying permitted by license only Printed in Malaysia
DENNIS A. AHLBURG
Industrial Relations Center and Center for Population Analysis and Policy,
University of Minnesota
Downloaded by [Purdue University] at 09:36 17 January 2015
and
Program on Population, East-West Center, 1777 East-West Road,
Honolulu, HI 96848, USA
This paper argues that it is premature to decide whether simple forecasting models in demography are
more (or less) accurate than complex models and whether causal models are more (or less) accurate than
noncausal models. It is also too early to say under what conditions one type of model can outperform
another. The paper also questions the wisdom of searching for a single best model or approach. It
suggests that combining forecasts may improve accuracy.
assumed paths of fertility, mortality, and migration, then they are complex/causal
models, according to Smith. This seems to have been the case until fairly recently.
Now that the Bureau uses statistical time series analysis for the future age-specific
fertility rates (Long, this collection), the position is a little less clear. Economic-
demographic models, represented in this collection by'Warren Sanderson's paper,
are integrated (or linked) models with both demographic and economic detail. They
are distinct from purely demographic models (trend, time series, cohort-component)
that have no causal economic inputs, and economic models with the demographic
component exogenous. They are thus causal and (generally) complex. Structural
economic models with exogenous demographics are complex and causal in their
economic structure but not in their demographic structure.
What this typology means is that when addressing the question of whether a sim-
ple model is more accurate than a complex model, one should do so within a class
Downloaded by [Purdue University] at 09:36 17 January 2015
Carbone and Armstrong (1982): users are concerned about the cost and time re-
quired to make the forecast, the ease of use and implementation, and the ease
of interpretation. While these attributes are certainly desirable, they should not be
overemphasized. As pointed out by Newbold and Bos (1994: 523), it is important
that the forecasts be generated through some intellectually plausible mechanism and
it is legitimate to remain skeptical when apparently good forecasts are generated by
an implausible mechanism (such as Andrei Rogers's pundit monkeys). However,
there may be a tradeoff between accuracy and" these criteria. I think "assumption
drag" (see John Long's paper) can result from an undue focus on face validity, le-
gitimacy, and transparency and has contributed to the inaccuracy of the US Bureau
of the Census forecasts and its missing turning points in demographic series. A bal-
ance needs to be struck between the cost of generating and using a forecast and
the benefits of more accurate forecasts. As Newbold and Bos (1994: 523) note: "one
can derive benefit from driving an automobile without fully understanding exactly
how it works, though some general understanding is certainly useful." If we adopt
the use of a wider set of criteria in evaluating a population forecast, then we are
concerned with whether the marginal benefits of a change in methodology in terms
of increased accuracy outweigh the increased marginal cost from any decrease in
face validity, internal consistency, parsimony, or other criteria.
'Avoiding statistical bias is also a desirable property. See Holden, Peel, and Thompson 1990.
284 D. A. AHLBURG
example, the cost of errors is linear, then an absolute error cost function prevails
and an error measure such as mean absolute error (MAE) is appropriate. If the loss
function (cost of errors) is linear in percentages (rather than in absolute errors),
then MAPE is an appropriate error measure. If big errors are very costly, then a
quadratic loss function, that is, one that weights larger errors more heavily, should
be used. Mean square error (MSE) is a candidate, another is RMSE, but, despite
being the most widely used error measure, it is inappropriate for comparisons across
models and time horizons (Chatfield, 1988; Fildes and Makridakis, 1988; Armstrong
and Collopy, 1992; and Fildes, 1992). This is because it is highly unreliable. For
example Chatfield (1988) showed the comparison on methods for 1001 series was
dominated by the RMSE of about three series. Note that one of the most famous
results in population forecasting, that of Keyfitz (1981) on the size of forecast error
from cohort component models is based on a the use of RMSE across forecasts and
Downloaded by [Purdue University] at 09:36 17 January 2015
forecast horizons.4
As Granger (1980) and Newbold and Bos (1994) note, it is difficult if not im-
possible to get users to specify a loss function. For the Bureau of the Census with
thousands, if not millions of users, it is impossible. Consequently most forecast-
ers who consciously choose a loss function assume a quadratic loss function. Such
an assumption also implies that costs are symmetric which may not be the case.
Asymmetric loss functions can be specified but require special treatment in fore-
cast evaluation and also some modification of the usual forecasting methodologies
(Newbold and Bos 1994: 525).
There are cases though where an exclusive focus on loss functions is not war-
ranted. The most common case is that where turning points are of particular im-
portance. Here a method that correctly predicts the turning point is preferable to
a method that has a lower average error across the whole forecast period. Turning
points are important in demography but, at least at the national level, occur in-
frequently. However, because of assumption drag, such missed turning points also
influence the projections made after the turning points. Thus it is not clear how
much weight should be given to the ability to predict turning points at the cost of
inferior average accuracy. Andrei Rogers calls for some measure of "degree of diffi-
culty" by which to weight performance. This is a good idea. An alternative involving
the use of relative accuracy measures is discussed below.
Another case is where the outcome is not independent of the forecast. Here a
forecast may be ex-ante accurate but ex-post inaccurate because the ex-ante forecast
leads to action that changes the outcome. This is less likely to be the case in national
and state forecasts but may be relevant in certain small area forecasts.
4
Fildes and Makridakis (1988: 549) show that mean square error loss is sensible where a homogeneous
set of series are being evaluated and where the square forecast errors have associated with them a dis-
tribution of associated costs, specifically, constant unit costs independent of the variability of the series.
RMSE performs poorly on two other criteria for choosing error measures: reliability and validity. MAPE
scores reasonably well on these. Reliability is concerned with the extent to which an error measure pro-
duced the same accuracy rankings for a set of methods when it is applied to different samples from a set
of time series. Construct validity is concerned with whether measures what they should be measuring,
that is, accuracy. Rankings of methods were found to differ by accuracy measure used (Armstrong and
Collopy, 1992: 73).
SIMPLIXITY VS. COMPLEXITY 285
Finally, an accuracy measure may be consistent with the user's loss function but
may have properties that make it undesirable to use. This is illustrated in some re-
cent work on the choice of a measure of forecast accuracy when comparing models
across series. This is the typical procedure in sub-national forecast comparisons,
as well as in Keyfitz's work using population forecasts from different countries and
across forecast horizons. A key problem that arises is that accuracy measures should
be unit-free measures, otherwise series with large numbers might dominate compar-
isons. RMSE, often used in demography, is not unit-free. MAPE, another common
measure, is unit-free. Sometimes comparisons are made over series with different
amounts of change (discussed at length by Rogers). Using a relative measure that
compares the forecast from the model against those of another model is appropri-
ate. Such a measure is the Relative Absolute Error (RAE) which divides the abso-
lute forecast error by the corresponding error for the random walk. To summarize
across series the geometric mean of the RAE (the GMRAE) is taken.
If outliers are a concern they can be guarded against by the use of medians (Arm-
strong and Collopy 1992). To select among forecasting methods, they suggest using
Downloaded by [Purdue University] at 09:36 17 January 2015
the median RAE (MdRAE) when using a small number of time series and the
median absolute percentage error (MdAPE) when comparing across many series.
Fildes (1992) recommends the use of relative geometric RMSE because it has de-
sirable statistical properties and a simple interpretation as measuring the average
error of one method compared to another. Theil's U is another attractive measure.
The gist of these papers is to use an error measure with desirable properties and
if there is a clear loss function underlying the application, use an error measure
consistent with it. If the loss function is not known, the forecaster may want to
consider a number of error measures within the appropriate set discussed by Arm-
strong and Collopy and Fildes. Unfortunately, the error measures generally used in
demography are not those that have been shown to have desirable properties nor is
the match between loss function and error measure widespread.
'Bodkin et al. (1991: 531) also recommend combining forecasts for high frequency intervals (e.g. monthly)
from time-series models with forecasts of lower frequency (quarterly) from structural macroeconometric
models.
286 D. A. AHLBURG
has observed "[forecasters] usually cannot anticipate the likely occasions that are in
prospect." Consider the following information from economic forecasting: the rank
correlation between the 1971 and 1972 accuracy of forecasts from 12 US econo-
metric models that varied substantially in complexity for 1971 and 1972 was —0.3
(McLaughlin, 1973). The average rank correlations for the relative accuracy of short
range forecasts for five British macromodels on seven variables were even smaller
(Armstrong, 1985: 22). Wolf (1987) ranked 15 leading US macroeconomic forecast-
ers on the accuracy of four variables (real GNP growth, unemployment, the three-
month Treasury Bill rate, and inflation) for 1983 through 1986. The rank correlation
of accuracy over the three years was 0.168 and not significant. That is, a model's rel-
ative accuracy in one period is not a good guide to its accuracy in a future period.6
In a review of demographic comparisons McMillen and Long (1987) found that
time series models dominate economic-demographic models for short horizons but
not for longer horizons. Research on the accuracy of small-area forecasts has failed
to find a method that is superior to others for places with different rates of growth.
It is doubtful, therefore, whether a forecasting strategy that searches for the best
Downloaded by [Purdue University] at 09:36 17 January 2015
single model is a good strategy for forecasting even if one model is thought likely to
outperform others.
A large volume of empirical evidence in the general forecasting literature, includ-
ing that from demographic series (but not forecast by demographers) suggests that
the combination of forecasts leads to smaller forecast errors in practice (Newbold
and Bos 1994:495; Holden, Peel, and Thompson 1990: Chapter 3; Armstrong 1985).
To the best of my knowledge there has been only one study carried out in demogra-
phy on combined forecasts. In a study of forecasting the population of census tracts,
Smith and Shahidullah (1993) found that a forecast based on the simple average of
the forecasts from all extrapolation techniques was about as accurate as the single
most accurate method (which was not known ex-ante), although it was not as accu-
rate as "a combination of forecasts using only those techniques found to perform
particularly well for each type of place". As Smith and Shahidullah are aware, such
weighting schemes should be tested by using data not used in the construction of the
weights. Ex-ante forecast comparisons are best but ex-post comparisons for other
areas or time periods can also be useful.
A number of considerations have been suggested to help in deciding whether to
combine forecasts. First, combining forecasts is most likely to yield gains in accu-
racy when the forecasts combined are from different methodologies because they
are capturing different information sets or different specifications (Holden, Peel,
and Thompson 1990: 42, 96-97). Second, the series to be included can be chosen
on the basis of past ex-ante forecast errors. These should be unbiased, random,
and small. Experience in other fields suggests that the total number of forecasts
to be included is likely to be small, most likely between three and six. Third, even
though it can be shown theoretically that an optimal weighting scheme exists, such
an optimal weighting scheme changes over time if the covariances of the forecast
errors vary over time. Consequently, an empirically based scheme may be useful.
Fortunately, research on the combination of forecasts shows that a simple weighted
6
I am setting aside here the problems mentioned above with using certain measures of accuracy across
series and forecast horizons.
SIMPLICITY VS. COMPLEXITY 287
average of forecasts often works well relative to more complex combinations (see
Clemen, 1989, for an exhaustive survey).7
Most of the "rules of thumb" we have on combining forecasts have come from
a narrow set of extrapolative techniques. Thus, it is possible that new approaches
to combining may yield more efficient strategies for combining forecasts (see Arm-
strong and Collopy, 1993; Diebold, 1989; and Holden, Peel, and Thompson 1990).
The Armstrong-Collopy approach, which uses causal knowledge about the series
to be forecast to determine weights for combining extrapolative forecasts, shows
promise for application in demography, especially in sub-state forecasts where ex-
trapolative models are common. For example, Isserman (1977) and Smith (1987)
attempt to reduce forecast error by using domain knowledge in a structured way,
although not as formally and completely as Armstrong and Collopy. There is no
reason to believe that the approach could not be used to combine these extrapola-
Downloaded by [Purdue University] at 09:36 17 January 2015
tive models with more complex noncausal models or with causal models.
CONCLUSION
In my view, it is too early to say whether simple models are in general more accurate
than complex models or whether causal models are more accurate than noncausal
models. Research to date in demography has not established the clear superiority
of one type of approach over another. Furthermore, the papers by Rogers and by
McNown, Rogers, and Little show that some of the accepted results of population
forecasting have nothing to do with the accuracy of complex models relative to that
of simple models. The contradictory findings of Stoto and Keyfitz on simple models
versus complex models probably reflects the different lengths of the base periods
they used to construct the simple growth rate extrapolation (Rogers). McNown,
Rogers, and Little show that a comparison of the accuracy of simple and complex
extrapolation models (time series models) would most likely be determined by the
choice of the sample period for computing the historical rates of change of fertil-
ity. Forecasts based on information from the past five years would show persistent
increases in fertility, whereas those based on the last thirty years show dramatic
declines in fertility. They show that both the point estimates and the confidence
intervals are affected. In addition, most of the comparisons we have in demogra-
phy are based on fit to the historical data or ex-post errors. It has been shown that
ex-ante not ex-post comparisons are the appropriate basis for comparing forecast
accuracy (Pant and Starbuck 1990, and Armstrong 1985: 241-242).
I think that it is also too early to say under what conditions simple models can
outperform complex models. While it is quite possible that different models may be
more accurate for different periods of change, forecast horizons, and for different
types of series, there have been too few careful comparisons for any general conclu-
sions to be drawn. I disagree with Andrei Rogers's conclusion that "complex mod-
7
Newbold and Bos (1994: 510-511) found that weights based on the inverse of the sum of squared fore-
cast errors perform well compared with regression based weights. Regression-based weights, such as
those mentioned by Smith and Shahidullah (1993: 13-14), perform well when a small number of the
forecasts are clearly inferior. However, if this is the case it may be best to exclude these forecasts.
288 D. A. AHLBURG
els have outperformed simple models in times with relatively stable demographic
trends, when the degree of difficulty has been relatively low, and have been outper-
formed by simple models in times of significant unexpected shifts in such trends,
when the degree of difficulty has been relatively high." The RMSEs provided by
John Long (Table 1) do not support these conclusions. If we take the period of
the late-1950s and early-1960s and the period 1975-1985 as being relatively stable,8
then the simple constant growth model outperforms the cohort-component model
in the latter period. While it is true that the cohort component model outperforms
the simple model in the earlier period for five-year ahead forecasts, it does not do
so for both 20 year forecasts. If we were to include the 1955 forecast then the pic-
ture is even murkier. Rogers also concludes that "simple models have outperformed
complex models at major turning points in U.S. demographic trends." Again Long's
error statistics do not support this conclusion. Assuming the major turning points to
Downloaded by [Purdue University] at 09:36 17 January 2015
be those in fertility in 1957 and 1976, the cohort component model outperforms the
simple model just after the 1957 peak (for the five year but not the 20 year forecast)
but not after the 1976 trough in fertility. It is of course possible that Rogers's con-
clusions may hold under different simple models. One area where we may have a
"general rule" is in forecasting age-specific populations. Here it does seem that the
cohort component model's use of information on the age structure helps as claimed
by McNown, Rogers, and Little (this collection). But note that the results in Long's
Table 2 show that this is not universally true: although all of the Census Bureau's
cohort component projections are more accurate than a simple constant growth
model for the population aged 15-19, in two of the seven forecasts for the popula-
tion 60-64 years, the simple model outperforms the cohort component model.
It is not clear that our search for the single best model or approach makes sense
because, as I have argued above and is illustrated in the papers here: 1) the fore-
caster rarely has enough information to assert with any great conviction that a par-
ticular model is superior to all others, and 2) even when a particular forecast ap-
pears to be superior, it does not necessarily follow that the other forecasts contain
no useful information.9 The extensive literature on combining forecasts suggests that
forecast accuracy can be improved, often greatly, by combining the forecasts from
different models. The one study in demography that combines forecasts finds this to
be the case.
Where, then, do we need further research? I think the following areas would
repay furtherwork:
1. I agree with Andrei Rogers that we need forecasting competitions of alter-
native models. These have been very helpful in economic forecasting and in busi-
8
These periods were identified by Andrei Rogers (personal communication). They accord well with peri-
ods of relatively lower decade change in the growth rate of population and with lower annual net change
in population per thousand.
9
The usefulness of the other forecasts can be established by regressing the actual value of the forecast
variable on a constant and the predicted values of the variable from different models. If a model's fore-
cast contains all the information in another model's forecast and some additional information, then its
forecast should be significant in this regression, and the other models should not If both forecasts con-
tain independent information, then both should be significant. If neither contains useful information, then
neither should be significant See Fair and Shiller (1990) and Fair (1993). One assumes that "usefulness"
established in this fashion is robust over time.
SIMPLIXITY VS. COMPLEXITY 289
and the potential replacement of mortality experts by the time series approaches
of Lee and Carter and Rogers and McNown. Recent advances in forecasting have
suggested a procedure for structuring the domain knowledge of experts to enhance
forecast accuracy. The conditions under which this approach is useful seem to fit
nicely with the traditions of the cohort component method. They are a) the fore-
caster has expert knowledge, b) the trend of the target series is affected by more
than one important causal force, c) the series can be decomposed such that the
separate causal forces can be specified for at least one of the components, and d)
it is expected that the components can be forecast more accurately than the target
values (Collopy and Armstrong 1994).
4. Uncertainty. I think the papers in this collection have made a great contri-
bution to our understanding of uncertainty in population forecasts. One thing that
needs to be done is to educate the users of population forecasts on measures of
uncertainty beyond "high" and "low" scenarios. In particular, we need to consider
the implications of Keyfitz's (1981) observation, re-emphasized in the paper by Mc-
Nown, Rogers, and Little, that demographic forecasts beyond 20 to 30 years ahead
convey little information.
REFERENCES
Ahlburg, D. A. (1982) How accurate are the U.S. Bureau of the Census' projections of total live births?
Journal of Forecasting 1: 365-374.
Ahlburg, D. A. (1987) Population forecasting. In S. Makridakis and S. Wheelwright (eds.), The Handbook
of Forecasting: A Manager's Guide, second edition, 135-149. New York: Wiley.
Ahlburg, D. A. (1990) A Comparison of the Ex-ante Forecasts of U.S. Births From an Economic-Demo-
graphic Model and the Bureau of the Census, paper presented at the Annual Meeting of the Population
Association of America, Toronto, Ontario.
Ahlburg, D. A., McPherson, M., and Schapiro, M. O. (1993) Incorporating enrollment forecasts into
projections of Pell Program Costs: A study of feasibility and effectiveness. Report for the Department
of Education (Washington, DC), September.
Armstrong, J. S. (1985) Long-Range Forecasting, second edition. New York: Wiley.
Armstrong, J. S., and Collopy, E (1992) Error measures for generalizing about forecasting methods:
Empirical comparisons. International Journal of Forecasting 8: 69-80.
Armstrong, J. S., and Collopy, F. (1993) Causal forces: Structuring knowledge for time-series extrapola-
tion. Journal of Forecasting 12: 103^115.
Beaumont, P., and Isserman, A. (1987) Comment. Journal of the American Statistical Association 82:
1004-1009.
290 D. A. AHLBURG
Bodkin, R., Klein, L. R., and Marwah, K. (1991) A History of Macroeconometric Modelbuilding. Brook-
field, VT: E. Elgar.
Carbone, R., and Armstrong, J. S. (1982) Evaluation of extrapolative forecasting methods: Results of a
survey of academicians and practitioners. Journal of Forecasting 1: 215-217.
Chatfield, C. (1988) Apples, oranges, and mean square error. Intemational Journal of Forecasting 4:515-
518.
Clemen, R. T. (1989) Combining forecasts: A review and annotated bibliography. International Journal
of Forecasting 5: 559-584.
Collopy, F., and Armstrong, J. S. (1992) Rule-based forecasting: Development and validation of an expert
systems approach to combining time series extrapolations. Management Science 38.
Collopy, F., and Armstrong, J. S. (1994) Decomposition of time series by causal forces: A decision
process for structuring forecasting problems. Working paper.
Diebold, F. X. (1989) Forecast combining and encompassing: Reconciling two divergent literatures. Inter-
national Journal of Forecasting 5: 589-592.
Fair, R. C. (1993) Testing macroeconometric models. American Economic Review 83: 287-293.
Fair, R. C., and Shiller, R. J. (1990) Comparing information in forecasts from econometric models.
Downloaded by [Purdue University] at 09:36 17 January 2015