International Journal of Forecasting 8 (1992) 233-241 233


Top-down or bottom-up: Aggregate versus

disaggregate extrapolations
Byron J. Dangerfield and John S. Morris
College of Business and Economics, University of Idaho, Moscow, ID 83843, USA

Abstract: Two approaches have been suggested for forecasting items in a product line. The top-down
(TD) approach uses an aggregate forecast model to develop a summary forecast, which is then allocated
to individual items on the basis of their historical relative frequency. The bottom-up (BU) approach
employs an individual forecast model for each of the items in the family. The present study compares
these two approaches by using over 15,000 aggregate series constructed by combining individual series
from the M-competition database. The effects of correlation between individual items and the relative
frequency of individual items in the family are examined. In most situations, BU forecasting of family
items produces more accurate forecasts.

IKeywords: Family forecasts, Top-down forecasts, Aggregate forecasts, M-competition.

1. Introduction 2. Previous research

Many organizations find it necessary to fore- A number of TD allocation methods have

cast individual items that make up a family or been suggested, all of which rely on using histori-
group classification. For example, a personal cal proportions for individual items. Brown
computer manufacturer may offer two models (1962) suggested a vector smoothing method that
that make up their lap-top PC family. Two gen- updates the item proportions each period before
eral approaches have been suggested for de- the aggregate forecast is distributed. Initial item
veloping forecasts for individual models or items. proportions are calculated as the ratio of the
One approach might be referred to as a top- items cumulative demand to the familys
down (TD) strategy since a single forecast model cumulative demand for some historical interval;
is developed to forecast an aggregate -or family the item proportions are then updated through
total which is then distributed to the individual an exponential smoothing formula based on the
items in the family based upon their historical prior periods item proportion and its most re-
proportion of the family total. The other ap- cent actual historical proportion of family de-
proach might be labelled a bottom-up (BU) mand. Item forecasts are developed by applying
strategy since multiple forecast models based smoothed proportions to the aggregate or family
upon the individual item series are used to de- forecast. Cohen (1966) developed a blending
velop item forecasts. method for forecasting item demand; a forecast
for the group is blended with the average de-
mand for each item. Hausman and Sides (1973)
used perhaps the simplest allocation scheme in
Correspondence to: B.J. Dangerfield, College of Business their approach which consisted of apportioning
and Economics, University of Idaho, Moscow, ID 83843, the aggregate forecast based on the year-to-date
USA. Tel: (208) 885-6478; Fax: (208) 885-8939. percentage of total demand for each item.

234 B. J. Dungerjield, J.S. Morris I Top-down or bottom-up

Support for the TD approach is usually based field and Morris (1988) examined the relative
on the statistical fact that the variance of the performance of TD and BU approaches in fore-
aggregate demand is equal to the sum of the casting the returns of individual species of fish
variances of independent item demands. For ex- that make up the total run of anadromous fish in
ample. McLeavey and Narasimhan (1985), in the Columbia River drainage. They developed
their discussion of multi-item forecasting tech- exponential smoothing models for the total run,
niques, conclude that one can generate a more as well as for the individual species runs, and
accurate forecast for a group of items than for found that forecasts for the individual species
individual items in the group (p. 67). Fogarty were more accurate when separate exponential
and Hoffmann (1983) echo this sentiment in smoothing models for each species (bottom-up)
their discussion of this variance relationship: were used. Similar results have been obtained
This equation means that if we simply add to- using econometric methods to forecast earnings.
gether the forecasts for the individual items, the Kinney (1971) found that disaggregating earn-
variance will be quite large. Hence, it is usually ings data by market segments resulted in more
better to forecast total demand directly than to accurate forecasts than when firm-level data
sum component forecasts (p. 81). Of course the were used. Collins (1976) compared segmented
item demands may or may not be independent, econometric models with aggregate models for a
and it is not clear that the TD allocation methods group of 96 firms. The segmented models using
will improve forecast accuracy for individual disaggregated data produced more accurate fore-
items even if the independence assumption is casts for both sales and profit forecasts.
correct. Schwartzkopf, Tersine and Morris (1988) dis-
Several authors have pointed out the weak- cussed the relative merits of TD and BU
nesses of TD. Theil (1984) described conditions strategies and concluded that the relative per-
under which bias is introduced when aggregating formance of these two approaches depends upon
from microeconomic to macroeconomic rela- three sources of error: estimation precision (vari-
tions. Edwards and Orcutt (1969) and Orcutt, ability of the estimate around the predicted
Watts and Edwards (1968) argued that informa- value), bias (deviation of the mean of the esti-
tion losses resulting from aggregated data could mate from the true value), and outlier influence
be substantial, and hence they generally sup- (sensitivity to bad data). Using an analytical
ported bottom-up approaches. Zellner (1969) ag- model of mean squared error for a two-item
reed, noting that aggregating data involves an family, they showed that item correlation affect-
important loss of information. Aigner and Gold- ed the precision and bias error components in
feld (1974) addressed the impact of measure- different ways. In TD forecasting, negative item
ment error for independent variables and found correlation reduced variability in the aggregate
no unequivocal superiority for the models con- series, but increased item model bias. They also
structed from aggregate data. found that item proportions in the aggregate
There has been only limited empirical testing series could have an affect on the relative per-
of the two approaches, but the evidence so far formance of the two techniques. Similar item
supports the BU approach. Dunn, William and proportions increased the effects of model bias
Spiney (1971) found that forecasts aggregated and outlier influence when the TD approach was
from lower-level modeling worked best in fore- used and therefore, favored BU forecasting.
casting demand for telephones. Time series mod- However, no specific guidance was given for
els were developed for each of nine local outlets. selecting between the two approaches.
The models varied but included exponential The purpose of this research was to examine
smoothing and autoregressive integrated moving the relative performance of the two different
average (ARIMA) models. The authors mea- methods (TD or BU) using exponential smooth-
sured error by using mean absolute deviation ing models and simple two-item families. The
(MAD), mean squared error (MSE), and a two approaches were tested on over 15,000 ag-
scaled-error criterion. Summing the forecasts gregate time series, each consisting of two in-
from these local models proved more accurate dividual item series with different correlations
than forecasting from aggregate data. Danger- and item proportions. In addition to observing
B.J. Dangerfield, J.S. Morris I Top-down or bottom-up 235

the overall performance of the two approaches, were used from each series so that all specifica-
we hoped to observe differences due to the tion subsets had the same length. This 48-point
correlation and item proportion effects described restriction, while arbitrary, allowed equitable
by Schwartzkopf, Tersine and Morris (1988). specification for all series as well as uniformity in
the aggregation of separate series. The holdout
or test sample consisted of 18 observations for
3. Methodology the monthly series.
The series were then combined to create an
The two forecasting approaches were com- aggregate family composed of two item series. A
pared using both carefully specified forecast total of 15,753 aggregate time series was con-
models as well as models whose parameters were structed using all possible unique combinations
randomly specified. The results of the random of pairs of the 178 series selected. Thus, each of
model specification were used to examine the the constructed aggregate or family series was
effect of model specification on the relative per- composed of two of the 178 individual item
formance of the two methods. Next, the results series. These pairs of item series varied with
using the carefully specified models were respect to correlation (-0.96 < r < l.O), to sea-
grouped into three classes of item correlation sonal and trend patterns, and to their relative
(high negative, low, high positive) and three proportion of the aggregate series. Exhibits 1
classes of item ones proportion of the aggregate and 2 show the relative frequency distributions
series (low, medium, high). A detailed discus- of families by item correlation and item propor-
sion of the experimental variables, along with tion, both of which have been rounded to the
any common factors and assumptions, is given nearest tenth.
3.2. Forecast models
3.1. Time series
To compare the performance of the two fore-
The M-competition data [Makridakis et al. casting methodologies, exponential smoothing
(1982)] were selected to test the accuracy of the (ES) models were developed for both the TD
two approaches; these data consist of 1,001 time and BU approaches. We chose to use Winters
series that are classified as either micro or macro (1960) triple smoothing approach since it is com-
data. Each time series has two subsections: a monly found in production/operations manage-
specification subsection containing data points ment texts [e.g. Tersine (1985), Vollmann, Berry
used to specify the model, and a holdout subsec- and Whybark (1988)] and has performed well
tion to .test the accuracy of the model. In addi- relative to other forecasting models in tests such
tion to the time series observations, the M- as the M-competition. This model has three
competition data include a set of seasonal indices smoothing constants (a, b, and c) to smooth
for each series. level, trend, and seasonal indices and is de-
A subset of all 192 monthly series was chosen scribed in Makridakis, Wheelwright and McGee
from the 1,001 series. Monthly series were se- (1983). Two approaches were used to select
lected since the literature favoring the TD meth- smoothing constants.
odology is largely found in the production/oper-
ations management field where short-term fore- 3.3. Best jit model specification
casts are the norm. The 192 series comprised the
set of series from number 395 to 586. These 192 Models were formulated for each of the ag-
series were further reduced to 178 because we gregate series for the TD method as well as for
wished to consider only series that had at least 48 each of the item series used in the BU approach
data points in the specification subset. A specifi- using the 4%point specification subset. Smooth-
cation subsample of the most recent 48 observa- ing constraints were selected after a grid search
tions was used to develop forecast models for the over both the family series (TD) and each of the
TD and BU approaches for the carefully individual item series (BU). The initial trend and
specified models. Only the last four years of data level for this search process were estimated by
236 B.J. Dangerfield, J.S. Morris I Top-down or bottom-up


, 0.1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Correlation Coefficient

Exhibit 1. Frequencies of correlation coefficient.


01 J
0 0.2 0.4 0.6 0.8 1
Proportion of Item 1

Exhibit 2. Frequencies of item 1 proportion.

fitting a regression line to each of the aggregate tage error (MAPE) was selected for use in the
and individual series. The seasonal indices sup- respective models. Makridakis, Wheelwright and
plied in the M-competition data were used as McGee (1983, pp. 45-47) suggest that a relative
initial indices for item series in the constant error criterion such as MAPE is more appropri-
search procedure. Initial aggregate series indices ate than mean squared error (MSE) in model
were calculated as proportionally weighted aver- specification due to the potential problem of
ages of these item indices. The combination of overfitting which is equivalent to including ran-
constants that minimized mean absolute percen- domness as part of the generating process.
B. J. Dange$eld, J.S. Morris i Top-down or bottom-up 237

The best fit smoothing constants were then 3.6. Performance measures
used to generate an initial forecast for the 18
period holdout sample. This was accomplished We assumed that the objective of the firm was
by recursive application of Winters model to the to forecast as accurately as possible each of the
48-period specification subset using the best fit two individual item demands. Armstrong (1985)
smoothing constants. The smoothed level, trend points out that there is no universally accepted
and seasonal indices from the specification subset measure of accuracy. However, Armstrong and
were then used to generate an initial forecast for Collopy (1992) used M-competition series to test
the test sample. a number of error measures for reliability, con-
struct validity, outlier protection, and sensitivity
3.4. Randomly selected smoothing constants and concluded that MSE should not be used for
generalizing about the level of accuracy of alter-
We also investigated the effect of random native forecasting methods because of its low
model specification to test the robustness of the reliability. Therefore, our analysis used MAPE
two approaches to poorly specified models. Ran- for comparing the TD and BU methods. We do,
dom values for each of the three smoothing however, report the results using MSE in Ap-
constants were selected for both the aggregate pendix A and discuss them below in Section 5
series used in TD and each of the time series since MSE is a commonly used measure.
used in BU. Initial trend, level, and seasonal Comparison of the results of the two forecast-
indices for the holdout sample were generated in ing approaches was complicated by the fact that
the same manner as the best fit models. the aggregate forecasts generated only one result
(i.e. only one MAPE), while the disaggregate
3.5. Item forecasts forecasts generated two. To allow direct com-
parison of the methods, we constructed a sum-
Item forecasts using BU were generated by mary error measure consisting of the average of
applying the individually specified ES models to the two item MAPEs.
the holdout sample. Item forecasts for the TD Next, the natural log of the ratios of error
approach were developed by multiplying the ag- measures for the two approaches was computed,
gregate forecast by an allocation parameter, pi, i.e. ln( top-down MAPE/ bottom-up MAPE). If
which was equal to item is fractional share of this log relative error is positive, BU is more
the aggregate demand during the specification accurate than TD; if it is negative, TD is more
period. The specific item forecast model is given accurate. The natural log was used to eliminate
below: the potential basis in interpretation that can
result when summary statistics are computed
f,,,=Pi*F, 1 using simple ratios. Because of the limited range
of simple ratios (in this case they cannot be less
where A,, is the forecast for item i in period t, F, than zero), the frequency distribution of ratios is
is the forecast for the aggregate series in period positively skewed [Alexander and Francis (1986,
t, and p, is the fraction that item i represented in p. 145)]. For example, consider the results from
the aggregate series during the specification the hypothetical test of the two approaches given
period. This allocation procedure is consistent in Exhibit 3. The overall average error ratio for
with those discussed in the literature and ob- the two series tested is 1.25 ((2 + 0.5)/2) for the
served in practice. While other TD allocation simple ratio, indicating inferior overall perform-
methods exist, all involve the calculation of an ance for TD even though there is really no
allocation parameter based upon an items share difference in performance between TD and BU.
of the aggregate demand over some period of BU is twice as good as TD in series A, while the
time. Neither the proportionality factors nor reverse is true for series B. When the natural log
smoothing constants were updated during the of the ratios is used, the overall average of this
actual forecasting competition with the holdout statistic is zero. Thus, the average log relative
sample. error measure provides an unbiased summary
238 B.J. Dangerfield, J.S. Morris I Top-down or bonom-up

Exhibit 3


A 2 1 2.0 0.69
B 1 2 0.5 -0.69

statistic which can be used to evaluate per- 73% of the time series when models were ran-
formance. domly specified. In addition, the average log
relative error was positive in both the best fit and
randomly specified models indicating better
4. Results and discussion overall MAPE performance for BU forecasting.
The average log relative error of 0.29 in the best
In most situations, BU forecasting of family fit models and 0.30 in the randomly specified
items produced more accurate forecasts. Exhibit models translates into a 34-35% higher MAPE,
4, part (a), compares the two methods using both on average, for TD models. These results indi-
best fit models and randomly specified models. cate that the fit of the forecast model does not
BU was preferable in 74% of the time series have a substantial effect on the relative per-
tested when best fit ES models were used and formance of TD and BU forecasting.

(a) Overall results

Model Percent when Percent when Percent when Average error

Specification top-down better no difference bottom-up better (Mean In(TD/BU))
(In(TDIBU) > 0) (In(TDIBU) = 0) (ln(TDIBU) CO)

Best fit
smoothing constant 26 0 74 0.29
Randomly selected
smoothing constant 27 0 73 0.30

(h) Results by item 1 proportion

Item 1 Percent when Percent when Percent when Average error

proportion top-down better no difference bottom-up better (Mean In(TDIBU))
(In(TDIBU) > 0) (In(TD/BU) = 0) (In(TDIBU) C 0)

Low (O<p, CO.35) 31 0 69 0.26

Medium (0.35 < pI < 0.65) 35 0 65 0.21
High (0.65 <pI < 1.0) 19 0 81 0.36

(c) Results by item correlation

Item Percent when Percent when Percent when Average error

correlation top-down better no difference bottom-up better (Mean In(TD/BU))
(ln(TD/BU) > 0) (In(TDIBU) = 0) (In(TDIBU) < 0)

High negative
(-l.O< r< -0.25) 18 0 82 0.50
Low correlation
(-0.25 < r < 0.25) 27 0 73 0.26
High positive
CO.25 < I < 1 .O) 30 0 70 0.24

Exhibit 4. TD vs. BU performance using log relative MAPE

B.J. Dangerfield, J.S. Morris I Top-down or bottom-up 239

Effects of item proportion and item correlation of correlation resulting in lower MAPE in 70-
in best jit models 82% of the series. Schwartzkopf, Tersine and
Morris (1988) describe two opposing effects with
In Exhibit 4, part (b), the results from the respect to negative item correlation: model dif-
best fit models were separated into three ference error which favors BU forecasting and
categories based on the relative proportion (low, reduced variability in the aggregate series which
medium, and high) of item one (p,) in the group favors TD forecasting. The results indicate that
demand. BU forecasting was consistently the effect of model differences outweighs the
superior to TD regardless of the relative dis- improved stability of the aggregate series since
tribution of demands for the two items that made BU forecasting is preferred 82% of the time
up a family group. However, we did not observe for these families. The results for positively
as strong a symmetry as expected in the low and correlated item series are as expected; BU fore-
high categories of item 1 proportion. One might casting should produce better results owing
expect that these categories should produce not to the increased variability in the aggregate
only the same choice of forecasting method, but series.
also should do so about the same percentage of The interaction between item distribution
the time, all other things being equal. Moreover, ( pl) and item series correlation (r) is illustrated
these results are not consistent with those pre- in Exhibit 5. This exhibit plots the average log
dicted by Schwartzkopf, Tersine and Morris relative MAPE errors for groups of time series
(1988). Their model of TD forecast error pre- with the same rounded item series correlation for
dicted minimum total error when one item domi- three different levels of p,. The three MAPE
nates the family series; however, they used MSE plots indicate superior BU performance across
as their accuracy measure. all values of r for each of the three ranges of pl,
The influence of item series correlation is although the log relative ratios are particularly
considered in Exhibit 4, part (c). The time series high for item series that have a strong inverse
were separated into three categories of item relationship. The results for extreme correlation
dependence: high positive correlation, low corre- values (i.e. 1.0 or -1.0) should be viewed with
lation, and high negative correlation. BU fore- caution, however, owing to the low frequencies
casting performed better for all three categories in these categories.



-1 -0.6 -0.2 0.2 0.6 1

r, rounded
- o<p<.35 _ .35<p<.65 _ .65<p<l

Exhibit 5. Average log relative MAPE, rounded rs,

240 B.J. Dangerfield, J.S. Morris I Top-down or bottom-up

5. Limitations able densities, the distribution of series within

categories was not consistent. Item correlation
The above results apply to families of two categories were formed on the basis of the statis-
items. In addition, this research provides guid- tical meaning of correlation and as a result did
ance on the more general question of whether an not contain comparable densities of the time
individual item series should be forecasted separ- series tested.
ately from a family series. The two-item series, Winters exponential smoothing model was
in this situation, would consist of an individual the only model tested. While it has performed
item series and another series that consists of the well in previous studies, the external validity of
sum of the remaining family of items. By apply- these results could be extended by testing other
ing this logic recursively, one can continue to time series models as well as casual forecasting
decompose the remaining aggregate series until models, Other methods for apportioning the
all item forecast decisions have been made. family forecast to individual items should also be
Other performance measures could and, in examined.
fdCt, did produce different results. Results using Finally, this research does not address other
MSE as the accuracy criterion are reported in important factors such as cost and data accuracy.
Appendix A. Using MSE, BU forecasting pro-
duced better forecasts in only 34-49% of the
more than 15,000 cases tested. However, given
the unreliability of this measure reported by 6. Summary and conclusions
Armstrong and Collopy (1992), these results
must be viewed with caution. The results using Bottom-up forecasting resulted in more accur-
MSE are, however, consistent with the effects ate forecasts for nearly three out of four series
expected by Schwartzkopf, Tersine and Morris tested regardless of model fit; the result was
(1988). Items with similar proportions produced more pronounced when items were highly corre-
better BU forecasts. lated and/or when one item dominated the ag-
In addition, care should be taken in interpret- gregate series. We found no combination of item
ing the categorical results for item proportion correlation and/or proportion where top-down
and correlation. While care was taken to ensure forecasting produced a lower total MAPE than
that the item proportion categories had compar- forecasts developed using individual ES models.

Appendix A. TD vs. BU performance using log relative MSE

(a) Overall results

Model Percent when Percent when Percent when Average error

specification top-down better no difference bottom-up better (Mean In(TD/BU))
(In(TD/BU) > 0) (In(TD/BU) = 0) (ln(TD/BU) < 0)

Best fit
smoothing constant 63 3 34 0.02
Randomly selected
smoothing constant 51 0 49 0.03

(b) Results by item 1 proportion

Item 1 Percent when Percent when Percent when Average error

proportion top-down better no difference bottom-up better (Mean ln(TD/BU))
(In(TD/BU) > 0) (ln(TD/BU) = 0) (In(TD/BU) < 0)

Low (Oip, CO.35) 65 3 32 -0.02

Medium (0.35 < p, < 0.65) 46 0 54 0.19
High (0.65 < p, < 1.0) 67 3 30 -0.01
B.J. Dangerfield, J.S. Morris I Top-down or bottom-up 241

(c) Results by item correlation

Item Percent when Percent when Percent when Average error

correlation top-down better no difference bottom-up better (Mean In(TDIBU))
(In(TDIBU) >O) (In(TD/BU) = 0) (ln(TD/BU) < 0)

High negative
(-l.O< r< -0.25) 59 2 39 0.13
Low correlation
(-0.25 < r < 0.25) 65 3 32 -0.01
High positive
(0.25 < r < 1 .O) 62 3 35 0.01

