Professional Documents
Culture Documents
Statistical Forecasting
Warren Gilchrist
n
Head of Department of Mathematics and Statistics,
Sheffield City Polytechnic
A Wiley—Interscience Publication
‘A Wiley—Interscience publication.’
1. Time-series analysis. 2. Prediction theory.
I. Title.
QA280.G54 519.5'4 76-13504
From the dawn of recorded history, and probably before, man has
sought to forecast the future. Indeed, the ability to foresee the
consequences of actions and events is one of the defining properties of
‘mind’. Where the phenomenon being forecast was a purely physical
one, such as the occurrence of midsummer’s day or of an eclipse, man
was able from very early times to obtain very accurate forecasts. Initially
such forecasts were derived on a purely empirical basis. Methods were
found that worked without any basic understanding as to why they
worked. Later, as greater understanding of the phenomena developed
theoretical models were developed which enabled forecasts to be
obtained on a more reasoned basis. During the last century interest
focused on a number of phenomena, such as economic variation and
sunspot cycles, where
(a) there were available series of observations taken over a period of
time, called time-series, upon which forecasts could be based;
(b) purely mathematical rules were found to be inadequate to describe
the phenomena since they involved features of a chance nature.
Over the last fifty years approaches to this type of problem have been
developed which seek to allow for the influence of chance and for the
unknown complexity of these situations. Thus, forecasting methods
were developed which were essentially statistical in nature. These were
based on using statistical techniques to fit models of a wide variety of
types to past time-series data and on forecasting the future from these
models.
The aim of this book is to provide the reader with an understanding
of the methods and practice of statistical forecasting. The structure of
the book is based on the structure of the activities involved in the
practice of statistical forecasting. Figure P.l summarizes this structure.
Part I of the book provides a general introduction to some of the
basic concepts and terms in statistical forecasting. Part II deals with the
development of statistical forecasting methods applicable to each of a
variety of different situations. The approach to classifying this part into
chapters has been to classify the situations; e.g. one chapter considers
v
VI \
Part I
Acknowledgements
W. G. GILCHRIST
N.
.
Contents
Notation xiii
PART I PRELIMINARIES
Chapter 2. Models 12
2.1 Scientific forecasting 12
2.2 Basic models 17
2.3 Models of global and local validity 19
2.4 Forecasting models and formulae 21
Index 305
Notation
xt an observation at time t
H\ HI H>
a forecast, or estimate, of xt
h a forecast made at time t of xt+h, i.e. with lead time h
the one-step-ahead forecast abbreviated from xtl
et the forecast error xt — xt, often xt — xt_ 1
et h the lead time h forecast error xt+h — xt+h
Mr the underlying mean at time t, i.e. E(xt)
Pt the underlying trend at time t
.
Part I
Preliminaries
Chapter 1
An introduction to forecasting
forecasts may be wrong, but they must be made. In the past most
answers to questions of the above type have been based on unconscious
or semiconscious forecasts. In these forecasts it was frequently assumed
that the future would be just like the recent past. The growing
interest in many aspects of forecasting is based on the belief that
conscious and careful thought cannot fail to help improve our forecasting
skill and thus our ability to get answers to the above types of question.
It may be in some cases that a thorough scientific forecasting study
might not lead to much better forecasts than the old ‘by guess and by
gosh’ methods. Even so, such a study will enable those involved to have
a better understanding of their situation and thus improve their control
of it
There are many methods of statistical forecasting, none of which can
be put forward as providing all the answers to everybody’s forecasting
problems. The techniques discussed in this book are chosen because
they have found the widest application. The emphasis of the book,
however, is not on specific techniques but on general methods and
attitudes towards forecasting. In this way it is hoped that the reader
will be better able to apply statistical forecasting methods to his unique
problems. New techniques and applications of forecasting are being
published every month, and thus if this book is regarded simply as a
compendium of forecasting techniques it would very soon become
obsolete. If, however, the reader treats it as a text on general methods
and attitudes in statistical forecasting, then he should be able to
incorporate the new developments in this rapidly expanding field into
his basic framework of knowledge.
Example 1.1
Most political policy-makers are concerned with forecasting the
future as it would be if no new actions were taken. They then consider
6
Example 1.2
H. T. Davis (1941) points out that had the mass of the sun not been
so great, relative to the planets, statistical methods would have had to
be used to investigate the physical laws of gravitation and to predict
future movements of the planets. As things are, the gravitational pull of
the sun on a planet is so powerful that the planet’s orbit is almost the
same as it would be if the other planets did not exist. From
observations on this orbit the structure of the relationship between sun
and planet can be inferred and the inverse square law of gravitation
deduced as the model that fits most of what is observed. The great
stability of the situation enables very accurate forecasts of the future
positions of the planets to be made. Had the mass of the sun been
smaller, its effects would not have dominated those of the other planets
and much more complicated orbits would have occurred. The conse-
quence of this complication would have been that, though the inverse
square law would still be the basic law, it would have been much harder
to find it from the observed data. Further, forecasting the future
positions of the planets would have been much more difficult, as the
simplicity and stability of orbit we have would not have occurred.
Example 1.3
Week number
1 2 3 4 5 6 7 8 9 10 11 12
(a) 6 6 6 6 6 6 6 6 6 6 6 6
(b) 6 6 6 6 9 6 6
(c) 6 6 6 6 9 6 6 6 6 9 6 6
(d) 7 6 6 4 7 6 8 6 5 6 7 5
(e) 8 8 9 8 12 12 15 14 14 16 18 17
Frequency 5
’| _L
4 5 6 7 8
Orders (in OOOs)
Consider, finally, the set of data in line (e) of Table 1.1 which is
plotted in Figure 1.2. It is clear from the graph that the data lie in a
‘chance’ fashion about a line. This might occur if, added to the source
of the orders in data (d), and additional customer ordered 1,000 units
in week 1, 2,000 units in week 2, and so on. Thus in mathematical
terms his orders in week n are 1,000 n. It is thus clear that the structure
of the data has both a mathematical and a statistical aspect. To say that
such a structure is stable, we imply that both the upward trend and the
statistical variation about that trend are both stable. The forecast for
9
Example 1.4
Table 1.2 gives two years’ data and Figure 1.3 gives a plot of this set of
data. At first sight these data look unstructured. A study of the situation
in which it was obtained suggests that it might be composed of an
upward trend, some sort of seasonal variation and a chance variation, or
as it is usually called, a random variation. Table 1.3 shows a breakdown
of the data into these three elements, together with a table of the
frequencies of the random variation, which gives the same sort of
information as Figure 1.1. Thus the data have a structure composed
of both mathematical and statistical elements. It is clearly seen from
Month
Year 1 2 3 4 5 6 7 8 9 10 11 12
1 33 34 30 36 26 22 14 25 19 20 38 37
2 42 49 42 33 37 34 23 40 23 31 44 40
Random
Trend Seasonal variation Total
Month Year 1 Year 2 additions Year 1 Year 2 Year 1 Year 2
1 21 33 10 2 -1 33 42
2 22 34 12 0 3 34 49
3 23 35 10 -3 -3 30 42
4 24 36 8 -1 -11 31 33
5 25 37 0 1 0 26 37
6 26 38 -6 2 2 22 34
7 27 39 -16 3 0 14 23
8 28 40 -10 7 10 25 40
9 29 41 -9 -1 -9 19 23
10 30 42 -7 -3 -4 20 31
11 31 43 2 5 -1 38 44
12 32 44 6 -1 -10 37 40
this example that finding the underlying structure for a set of data can
be a difficult task. It is certainly not obvious that the data of Table 1.2
were obtained in the fashion shown.
It is also clear that, given enough effort and ingenuity, one could
create some such structure for any set of data. If, however,we are going
to assume stability of the structure for forecasting purposes, it is clear
that the structure that is used must correspond to some real structure in
the situation. The mathematical and statistical equations used in
forecasting must make sense in terms of the situation in which they are
used. The search for a structure for use in forecasting is thus not simply a
mathematical or statistical task but requires a broad knowledge of the
nature and background of the particular forecasting problem being
investigated.
We have seen from the above examples that the basic requirement for
forecasting is the existence of a stable structure. This structure may be
of purely mathematical form or purely statistical form or, most
frequently, a mixture of both. So far these structures have been dealt
with in a purely descriptive fashion; to proceed further we must be
more precise and rigorous in the description of the structures.
Mathematical and statistical models of forecasting situations provide
such a rigorous description. It is an introduction to the study of such
models that provides the subject matter for the next chapter.
11
References
Chisholrne, R. K., and Whitaker, G. R. (1971). Forecasting Methods. R. D. Irwin,
Homewood, Illinois.
Davis, H. T. (1941). Analysis of Economic Time Series. Principia Press.
Morrell, J. (1972). Management Decisions and the Role of Forecasting. Penguin,
Harmonds worth.
Robinson, C. (1971). Business Forecasting. An Economic Approach. Thomas
Nelson and Sons, London.
s
Chapter 2
Models
The aim here is to pick out from the possibly vast amount of
information obtained at the first step that which is regarded as most
relevant. This is then reduced to a bare minimum. For example, in a
sales forecasting situation the record of past sales for the firm would
probably provide this minimum information. In deciding which items
12
13
xt = 6,000 (t = 1, 2, 3, . . .)
For data (c) xt becomes 9,000 in weeks 5, 10, 15, . . . , and so we write
x, = 6,000 {t = 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, . . .)
xf = 9,000 (t = 5, 10, 15, 20, . . .)
For data (d) we found that the orders differed from 6,000 by a value
14
0 4
02 0 2
0 1 |
°'l
-2-10 I 2
Difference in orders from 6,000 (in 000s)
xt = 6,000 + 1,000/ + et
In our example we have taken the data as the starting point and then
constructed the model. The illustration of the tickets in the hat suggests
that it is always possible to start with a model and construct from it
data having that structure. The purely mathematical part of the model
presents little problem, but generating values for et can get quite
complicated. A literal hat would do the job for the distribution in
Figure 2.1. For more complicated distributions we need more elaborate
equipment, most usefully a computer. There are two reasons why it is
useful to be able to generate forged data having the properties of any
given model. The first is that it gives us experience of the behaviour to
15
is still no guarantee that this model will continue to give good forecasts
in the future. The chances of a structure remaining stable decrease, the
longer the period of time considered. It follows that the doubt about
the reliability of our forecasts increases, the greater the period into the
future over which we extrapolate our model. The possible failure of our
basic underlying assumption that we have correctly identified a stable
structure means that the forecaster can never quite throw out his
crystal ball.
X, = a + Pxt_ ! + e,
to allow for the chance variation of the real world, we obtain a very
different situation. As xt is influenced by the previous x, it must also be
influenced by previous random variables e. Thus xt is influenced by the
whole past history of the sequence of es. Such a model is an example of
a stochastic model. Such models form the base for a large number of
forecasting methods. Many forecasting methods are based on either a
deterministic or a stochastic model. We will tend to look at them
separately, but clearly there are situations where one would seek to
combine these types of model.
Models (b), (c) and (d) all have the form
xt = a + /3( )
The parameters thus occur singly in separate terms of the model, the
terms being added together. The influence of the parameters and the
effects of changing them can thus be considered independently for each
parameter. Such models are said to be linear in the parameters.
Problems of estimating parameters and forecasting tend to be simplest
for linear models. Other models, such as the growth curve in (f) in
Table 2.1, do not have this property. Here the model is non-linear in
the parameters; the effect of changing (3 in (f) will depend on the value
taken by a. In studying such growth curves we will be faced with more
difficult problems of estimation and forecasting.
underlying structure of the data. The model for such a situation will be
called a global model. In the second situation, the structure is believed
to be stable in the short run but not necessarily in the long run. Slow
changes in structure may occur in our data which will not affect us too
adversely for forecasting over short periods of time. When examining
forecasting at an advanced level, attempts are usually made to
incorporate any slow changes in structure into a more complex model.
It is, however, often sufficient to treat such data as coming from, what
we will term, a model of local validity, or a local model.
There is no difference in the mathematical or statistical formulation
of these two types of model, global and local. The difference lies in the
way in which we make use of the models. Thus when we use a local
constant mean model we shall forecast the future on the assumption
that the mean in the near future will be the same as the current mean.
We will not be willing, however, to assume that the current mean is the
same as that in the past when we first obtained our data. We will, in
future chapters, discuss models from two viewpoints: firstly, from the
global viewpoint, with the model assumed to be a proper description of
the underlying situation; secondly, from a local viewpoint, with the
model regarded as an approximation valid only locally in time. There
are a number of different situations that would lead to the use of
models as approximations only. These are
People who have studied statistics will be familiar with the ideas of
statistical models for data, such as those just referred to. It is therefore
important to draw a distinction between the use of models in
forecasting and the more usual statistical uses. In fact, we might refer to
the models in section 2.3 as forecasting models. Though they are
statistical or mathematical models in form, their required relation to
the data is different in forecasting from their relation in ordinary
statistics. In statistical model building we are just as interested in how
the model describes the first observations as in how it describes the
most recent observations. In forecast model building it is the future
that interests us, and as long as the model is good at giving forecasts we
do not mind how badly it fits the data of the dim and distant past.
Thus we will often feel justified in using a model that obviously does
not fit the entire set of data. Provided it gives a good fit to the most
recent data, it is a potentially good forecasting model. Putting this a
different way, our attitude to models in forecasting means that we are
as happy with a local model as with a model of global validity, as long
as we get good forecasts.
Much of this book will be devoted to describing how to obtain
formulae for forecasting future values in particular forecasting models.
We will call such formulae forecasting formulae. For example, given a
model of the form
xt = n + et
where p is a constant mean and et is a purely random normal sequence,
and given past data xx, x2, . . . , xf, we wish to forecast the future value
xt+k. Denote by xtk the forecast made at time t of the value of x at k
time units into the future, referred to as a forecast with lead time k. An
appropriate forecasting formula for xt+k is
=
%t,k ~ (xx + x2 +...+#*)
Forecasting criteria
Problem A
Problem B
Problem C
Problem D
His enthusiastic member of staff advocates his pet method as the best
method for this type of model. How does the sales manager test the
success of this method for his model?
23
24
Data A B
Model C D
1 6 — —
2 6 6 6 0 0
3 6 6 6 0 0
4 6 6 6 0 0
5 6 6 6 0 0
6 6 6 6 0 0
7 7 6 6 1.00 1.00
8 5 6.5 6.10 -1.50 -1.10
9 4 6 5.99 -2.00 -1.99
10 8 4.5 5.79 3.50 2.21
11 6 6 6.01 0.00 -0.01
12 7 7 6.01 0.00 0.99
13 7 6.5 6.11 0.50 0.89
14 5 7 6.20 -2.00 -1.20
15 10 6 6.08 4.00 3.92
16 9 7.5 6.47 1.50 2.53
17 11 9.5 6.72 1.50 4.28
18 12 10 7.15 2.00 4.85
19 10 11.5 7.64 -1.50 2.36
20 9 11 7.87 -2.00 1.13
21 8 9.5 7.99 -1.50 0.01
22 10 8.5 7.99 1.50 2.01
23 11 9 8.19 2.00 2.81
24 10 10.5 8.47 -0.50 1.53
25 9 10.5 8.62 -1.50 0.38
t x -V*
XA
-V*
e
A e
B 2
A Z
B
1 6 — —
2 6 6 6 0 0 0 0
3 6 6 6 0 0 0 0
4 6 6 6 0 0 0 0
5 6 6 6 0 0 0 0
6 6 6 6 0 0 0 0
7 7 6 6 1.0 1.0 0.5 9.00
8 8 6.5 6.10 1.5 1.90 0 8.10
9 9 7.5 6.29 1.5 2.71 0 7.29
10 10 8.5 6.56 1.5 3.44 0 6.66
11 11 9.5 6.90 1.5 4.10 0 5.90
12 12 10.5 7.31 1.5 4.69 0 5.31
13 13 11.5 7.78 1.5 5.22 0 4.88
14 14 12.5 8.30 1.5 5.70 0 4.30
15 15 13.5 8.87 1.5 6.13 0 3.87
16 16 14.5 9.49 1.5 6.51 0 3.49
17 17 15.5 10.14 1.5 6.86 0 3.14
18 18 16.5 10.82 1.5 7.18 0 2.82
19 19 17.5 11.54 1.5 7.46 0 2.54
20 20 18.5 12.29 1.5 7.71 0 2.29
21 21 19.5 13.06 1.5 7.94 0 2.06
22 22 20.5 13.85 1.5 8.15 0 1.85
23 23 21.5 14.67 1.5 8.33 0 1.77
24 24 22.5 15.50 1.5 8.50 0 1.50
25 25 23.5 16.35 1.5 8.65 0 1.35
Large t t t- 1.5 t- 10 1.5 10.00 0 0.00
too much of the random variation in the data, as method A does, nor
that it lags too far behind any new feature, as method B does. When
dealing with testing out forecasting formulae, both these aspects must
be studied. It must be recognized, however, that if a method is to be
sensitive enough to adjust quickly to major changes in the structure of
the data, it will also be sensitive to the random variations in the data.
Conversely, if we wish to reduce the response to random variation,
giving forecasts that are ‘smoother’ than the data, we will also prevent
the forecast from responding quickly to changes in the structure of the
data.
It is of value to classify the features that interest us when trying out
forecasting formulae on data (and also on models). Three main headings
can be used. These are as follows.
Thus a positive error means that the value that occurs is larger than the
forecast value. It should be noted that some writers define the error the
other way round, as xt — xt. An examination of the errors in Table 3.4
shows that the errors from method A show a greater spread than those
of method B, but the average error is smaller. If we examine the average
error in the data of Table 3.2 for the trending region, this average will
be positive, 1.5 for method A, indicating the lag of the forecast behind
the trend. Hence a study of the errors requires at least a study of
(a) their average (mean) value and (b) their spread. The mean error e for
a set of errors ex, e2 , . . . , en is defined simply as
e=\(ex +e2+...+e„)
1 H
(b) Summary
Method A Method B
mean error. Where this average error becomes large, perhaps due to
failure to follow accurately some systematic structure in the data, the
forecasting formula becomes suspect and needs adjusting or replacing
by another which is suitable for this particular type of structure. A
systematic deviation of the forecast errors from zero is referred to as a
‘bias’ in the forecasts. The mean on its own is not enough, since it may
well be close to zero while the actual errors are very large, positive and
negative errors tending to cancel each other out. To eliminate the sign
for the purpose of measuring spread, one may either just ignore it and
consider the absolute errors, denoted by I et I, or remove it by squaring.
The first approach measures spread by the mean absolute error (MAE),
1 "
MAE = - I | et\
n i= i
sometimes also called the mean deviation of error. The second approach
uses the mean square error (MSE),
31
MSE = — 2 ej
n i= 1
Table 3.4 gives the calculation of these quantities. It is seen that both
quantities indicate quite clearly that method B gives errors having a
smaller spread than method A.
To make the units of the mean square error the same as those of the
mean error and the mean absolute error, its square root must be taken
to give the root mean square error (RMSE). The choice of whether the
mean absolute error or the root mean square error is used is based on
practical considerations of the ease of calculation and the use to be
made of the calculated measures. It should also be noted that the root
mean square error places greater emphasis on the large errors than does
the mean absolute error.
If a bias is found to exist, it may be advisable to measure the spread
of the errors about their average rather than about the zero value. Thus
the mean square error would be replaced by the ‘sample variance’
defined by
2
1 n
s = — 2 (e — e)2
n i=i i
A little calculation will show that this is related to e and the mean
square error by
s2 = MSE -c2
The square root of this is called the sample standard deviation, s.
In many of the computer programmes available commercially and in
much of the literature on forecasting, the root mean square error is
used as the main criteria of forecast success. It is seen that
MSE = e2 +s2
Thus the mean square error is the sum of two contributory factors, one
measuring any bias there might be in the errors and the other the
variability of the errors about the mean error. An examination of the
mean square error without taking e into account may thus be
misleading or at least not as informative as a separate study of e and s2
or, equivalently, e and MSE.
In the same way that the mean square error can be modified to give
2
s , we may modify the mean absolute error by replacing the errors by
their deviations from the mean error, to obtain the mean absolute
deviation of errors (MADE),
1 n
MADE = - 2 \et — e\
n i= I
We are now in a position to say that from a statistical point of view the
32
types of steady-state behaviour for which the average error will be close
to zero. In particular, if an oscillation occurs in the data which is
followed by the forecasting formula with some lag, positive and
negative errors will occur in about equal proportions and thus a small
average error will occur; none the less, there is in this situation a clear
steady-state feature. The best and simplest method of examining such
features is to plot a chart of the errors against time. Chapter 15 gives
examples of such plots.
When dealing with a set of data we are very limited with what can be
done in investigating steady-state features, and hence again we must
turn to the use of models. Here we would use both models chosen to
describe the existing data and also models that describe features that
could well occur in the future structure. The analysis of the steady-state
effects, when the forecasting formula is applied to a model, is thus the
crucial test of the formula. The existence of steady-state errors is
obviously a disadvantage to any forecasting formula and it is necessary
to design formulae so that they are unlikely to occur. Where their
existance is acceptable, as a sacrifice made in order to obtain other
advantages, the effects should be kept as small as possible.
i.e. we simply use the last observation to forecast the next. This is the
simplest formula to use here for illustrative purposes. It corresponds to
the intuitive forecasts made by the type of individual who uses only his
own experience and feel for the situation as the basis for forecasting
but unfortunately has only a short memory. Having stated the model
and the forecasting formula, we need only to bring the two together to
obtain an expression for the forecast error. Thus,
(?t+1 ~ %t+ 1 %t+ 1
~ ^t+ 1
Hence,
CJ + 1 =(3 + ef+, — e,
Using the properties of expectations given above, the expectation,
which is the average of the population of the errors, is
E (et+ j) =E(@ + et+1 — et)
= E (0)+E(et+1)-E(et)
=
P
since the E(et) = 0 for all t from the definition. Thus the forecast is
biased in the sense that the population average of the errors is f3, so that
on average the forecasts lag behind the observations by an amount (3.
This is a steady-state lag which remains the same over all time. To
illustrate a transient phenomena, consider what happens if at some time
T there is a sudden change in the situation which makes a become
ot + 5, this new value becoming permanent after time T. Since we have
shown that et does not depend on the value of a, the expression we
obtained for et will remain unaltered for times before and after eT. The
value of eT will, however, be different and we will have
& j' — Xj x j'
'V*
Jij1
/V*
T —1
= {ce + 5 + (3T + eT} — {cv /3 (T — 1) + 6T-_1}
=
j3 + 5 + 6 j 1
6 j1 _j
Thus the bias is as before, except at time T when the change in a
produces a transient effect in the form of sudden increase by 5 which
dies away immediately.
Let us now look at the spread of the errors. As we know the errors to
be biased, it is more useful to look at the variance than the mean square
error. This variance is the population quantity, from the model, which
corresponds to the sample variance s2, discussed in section 3.3. From
the above results we now have
Var(ef+ j) = Var(|3 + e,+ j - et)
= Var(ei+1) + Var(et)
= 2 a2
using the fact that et and et+i are independent with variance a2, and
letting b = P and a = — 1 in the formula for Var(ae? + 6). We see then
that the variance of the forecast errors is twice the variance of the
individual observations. This is a large error variance and suggests that our
forecasting formula would not be of much practical use for data from
this model. The mean square forecasting error is
E(e2) = 2o2 + P2
36
from the relation in Table 3.5 between the mean square error, variance
and bias.
We have now seen that by substitution of the model in the
forecasting formula we may investigate the main steady-state, transient
and statistical properties of a forecasting situation.
xt = a + (3t + et a = 50
(3 = 2
o = 2, the standard deviation required
ritP\ N (0, 1) obtained from tables
of random normal deviates
t a fit Vt e* = xt = a + (3t
simulating data from a trend model. The table shows one run of the
simulation using one set of es. Sets of random normal deviates, as such
quantities from a normal distribution iV(0,l) are called, may be
obtained from tables (e.g. Rand Corporation, 1955) or generated by
standard instructions on most computers. The simulation of data with a
seasonal variation was illustrated in Table 1.3.
References
Hogg, R. V. and Craig, A. T. (1970). Introduction to Mathematical Statistics.
Macmillan, New York.
Mood, A. M., and Graybill, F. A. (1963). Introduction to the Theory of Statistics.
McGraw-Hill, New York.
Rand Corporation (1955). A Million Random Digits with 100,000 Normal Deviates.
The Free Press, Glencoe, Illinois.
'
Part II
In the next eight chapters we will examine a range of models that form
the basis of most practical forecasting. The models discussed are
The constant mean model Chapter 4
The linear trend model Chapter 5
Regression models Chapter 6
Stochastic models Chapter 7
Seasonal models Chapter 8
Growth curves Chapter 9
Probabilistic models Chapter 10
Multivariate models Chapter 11
For each class of model we will examine various ways of obtaining
forecasts. In so doing we will develop a range of methods, principles
and approaches that have wide applicability in all types of forecasting;
these are discussed more generally in Chapter 12.
Chapter 4
Thus to forecast xt+k we shall have to assign values to both p and et+k.
The random variable et+k is by definition independent of the available
information contained in xl9 ... 9 xt and hence we cannot give a
forecast of its future value, save to note that its expected value wall be
zero. So we simply forecast et+k by the value zero. The quantity pis,
for our global model, a constant over all time, and therefore an estimate
of p at the present time will also estimate its future value. The most
natural estimate of the mean p is the sample mean (average) based on all
available data. Combining this with the zero forecast value for et+k
gives as our forecast the expression
Xfk (*^ l *^2
can express the forecasting formula in yet another form. The basis of
this third form is the use of the forecast error. Denote the forecast error
at time t by et, where
Xf xt — I
t—1
t xt xt/t xt-1 xt-1 xt
t
1 7 7.000 0 0 7.000
2 3 1.500 7.000 3.500 5.000
3 9 3.000 5.000 3.333 6.333
4 6 1.500 6.333 4.750 6.250
5 8 1.600 6.250 5.000 6.600
6 7 1.167 6.600 5.500 6.667
43
t xt Xt- 1 e
t et/t xt
Xt = 2 Xj/t
t= 1
= 2 (;u + e,)/f
t= 1
44
Hence,
Xt = ju + 2 e,/f
?= 1
Taking expectations of both sides and using the facts thatE(ju) = p and
E(€i) = 0, we have
E(xt) = n
Thus the forecast is unbiased. The forecast error is
^t,k ~ Xf+ k Xf k
t
~ ^t+ k ^ ft
i= 1
As E(et k) = 0, for all k, the forecast is thus again proved unbiased for
all future values.
Finding the variance of both sides of the above expression for x t and
using the facts that Var(p) = 0 and Var(ei ) = a2, for all i, together with
the independence of the e values, gives
Var(x;) = Var^2^e,-/f j
= 2 Var(edit2
i= 1
Hence,
Var(xr) = o2ft
Since the mean error is zero, as the forecast is unbiased, it follows that
the mean square error is the same as the variance of error. Hence,
MSE = Vax(xt+k -xt)
But xt+k and 3ct are independent of each other and so
= o2 + o2 ft
t
45
Thus as t gets large the forecast estimates p with high precision and the
mean square error approaches o2, which is due simply to our inability
to forecast the random variable et+k.
Knowledge of the fact that the forecast is unbiased with error
variance (mean square error,)
enables us to make much more useful forecast than the bare statement
that the forecast is such and such a number. In particular, we can make
statements about the probable range of values to be taken by the future
observation. The most common form for such statements is called a
confidence interval. We will outline the basic idea, but for details the
reader is referred to any standard statistical text, e.g. Hogg and Craig
(1970) and Mood and Graybill (1963).
Let us start by assuming that the forecast errors are normally
distributed with zero mean and variance o^. It follows from the
properties of the normal distribution that the probability of the
forecast error lying in the interval -—1.96 oe and +1.96 oe is 0.95. In
general, the probabilities associated with ranges of values of a random
variable are given by the areas under their distribution curves. In the
normal error distribution (see Figure 4.1) these are determined by the
spread of the curve as measured by oe. If we require different
probabilities, then the 1.96 is replaced by some other appropriate
Interval containing
observed x on 95%
of forecasts
Since
t + 1
et+i = xt+ i — x t and o\ = cr
this may be simply rewritten as
\
t+ 1 t + 1
Prob —1.96 .o<xf+1 —£*<1.96 a \ = 0.95
Notice that, assuming a is known, the terms forming the limits of the
inequality are all known and can be calculated as two numbers. These
we call the lower confidence limit (LCL) and the upper confidence
limit (ULC). Thus,
/9
LCL = 245 - 1.96 / - x 10 = 224.2
V 3
/9
UCL - 245 + 1.96 / - x 10 = 265.8
V 8
Thus the 95 per cent confidence limits for the forecast are 224.2 and
265.8. The interval 224.2 to 265.8 is called the 95 per cent, confidence
interval, or prediction interval.
Notice that, though for ease or notation we have talked about
forecasting xt+!, we could have talked of forecasting xt+k without any
change in the calculations. Similarly, we have treated the observations
x{, . . . , xt as though they were made at equally spaced times called 1,
2, . . . , t. The observations could have been made at any times, say Tx,
47
T2, . . • , Tt, without altering our results. The reason for this is that in
this model the basis of the forecast is the estimate of \u which is
assumed constant for all time. Thus it does not matter when the
forecasts are made for or when the observations were taken. In all cases
we get the same results and, in particular, the same confidence interval.
We will see in later sections that this is not true in general.
In the previous discussion it has been assumed that a is known,
which is obviously very rarely true. When any parameter is not known,
it must be estimated from the observations. To estimate o2, the
variance of x, we use a simple modification of the sample variance of
the data, which was defined in section 3.2. Here we define our estimate
as
1 t
2 (x,—x)2
where x is the mean of the available data, which is xt for the model of
this section. If this estimate is used in place of a2 in the calculation of
confidence limits, the value 1.96, or whatever was used from the tables,
has to be replaced by the corresponding value from the table of the
^-distribution. The reason is that when the constant a2 is replaced by
the random variable a2, which depends on the data, the normal
distribution form of Figure 4.1 is no longer applicable and a
distribution called a ^-distribution is used instead (e.g. Fisher and Yates,
1938). If in our example the value of a2 was 100 and the value of
t — 1, called the degrees of freedom, was 7, then looking up the 95 per
cent value for 7 degrees of freedom in tables of the f-distribution gives
the number 2.365 in place of 1.96. Recalculating the confidence limits
gives
/9
UCL = 245 + 2.365 / — x 10 = 270.1
V 8
Thus the prediction interval is wider, which simply reflects the fact
that, when a is not known, we have less knowledge of the situation and
cannot give as short an interval as before. In the above calculation we
have obtained the error standard deviation oe from that of the data a.
As we will usually measure the errors directly it is more natural, and
reliable, to estimate oe directly from the observed errors. We would
thus use
1 t
-2c
t i=
Table 4.3 Global constant mean model
Data • • • » ^t
Minimum MSE
forecast of xt+k is xtk = xt independent of lead time k
-“ M *'*■
xi
Explicit form
X
II
h
Forecast bias 0
95% prediction l t + 1
1.96 a // ,xt+ 1.96 a /A +1 \)
interval, o2 known
Month 1 2 3 4 5 6 7 8 9 10 11 12
Sales 7 5 6 4 5 4 5 3 4 7 6 5
First set 7 5 6 4 5
Average 5.4
Second set 5 6 4 5 4
Average 4.8
etc.
Eighth set 3 4 7 6 5
Average 5.0
Moving averages 5.4 4.8 4.8 4.1 4.2 4.6 5.0 5.0
1
— (*,_3 + Xt_2 +Xt_ ! +*,)
4
we might use
X f— 3 1 — 2 f— 1 Xf I
then this ‘weighted’ moving average will also have the value 7. Though
this weighted moving average estimates p locally, in the sense that it
takes most notice of current information, it suffers from two
disadvantages. Firstly, it only uses the four latest observations and
ignores the rest. Secondly, it does not possess a very simple recurrence
form. A set of weights for a weighted moving average which overcomes
51
where a is a constant with value 0 < a < 1. The general formula for the
exponentially weighted moving average is
t- I
Z arxt_ r
r= 0
Xt =
t- 1
Z ar
r- 0
St = Z arxt_ r
r= 0
and
1
Wt = f Z ar
r= 0
52
Week Sales 7 5 6 4
1 Weight 1.0
Weighted sum 1.0 x 7 = 7
Sum of weights 1.0
JCi 7/1.0 = 7
xt s xt
II
Then
St - aSt_ x + xt
Wt=aWt_ i + 1
xt = St/Wt
s0=o,w0=o
Thus both numerator and denominator may be evaluated using very
simple recurrence formulae. This is illustrated in Table 4.5(b). In
53
moving from one stage to the next of this calculation, only the last
values of St and Wt need to be recorded.
One way of looking at the introduction of the weights ar is that we
now look at our data through a fog. When a is small the fog is thick, so
that we can only see with any clarity the data that are close to us in
time. Putting a = 1 corresponds to a clear day in which we can see all
the data clearly. Technically, the association of weights with variables
to decrease their importance is called discounting and we will refer to a
as the discounting factor.
Let us now consider two other approximate ways of expressing our
forecasting formula. If a < 1 and if t is so large that af is negligible, then
the value of Wt can be shown to tend towards a constant limit of
(1 — a)-1 and thus only St needs to be stored. In this case the formula
becomes
t- I
xt - (1 — a) 2 arxt_ r
r= o
This can be rewritten in two ways that are very convenient for
calculation purposes:
Thus
xt = (1 — a)xt + axt_ i.
This recurrence relation, as that for the ordinary average in section 4.1,
relates the new forecast to the old forecast and the new observation.
The values a and 1 — a give the proportion of weight attached to the
old forecast and new observation, respectively. In words, we may thus
write
Hence
x, =xt_1 + (1 — a)et
Thus the new forecast can be obtained from the old by adding a
fraction (1 — a) of the last error made in our forecasting. The only
difference between this and the similar expression for the global mean
model is that the fraction is a constant independent of t rather than a
fraction which gets smaller as t increases. Intuitively, this means that
the forecast is always ‘constantly’ alert for changes in the situation
which would reveal themselves through the forecast errors. This is
exactly the feature that is required if we are to forecast a constant
mean model which is valid locally rather than globally. The method has
been introduced to deal with a model approximating to the global
constant mean model
xt = n + et
over small localities of time, but for which the mean ju wanders in some
unspecified fashion. Many theoretical models have been devised that
give, in effect, wandering means. It has been shown that for many of
these the exponentially weighted moving average gives a good forecast.
Clearly, if we can be more precise about how the mean wanders, we
should be able to improve on this forecast. We will discuss suitable
methods in later chapters. Experience shows, however, that this
method, often called exponential smoothing, provides a very robust and
widely applicable technique.
55
Wt = 1 + a + a2 + . . . + af~ 1
= (1 — at)/( 1 — a)
Hence
E(St) = nWt + 0
and
E(xt) = n
Thus exponential smoothing leads to an unbiased forecast if the mean is
genuinely constant. Further,
2t
t- I 1 — a
Var{St) = 2 a2ro2=~ y o2
r— 0 1 Cl
But
Var(x?) = Vax(St)/Wt2
1 —a 1 + af
Var(x,) * . a
1 +a 1 — af
For t large this becomes
t = 10 t = oo
0.10 1.27 0 oo
xt - average of x s
xt = exponential smoothing 0.14 0.19 0.11 0.12
saw in the previous section, the best forecasting formula for this model
is the ordinary average which corresponds to the exact form of
exponential smoothing with a = 1. By comparing values of Var(x) for
various values of a, including a- 1, we can see howr much we lose by
using our exponential weights when we are in fact dealing with a global
constant mean model. Conversely, if we consider some model for a
wandering mean process and repeat the comparison, we can see the
advantage of introducing the weights ar. Table 4.6 shows such a
comparison for a particular wandering mean process. Notice from the
table how little is lost, by way of having a larger variance, when a = 0.9,
say, is used instead of a = 1.0 in the global constant mean model,
and conversely how much bigger the variance is if a = 1.0 is used when
some a < 1.0 should have been used. It would appear from this example
that unless we can be absolutely certain that the constant mean model
has applied in the past, and will continue to apply in the future, then it
will be safer to use the weighted mean rather than the ordinary mean.
In the previous section we discussed the construction of confidence
limits for the future observation. It might appear reasonable to use the
above expressions for variance to construct such limits, and indeed we
could. The limits, however, would refer to the permanent model; the
limits for a local model would, in general, be different. The natural
approach here is to estimate oe directly from the observed forecast
errors using the estimator be given in the last section. The prediction
r-0 / r= 0
intervals can then be derived using the same approach as that discussed
in that section.
The forecasting formula of exponential smoothing involves the
‘discounting factor’ a, which is a ‘forecasting parameter’ as distinct
from a parameter of the model. The value of a is normally chosen so as
to minimize the mean square error of forecasts over a trial period. This
is discussed in more detail in Chapter 16. Table 4.7 gives a summary of
the main formulae for this section. It should be noted that much of the
literature on exponential smoothing uses a = 1 — a as the basic
parameter. This is often referred to as the ‘smoothing constant’.
References
Brown, R. G. (1959). Statistical Forecasting for Inventory Control. McGraw-Hill,
New York.
Davis, H. T. (1941). The Analysis of Time Series. Cowles Commission, Yale.
Fisher, R. A. and Yates, F. (1938). Statistical Tables. Oliver and Boyd, Edinburgh.
Hogg, R. V. and Craig, A. T. (1970). Introduction to Mathematical Statistics.
Macmillan, New York.
Mood, A. M. and Graybill, F. A. (1963). Introduction to the Theory of Statistics.
McGraw-Hill, New York.
Winters, P. R. (1960). Forecasting sales by exponentially weighted moving averages.
Man. Sci., 6, 324—342.
Chapter 5
x12 = a + ]3.12 + el 2
Time 1 2 3 4 5 6 7 8 9 10 11
31 34 37 36 42 39 42 45 47 47 52
Refer again to Figure 5.2 which shows a fitted line and the data. To
decide on how to fit a line, we must first decide which criteria the
fitted line should satisfy. The intuitive criterion is simply that the line
should be as close as possible to the observations. That is to say, that
the residuals ex, . . . , et should be small. By analogy with the criteria
for forecasting developed in previous chapters, an obvious criteria
would be that the mean square error of the ef should be as small as
possible. Since t is here fixed, this requirement is that the quantity S,
where
s= 2 if
i~ 1
is as small as possible. This criterion is referred to as the least squares
criterion. From the definition we can see that if xt is the observation
made at time i then
€j = xt - a — pi
Thus
S = 2 (xt — a — $i)2
i= 1
A
The least squares line is the line for which a and (3 makes 5 a minimum.
A direct application of calculus (setting dS/da= 0 and dS/d(3 = 0) gives
two simultaneous equations for a and j3. These are
t A t
2 xt = at + (3 2 z
i= 1 i= 1
2 iXj = a 2 i + (3 2 z2
f=l i= 1 i=l
•
Fitted line := 41 .09 + 1.882 (i--6) = 29. 80 + 1.88 i
Estimated variance b2 = 2e2/9 = 1.592
Forecast of x\2 = 41.09 + 1.882 x 6 = 52.38
aT — Xj1 Xj'
= {a + PT + eT)~ (a + pT)
= a - a + (p - p)T + eT
M.S.E. = E{e2T)
= Var(er)
= Var(&) + T2 Var(|3) + a2
Thus
/ 1 T2 \
M.S.E. = o (j + 2
+ l)
It will be seen that as T, the time from the average time of the
observations to the time being forecast, increases so does the mean
63
and
t xt t xt t xt t xt t xt
sophisticated models, but often a local linear trend will provide a simple
and robust model as the basis for forecasting. The terms in the more
sophisticated model might well describe the past more accurately than
our local trend model, but we may have doubts about their continuing
stability in the future. It will often seem in practice that the linear
trend is the only stable aspect of the situation that one can reasonably
expect to continue into the future, at least for a short time. This
distinction in attitude to past and future must always be borne in mind
by the forecaster. What he regards as a good model will often be
different from that, of the person whose only aim is to obtain a good
65
model for a given set of data from the past. We shall consider in this
section three approaches to forecasting with this model.
where r is time-measured into the past. By its definition and the fact
that the slope is fi, it is clear from Figure 5.4(a) that adjacent values of
fi are related by
= lit-1
Hence an estimate of \xt can be found from the two previous estimates
of j\it_ i and fi by simply adding fit_ ! to j£tf_ i, see Figure 5.4(b). This
gives the estimate of the mean at time t based, on past data. The latest
observation xt also provides an estimate of pf. Hence if we use the ideas
in the exponential smoothing recurrence equation and make the new
estimate jlt a weighted mean of the latest data and the estimate based
on using the previous estimates, we arrive at
— +
ft = (1 a)xt + I fit- I )
To make use of this equation, we need a way of calculating the
sequence of estimates fit_ l. This may be done by applying the idea
used in getting the above reccurrence relation to finding a recurrence
relation for the fi^ values. The new value fit will be a weighted mean of
the old estimate fit_ { and an estimate based on the most recent data.
To estimate the slope using the most recent data, we might use
xt — jcf_ i, since from the model this can be written as
xt — xt_ ! = (iit + et) — (iit+1 — fi + et_ !)
= fi + et-et_1
66
A A
H-t - H-t-1
0 = M/ — Mf- l
By replacing the unknown parameters jut and fJLt- I by our estimates of
them, we see that gives an estimate of (3 and includes
implicitly the influence of the latest observation xt. It is, however, not
as strongly influenced by xt as the previously suggested estimate. The
method of updating the estimate of slope is thus based on the
recurrence relation
Note that we do not have to use the same weighting constant as for jCq,
so a new constant b is introduced.
The practical application of these formula requires the estimates of (3
and ju to be updated alternately using the recurrence formulae above,
each updated estimate being used in the next calculation of the other
parameter. The calculation is illustrated in Table 5.5. We now have a
way of finding estimates of the slope and current mean at the time of
67
(a) Formulae:
+
£* = (!“ a)xt + a
(Pt-1 fit-1)
+
fit = (1 - &)(£* ~ £t-i) b$t_i
^r+2 =
fit +
%fit
a = 0.8 6 = 0.9
Initial values assumed pi = 42.9, (3 = —0.15
xt P-t-1 +
fit-i Pt Pt-fit-i 6t-1 ft x
t+ 2
— —
42.900 — —
-0.15 —
the most recent observation. The estimated equation of the line at time
t is
A
xt ,h = Ar + ft*
by any of the methods of sections 5.1. j30 is then taken as the slope of
this line and p0 as its value at the time one unit before the initial
observation is obtained.
If there is an initial set of data available, it is wise to use it to obtain
some idea as to the best values for the forecasting parameters a and b.
For example, if we take the mean square error as the criterion, then we
68
can evaluate the mean square error for a range of different values of a
and b. Since all methods produce poor forecasts when there is very
little data, the forecasts obtained over the first few observations should
be ignored in calculating the mean square error. A discussion of
methods of choosing parameter values is given in section 15.3.
In Chapter 4 we noted that the recurrence form of forecasting
formula could also be written in terms of the forecast errors. This can
also be done here. The one-step-ahead forecast error is
A
et = xt — {Ut- I +&_.)
Substituting the value of xt given by this definition in the recurrence
relation for jdt gives
Pt = Pt-i +cet
69
where c = (1 — fr)(l — a). Thus we can update both the mean and slope
estimates by adding simple multiples of et.
S= Z 6/2
t= I
where the e values were called the residuals. In doing this we attach as
much importance to residual ex, for our very first observation, as to the
residual eu for our latest observation. When we are dealing with data
believed to come from a local model and our aim is to get a good fit to
the recent data for forecasting purposes, then this equality of
importance may not be very sensible. It may be more reasonable to
introduce a weighting, or discounting, into the sum of residuals squared
in exactly the same way that we introduced it into the ordinary mean
in justifying use of the weighted mean of exact exponential smoothing.
Various types of weighting or discounting could be introduced but, as
in the constant mean model, it is found that the use of exponential
weights lead to particularly simple calculations. We thus replace the
sum of residuals squared by a weighted sum in which e]_ r has a
weight of ar (0 < a < 1) associated with it. The most recent observation
gives a residual squared with weight one. The far-past residuals, having
small values of ar (for a < 1), contribute little to the total. As the
magnitude of a controls the extent to which the past is discounted, it is
often called the ‘discounting factor’. Thus our least squares criterion is
replaced by
S=Vae _
r= 0
r 2
t r
€f— y ~ Xt — r frt
A
t- 1 A t- I t-I
r r
fit 1 a -pt 2 m = 2
r=0 r=0 r=0
It will be seen that the form of these equations is the same as that for
the normal equations for the undiscounted case — which it must be, for
this corresponds to putting a - 1. The sign of fit has changed since we
are now measuring the time, r, as positive into the past. The other
difference is that all the sums are now ‘discounted’ sums:
t t- I
f = 2 1 is now 2 ar,
1 o
and so on for each term. To get the full value of introducing the
weights a, we require evenly spaced data, so we shall assume in what
follows that observations occur evenly at times 1, 2, 3, ... . The
occurrence of uneven intervals can be dealt with fairly easily once the
principles have been grasped.
The forecast of xt+h is, as before,
X, ,h = M, + isth
Initial
Sum Symbol Recurrence relation value
t-1
Wt Wt = 1 +aWt_x Wx = 1
r-0
t-1
1
2 mr
+
Ax =0
0
ii
A,
'■+
1
r-0
t-1
2 r
2 r a B, Bt =At + a(At_l + Bt_i) Bi =0
r=0
t-1
r
2 a xt_r Yt Yt
= x
t
+ a
Yf-i Yi
r—0
f-1
r
2 ra xt_r z, Zt = a
(Yf-i
+
Zt_ i) Zj =0
r=0
A A
W- 1/(1 —a) 5 10
A = a/(l — a)2 20 90
B = a(l + a)/(l — «)3 180 1440
1/C = a/(l - a)4 500 6300
c 0.002 0.000158
At B, xt Yt Ct Bt pt *7+2
1.800 0.800 0.800 42.6 76.84 34.24 1.25 42.6 -0.2 42.200
2.440 2.080 3.328 42.5 103.972 88.864 0.2626 42.487 -0.149 42.189
2.952 3.616 7.942 42.4 125.578 154.269 0.09644 42.386 -0.127 42.132
3.362 5.254 14.501 42.1 142.562 223.877 0.04730 42.146 -0.173 41.800
3.689 6.893 22.697 41.8 155.850 293.151 0.02761 41.874 -0.198 41.478
3.951 8.466 32.138 41.5 166.180 359.201 0.01808 41.578 -0.223 41.132
72
E(5ct) = Xal'(nt-l3r)IZar
= 1ut-P(2rarIXar)
= M —f a/(l — «)}/3
for large f, using the limiting results given in Table 5.5. Thus the lag is
{a/(l — a)}p.
A reasonable way to obtain an estimate of the current mean is to
adjust x t by adding an estimate of this lag. One way of doing this is to
make use of what has been termed double exponential smoothing (see
Brown, 1959, 1963). Since the exponentially weighted moving average
has the effect of smoothing out much of the fluctuation in the data x x,
x2, x3, . . . , xti the sequence 5c l, 5c 2, x3, . . . 5ct is said to be the result
of exponentially smoothing the first sequence. Suppose that we now
apply the formula of exponential smoothing to
This we could denote by
x/2 )
= (1 — a)5ct + a5c)2_\
The new sequence 5c{2 \ 5c^2 \ 5c^2\ . . . , x[2^ is said to have been
obtained by the process of ‘double exponential smoothing’ or second-
order exponential smoothing of the data. If xt exactly follows the trend
line, then xt lies on a trend line a distance l below, when p is positive.
Similarly x}2) will be a distance / below 5ct. Thus we would have
xt = 5ct + l
X — X^ ^ ^ -4-7
By subtracting we get
xt = 2 5ct — 5c
At = 2*r —x\x)
73
To estimate the slope, we use the facts that for the exact trend
l ={a/( 1 — a)}/3
and
1 — x
l
-y* ( 2 )
*/V f
a
*o = Mo ~ “ ft
1—a
and
xtf) = Mo - : ft
1 —a
give the correct starting values to use. If we can at best make rough
guesses for p0 and ft, it is advisable to use the exact method as given
by the normal equations, at least to obtain the first few forecasts.
Table 5.7 provides an example of the method.
As with Holt’s method, we may express the forecasting formulae in
terms of the one-step-ahead forecasting errors. The derivation of these
formulae is an exercise in algebra which we shall leave to the more
mathematically inclined readers. The final equations in error correction
form are
ft = ft- I + (1 —a)2e,
A comparison of the above error correction forms with those of Holt’s
method shows that the double smoothing method is just a special case
of Holt’s method. In fact, if we introduce subscripts h to denote the
parameters in Holt’s method, then the two sets of equations are
identical if we write
ah = a2
74
(a) Formulae
xt = (1 — a)xt + axt_i
3cf> = (1 — a)xt + ax^
jlt = 2xt — x}2}
Pi-
a
=
a = 0.8,/JQ 42.9, = —0.15, assumed initial values giving
x0 = 43.5, X<2) = 44.1
xt xt fit ft Xf+2
and
ch = (1 — a)2
ch = (1 — sfa^)2
5.2.4 Conclusions
Number of
forecasting
parameters 2 1 1
Initial estimates of
lit and Pt required Yes No Yes
The reason for discussing all three methods at some length was not
just to cover the main methods. The three methods we have discussed
illustrate three main approaches to forecasting local models, which are
complexity of the model used. Further, the more parameters used the
more observations are needed to obtain forecasts. Even with global
models a lot of data are required to give a reasonable error variance.
Cowden (1963) tabulates the type of information required to assess the
error variances. By way of brief illustration, if, as usual, the e values are
independent with variance o2, then for, say, eleven observations the
one-step-ahead forecast error variance is 1.647 o2 for a linear trend
model, 2.098 o2 for a quadratic model, 2.771 a2 for a cubic model and
3.908 o2 for a quartic model. Clearly, in a less stable situation with a
model fitted by, say, discounted least squares the error variances will be
even higher in practice.
References
Brenner, J. L., D’Esopo, D. A., and Fowler, A. G. (1968). Difference equations in
forecasting formulae. Man. Sci., 15, No. 3.
Brown, R. G. (1959). Statistical Forecasting for Inventory Control McGraw-Hill,
New York.
Brown, R. G. (1963). Smoothing; Forecasting and Prediction of Discrete Time
Series. Prentice-Hall, Englewood Cliffs, New Jersey.
Coutie, G. A., Davis, 0. L., Hassall, C. H., Miller, D. W. G. P., and Morrell, A. J. H.
(1966). Short Term Forecasting. ICI Monograph No. 2. Oliver and Boyd.,
Edinburgh.
Cowden, D. J. (1963). The perils of polynomials. Man. Sci., 9, No. 4, 546—550.
Croxton, F. E., Cowden, D. J., and Klein, S. (1968). Applied General Statistics.
Pitman.
Gilchrist, W. G. (1967). Methods of estimation involving discounting. J. Roy.
Statist. Soc. (B), 29, 355—369.
Holt, C. C. (1957). Forecasting Seasonals and Trends by Exponentially Weighted
Moving Averages. Carnegie Institute of Technology, Pittsburgh, Pennsylvania.
Chapter 6
Regression models
6.1 Introduction
S = f(I9P,A, . . .)
where/( ) is some function to be found. Our ability to construct such
models that provide good forecasts is a measure of our real under-
standing of the situation. Clearly, in constructing models for this type
of forecasting a good knowledge of the underlying situation is essential.
There are a number of books that deal particularly with this area of
forecasting, e.g. those by Robinson (1971) and Spencer, Clark and
Hoguet (1961).
Within the scope of this book we are concerned only with some of
the statistical aspects of the use of such models. There are an infinity of
possible models that might be used, so we must clearly limit our
discussion to a simple type as an introduction to the subject. In this
chapter we will consider the problem of forecasting with the most
commonly used model, namely that based on the use of linear
relationships. By way of example, suppose that in the situation
described above we finally obtained the model
77
78
yt =a + (3xt + et
If we can use the data to estimate the unknown parameters, a and j3,
given estimators a and 6, then we have as a fitted model the line
y = a + bx
These forms of model are called linear regression models and we talk
about fitting a regression line of y on x. If we wish to forecast yt+h , the
natural thing to do is to use
—
yt+h C + bXf+h
This is fine, provided that we know the future value xt+h. There are
situations where this is the case, as, for example, where xt+h is the
number of selling points of a product, in which case xt+h is a value
under our control. In general, however, xt + h may be as much a
quantity requiring forecasting as yt+h. If xt+h is a forecast of xt+hi
then
yt + h =a + bxt+h
will provide a forecast of y t+h _ This forecast may be no better than that
obtained by looking at past values of yt and using, say, a local trend
model. It does, however, provide another forecast based on some
additional information. This may be usefully combined with the trend
forecast using the methods to be discussed in Chapter 17.
It is sometimes the case that past experience shows that xt+h leads
to a better forecast of the y variable than we could obtain from past
values of the y variable. In this case, using the regression model provides
a good way of forecasting yt+h. As an alternative procedure worth
considering, we could examine as a basic model
yt = a + fixt + et
where xt is the forecast of xt. We would then fit this to past values of y
and x to obtain a fitted line
y =a + b x
yt+h = a + b'xt+h
In a different type of situation a change in a variable x may not
produce an effect on y for some time, h. For example, sales of spares y
are likely to lag behind sales x of the item for which they are spares. In
such a situation we might have a regression model of the form:
yt = a + pxt-h +
where xt_ h is referred to as a lagged variable. We can now forecast yt+h
by fitting and extrapolating to get
A rr , 7 n
yt+h = a + b xt
where xt is the latest known value of x.
Clearly, it is more common to have the variable y depending on
more than one other variable, e.g. ice cream sales may depend on both
temperature and sunshine. If we denote these variables by xl t,
x2 • • • , xkt, then we may write our general model as
k
yt = 2 ft*,. ( + et
i=0
There is a great deal of literature on this topic (see, for example, Draper
and Smith, 1966). In nearly all this literature the models are regarded as
being, in theory, global models, though in practice the users of the
models often ignore data from the far past as being irrelevant. From the
forecasting viewpoint it seems more reasonable to assume that the
models are best treated as local models. Thus rather than use least
squares we would use discounted least squares with a discounting factor
ar. If it so happens that we are wrong here, then in choosing a to get the
best forecasts we find, approximately, that a- 1. We would thus obtain
the same forecasts as the user of ordinary least squares fitting.
To indicate the type of result we obtain by applying discounted least
squares, consider the data in terms of the fitted model
k
yt = 2 b,xut + e,
i= 0
where et is the error in the fitted regression, the residual. The regression
coefficients bt are the discounted least squares coefficients when they
are chosen to minimize
s = tx\ri?_r
r= 0
2 arzt_rvt_r
r= 0
Notice that in this simple case the variance of the error increases as the
82
Normal equations, a = 1
Figure 6.1 Plot of the data and regression lines of Table 6.1
square of the distance of xt+h from the origin, which acts as a pivot. In
the general case the exponentially weighted mean of the x values acts as
the pivot. It is the uncertainty in the estimate of slope that generates
the element Var(b)x2+h in the error variance.
If xt+h has to be forecast by oct+h , then
= Var(bx,+ „) + a]
= E(b2)E(x2+h)—p2xj+h + o2e
= (Var(6) + p2 }{ Vax(xt+h ) + xj+h} — p2x2t+h + o2e
= Var (b)xj+h + o2 + Var(xf+A X Var(6) + p2}
assuming that b and xt+h are independent and xt+h is an unbiased
forecast of xt+h. Thus the expected mean square error is increased by
an additional amount depending not only on Var(xf+^) but also on
Var(6) and (32. As usual, we need to investigate the magnitude of the
mean square error directly via a study of the observed errors when the
method is applied to past data. Theil (1966) gives a general study of
forecast errors in regression models.
R2 =1- 2 if
i= 1
/i
/ i= 1
(y,-y)2
where
y= 2 yy/t
Z= 1
This is the simplest measure of the ability of our model to ‘explain’ the
data. If our model fitted perfectly, the e values, the residuals, would all
be zero so R2 would be one. If it did not really ‘explain’ the data at all,
the variability of the e values would be much the same as that of the y
values. In this case Ze2 would be approximately the same as
2(yz ~y)2> so that R2 would be approximately zero. Most of the
methods above make some use of the magnitude of R2 in selecting the
regressor variables. If we now say that interest is concentrated on the
most recent information, then it is reasonable to introduce discounting
into the above statistics. Thus if our usual exponential weights are used,
we would replace
t
e
^ i
i= 0
t
2 are2t- r
r= 0
and
t
2 (y; - y?
2=0
by
2 ar(yt_r —y)2,
r= 0
series were used; one, Tt, was the seasonal normal temperature, the
other, ATt, was the deviation of the actual temperature from this
at time t. In the event it was found that Tt was effectively allowed
for by the seasonal terms in the model and ATt remained as a
highly significant term.
(e) Suppose that variables xx t and x2 ft are thought to be relevant for
forecasting the yt series. It does not necessarily follow that xx and
x2 should appear in the regression model as
= +
yt ,t &2X2 ,t +
et
It could be, for example, that the best model for forecasting is
xt = yt + At + Bt
Such equations do not lend themselves to simple direct solution by
least squares. Much of the literature of econometric models is
concerned with the least squares solution of models of such involved
types. We will not study this topic further here other than to give, in
Figure 6.2, a visual presentation of such equations that is of practical
use. Assuming that the parameters are estimated, we still have the
quantities At and Bt to deal with. In some situations we will be asking
questions about what xt+ i will be if At+\ and Bt+i take on various
different values. These would be referred to as ‘conditional’ forecasts.
If, however, we have to use some forecasts of At+1 and Bt+X to obtain
our forecast of xt+ i, we would have an ‘unconditional’ forecast. A
practical aspect of most econometric models is the large number of
equations involved. The paper by Ball and Burns (1968) gives, an
example of such a set of equations for a relatively small model, together
with a discussion of the forecasting aspects of such models.
A particular problem occurs in economic forecasting, and in general
in forecasting with regression models, that the forecaster needs to be
aware of. The problem is usually referred to as the problem of
multicolinearity. In essence the problem is this: if in the model
+ + e
yt =$0 $\X\t t t
it so happens that x2t v> for all U 7 being a constant, then the
normal equations will not give a solution. The reason is clear if we
substitute to get
+
yt =Po VX11 +et
where 77 = fi1 + y/32 • The equation for yt really only has two parameters,
but yet we were originally trying to solve it for three. In practice we
will not get such an exact relation between the two variables, but we do
often get very high correlations. Though in these cases we can solve the
normal equations, the variances of the estimators we get can become
unacceptably high. One way out of this dilemma is to try and replace
xx and x2 by a new single variable that combines their effects. For
example, in a regression relating sales to temperature and sunshine a
variable termed ‘weather’ could be introduced which combined both
factors in an acceptable fashion. It is never very clear in practice just
how large will be the errors caused by multicolinearity effects. The
safest approach is to use the fitted regression model on new data and
examine the forecast errors.
References
Allen, D. M. (1971). Mean square error of prediction as a criteria for selecting
variables. Technometrics, 13,469—475.
Ball, R. J. and Burns, T. (1968). An econometric approach to short run analysis of
UK economy 1965—66. Op. Res. Quarterly, 19, 225—256.
Beale, E. M. L. (1970). Note on procedures of variable selection in multiple
regression. Technometrics, 12, 909—914.
Draper, N. R. and Smith, H. (1966). Applied Regression Analysis. John Wiley and
Sons. New York.
Farmer, E. D. (1964). A method of prediction for non-stationary processes.
Proceedings of Second Congress of the Intern. Fed. Auto Control. Butterworths.
London.
Ivakhnenko, A. G. and Lapa, V. G. (1967). Cybernetics and Forecasting
Techniques. Elsevier. New York.
Robinson, C. (1971). Business Forecasting. Thomas Nelson and Sons, London.
Singleton, P. (1971). Multiple Regression Methods in Forecasting. Thesis M.Phil.,
London University.
Spencer, M. H., Clark, C. and Hoguet, P. W. (1961). Business and Economic
Forecasting in Econometric Approach. Irwin, Homewood, Illinois.
Theil, H. (1965). Applied Economic Forecasting. North-Holland Publishing Co.,
Amsterdam.
Woy, J. B. (1965). Business Trends and Forecasting — Information Sources. Gale
Research Co., Detroit.
Chapter 7
Stochastic models
7.1 Introduction
We consider in this chapter the use of stochastic models in forecasting,
that is to say, the use of models in which the random element plays the
dominant part in determining the structure of the model. In previous
models the chance element et was simply an added ‘error’ added
separately at each time moment to a strictly deterministic function.
Appendix A brings together some of the terminology and definitions of
stochastic theory as relevant to forecasting. The reader should,
however, be able to follow most of the results of this chapter at an
intuitive level.
Consider now the model
Xt = (j)Xt _ i + €t
93
94
t <J>x e
t Xt = + €t
t-1
0 1.0
1 0.8 -0.465 0.335
2 0.268 -2.120 -1.852
3 -1.482 -2.748 -4.230
4 -0.384 0.308 -3.076
5 -2.461 -1.178 -3.639
6 -2.911 0.063 -2.848
7 -2.278 0.377 -1.901
8 -1.521 -1.412 -2.933
9 -2.346 1.322 -1.024
10 -0.820 0.204 -0.616
t e# Oet-i xt = et-det_l
0 1.563
1 1.085 0.7815 0.303
2 -0.345 0.5425 -0.887
3 -0.592 -0.1725 -0.419
4 0.399 -0.2960 0.798
5 0.568 0.1995 0.368
6 -0.377 0.2840 -0.661
7 -0.697 -0.1885 -0.508
8 -2.457 -0.3485 -2.108
9 0.410 -1.2285 1.638
10 -0.225 0.205 -0.430
write
xt = et+ 4>e,_ , + </>2ef_2 + . . .
As another example of a stochastic process consider the model
x, = et ~9et_ !
The sequence of x values in this model can be imagined as being a
weighted moving average (or rather moving sum), with weights (1, — 6),
of a sequence of e. This in fact is an example of what is called a moving
average process. Table 7.2 shows an example of this type of process,
again illustrated by showing a simulation.
It is seen that both autoregressive and moving average models can be
expressed in terms of a weighted moving sum (either infinite or finite)
of the sequence of e.
95
x3 = e3 +<pe2 +<p2e1
A system for which we can carry out the above multiplications and
additions is called a linear system and the models so defined are called
linear models. Table 7.3(b) shows how the output sequence for this
particular system is obtained from the input sequence for t = 1, 2,
3, . . . . It will be seen that for t tending to infinity we obtain for the
output sequence the simple autoregressive model.
We have described a physical system in this example. There are many
situations in business and economics where similar phenomena may be
expected. The et quantity is the total effect of the unpredictable
factors in the situation at time t. The quantity et will not only affect xt
but will also influence xt+ i, xt + 2, etc., though the influence will reduce
as time passes, a reduction in our example governed by the parameter 0.
Consider now the problem in our example of forecasting x4 and x5
from the preceding data. The output at times 4 and 5 will depend on
two types of input. Firstly, it will depend on the inputs at times 1, 2
and 3, whose effects we have already observed and, secondly, on the
future inputs e4 and e5. We cannot attempt to forecast any particular
numerical values for these inputs since they are independent of any
existing information. The best we can do is to use their expected values,
which as before we will take as zero. Thus we have to replace the
known structure of Table 7.3(b) with a similar table, Table 7.3(c),
which allows for our ignorance of future events. Thus our forecasts of
x4 and x5 will be
x5 = 0 + 02 e3 + 03 e2 + 04
In terms of the observations xx, x2, x3 these will be
—
X4 0X3
96
\
Table 7.3 Inputs and outputs of a simple linear system
(a) Impulse response
Time 1 2 3 4 5 6
Input 1 0 0 0 0 0
Input times t 1 2 3 4 5
Input ei e
2
6
3 e4 e5 Total output at time T
X y1
Output times T
1 €1
2 0e 1 ^2 e2 + 06!
3 026I e 63 + 0e2 + 026I
0e2 3
3
4 0 ei 02e2 0^3 €4 e4 + 0e3 + 02e2 + 036I
5 04ex 03e2 02e3 0e4 e5 e5 + 0e4 + 02e3 + 0362 + 04ex
Input times t 1 2 3 4 5
Forecasts Forecasts
4 future 0 3
e! 02e2 0^3 0 0e3 + 0262 + 03 61
4
5 0 ^i 0 e2 02e3 0 0 02e3 + 03e2 + 046I
and
Xs = <t>2X3
and
e5 = e5 + 0e4
as will be seen by comparing parts (b) and (c) of Table 7.3. The error
variances, which here equal the mean square error as the expectations
are zero, are thus
E(el) = o2
£(e2) = (l+02)a2
again using the independence of the e values.
It should be clear from the above special case that for the
autoregressive model the forecast with lead time h is
xt+h =<j>het + 4>h+iet_ I + 4>h + 2et_2 + • • •
which is
xt+h = <t>hx,
We may also derive this formula intuitively from the basic form of the
model. The future observation xt+ 1 can be written as
x —
t+ l £f+ l
At time f, xt+ i is not known but can be forecast by xt+ x; et+ 2 again is
forecast as zero. Thus
=
%t+ 2 0*^f + 1
and so
Xf + 2 0
Thus, in general,
Xf+h 0 Xf
+ h ~ Cf + h 0C?+ h — 1 ^"•••*^’0
since the error consists of the part of the model of xt+h involving the
future unpredictable e. The error has zero expectation so that the
98
\
forecast is unbiased. The error variance is given by
Var(ef + h ) = o2 (1 + 02 + . . . + (p2h ~ 2
)
and this can be used to calculate the necessary confidence limits for the
forecast. If o2 is not known, we must consider how to estimate it. This
can be done by noting that if we forecast only one step ahead, h = 1,
then
e e
t+ I t+ I
If we go back through our past data and use the forecasting formula
Xt+ 1 (])Xf
then the set of forecast errors obtained will in fact be the actual
random variable, e. The variance of the e value is thus o2 and an
estimate of a2 is provided by
„
2
1 t
d =- X ej
t t= I
1.0
0.8 -0.465 0.335 0.0 0.335
0.268 -2.120 -1.852 0.268 -2.120
-1.482 -2.748 -4.230 -1.482 -2.748
-3.384 0.308 -3.076 -3.384 0.308
-2.461 -1.178 -3.639 -2.461 -1.178
-2.911 0.063 -2.848 -2.911 0.063
-2.278 ' 0.377 -1.901 -2.278 0.377
-1.521 -1.412 -2.933 -1.521 -1.412
99
used for this purpose. The value of 0 is thus chosen to minimize the
sum of residuals squared, i.e.
2 ef = 2 (Xj — ! )2
i= 2 i—2
0 = 2 XjXi_ ! 2 xf_ j
/= 2 i=2
/t — 2
r
4>t= 2 a xt_rxt_r_l / 2 arx2t„r„ , = Nt/Dt
r= 0 y= 0
t- I t-2
2 CL X f + i _ r Xf +1 _ y — i X t+ 1 x, + a 2 a'A, x f- r- i
r-0 r=0
i.e.
t-1 t-2
2 r
a x?+1_r_! = x? +a 2 arx]_r_x
r=0 r= 0
SO
Table 7.5 illustrates the use of this method on the same data as that
in Table 7.4. It will be seen that the initial estimates of 0f are wild, but
soon settle down to vary about the true value used in the simulation.
Similarly, the forecast errors soon become fairly close to the e values
used in the simulations of Table 7.1. We have taken the discounting
factor a as 0.8 to illustrate the method, but clearly we are dealing with
simulated data which we therefore know to come from a globally valid
100
\
Table 7.5 Forecasting the data of Table 7.4
0 estimated using discounted least squares, a = 0.8
Forecast Error
xt *t- 1 N, D, 0* (pt%t e
t
0.335
-1.852 0.335 -0.620 0.112
-4.230 -1.852 7.338 3.520 — — —
model. Thus a = 1 would be the correct value to use; it would give the
best estimator of 0 and the best forecasts.
In this section we have concentrated on developing some intuitive
ideas about forecasting stochastic processes. To progress further, we
need a method for justifying these intuitive methods.
is as small as possible. The basic result that we will use is that the xt>h
which minimizes the mean square error is the expected value of the
future xt+h, given that x{, x2,...,xf are already known. This is
referred to as the ‘conditional expectation’ and is denoted by
E(x f+ fa I Xi > X2 , • • • 5 )
This result is derived in Appendix B. To see how to use this result, let
us look again at the autoregressive model of the last section, namely
Xf ~ E(Xf+ 2 ^ Xfy Xf _ 2 5 • • • )
Xf ~ E(xt+ 2 I 6fy €f — 2 5 • • • )
—
E(€f+ 2 I t
^ — 1 5 • • •)
= E(et+1) — det
102
>
The future et+l has zero expectation and E(et \ et) is the expectation
of a constant, et. Thus xt = — 0eti and we obtain a formula for the
forecast in terms of et. To obtain a more practical forecast, we first
examine the prediction error
—
@t+ 1 •%t+ 1 ^t
xt = — det
which is also expressible in the recurrence form
xt = — 0(xt —xt_ i)
If we seek to express xt directly in terms of the data, we may
repeatedly substitute for xt_i, xf_2, ... in the last expression,
giving
It follows from this last expression that to make sense of our forecast
we must require that 6 lies in the interval (—1, 1). Finally, as et = et the
minimum mean square error is in fact o2. Table 7.6 shows a set of data
and forecasts for this model, in which e0 has been taken as zero to
provide a starting value.
Suppose now that we wish to forecast two time units into the future.
t xt %t-l e
t xt
0
1 0.303 0 0.303 -0.1515
2 -0.887 -0.151 -0.736 0.368
3 -0.419 0.368 -0.787 0.393
4 0.798 0.393 0.405 -0.202
5 0.368 -0.202 0.570 -0.285
6 -0.661 -0.285 -0.376 0.188
7 -0.508 0.188 -0.696 0.348
8 -2.108 0.348 -2.456 1.228
9 1.638 1.228 0.410 -0.205
10 -0.430 -0.205 -0.225 0.112
103
SO
Xt ~ 0 i €t d 2^ t — 1 • • • @ q ^ t+ 1 — q
Xf ~ 2
^ ^t—\ • • • @ q ^t+ l — q
or
Xf 2 02 ^ 3 &t 1 • • • ^ q ^t+ 2—q
104
where the e values are still the one-step-ahead forecast errors. Compar-
ing the forms of xt and xu2, it is seen that if we are forecasting both
xt+ i and xt+2 at each stage we can simply update our forecast xt using
x + x
t+ 1,1 ~®\et+\ t, 2
As there are only future e values on the right-hand side, the conditional
expectation of xt+q+ x is simply zero. Thus we have
x
t,q + k=0 (k>l)
The forecast error for forecasting h{h < q) steps into the future is
x x
@t+h ~ t+h t, h
( @ h^t ••• @ q ^ t+ h — q)
i.e. it equals the ‘future’ part of the model for xt+h. This has zero
expectation and variance
°uh = o2 (1 + 6\ + ... + Oh - 1 )
where 0j > 02, . . . , 0P are constants and the model is denoted AR(p).
To investigate the forecasting of such a process we start with the
simplest case, the first-order autoregressive process, which we have
already studied in some detail in section 7.1. The model is
=
*t 0\Xt- I + et
We have already shown in section 7.2 that the minimum mean square
error forecast of xt+ x is
Xf 0 j Xf
The forecast error is
^t+ 1 ~~ Xf+ i xt
= {4)1xt + e(+1) — <t>ix,
= et+1
So the forecast error is the future et+1, and the mean square error is
thus a2.
To forecast two steps into the future, we obtained in section 7.1
=
Xf ,2 0i xt
As a somewhat different way of getting this result we could write
So
Var(ef+2 ) = ofA = (<t>] + 1 )o2
Hence, assuming normality, the 95 per cent prediction interval for xt+2
is
<j>\xt - 1.96(0i + l)a2
For the general pth-order autoregressive processes we have
x
t+l = <t>\Xt + <S*2Xt- 1 + • . • + 0pxf+ i _p + et+ i
and so, as
—
xf i E(Xf+ i | x^, . . . , Xj)
we have
= 0i X* + <t>2Xt- 1 + • • • + $pxt+ 1 -p
x X x X
^(01 t+ 1 $2 t 03 t— 1 0p + 2 — p + ^t+ 2 \ ti ’ * • > *^T )
X x
t ,2 ~ 01 1 02 t &pxt + 2 — p
By a similar argument
x
t,3 ~ $lxt,2 &2xt,l 03 X
t *ftpxt+ 3 — p
Thus in predicting a future value xt+h we first write down the model
for xt+h and delete the term e . When an x in the model refers to a t+h
= e +( ) e
t+3 ti t,2 + (p2eti
and so
=e + + +
f+3 01^t+2 01^f+l 02et+l
Of,2 = o2 + Var(et>1) = 2
CT (1 + 0? )
and
xt = (j)xt — 0et
xt = (pxt — 6et
Xt = (0 — 6)xt + 0xt_ !
If 0 = 1 and 0 < 6 <1, this is identical to the forecasting formula of
exponential smoothing; thus exponential smoothing is the minimum
mean square error forecasting method for the model
Xf Xf _ \ 0€f__ i
Here each value of xt equals the previous value xt_ j plus a random
term from a simple moving average stochastic process.
Consider now a prediction for a lead time of two in the original
model
—
% t ,2 E((j)Xt + 1 2 @ 1 \ %t9 • • • >)
= (f)E{xt+ ! I xt . . . )
—
0**-1,1
This same calculation obviously applies to any lead time, k say. Thus
— ... — f)v fc _
q f q
x =
(b) t+2 0.5Xh1 — 0.3 xt + et+2 — 0.8 et+1
so
=
xt2 0.5 xt i — 0.3 xt
(c) Xf+fo ~ 0.5 _j 0.3 xt+h_2 "t* £f+0.8 €f+h_i(k ^ 3)
so
x
t,H ~0.5x^ ^_i 0,3 Xj/j_2
Using the extreme simplicity of these relations, we may also derive
expressions for the forecast error variance for different lead times. Thus
subtracting the forecast formulae from the models gives
(a) et+1 - et+1, for a lead time of one,
so
= a2
= —0.3 et+1 + et
so
o] 2 = 1.09 a2
(c) For a lead time of k
€f+h ~ ^t+h — 1 0.3 @t + h—2 £t + h 0.8 ^-t+h —X
Here et+h_1 is not independent of et+h_2 and so we cannot
simply relate a2>h to a2>k_ x and a?By repeated use we may
relate cf+ 3, et+4, etc., to the e values and hence find a2 3, a2?4, etc.
(i) If E(xt) = jU, for all time, then all the previous stochastic models
can be straightforwardly reformulated with (xt — ju), (x^_ x — /u), etc.,
replacing xt, xt_x .... Estimation of ix is not very difficult, given a
good quantity of data.
(ii) If a constant trend occurs with an autoregressive model, it can be
allowed for by the addition of a constant term in the regression form.
Thus, for example,
xt = 01 xt- i + et
would include a slope /3 if it was modified to
xt = @ + j + e,
xt = Ht + et
where
We dealt with this model in section 4.2 by assuming that \xt wandered
in an unknown, undefined, fashion. We now formulate a model that
describes how fxt changes. If there is no trend, we use
= +
Mr Mf- I 71
where
where
References
Anderson, O. D. (1975). Time Series Analysis and Forecasting, the Box—Jenkins
Approach. Butterworths, London.
Box, G. E. P., and Jenkins, G. M. (1970). Time Series Analysis, Forecasting and
Control Holden-Day, Inc., San Francisco.
Grunwald, H. (1965). The correlation theory for stationary stochastic processes
applied to exponential smoothing. Statistica Nierlandica, 19, 129—138.
Harrison, P. J. (1967). Exponential smoothing and short-term sales forecasting.
Manag. Sci., 13, No. 11, 821—842.
Harrison, P. J., and Stevens, C. F. (1971). A Bayesian Approach to short-term
forecasting. Op. Res. Quarterly, 22, No. 4, 341—362.
Nelson, C. R. (1973). Applied Time Series Analysis for Managerial Forecasting.
Holden-Day, Inc., San Francisco.
Chapter 8
Seasonal models
115
116
separately:
(a) If applied to random data the moving average M reduces the
variance but leaves the mean unaltered. Thus it can be used to
smooth out the random element in data. Unfortunately, a side
effect of this smoothing is a high autocorrelation between values of
the moving averages. Thus if K = 12, the correlation between
adjacent values is 11/12, between next but one values is 10/12, and
so on. High autocorrelations such as these show themselves as a
smooth wandering movement in the moving averages, which can be
misinterpreted as a genuine oscillation in the series. Appendix A
gives a brief discussion of the idea of autocorrelation.
A further aspect of this effect is that the smooth ‘oscillation’
obtained can be emphasized even more if the sequence of moving
averages is again smoothed using another moving average. There is,
in fact, a theorem due to Slutsky, 1937, that shows that, under
certain conditions, repeated application of moving averages can
eventually lead to a sequence that follows a pure sine wave. This
result is the exact opposite of what one is usually trying to do
when using moving averages. However, this is a ‘long run’ result and
is mentioned simply as a caution and a reminder that one needs to
study one’s methods carefully before assuming that all is well. In
circumstances where one is in doubt about effects such as the
above, it is advisable to try out one’s methods on simulated data
which have known properties and then examine the results.
(b) As is intuitively reasonable, if M is applied to a perfect trend
T = a + j3£, the value taken by MT is the value of T at the centre of
the moving average. Thus apart from losing the end values M leaves
trend data unaltered. If our trend has superimposed error, the same
procedure is still reasonable; M simply smooths out this error.
There is, however, another valuable way of looking at this. Suppose
we fitted the model a + /3t by least squares to just the K
observations used in M. The value of the fitted line at the central
point will in fact be the moving average. So the moving average
operation corresponds to fitting the trend to a moving section of
data and finding the mid-value of the fitted curve. This can clearly
be extended by fitting not just linear trends but quadratics, cubics,
etc. For example, if we fitted either a quadratic or cubic to a
moving section of five observations the fitted value at the middle
would be
Quarter I 2 3 4 1 2 3 4 12 3 4 12 3 4
Data 6 II 17 10 4 10 18 6 I 9 17 5 7 10 16 7
( b ) Tier chart
0 —
I* 4* Quarter
-I -
(a) This is a simple plot of the data. Joining the points by line
segments often helps to clarify the seasonal pattern.
(b) A tier chart helps to clarify the seasonal pattern and the extent of
its variation from year to year.
(c) A centred moving average will show up any major trends. In the
example there is no clear trend and the magnitude of the changes
are small, suggesting that the level of the series is remaining fairly
steady. The very nature of the moving average will produce a fairly
smooth curve with a natural wandering movement even when the
underlying mean is constant, such as in this example.
(d) A simple seasonal ratio, detailed calculations for which are given in
later sections, is plotted against time. Any change in amplitude or
in the form of the seasonal pattern, such as illustrated in Figure
8.1(c) and (d), will show up in this plot.
(e) These seasonal ratios are averaged to give the estimated pattern of
the seasonal variation.
(f) The irregular component is estimated by calculating the residuals.
We have done this here by subtracting from the data the smoothed
values, found by multiplying the central moving average by the
rough seasonal ratio from (e). The plot of these rough residuals
against time will indicate if the error variance is constant; with
enough data a histogram of the residual distribution could be
produced.
(g) If our rough analysis has failed to cope with the seasonal variation,
then the residuals will still contain a seasonal element. This will
show up when the residuals are plotted on a tier chart. No such
structure is evident from (g).
where ju/;- is the ‘seasonal mean’ for season j in period i. If the seasonal
mean is the same or approximately the same from period to period, we
could write this as
%ij ~ M/ ^ij U ~ 2, . . . , i — 1, 2, . . . , t)
For each season j this corresponds exactly to the constant mean model
of Chapter 4. Hence we can use the formulae of that chapter to provide
estimates of /x;-, here denoted by jut f. Assuming a global model, we have
t
Mt,j ~ ^ %ij/t ~ X.j
i- 1
The layout of the data and the calculation of pLt )- for the global model
is illustrated in Table 8.2. The forecast of xt+ x j is simply pLtj.
Season j
1 2 r
1 *i I *12 Xi r
2 x
2 1 *2 2 *2 r
Period
t Xtl x
t2 Xfy
Totals X 1 *2 X r
Hi X 1 *.2 X r
It is clear that the forecasts of each juj are based only on data from
that period, so we are really doing separate forecasting exercises using
the constant mean model for r different sets of data. For the global
model this is obviously the correct thing to do.
Suppose that we wish to refer to the seasonal means \uy as being
relative to a constant mean (JL for the whole time of observation. We
could do this in two particularly simple ways, either by using an
additive form
M/ = li + 0/
or in a multiplicative form
M; = ndj
So in our notation the trend term T is now a constant mean p and the
126
2 0; = 0 and 2 0,- = r
7=1 7=1
1 ^
Xi = p X — 2 Oj +
r j= i ’
so p is the underlying mean for the average of the data from any
period i.
Let us assume for simplicity that there are exactly tr observations, as
G
03 O
03
is—1 O O O rH
S T—I rH
cj CD
03 co o ^ o'
Q 1-1
1
> 00
o CO O <M o'
tH
1
-4-i CM
03 CM O CM rH
O T—1 T—1
-l-s U0
Qu
cn 03 CO O CO T-H
S-i cn T—1 T—1
03
03
£
o3 eii) uo
J-i
03 G WOiOH
a < T—1 TH
G
o
C/3 LO
o3
03
C/3
3
l
“5
lO O lO H
T-H T-H
«4-l
O
_a>
03 CO
'a G
£ G
l—
3
COOCOri
T-H rH
03
X
w
co T-H
>>
00 03
i—i T-H O TH T—1
03 T-H T-H
3
03
J-I 03
a 03 O T-H O
<3 ^ 1
3-4 C—
03
O' O CO o
^ 1
CD
£>
03 CDOrJO
&H
^ i
UO
•
G*
o3 uO O u0 O
'”5 tH
1
3-1
03
-P
03
G
G
o3
3-H
cO
PH G. G.-0-<G>
127
128
t
fit = x.. = 2 Xj. /£
/= i
which is the ‘grand average’ of all our data. If instead of averaging over
seasons we average over periods for each season, we obtain fitj once
again. In terms of our two forms of model this is
A ,j = n + <l>j + i..f
and
At,i = +e.j
where e denotes the average of e,-7- over i. It is now natural to estimate
(j)j and 6j by
Sj = frt.j - i*t
and
Qj = fit, jib
We thus have reasonable ways of breaking our estimates of into
components jd and 0; or p and Oj. A little calculation will show that
£ 07- = o
/= i
r A
2 0y = r
7=1
so the multiplicative seasonal indices are numbers above and below one,
indicating deviations above and below the mean. The forecast of the
value of any further observation xT j- in season) is
xTj = At + 0/
or, equivalently,
XT,i = A»A;
The structure of the above calculation is presented in Table 8.4. If
the assumption of exactly tr observations does not hold, it is clearly no
use basing a period average on less than r seasons’ data since not all
values of 0; or Oj will be included. In this case jlt could be based on an
129
Season
Period Exponential
1 2 r average weights
Column averages
or weighted
averages fit, l fit,2 fit,r fit
t
at - i
i= 1
This form has the advantage that if we define 0; and 0as before, by
h = Ar,/ - Ar
and
0,- = At,,7 At
then all the results of section 8.2.1 still hold for the local model. We
also have, for large t, the recurrence relations
+ a
A,,/ = (1 — a)xf; Ut- l j
and
At = (1 — + aAt- l
The above equations provide an ‘exact’ type of solution together with a
recurrence form for large £.
Let us consider now the estimation of the parameters ju, 0y and 07- in
a purely intuitive fashion and with the object of expressing our
estimates in the form of recurrence relations, such that jlt can be
updated more frequently than once a year. We thus want something in
the form
New estimate = constant x new information + constant x old estimate
e x +
t,j ~ t,j (A(t,j- i) fit- i,/)
or
e
t,j ~ xt,j (Ai)®t— i,/)
Then we may rewrite both the recurrence equations as
= +
£(*,/) ht,j- i) (l—b)et>f
Error from
previous forecast
Update
mean estimate
New observation
*t.j + l
seasonal 0 adds the same amount on to fjtt, whatever its value, and the
multiplicative situation where an increase in jit will increase the total
variation in the seasonal pattern represented by 0/xf.
In the above we have assumed that the seasonal element, S = 6 or
S = (j), wanders in a slow and irregular fashion. There may, however, be
systematic change taking place in S. This possibility should be
examined by looking at the variation of S with time. The quantities
A ^
MT^ = Tn
Hence
Mxij - Tij
and the sequence of values given by the moving averages approximately
follows the underlying trend at the centres of the moving averages. If
the model is multiplicative, we will get M(TijSj), which does not
conveniently simplify unless T/y is nearly constant over the time
spanned by the moving average. We might in this case convert to an
additive model by taking logs of the data and then converting back the
estimated value of T/y obtained.
134
Before we can fit a trend curve we first need to get rid, at least
approximately, of the seasonal component. This can be done most
simply by using a period average in which the seasonal component is
summed out, as it is in the moving average. These averages can now be
fitted by an appropriate mathematical curve using the method of least
squares or any other appropriate method, we have already discussed
fitting straight lines in Chapter 5 and we will discuss fitting non-linear
curves in a later chapter. However, there is a complication that requires
a brief mention, and that is that the period averages do not lie on the
same curve as the original data. For example, if monthly data showing
slope |3 are used to get the averages for a sequence of years, these yearly
averages will increase by 12 j3 from year to year. However, if the
original data, the averaged data and fitted lines are plotted together,
there should be no confusion in obtaining the correct estimates.
Both the above methods are open to criticism and the choice
between them depends somewhat on how appropriate these criticisms
are in one’s own particular circumstances. The main aspects to be
considered are as follows.
The main criticism of these curves is that they are empirical. They do
not have a specified mathematical structure. This fact has two
unfortunate consequences. Firstly, as we do not have a structure
defined, we cannot easily extrapolate into the future to obtain our
forecasts. Secondly, the lack of a model makes almost impossible an
examination of how well the method has determined the past trend.
The method really provides one definition of what is meant by the
trend.
S ij — X(j Tfj
Sj + Iij
To estimate Sj then we must now reduce the effect of 7Z/- by averaging
the above quantities over z,
A A
Sj - average of S/;- (j = 1, . . . , r)
i
The sequence of steps is thus:
A
average of the five observations. Now we have seen before that the
trend line goes through the mean of the observations, so the value of
the fitted line at the time corresponding to x'0 is
—
(x'_ 2 + X- l + XQ + x[ + x'2 )
5
often denoted by
“ [!> 1,1]
o
making use of the symmetry and listing only the coefficients of x'. If
we had fitted either a quadratic or a cubic, the value of the fitted curve
at the midpoints would have been
— [-3, 12,17]
35
empirical trend curve over past data which has no natural way of
extrapolating into the future. It does, however, provide a means of
analysing past data as a preliminary to forecasting.
The natural formula for obtaining the Siyj is that given at the end of the
previous section, namely
su = bs, J + (1 - b )Sjj
where the simplest form for Siy- is
A
=
Sij Xij — T{j, for the additive model
The methods of (a) above and of the previous sections can be used
on an initial set of data to provide the values from which to start the
recurrence relations. The sequence of the recurrence relation is given in
Figure 8.4(a).
The forecast of a future value xrs will be
xrs = T^ + SiS (+ model) or TijSis (x model)
where Tz;- is the latest trend estimate and Sis is the latest seasonal factor
for season s. If a study of past data can give good starting values for the
recurrence relations, it is advisable to keep the two forecasting
parameters a and b quite close to 1.0 otherwise such a set of equations
138
Xrs =
{Tij + hfiij) + Sis
The sequence of calculations here is given in Figure 8.4(b) and is often
called the Holt—Winter’s method.
A slight modification of this procedure is based on the observation
that x'ij, the seasonally adjusted observation, is based on the use of the
last seasonal factor St_ { y. Having gone once through the sequence of
139
calculations in Figure 8.4(a) and (b) we can use the new value to
revise x\j and iterate the calculation to obtain what are, intuitively,
better values for Stj and fty.
The literature on seasonal methods gives extensive study to ways of
cycling round these various calculations seeking at each stage to
improve these estimates. Durbin and Murphy (1975) and Harrison
(1965) discuss aspects of these calculations in detail. Basically, the
modifications are concerned with:
(a) reducing the effects of extreme observations, possibly caused by
known factors;
(b) adjusting the seasonal estimates so that they satisfy the conditions
that XS = 0 or XS = 12;
(c) introducing more sophisticated forms of moving average to deal
with particular features of the data.
where each of the three parts can be modelled in some sensible way.
For example, the trend might be modelled by a polynomial in t (see,
140
1 L1 „ 27r kj
uk — 2J Sj cos (k = 1, . . . , 6)
6 /= I 6
and
1
v
2 s • 277 kj
c Sln (k - 1,. . . , 5)
vk = T i —T~
6 j=\ 6
= 0
Thus we have converted from one set of twelve indices with the
additive condition
(t -«)
to another equivalent set of eleven regression coefficents. Thus at this
level there is no great merit in introducing such a Fourier model. The
advantage of these models comes from the practical observation that
one usually obtains a good fit without having to use all twelve terms in
the regression model. Any pair of sin and cos terms, say,
u
27r kj . 2Tr k
k cos —- + vk sin —— ]
It may be observed, for example, that ax and a2 are large but a3. . . a6
are relatively small. The procedure then is to use the model
2(1 - b) ^ 2 r(k — 1)
2 COS
n r n
they are recalculated to allow for more data. One natural extension is
to make u and v linear functions of time; alternatively, the whole
seasonal expression can be multiplied by a time-varying term. The work
by Nettheim (1964) and Duval (1966) deals with fitting these forms of
model. In such applications we usually have a linear regression model
and so the classical approach is to fit by least squares. We have already
seen that discounted least squares with exponential weights ar reduces
to ordinary least squares when a = 1. We will therefore look at the
fitting by discounted least squares rather than the special case of
ordinary least squares. If we seek to reduce our problem to a linear
regression form of model, it is natural to include the trend in the model
and not just the seasonal, as we did above. Examples of linear models of
use in seasonal forecasting are
27r t 27T t
X, = /So + Pi t + p2 sin + (3 3 cos + et
n n
2i7
Xt = Po + (P2 + 72 Osin — VP 2 / 5 -
n n
277 t 27Tt . 477 t 477 t
xt=Po+ p2 sin + p3 cos + p4 sin cos + et
n n nn n
In the first example an additive linear trend is included with the simple
seasonal. This has clear advantages in terms of forecasting when the
situation is fairly stable. If, however, this is not the case and
independent changes can occur in the trend levels and the seasonal
patterns of the data, then the previous methods of separately handling
T and S are likely to give better results. When we fit by discounted least
squares the smoothing constant a is clearly applied to the whole model,
and having different smoothing constants for the trend and seasonal
parts is clearly ruled out. In the second example the amplitude and
position of the simple seasonal pattern is changing in time. In the last
example a more complicated seasonal pattern is allowed for by
including more Fourier terms. As a means of indicating some of the
procedures used in forecasting from such models, let us consider in a
little more detail how we could locally fit a simple linear model with a
trend and seasonal component. Suppose the fitted model takes the
form
7 T . .
277 t 277 t
xt = be + b 1 + 1 sin + o3 cos + et
n n
where et is the residual and 60, bx, b2 and b3 are discounted least
squares coefficients chosen to minimize
5=2 aret_r
r= o
143
This is simply a linear regression model with 1, f, sin 27xt/n and cos
2nt/n acting as regressor variables. As usual the values of b0,bl9 b2 and
b3 are obtained by solving the set of normal equations. With a slight
alteration of the notation of section 6.2, the four normal equations are
27r(t — r)
5(c,x) = 2 arxt~r cos
r=0 n
In practice we have a large, but straightforward, exercise in calculation
to evaluate all the coefficients, S( ), and to solve the normal equations.
The forecast is obtained as usual by setting the future error term at zero
and substituting the future time t + h in the fitted model. Hence
b0 + bl(t + h)
xt = a + fit
but then rewrote the model as
xt_r = at — fir
so that 0it always referred to the mean of the trend line at the time
of the latest observation. This requires the updating of the meaning
of the parameters. Here, for example,
=
°^t+ I &t + fi
This has the advantage that the forecast for a lead time h using
estimates at and $t always takes the form
=
. 27T h . 2ir h
Xt+h b2 ,Tt sin n + 63 ’ t cos n
If, however, we write down the same value using t as the origin, it
takes the form
. 27T(/Z + 1) 27Uh + i)
0 2,7 sm + p3 f cos
77 77
Using the same formulae as in (a), this last expression can be put in
terms of sin 27T h/n and cos 27T h/n. Equating the coefficients of sin
27T h/n and cos 27T h/n in the two expressions gives
« ,t+i = Pi.t
P2 a 2TT
cos — — . —
sin 2ff
IV IV
and
2TT
P 3,7+1 COS
77
t+ 1
=
Sj+i(z,i') 2 arz_ rv_ r
r-0
SO
For all the st terms on the left-hand side of the normal equations
the quantities z and v are sine or cosine functions that are
numerically less than one, or are simple powers of t\ hence for large
t the added term at+ 1 (t+1 (f +1) becomes smaller and smaller
as t increases. It can in fact be shown that for large t all S terms
on the left-hand side approach constant values. As these values do
not depend on the data, they can be worked out as a set of
numerical values Soo(z,v) as a preliminary to the forecasting
exercise.
Notice that this approach exactly parallels the methods discussed
for fitting a linear trend model. In that case discounted least
squares gave a set of normal equations which for large t led to the
simplification of the recurrence relations of the double exponential
smoothing method. If the values S«, ( ) are used in the normal
equations, from the start of the forecasting method, we obtain a
method that approximates the discounted least squares solution in
the same sense that the double exponential smoothing method does
in the linear trend situation. Further, if explicit solutions for the
fitted constants are obtained by solving the normal equations, then
these solutions, though rather complicated, can be expressed in
recurrence form. They can also, as has happened before, be
expressed in error-correction form. Suppose we denote the one-
stop-ahead error by et+ l. Then
27T.1 27T.1
e
t+ I = **+i ~ b2 t sin b3 , cos
n n
Then using the results of (b) we can show that
27T 2ir
b2,t+i = b2,t cos b3 f sin — +k2et+l
n n
27T 2TT
b3,t+\ = b2,t sin — + b3J cos—+k3et+l
n n
147
model
This can now be used to obtain forecasts using the method discussed
for stochastic models. This model provides a very elegant form of
seasonal model using only two parameters to describe a fairly complex
situation. A somewhat similar approach is used by Grether and Nerlove
(1970), who use mixed autoregressive—moving average models indepen-
dently for the trend—cycle and the seasonal components of a classical
additive seasonal model. A detailed discussion of the methods for
identifying and fitting such models is beyond the scope of this book,
but is given in the book by Box and Jenkins (1970).
An alternative approach that leads to similar forecasting formulae to
the above is based on using a regression of xt on recent values and
values a year or so ago. Thus the model might look like
+ + e
xt = b0 +blxt_1 + b2xt_l2 &3*t-i3 t
The actual terms used in the model can be determined by the usual
methods for selecting variables. A brief discussion of this approach is
given in section 15.4.
model is however far from easy, e.g. see Chatfield and Prothero (1973),
Box and Jenkins (1970). Providing good forecasts may not, however, be
the only consideration. The estimates of current trends, seasonal
indexes and the deseasonalized data provided by the seasonal index
methods and Fourier methods, but not directly by the stochastic
methods, are often important supplementary information for decision
making in seasonal situations.
References
Box, G. E. P. and Jenkins, G. M. (1970). Time Series Analysis and Forecasting.
Butterworths, London.
Box, G. E. P., and Jenkins, G. M. (1973). Some comments on a paper by Chatfield
and Prothero and on a review by Kendall. J. Roy. Statist. Soc. A 135, 337—345.
Brown, R. G. (1963). Smoothing, Forecasting and Prediction of Discrete Time
Series. Prentice-Hall. Englewood Cliffs. N.J.
Brown, R. G. (1967). Decision Rules for Inventory Management. Holt Rinehart and
Winston.
Chatfield, C. and Prothero, D. L. (1973a). Box—Jenkins seasonal forecasting. J.
Roy. Statist. Soc. A. 136, 295—336.
Chatfield, C. and Prothero, D. L. (1973b). A reply to some comments by Box and
Jenkins. J. Roy. Statist. Soc. A. 136, 345—352.
Dauten, C. A. (1961). Business Cycles and Forecasting. South-Western Publishing
Co., Cincinnati, Ohio.
Davis, H. T. (1941). The Analysis of Time Series. Cowles Commission, Yale.
Durbin, J. and Murphy, M. J. (1975). Seasonal adjustment based on a mixed
additive—multiplicative model. J. Roy. Statist. Soc. A. 138, 385—410.
Duvall, R. M. (1966). Time series analysis by modified least-squares techniques.
JASA, 61, 152-165.
Grether, D. M. and Nerlove, M. (1970). Some properties of optimal seasonal
adjustment. Econometrica, 38, 682—703.
Groff, G. K. (1973). Empirical comparison of models for short range forecasting.
Management Science, 20, 22—30.
Harrison, P. J. (1965). Short-term sales forecasting. J. Roy. Statist. Soc. C, 14,
102-139.
Jorgenson, D. W. (1967). Seasonal adjustment of data for econometric analysis.
JASA, 62,137-140.
Kendall, M. G. (1973). Time Series. Griffin Co. Ltd., London.
Kendall, M. G. and Stuart, A. (1966). The Advanced Theory of Statistics, Vol. 3.
Griffin, London.
Lovell, M. C. (1963). Seasonal adjustment of economic time series and multiple
regression analysis. JASA, 58, 993—1010.
Nettheim, N. F. (1964). Fourier methods for evolving seasonals. JASA, 60,
492-502.
Reid, D. J. (1972). A comparison of forecasting techniques in economic time-series.
In M. J. Branson and others (Eds.), Forecasting in Action. Op. Res. Soc. and
Soc. for Long Range Planning, London.
Slutsky, E. (1937). The summation of random causes as the source of cyclic
processes. Econometrika, 5.
Chapter 9
Growth curves
105
log 100 + 2 log
100
105
log 100 + 3 log
100
etc. Thus the log sales increase by a constant amount log 105/100
150
151
each year. We could write the general model for such sales as
xt = arr
or
xt = ae^
called the exponential growth curve, and
Figure 9.2 Plot of data and log data for exponential growth
only sell for a certain period, e.g. a year in the case of fashion
clothes. The monthly sales for such an item might follow a curve,
such as the logarithmic parabola in Figure 9.1, or, alternatively, we
might try to model the total sales up to any given month by one
of the asymptotic growth curves.
slope/x = (3 + 2 yt
and for the Gompertz curve
•K. n *«i
s X *
CO
c- X a>
11
CM
W X
■K> x + CD ^1 «
co.
co- oa i
c
o ll II II ll II
QJ *-4^ ■K* •k* -k.
a> CD a> a> a>
a ^ CU a a, a a,
o ^ cr1 j3 o o o o
co a> To To To To To
O O
II II
Cl
t _i
* 5 - ■+- CD *5^ CM
o .2 c,
X -+~i 0)
CD CD CO.
c ai ^ 3 G O ii ^ II
-X
'SC ci O O £ +
d
Cl
x
<2.S£ Z Z H rH X
Table 9.1 Growth curves
a> 8 o 8 o o o
a
8 o
*1A
t X 8 CO 8 o 3 s
ts
ca
<D
+
*§■ t t t # f
a»
a.
o
O To s ca.
H + 'Q. +
II OJ
CO S3 S
a
o
o
c<3 XJ
3 C3
o
A
A
Q°
vY
s—✓
A VA
,
o ° O ° °
V A v
°?°
V A A
O <3
i—* o ca CO o
154
155
the right one for the model. A detailed study of the use of slope
characteristics is given by Gregg, Hassell and Richardson (1964).
The method has two practical difficulties whose importance depends
very much on the smoothness of the data available. There is first the
problem of measuring the slope at different times. This can be done
approximately by taking first differences and smoothing these with a
moving average. Alternatively, one can smooth first and then take
differences or find the slope of a fitted trend line. Neither approach is
very satisfactory. The second problem comes from the fact that the
method depends on an eye comparison of different plots to see which
looks most like a straight line. One can, however, be misled in this,
since the vertical scales are all in different units. Within the forecasting
context it is best to follow a preliminary study of the data, using one of
the above two methods, by a study of the quality of forecasts that the
most appropriate models produce.
An important property of the growth curves of Table 9.1 is that by
suitable choice of transformation they can be transformed either to a
linear model or to a simple modified exponential. The nature of these
transformations is displayed in Table 9.2. As long as we are investi-
gating deterministic models, these transformations can be quite useful
in helping us to inspect the data to decide what model is appropriate;
we could also use the linear form as the basis of extrapolation. The
same is also true if there is a very small random component in the
model. If, however, there is a large random element superimposed on a
There is a large literature dealing with the fitting of growth curves (e.g.
Harrison and Pearce, 1972; Lewandowski, 1974). Rather than attempt
the rather daunting task of giving a survey of the subject, the aim of
this section is to indicate by examples a number of approaches that can
be adopted.
i.e.
€f
xt =
t +
Now we normally regard the form ae^ as giving the expected curve
with random deviations from it. Unfortunately, the expectation of xt in
the above model is ae^t+u /2 and not as is implicitly assumed by
the above method. Thus even if we obtained a perfect fit and forecast
xt+h by
Xf + h = ae^t+ h}
E(@t+ h ) —
E(Xt+ h %t+ h ) ae P(t + h) ^e<7 /2 —
1)
This bias increases with a2 and also increases as one climbs up the
growth curve. We will discuss in Chapter 16 methods to adjust for
systematic bias in forecasts. For the moment it is sufficient to draw
attention to the problem of the possible introduction of bias when
transformations are used.
xt = oce^t + et
157
S = 2 e2 = 2(xf-cee^)2
t t
2 xtoPt — a 2 e2 ^ = 0
t t
2 txte— a 2 te2 ^ = 0
t t
158
The intuitive logic of this is that if the variance is large less notice
should be taken of a large One way of obtaining such an estimated
variance is to model the variance as a function of level of the curve. For
example, one method uses
Variance is proportional to (level)k
where the level of the curve is found using a weighted moving average
of the data and k is some power determined to give the best fit (see
Harrison and Pearce, 1972).
The normal equations for this can be solved simply for 5 and &, but the
term is not separable. If, as before, we can get an estimate of |3, then
the normal equations can be used to provide estimates of 5 and a.
Suppose we have 3 n observations and we consider the model values
xl = 3 + ae^ + e j
x2 = § + cee2 ^ + e2
xn = 5 + aen^ + en
x j = 3 + ael(S + e {
159
n(i
e = (x 3 — x2)/(x2 — x1)
This can be used with the equations above or the normal equations to
obtain estimators of 5 and ce. Gregg, Hassell and Richardson (1964) give
details of these methods and tables to simplify the calculations.
The form of the slope equation can suggest other ways of obtaining
linear equations. For example, in the case of the logistic curve we can
remove the eyt term by using the basic equation of the model and
obtain
— slopef = 7 — — xt
xt (x
p^e~yt
So if we define
0 = 2 Ui/t
1=1
(a) For the structure we need to identify not only the type of growth
curve to use but also the way the random variation is involved in
the structure. For example, an exponential growth might be of the
form
xt = ae^f + et
or
xt = ae^t+ e ^
xt+h = aetf(f+',)
xt =S( l — e-fit)
which is the special case of the simple modified exponential curve
illustrated in Figure 9.1.
References
Barham, R. H., and Drane, W. (1972). An algorithm for least squares estimation of
non-linear parameters when some of the parameters are linear. Technometrics,
14, 757-766.
Draper, N. R., and Smith, H. (1966). Applied Regression Analysis. John Wiley and
Sons, New York.
Gregg, J. V., Hassell, C. H., and Richardson, J. T. (1964). Mathematical Trend
Curves; An Aid to Forecasting. ICI Monograph. Oliver and Boyd. Edinburgh.
Harrison, P. J., and Pearce, S. F. (1972). The use of trend curves as an aid to market
forecasting. Ind. Mark. Manage., 2,149—170.
Lewandowski, R. (1974). Prognose und Informations Systeme. W. de Gruyter,
Berlin.
Marquardt, D. W. (1963). An algorithm for least squares estimation of non-linear
parameters. J. Soc. Ind. Appl. in Math., 11,431.
Chapter 10
Probabilistic models
10.1 Introduction
In previous chapters interest has centred on forecasting some variable
xt. In a number of situations the item of central interest is an event or a
number of possible events. The model used then concentrates on the
probabilities of these events occurring. The forecasting problem in such
situations is that of estimating the future values of these probabilities.
For example, the prices of certain foodstuffs tend to be either stable,
with the basic price remaining constant from day to day, or unstable,
with price changes taking place. In obtaining price forecasts we are
faced with forecasting the probabilities of the two alternative events,
‘prices stable’ or ‘prices unstable’, at a future time. Another situation
which faces us more directly with probability considerations is where
we are forecasting a discrete variable that may only take on a limited
range of values. For example one might require to forecast, in a
changing situation, the number st of vehicles out of a small fleet of five
that would be in use at a future time. Such a problem requires a
‘forecast’ of the future probabilities of demands occurring for use of a
vehicle. The aim of this chapter is to discuss briefly some approaches to
the forecasting of probabilities.
162
163
X (residual)?
i= 1
was replaced by
t- 1
X icr(residual)?_ r
r= 0
where the residual was the error between the observed value and the
fitted value and the weights wr were chosen to give more emphasis to
recent data than past data. In fitting probability models the method of
least squares is often replaced by a method called the method of
maximum likelihood. The basis of this method is first to write down
the probability of the events that were observed. This probability is a
function of the model parameters and is called the likelihood. Next we
choose the model parameters so that the events observed had a higher
probability of occurrence than events that were not observed. Thus if
the probability of observing the discrete variable st at time t is Pt(st;0),
where 6 is a parameter, the likelihood, probability, of observing data
si . . . Sf is
<?(d)=P1(s1;6)P2(s2 ;6) . . . Pt(st;d)
and
Lw{9)= 2 wr\ogP,_r(st_r;d)
r= 0
(a) Theory
Stable S or
Time unstable U Probability Contribution to Lw(p)
t S P logp
t—1 S P a log p
t-2 s P a
2
logp
t-3 u 1 ~P a 3
log(l -p)
t—4 s P a 4
logp
t-5 u 1 -P a5 log( 1 -p)
t-6 u 1 -P a6 log(l -p)
t—1 s P a7 logp
t-8 u 1 ~P a8 log( 1 -p)
t-9 u 1 -P a9 log(l -p)
Total Lw{p)
pt maximizes Lw(p)
G = 1 a = 0.9
t S 1 1 1 1 1 1.000 1.000
t—1 S 1 a a 1 1 0.900 0.900
t—2 S 1 a2 a2 1 1 0.810 0.810
t-3 u 0 a3 0 1 0 0.729 0.000
f-4 s 1 G
4
G
4
1 1 0.656 0.656
f- 5 u 0 G
5
0 1 0 0.590 0.000
^— 6 u 0 G
6
0 1 0 0.531 0.000
f-7 s 1 G
7
G
7
1 1 0.478 0.478
t—S u 0 G
8
0 1 0 0.430 0.000
t-9 u 0 G
9
0 1 0 0.387 0.000
Pt = (1 — a) 2 ar8t_r
r= 0
+
Pt = (1 — a)§r apt- 1
a = 0 99
After 30
observations
After 45
observations
After 60
observations
12 3 4
State in S 35 18
first week u 11 36
State in s Pi 1 “Pi
first week u 1 ~P2 P2
various seasons of the year and so px and p2 are best estimated locally
using the discounted method of the previous section. Figure 10.3 shows
the local estimate of px and p2 over a period of time for some price
data. Notice that we only get information about px when a transition is
made from state S to S or U\ thus only one of the curves varies at any
one time. To illustrate how to use such transition probabilities, let us
suppose that we know that we are in the stable state this week. We will
be in state S next week with estimated probability px and state U with
probability 1 — px. The estimated probability of being in state S two
weeks hence is that of going through the sequence of states S,S,S or
through S,U,S9 which is px x px + (1 ~px) x (1 — p2).
Before leaving Markov chains we consider a further forecasting
problem that is similar in structure to the above. By way of example,
Table 10.3 shows some information about a steady market in which
four brands are competing. Part (a) gives the current market share of
the four products. Part (b) gives the transition probabilities of product
buyers between the brands. Thus buyers of brands A and B are
completely loyal, but a person who buys brands D in year t has only a
probability of 0.48 of buying it in year t + 1. Part (c) shows how this
information can be used; to forecast the brand shares in year t + 1. We
assume that because brand A held 50 per cent of the market in year t
and had totally loyal customers it will hold an initial 50 per cent of the
market. To this we must add the fraction 0.02 of the 10 per cent of the
market that used to buy C, but which switches to A, and the fraction
0.22 of D brand’s 10 per cent of the market, giving a total of 52.4 per
cent. A basic problem here is that of estimating the probabilities of the
transitions. One way of doing this, discussed by Chorofas (1965), is to
carry out the reverse of the above calculations over a period of years for
which market shares are known in successive years.
169
Product
A B C D
50 30 10 10
To
From A B c D Total
A 1.0 0 0 0 1.0
B 0 1.0 0 0 1.0
C 0.02 0.03 0.95 0 1.0
D 0.22 0.30 0 0.48 1.0
Year t
shares Contributions to
' (%) A B C D
A 50 50 0 0 0
B 30 0 30 0 0
C 10 0.20 0.30 9.50 0
D 10 2.20 3.00 0 4.8
00
xt = M
The other corresponded to the unstable price situation, U which in fact
was modelled by
xt = xt_ j + et
In the stable state S, the et and yt values come from distributions with
zero mean and variances cl and oy, respectively. If a very large value of
et occurs, then this shows as a transient in xt. If a large value of yt
occurs, pt will jump to a new level. Future means will start to wander
from this new level; thus a major step change will have occurred in the
mean level. The occurrence of such large values of^er and yt can be
modelled by assuming the existence of two further states, Ue and Uy.
In one, Ue, the variance of et is substantially larger than o\, thus
allowing the possibility of the transient, large et. Similarly the other,
Uy, gives a large variance to yt, thus modelling the step-change. Clearly
the system will only rarely be in states Ue and Uy. The probabilities of
the transitions S Ue and S Uy will thus be very small, and those for
Ue+S and Uy -* S correspondingly high. Harrison and Stevens,
using Bayesian methods, investigate a particular model of this type,
which includes also a state for trend changes, U5. In their examples the
variances involved in Ue and Uy are of the order of 100 times larger
than ol and oy, thus the occurrence of states Ue or Uy represent
significant events in the progress of the data.
The value of using the state approach here is not that the forecasts
make some allowance for future changes in state. This allowance must
inevitably be small since the changes are relatively rare. The real value is
that the model, as it is fitted to the data, detects past changes in state.
This automatically improves the quality of the forecasts. Thus if a
transient is detected, it will be treated as a transient and future
forecasts will largely ignore it. Many other forecasting approaches will
treat the transient as a respectable observation and it will lead to biased
forecasts. Similarly, if a step-change is detected, the new forecasts made
will be based on the levels of pt after the step and will tend to ignore
data from before it. Other forecasting approaches will often take some
time for the estimated level to move over to the region of the new level.
The model incorporating the possibilities of transient and step-change
states thus deals more effectively with data in which such features can
occur. For technical details the reader is referred to the reference
already given.
k
= — 2 Pi log pi
i= 1
Similarly, the expectation of the information in our forecasts pt is
A *
/ = - 2 Pi log pi
i= l
- A * Pi
D = / — / = 2 Pi log —
■-1 Pi
The smaller this quantity the better the forecasts, perfect forecasts
giving D - 0. Similarly, two different forecasts pt and pt can be
compared using D — D, i.e.
f 7 * ] Pi
I —T= 2 Pi logr
i= 1 Pi
— 2 u)r logpf_r
r= 0
It has already been noted that if we know the future and could assign
Pi = 1 to the appropriate future event Eh then the information content
of this perfect forecast would be zero. Thus we are aiming to produce
forecasts with information at a minimum. This suggests that we choose
the parameters in the formulae for the probabilities to minimize the
discounted information. Referring back to section 10.2, this is just the
same as maximizing the log discounted likelihood which is
t- I
2 wr\ogpt_r
r= 0
This was the method used to obtain our forecasts in that section. Thus
our brief discussion of information has given further support for the
methods of obtaining probability forecasts in section 10.2.
2 Cup/
/= 1
Estimated
expected
Events occurring costs
Forecast events E, ... E-j
J-! . . . Ek Decision C
E, Cn • • •
r..
W • • • Ci* d\ Ci
Ei Cn • • • C-
V/^y • • •
E'ik dt Ci
Ek Cki • • • E>kj • • • Ek k dk Ck
Forecast probabilities Pl • • • Pj •• • Pk
Using a layout such as that shown in Table 10.4, the value of Ct can
be calculated for the k possible decisions. It is then often a reasonable
procedure to choose the decision dh corresponding to the smallest
estimated cost Cm in. If we wish to compare two different sets of
forecasts, p and p, one could compare the corresponding sets of
estimated expected costs C and C One would certainly be interested to
see whether they both lead to the same decision. Suppose, for example,
there are only two possible decisions, dx and d2 ; then it is clear that, on
the basis of the table, when p1 is less than some value, p0 which
depends on the costs, we will reach decision d2 and conversely when
pl > p0. Thus any forecast will lead to the same decision provided it
lies on one particular side of p0. Consider another example of the use
of these expected cost tables. Suppose that the costs associated with
decision dt when E) occurs are constant, C say, for all i,j(i =£ j). If, also,
the costs associated with the correct decision (i = j) are all at some
constant value less than C, then the best decision is the one that
corresponds to the most probable event as indicated by p. This if p3 is
the largest probability, d3 is the decision made.
References
Brown, R. G. (1963). Smoothing., Forecasting and Prediction of Discrete Time
Series. Prentice-Hall. Englewood Cliffs, New Jersey.
Chorofas, D. N. (1965), Systems and Simulation. Academic Press, New York.
Harrison, P. J. and Stevens, C. F. (1971). A Bayesian approach to short-term
forecasting. Op. Res. Quarterly, 22, No. 4, 341—362.
Theil, H. (1965). Applied Economic Forecasting. North-Holland Publishing Co.,
Amsterdam.
Chapter 11
Multivariate models
11.1 Introduction
174
175
I = E(el)+E(e22) + ...+E(e2n)
xt = xt_ ! + bet
xt = Ht + et
M* =Mr-i + Vt
If xt, jit, et and rjt are now regarded as vectors, this describes a
multivariate process for which multivariate exponential smoothing
gives the optimum mean square error predictor, see Jones (1966)
and Kalman and Bucy (1961). The parameters b in this situation
depend on the variance—covariance properties of the e and 77
sequences. Jones gives a fairly direct way in choosing appropriate
values of these parameters.
xt = 0 ixf- 1 + et
Let r\t be the random variable for the y process and suppose that the
present xt and yt depend on the past values of both the x and y
sequences. Then we have
+ + e
Xt=(j)\\Xt-\ 0i2lh-l t
+
3T =
02 1 Xt- 1 +
0223T- 1 Vt
xt = et — det_ 1
and note that the latest e and 77 in the autoregressive model only
177
,
= e
X
t t $\ let- 1 0 12Vt- 1
yt =Vr—d2iet-i -d22rit_
In generalizing both the above models we can increase the number of x
or e, or y or 77, terms in the expression for both xt and yt. There are
thus four numbers required to define the order of the process. If we
introduce vector and matrix notation putting
e*
yt Vt
^1 1 0 1 2\ ^ _ / 01 1 01 2
e
^2 1^2 2' ' 02 1 02 2
x, = <t>xt_ 1 + et
and
xt = et- 0e,_ 1
x
? = x
/— 1 + et 0ef- 1
all the calculations given for the univariate model at the end of
section 7.5 follow through. This gives as optimum predictor
References
Hannan, E. J. (1970). Multiple Time Series. John Wiley and Sons, New York.
Jones, R. H. (1966). Exponential smoothing for multivariate time series. J. Roy.
Statist. Soc. B, 28, 241—251.
Kalman, R. E., and Bucy, R. S. (1961). New results in linear filtering and prediction
theory. J. of Basic Eng., 83, D 95—108.
Robinson, E. (1967). Multichannel Time Series Analysis with Digital Computer
Programs. Holden-Day Inc., San Francisco.
Chapter 12
The approach of model building has been the underlying feature of the
methods discussed in previous chapters. Though we devoted Chapter 2
to a discussion of models, there are a number of implications from the
topics studied in later chapters that need comment. In any real
situation it is unlikely that we can only talk about the model for that
situation. We are usually in a position to reasonably use a number of
different models. The truth in a situation is invariabliy too complex for
anyone to grasp, assuming even that it exists. Different models will pick
up different features of ‘the truth’. We may model our data by a trend
179
K*")
>> 03
CUD
x £
02
£ G Q; 02
-S^ G
03
> £ _£ ’o
X
G C/2 CUD 02
03
G o x: C/2 G
03 • r^ £
C/2 03
x:
a>"2
02 02 JP
o 4-3 o3
,0 o3 C/2 G C/2 JP
,02
c/2 JP 03 o3
XH 02 02 Jp c3
O 02
'£ 5
in
>>’5
02
02
a o
•G
4-3
c
G
02
JP
02 4P
OH JP
O
CUD CT o
£ C/2 HP PH X o3
o 02 4J
c o o
O a/
o
Jp ..
03
JP 02 O
02 02 r—4
02
£
4P> C/2 03 G £ G
4P £ C/2 X5 G *o O C/2
G OH
X3 02 £
C/2 O ^ G x: G 02
§ § CUD 02 O
n « 02 o3
C/2
? § 02 4P>
02
JP g 5
G
C/2
O 4X>
w
_ ^4—I
02 O
02 4—3
PP
02
Jp
G
.X
02 33 G JP
o3 o
OH G W 02 02 G O
Cuo 02 JP
03 o) o w 02 p
a* ~
-1-3 T3 y 3 cr -Q 0 03 02 Jp Jp C/2
«2 Q OH X JP 02 02 02 G
G >3 X -S o
4P 02 02 — 02
03 pp O 0 JP
—i P c/2 C/2
X
> ttJD S
03 J02 o3 ^ 02 *p oe Jp 44>
X OH .5 o £ £ 02
(U c3 3 15
03 O 03 2 4-3> 02 _2 a 03 o3 CO O
C/2 ’ 1’> G
£ £ O 02
S *2 cO XC X2 02 C/2
02 pH C3 C3 JP J_I
5 -IP
XSl T3
03 C-J
c/2 <1 X o H o X5 £3 OH O- OH 03
c/2 o
C/2 »- PP 4-3
C/2 G 1/2 C/2 T3
c3 03 'g £ C/2
G
_r 02 £ 3 •§ 3 02 03 § 03 G
4-3 >»
*4 X — o nn iS -G 02 4-3 4—3 03 O C/2 X!
>P C/2 C/2 p • f-H
G r 4-^ 03
O Jp Jp c3 02 ’£
tf -S 13 £ o O-i 02 c3
2 8 > o C/2 . „ 02 4
Forecasting methods
-S-3 ri
GH 02 _ .5-1 02 X! s
£ 2 02 • ^ X
G 02 02 44-1 O 02
C/2 C/2 G G
02
S J; -c <£ O 03 5P G G 02 X 4-3
02 o o
8 £ 2 "§ G Jp G G
o £ C/2
C/2 a G
c«-. £ £ ^ £ £3 G
•— O Pp 02 03 02 02 02 o3 a O
03 03 02 02
02 c/2 5 S > c/2
. G X3 £ QH
'a OH X3 JP
'51 G 03 02 03
3 •£ CUO • — 4-3 OH
Jp X
03 tj o “ 4-3 .5 XJ c3
i—i r* 02* G ^O "02 0 1/3 & a X
- o x3 02 G G
Q 4-^>
’§ a x ; x ^ 1 o n ^
02 G
02 03
02 - © 4-3
X C/2 -G ©
C/2
0/
« 2 _>
>, a ^ X 0)
I ° 02
p
£ £ -
S c 3 £
C3
02
43
c«
si O G
-H> C3
PH
O 02
4-2 X!
Jp Q; 4-2
. H— CS
CUO c C/2 4—3 —< 4P> 03
-a g X3 02 X G £ g
cd
o 02 .
n a C£ C/2 G G 02 U
02 Pf
Table 12.1
-P c/2 02 CT
1
£ 02 02 cr
CUD G 02 O C/2 G o C £ „
G G 03 >».£ £ _Q OiX) G V2 02 O
s«
CUD -5 JP W3
03 O 43
> a .x OH 03 03 O G -G x: -P? 03 G
SP 03 33 xs c/2 G p3
£ G +? 3 ^ 02 02 X _Q
X O c3 X S
.£ ■£ QJ C7^ O
< O a CO w
cr
cO
c/2 -G
HG a S ^ G OP
G Q § Q go —P C/2 JP
C/2
£ c/2 <>2
a ° ■“
ago
CM CM CM CM H CO CO
>■ cu
03 0> ^3
X 1—I T—I CM CM CM rH CO CM rH CM CM CM CM CM CM CM
JP -fi • • • • •
w a o 16 L'~ 03 lO CD I> 00 (J) LO LO CO 03 OD X LO 00
JP
Jp Jp o-i o3
C/2
o
44-1
„ 02 03 03 02
.55 4-3 02 02 X3 "02 G
03 X G 02 X5 _
O X "£H »—H * ^"4 > JP
O c3
• r"H C/2
X O a, 02 ' JP O G
5-1 02 JP G '02
O JS o X 02 O
£ ^
Jp o o ^=2 X5 O £ S03 CUD
© X
o
G
a C/2 OH JP
pp G Ou £ 43
03 £ -2
£ -S C/2 X3
o3 .£
£ ^ C/2 £ O
o3 O o
^ o co 02
£ o 43
C/2 -O X 02
QH X -G G
CL •■—> 8 9^ - 02 a 13 x 02 Q2
>>■& O O O £ o £ 2 02 > .G
4-3 G 02 O O G
H £ £ O cn £ J -s £
feP
-Q
c/2
CUD 02 °5 a J 3
C/2
02 CO
Jp x g> 02
o3 02 p 02
G 4-3 HZ G
02
x cr
C/2
C
£H _> 02
Jp
o 4-3) 4-3 JP
x: O w G
4-^ C/2 02 ^
03 ’£ 02
a; .52 ^ 4-3 02
02
G
4-] Q ^ IP on
180
181
Model Method
Here the question is whether there is any structure in the data at all.
Kendall (1973) gives a number of different tests of randomness. Table
12.3 describes and illustrates one such test.
1. Definition
A turning point is an observation that is greater than both its neighbours or
less than both its neighbours.
2. Examples
+ is a turning point.
+
+
• • •
+ + + •• • *
• • +*® + +•
#
e + .
• •
• •
+ * + + + *
•
•
•
• ••
( c ) Trending sequence
3. Formulae
n = number of observations
p = number of turning points
pp = expected number of turning points = \(n — 2) for random sequence
op = standard deviation ofp = y{ (16n — 29)/90}
u = (p — Pp)lop = standardized value of p
4. Test
If data are from a random sequence u has, approximately, a standard
normal distribution. Thus reject hypothesis of randomness if | u | > 1.96
for 5% significance test.
5. Example
Data 9.1 8.3 7.2 10.3 9.5 10.4 10.5 10.1 9.7 10.6
Turning + + + + +
points
n = 10, p = 5, pp = 5.3, op = 1.2, u = 0.28
The observed u is near the centre of the standard normal distribution so
the data is consistent with the hypothesis of randomness.
185
(a)
1. Formulae (see Chapter 5 for detailed notation)
n = number of observations
B = least squares estimator of slope (/3)
o = estimated variance = 2e2/(n — 2)
6j = estimated variance of /5 = d2/2C2
£=(/?— 13)1 dp = standardized value of (3
2. Test
If there is no underlying trend in the situation, j3 = 0. The statistic t = fildp
can then be tested using a f-test.
3. Example (from Table 5.2)
n = 11, p = 1.882, d2 = 2.546, St'2 = 110, = 0.023, t = 81.8, which is
clearly highly significant, i.e. there is a clear trend in the data.
(b)
1. Formulae
n = number of observations
P = number of pairs of observations, not necessarily adjacent, in which
the later one is greater than the earlier one
lip = expected value of P = \ n(n — 1) for random sequence
4P f +1, steady upwards trend
r= “1; r =
n(n — 1) T, steady downwards trend
or = standard deviation of r = ^J{2(2n + 5)/9n(n — 1)}
2. Test
If data are from a non-trending random sequence, r/or has, approximately,
a standard normal distribution. Thus reject hypothesis of non-trending
randomness if | r/or | > 1.96 for 5% significance test.
3. Example
Data 9.1 8.3 7.2 10.3 9.5 10.4 10.5 10.1 9.7 10.6
Number 0003 3 56449
of points
to left
with lower
value
P= 34, r=0.51, or = 0.25, r/or = 2.05
The observed value r/or is just significant at the 5% significance level; so
there is some evidence of a trend, but more data would be useful before
reaching a firm conclusion.
186
Season j
1 2 r
1 Xi x x x
12 lr
2 x x x
21 22 2r
Period
t X
tl x
t2 x
tr Totals
Totals Si s 2 Sr XSj
Sum of squares SSi ss 2 ssr ZSSj
2 2
(e.g. x\ L +X 21 + ...+ x n)
.*1
•s*
i i
X' = Xxj/r
i
Variance analysis
Total (1) N- 1
'j o
Quarter
1 2 3 4
Year 1 -10 2 9 -6
2 -3 -1 4 0
3 -4 5 1 -3
4 2 2 -6 -8
5 -2 4 7 4 Totals
(-3)2
Total sum of squares 491 - 490.55
20
(-3)2
Between season sum 165.4 164.95
of squares 20
Total 490.55 19
for the same data the test of Table 12.3 was unable to detect
non-randomness.
allow for the bias that will almost certainly occur in the forecasts of
some members of the group. This occurs because of the way in which
the forecast will influence their part of the organization; a high forecast
will make life harder for production men but possibly more lucrative to
salesmen. Over a period of time the magnitudes of these biases could be
assessed and these forecasts adjusted accordingly.
Other methods seek to get the group to work as a team and arrive at
a genuine consensus. This can be done by circulating initial forecasts,
each with some indication of the reasoning behind its choice, on the
basis of which each member revises his forecast. This is usually done
anonymously. The feedback produced by the study and revision of
forecasts reduces the spread of the forecasts. This is particularly so if
those well above or below the average have to produce detailed
arguments for keeping to their forecasts, that is if they wish to. Part of
the value of this type of exercise is the investigation of relevant factors
as seen from a number of different viewpoints. It is possible in this
technique to introduce something corresponding to the weights in the
previous method. In the initial circular, previous forecasts and actuals
could be given for each of the forecasters. Thus people would begin to
form a clearer assessment of their own and other people’s biases and
tendencies.
A major problem occurs with judgement forecasts that is not present
to the same extent with forecasts based on models. With a statistical
forecast, assuming stability, one can evaluate standard errors for the
future observations, given the forecasts. If the same forecaster makes
repeated forecasts on the same basis of information, then there will
exist a set of errors with probably a clear statistical pattern. Where we
are dealing with consensus forecasts using possibly different people and
different lines of argument each time, it is likely that the errors over a
sequence of forecasts will not show so much stability. If this is the case,
then it is difficult to assess limits for the forecast. One could try and
form a consensus opinion of the range in which the future observation
will lie with given probabilities. Alternatively, the revised, or original,
sets of individual forecasts could be treated as a set of random
observations on a distribution with the consensus forecast as mean. This
distribution could then be used to provide an error distribution.
Experience would have to be built up to see how close the distribution
of errors in forecasts corresponded to the distribution of forecasts made
by individuals.
References
Bassie, V. L. (1958). Economic Forecasting. McGraw-Hill, New York.
Kendall, M. G. (1973). Time Series. Griffin Co. Ltd. London.
.
Part III
Data
13.1 Introduction
The quality of a forecaster’s results cannot be better than the quality of
his data. It is therefore worthwhile to devote a short chapter to looking
more carefully at the forecaster’s data. In section 13.2 we will look at
the sources of his data and in section 13.3 we will examine the quality
of his data. The ‘quality’ of data is almost impossible to measure or
define, so we will avoid the issue by simply looking at situations where
the ‘quality’ of the data is clearly low and seeing what might be done
about it.
We will also look at some of the many ways in which data are
adjusted for use in forecasting.
The sources of data depend on the type of variable being forecast. The
sources for forecasting sales of furniture will be very different from
those required for forecasting the number of school places needed for
ten-year-olds in a town. There is no hope, nor need, to give a study
covering all applications. Instead, we will concentrate on forecasting in
the business area of application, as this is the most common and as it
also illustrates a number of more general aspects.
In Chapter 2 we made a distinction between internal and external
data, i.e. data obtained within the forecaster’s own firm and data
obtained from outside. Let us look at internal data first. It is essential
to the forecaster’s work that, as part of the firm’s procedures,
appropriate data are collected and recorded. He should look very
carefully at how data are initially obtained, combined and circulated so
that he can obtain data of the highest quality possible as early as
possible. Where the forecaster is starting from scratch and trying to get
together past data, a lot depends on knowing the firm and its staff very
193
\
194
is of great importance when they are used to forecast series that have
not themselves been investigated.
External data for business forecasting come under roughly two
headings:
(a) data for similar products, or generally for the same industry;
(b) data concerning the general business environment and economy of
the country.
Let us turn now to the quality of the data used in forecasting and list
some of the things that can be wrong with such data.
(a) The data may be unavailable or late. For example, in regression
methods substitute variables have to be used where an intuitively
important regressor variable is either not known or is not published
until after the deadline for producing the forecast. The substitute
\
196
We must also have a set of figures giving the correct annual total, so
that
8 • IUJ • 4 + 4 * w2 * 5 = 52
Solving these two simultaneous equations gives
wx = 13/12 and w2 = 13/15
Having multiplied each actual monthly figure by wx or w2 and
substituted the adjusted figure in the forecasting formula, we must
\
198
rk = ark_l + (1 — a)rk
E(rk) = n~ca
E(rk_ x) = iu—aT—ca
The expected number of items arising in the time T will be the average
rate times T, i.e. (p — aT/2)T. Thus the expected value of rk is
E(rk) = II-OLT/2
Solving gives a = (2c — T)/(2 c + T). Thus at each use of the recurrence
relation a is recalculated from T. The choice of the best a is replaced by
the choice of the best c.
A final comment on the consideration of time scales is that even if
one can obtain data at regular intervals it is not necessarily the best
thing to do. As an example of this point, consider some attempts that
have been made to model share prices (Mandelbrol and Taylor, 1967).
The obvious time scaling is to take one unit as one day’s trading, so that
the xt are the closing prices on the Stock Exchange. However, some
days are slow days for a given share, with very little buying and selling.
On other days as much stock is bought or sold as in the average week.
On these latter days it is as though time speeded up for that particular
stock. It is reasonable to try to scale t so that a uniform rate of buying
and selling occurs. This is done by observing xt at times called t = 0, 1,
2, . . . , which correspond to the moments at which, say, 0, £10,000,
£20,000, £30,000 pounds worth of stock has been traded, provided this
information is available. This keeps a constant amount of trading,
rather than a constant amount of physical time, between successive
observations xt.
In many sets of data there are certain effects that are known to
influence the raw data. Examples are
Ways of dealing with these vary greatly. Some are pure intuition, where,
\
200
for example, a guessed amount is added to allow for holiday lost sales.
Others are fairly sophisticated methods, where, for example, parameters
are introduced to correspond to the known factors, which are
estimated from the data and then used to adjust the data. The latter
methods are clearly preferable if one is dealing with repeated
occurrences. In unique situations caused by, say, a strike, one is forced
to make a guestimate on whatever grounds seem reasonable. As an
example of the more mathematical approach, a model for ice-cream
sales was expressed as:
=
Xj,k + Ptj + 7r/ +
5Tj,k + ejk
where Xjk is the sales in week j of year k, tj is the temperature norm
for that week, r;- is a seasonal term and Tj k is the average deviation of
that week’s temperature from the norm. A set of past data was used to
estimate the parameters. A figure for adjusted sales for normal
temperatures was then found by subtracting the estimated effects of
Tj k, namely $Tjfki from Xjk. Another common example is the use of
seasonally adjusted data for regressor variables in regression forecasting.
One general comment here is that though such adjusted data are often
useful at the data analysis stage, they may not be so useful for
forecasting. To see why this is so, note that the process works in three
stages:
(a) A model for the data, involving the effects of interest, is fitted.
(b) The data is adjusted by eliminating these effects.
(c) The adjusted data is used with an appropriate forecasting model.
These three stages can usually be combined by incorporating the model
for the data, with the effects, in the general forecasting model.
Experience indicates that this more general model will usually provide
better forecasts than the three-stage approach.
yt = xt —xt_ !
The first of these transformations is commonly applied to reduce
percentage changes to absolute changes, to reduce a growth model to a
linear model or to change from a multiplicative to an additive seasonal
model. The forecast of xt+ t is obtained as ey’t+ 1. As mentioned in the
discussion of growth curves, an unbiased forecast of yt+ j will lead to a
201
References
Boot, J. C. G., Feibes, W., and Lisman, J. H. C. (1967). Further methods for the
derivation of quarterly figures from annual data. Applied Statistics, 16, 65—75.
Ginsburgh, V. (1973). A further note on the derivation of quarterly data consistent
with annual data. Applied Statistics, 22,368—374.
Lisman, J. H. C., and Sandee, J. (1964). Derivation of quarterly figures from annual
data. Applied Statistics, 13, 87—90.
202
14.1 Introduction
We have discussed thus far a fairly limited number of forecasting
methods. The practicalities of forecasting often pose problems that are
not adequately dealt with by these methods. The situation may change
so rapidly that even our local model-building approach will not work. A
‘fracture’ may occur in which a model parameter suddenly changes its
value. The situation may be highly complex and involve variables other
than time. It may require the use of different approaches at different
stages in the problem . It may be that such new problems will require
new models and new methods. However, experience suggests that it is
worthwhile to look at the methods that we already know and seek to
modify and to extend them to meet the new situations. The aim of this
chapter is to introduce some of the ways in which this is possible.
75 63 79 0.80
73 68 81 0.80
71 70 80 0.80
70 72 81 0.75
75 69 73 0.75
Time 77 69 73 0.75
205
xt = xt_1 +y(et)et
The smoothing constant is thus made a function of the latest error et.
By choosing the form of the function y(et) to be like that in Figure
14.1, it is possible for the formula to behave like ordinary exponential
smoothing with a high smoothing constant when et is small. When,
however, large errors occur, the value of y(et) increases and these errors
have a consequent large effect of the estimated mean. A danger here, of
course, is overreaction and a consequent instability. The original work
of de Bruyn allows for this by using a more elaborate form of the error
correction approach. An alternative is to make the smoothing constant
depend on the whole sequence of past errors, though obviously with
emphasis on the latest values. We thus need to devise a function that
either increases or decreases when the sizes of errors tend to get larger.
One such function, that will be discussed further in Chapter 16, is
defined by
| St | = | smoothed error/smoothed absolute error |
etlI et ||
The smoothed error gives a measure of bias in the errors. The smoothed
absolute error gives a measure of variability with which to compare the
bias. It can easily be shown that 0 < | S |< 1 and it is clear the | S | will
equal one when all the errors have the same sign. The smoothing
constant can thus be set equal to 1 — | S | or to some multiple of this,
possibly with a restriction to limit | S I below some maximum value.
The restriction is sometimes introduced to stop the smoothing constant
getting unrealistically small. Thus one might have
1 — c | St I, | St I < S max
1 — c S max, | St I > S max
\
206
xt = (1 — a)xt + axt_ !
or, in words,
Xf— yi ~ fith + h
207
and
or, in words,
xt = fJ-t + Wi + adt + et
where a is a parameter, that we will assume known, measuring the
increase in sales per degree deviation in temperature. In practice a
would probably vary according to the time of year. As usual et
represents the random variation in the situation. Suppose at day t we
have estimates of nt and Wh denoted by jlt and Wt, and also a weather
forecast of dt+l9 denoted by dt+l. The forecast of xt+ i is then
When xt+1 and^dt+1 have been observed, they may then be used to
update jlt and Wt+ i using recurrence relations. Consider juf first. From
the model it is seen that if we have last week’s estimate of Wi+l,
denoted by Wi+ i id > a new estimate of p is given by
a being the smoothing constant. This formula, as with all our previous
recurrence forms, enables us to re-estimate pf in a very simple fashion,
and also, being of exponential recurrence form, it will tend to follow
local changes in p?. The same approach may now be used again to
update any of the day of the week effects. Thus from the model,
remembering that p?+ x is now known, the quantity
x
t+ I Mr+ I otdf+1
+
at = axt (1 ~ a)(S*_ i + fat-i)
—
7t = c(ft ft- I ) + (1 — c)7 f_ j
with the forecasting formula
ft(ft + 1)
- + ft ft + It
1.2
A further step is obviously to try to go directly to step (c), or rather
to go directly to a forecasting formula without reference to model
parameters at all. One approach is to specify the form of forecasting
formula and study its properties. For example, Ward (1963) shows that
the recurrence relation
xt+\ =A t +Pt
+ +
jlt =
fat- i fat- i (1 a
)et
=
fat fat- 1 + ce,
where et is the one-step-ahead forecast error and a and c are forecasting
parameters. In section 7.6(b) (iii) we noted that these results could be
obtained by using a recurrence formulation for the trend model itself,
namely
xt = M* + et
+ +
Mr ~ Mr- i fit- 1 7r
= +
fit fit- l ^r
the quantities ef, 7f and being independent random variables. It is
found that many of the structures discussed in previous chapters can be
described by models having such a recurrence form. For models of this
form it can be shown that the least squares and discounted least squares
estimators of the model parameters can be expressed in error correction
forms, analagous to the models themselves. Having illustrated this for
the linear trend model, let us give a sketch of this formulation for two
other structures from previous chapters, namely for seasonal and
growth curve structures.
(a) For the seasonal index models of section 9.2.2 a recursive model of
the additive form would be
X, = fit + 4>t,i + et
fit = fit-i + It
=
& M?- i +aet
A A
xt = ve* + et
+ +
fit = Ox Mr- i 1 x ft- i 5*
Ah =Ah-i
1
fit = fit-i +7 e,
L
As the criteria for the ‘best’ forecast we will use the minimum
mean square error. Thus we wish to minimize
b = C(k)/o2
Smin =02-C(k)2l02
= O2 (1 — p(k)2 )
where p(k) is the autocorrelation coefficient of lag ft. It will be seen
from this that as the autocovariance increases the mean square
error is reduced.
If we take the general linear forecasting formula and carry out
the same calculations as in the example above, we obtain a set of r
simultaneous equations, the normal equations, for the unknown
parameters b, , br:
This assumes, as in the example, that E(x) = 0 over time, and that
the autocovariances C(i) depend only on the time intervals i
between the x values and not on the actual time. The study of the
above equations for practical use is a major subject, see Wiener
(1949), Whittle (1965) and Yaglom (1962), and we limit ourselves
to making the following points:
(i) Given knowledge of C(0), C(l), etc., there are many computer
programs that can solve such sets of equations. The symmetry
of the equations helps to make this a relatively fast process
unless r is large.
(ii) In practice we must estimate the autocovariances. For large
values of r this cannot be done very precisely. If an
examination of the estimates of C(0), C(l), etc., indicates that
they have some structure, e.g. C(k) = pk, as for one of the
models discussed in Chapter 7, then this can be used and the
problem greatly simplified. It is, however, a move away from a
purely empirical approach.
(iii) Robinson (1967) suggests ways of solving the equations by
using first only one past observation, r = 1, as in our example,
then two, three, etc. These methods are not only efficient but
may indicate when we have sufficient past values to obtain
reasonable forecasts.
(iv) If E(xt) = p, for all time, instead of zero, we simply subtract ju
(or our estimate of it) from every x. The forecast of xt + h will
then be our linear forecast obtained from the adjusted
observations plus p, i.e.
Forecast = n + b1(xl — p) + . . . + bt(xt — p)
= kn + 3ct
where xt is the linear forecast using unadjusted observations
and
k = 1 ~bx — . . . — bt
(v) The above method and normal equations generalize in a natural
fashion to deal with the analogous multivariate forecasting
problem (see Robinson, 1967).
(b) An alternative approach to this problem follows from the simple
observation that the form of the forecast corresponds to a
regression on past observations. The linear forecasting formula gives
the fitted value xt k corresponding to the observed value xt+k. The
regression coefficients bx, b2, • • . , br (plus a constant term b0) can
be chosen by least squares or discounted least squares using the
normal equations of Chapter 6. This enables one to make use of
standard regression programmes that use stepwise regression pro-
cedures to select the significantly non-zero values of b to use. We
\
216
would put the values xt+k as the dependent variable and xti
xt_ j,..., xt_r+i as the regressor variables. This approach is
simpler than that of (a) and has proved to be effective in practice,
(e.g. Newbold and Granger, 1974). A variation on this, discussed by
Wheelwright and Makridakis (1973), uses a recursive method to
find the b values instead of the normal equations.
References
Albert, A. E. and Gardner, L. A. (1966). Stochastic Approximation and Non-Linear
Regression. MIT Press, Cambridge, Mass.
Bamber, D. J. (1969). A versatile family of forecasting systems. Op. Res. Quarterly,
20, 111-121.
Berrisford, H. G. (1965). Relation between gas demand and temperature. Op. Res.
Quarterly, 16 229—246.
219
15.1 Introduction
It would make life much easier if we could say that the forecasting
formula on page X is the best to use and all the others are of historical
interest. Unfortunately we cannot. One might say that surely knowing
the model the theoretician can derive the best forecasting formula. It is
true that he can, theoretically, derive the best forecasting formula given
some criterion, but in practice there are many snags with this. The
model we use will probably not be exactly right and even if it was
good enough for the past it may not be too good now. The model may
only describe certain aspects of the actual situation; other aspects will
shows up in the forecasting errors. Even if all is well we need to have a
forecasting formula that responds well to minor changes that might
occur in the structure of the data. In general, then, we can rarely
choose, for a practical situation, the appropriate method without a
great deal of analysis of the data and a comparison of various possible
methods. The aim of this chapter is to discuss ways of carrying out this
analysis and comparison.
It is possible to approach the comparison of methods in a purely
mechanical fashion; we simply calculate the mean square error for a set
of methods and/or a set of values of the forecasting parameters and
choose the best. However, for effective forecasting the objectives are to
seek the best methods and to understand their properties. It might be,
for example, that a more careful study of the results of the trials would
have led not to choosing that method with the smallest mean square
error but to developing an entirely different method. At the heart of
our study is the forecast error. Most, though not all, of the methods
developed in this chapter will be aimed at analysing the forecast errors
to see if they show any structure of their own. If the forecast errors do
show a clear structure then, as we will see in Chapter 17, we may use
this knowledge to obtain better forecasts or make better use of the
forecasts we have.
221
Time
/ Overestimate
/ an increase
Overestimate / A
z,
a decrease /
Forecast an increase
/ Underestimate when a decrease
/ a decrease actually occurs
rlx 2(x,-x)(Xi-x)
' ’V[2(^-x)22(x,-i)2]
If we had perfect forecasts, x{ = xi9 this would give r(x,x) = 1. Thus a
large value of r is an indication of a good forecast. Notice the relation
between m and r, namely
1 Xf
0 95-
Time
denote the forecast change in sales. The quantities zt and zt can now be
plotted on a forecast—observation diagram as in (c) above. The
interpretation of the points obtained are as indicated in Figure 15.2(c).
If points lie in the upper left or lower right quadrants, the inference is
that a ‘turning point’ error has been made; z and z have opposite signs.
In the other two quadrants the errors are in terms of magnitude of
change but not of direction. A detailed discussion of this type of plot,
together with many examples, is given by Theil (1958).
Lead time h
Time 0 1 2 3 4
Frequency
ro <3- loinrooj — o
ro <\j — i — CM ro
1
l I
Lower class limit
o>
~D
<V
M
o
“O
c
o
CO
• 0
0
•
• •
•
• •
• •
•
o
•
<D • • •
~o
• •
a;
.S
"O
0 - •• •
•
• 00
o • •
X)c •
o • •
CO •
• • 00
• •
•
0
•
•
•
«
-2
•
Forecast
forecasting errors will tend to be greater when the seasonal peaks occur.
This will show up as a greater spread in the errors for large x.
230
(b)
f(x li)
20
Method I o
Method 2 x
o o
x
10 X
X O
X
o X
X o
X X X
GO
o O
w_
0 -f
LLI I 2 3 4 5 6 7 8 9 10 II 12 13 14 15 Day number
x
Mon Mon Mon O
O
o x
-10 o
x
o
X
o
-20
X
x
o o
(a ) Direct plot against time
Method 2
error
^k ~ ^i+ k )
If we have a good forecasting method, it is reasonable to require that
the forecast errors are independent of each other. This implies that
expected values of rx, r2, . . . are all zero. To check this we calculate rL,
r2, . . . and plot these as shown in Figure 15.10. Such a plot is called a
233
0 5
« 0-4
o 03
o 0 2
O) 0 1
12 13
o o
11
1 I 1__L i—i— Lag
8 9 10 II
o'OI
D -0 2
<-0 3
-0 4
-0 5
(a) A good result-small autocorrelations
Lag
Lead Method
time I II III
(a) Tabulation
The most obvious step is to tabulate the root mean square error or
the appropriate criteria for the problem. Table 15.2 compares the root
mean square error for three methods for different lead times. Notice
how the root mean square error increases with lead time. A natural step
following this tabulation is to seek to model this increase relating the
root mean square error to the lead time. In Table 15.2 a model giving a
linear relation would not be far out.
i±j
CO
s
cc
o
o
o
130 —
120
Time
Having looked at the local root mean square error, it is natural to see
how its fluctuations are related to the separate local fluctuation in the
bias, E(e) and the standard deviation of error, J Var(e). Figure 15.12(a)
and (b) show such local plots for the same data as that used in
Figure 15.11. It is clear from these that the fluctuation in the root
mean square error is produced almost totally by that in the standard
deviation. The bias is very small and its local fluctuations have little
effect on the root mean square error. Had we been in a different
situation and had large local biases occurring, then we could improve
our forecasts by making some adjustment to remove the bias. We will
examine this possibility in Chapter 17.
a
Data 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
Set 3 46 42 39 36 33 32 31 30 31 32
Set 2 86 85 87 92 97 103 — — — —
Set 1 39 40 42 45 49 55 — — — —
Order p
MSE
tabulated 1 2 3 4 5 6
240
Though there are problems if the errors are not normally distributed,
the basic approach is the same. In Williams and Goodman (1971),
referred to above, the magnitudes of the errors, I e I, were found to give
a good fit to a gamma distribution. This distribution was fitted to the
set of errors and prediction intervals obtained. The observed propor-
tions of forecasts, other than those used to fit the model, that fell
within the intervals was satisfactorily close to the probabilities as fixed
from the gamma distribution. An alternative approach, that makes no
assumptions about the distribution, is based on simply ordering the
errors. Suppose, for example, we were concerned only with estimating a
value for e that had only a probability of 0.1 of being exceeded. In this
case the largest of ten observed errors, or the second largest of twenty
observed errors, etc., would provide a rough estimate based on the
observed errors alone. To use estimates of this form we need to be
fairly confident that the data used to obtain these estimates come from
a stable error distribution and do not contain any outliers.
244
could be done with some effort if all the forecast errors were
independent of each other. However, in practice this is unlikely to be
the case. In the above situation there is very little experience of
forecasting and the costs involved are sufficiently high to discourage
much purely experimental work. A natural reaction is to simply ignore
the error analysis side of forecasting and hope that the forecast values
will be reasonably good. There is clearly no hope of giving confidence
intervals for the forecast of st. While this may all be true it is still
nonetheless possible to investigate the forecasting method and say quite
a lot about the behaviour of the forecasts. One line 1 of Table 15.5 we
give the forecast values of all the variables. On line 2 the values of ht_
at_1 and lt_1 have all been set at what in the practicalities of the
situation are thought to be the most extreme values in terms of making
nt large. These are perhaps our guesses at the 95 per cent confidence
limits. The values of pt and rt are left alone. On line 3 we take the line
1 value of ht but let pt and rt take on low values. The sizes of the
schools involved suggest only small drops in p and r are likely and also
puts clear upper values on them. On line 4 we take these low values
with the high value of nt of line 2. By this type of exercise we can begin
to appreciate the effects on the final answer of variations in the
forecasts. If one is willing to make experimental assumptions about the
distributional behaviour of the forecasts, e.g. that ht_ x, at_ x ,\t_ x are
independent with means as on line 1 and standard deviations 100, 50
and 10, respectively, then we can go even further. Clearly, ht will have
mean 2,350 and standard deviation
\/(1002 + 502 + 102) = V(12,600) ~ 112
References
Davis, R., and Huitson, A. (1967). A sales forecasting comparison. The Statistician,
17, 269-237.
Godolphin, E. J., and Harrison, P. J. (1975). Equivalence theorems for polynomial-
projecting predictors. J. Roy. Statist. Soc. B, 37, 205—215.
Granger, C. W. J. and Newbold, P. (1973). Some comments on the evaluation of
economic forecasts. Applied Economics, 5, 35—47.
Groff, G. K. (1973). Empirical comparison of models for short range forecasting.
Management Science, 20,22—30.
Marchant, L. J., and Hockley, D. J. (1970). A comparison of two forecasting
techniques. The Statistician, 20, 35—64.
McKenzie, E. (1974). A comparison of standard forecasting systems with the
Box—Jenkins approach. The Statistician, 23,107—116.
Newbold, P., and Granger, C. W. J. (1974). Experience with forecasting univariate
time series and the combination of forecasts. J. Roy. Statist. Soc. A, 137,
131-146.
Theil, H. (1958). Economic Forecasts and Policy. North-Holland Publishing Co.,
Amsterdam.
Theil, H. (1965). Applied economic forecasting, North-Holland Publishing Co.,
Amsterdam.
Wagle, B., Rappoport, J. Q. G. H., and Downes, V. A. (1968). A program for
short-term sales forecasting. The Statistician, 18,141—145.
Williams, W. H., and Goodman, M. L. (1971). Constructing empirical confidence
limits for economic forecasts. JASA, 66, 752—755.
Chapter 16
Forecast control
246
247
400
300
100
0
LJ
-100 • •
-200
Lower control limit
-300
-400
Time
where | et\ denotes the absolute (signless) value of et. The smoothing
constant is the same as that used for et, for reasons that will become
apparent later. It can be shown that for a normal distribution
E(\et |) =
and Brown (1959) shows that for the normal, exponential, rectangular
and triangular distribution
Tt = et/MADt
To obtain the control limits for Tt we note that, using the 95 per cent
control limits and the relations derived above, the situation is assumed
to be in a state of equilibrium when
If T falls outside these limits then there is a suggestion that all is not
well. We do not put it more strongly than this, since 5 per cent of
observations will by chance fall outside these limits even when all is
well. One assumption behind this calculation is that MAD* gives a good
approximation to oe. This is only reasonable when t is large and 6 is
close to one.
A problem associated with the use of this tracking signal is that both
e0 and M AD0, the initial values in the recurrence relations, need
specifying before the data ex, c2 become available. A reasonable value
for e0 would be zero and for MAD0 would be \J2ITI a0, where a0
\
250
x, = (l —y)x, + yxt_ j
then the variance of xt is
i-7 r2
°l = o;
1+7
assuming independent observations with locally constant mean and
variance o2. Hence the error variance is given by
Var(et) = Var(xf — xt_ j)
1-7 „
= o% + oX
1+7
ol
1+7
By carrying out this analysis we have not gained any information but
have simply replaced the requirement for a guess at MAD0 by a guess at
a2, which is usually more within one’s intuitive grasp of the situation.
If by comparing the variable being forecast with other similar variables
we estimate ox by dx, then the initial value for MAD is
2 ox
MAD0
+
^ 7)}
A better solution to this problem is to defer the commencement of the
scheme until enough actual data are available to provide an estimate of
the mean deviation of errors and to use this value, i.e.
MAD0 =2 | e |/(number of observations)
available
data
t—
/
t- 1
r
2 b e
/
t- t-
2 r
b \e,_r 2 br
/
t- I /*- i
r
2 & e,_r/ 2 6r|e,-r
= Sf(e)/Sf(|e|)
where Sf ( ) represents the exponentially weighted sum, e.g.
t- I
= St(z)= 2 brzt_r
The advantage of this form for the tracking signal is that there is no
requirement for starting values in recurrence relations; T\ depends only
on the data available and not on our estimates. Strangely enough,
though this is the ‘exact’ form it is easier to calculate since quantities
like St(z) have a very simple recurrence form, namely
St(z) = z, + 6Sf_1(z)
The starting value for this recurrence relation is simply S0 (z) = 0 so that
Sx (z) = z j. Obviously if t is large then T't and Tt are equivalent.
Let us examine the properties of Tt and T\ in a little more detail.
Suppose that the process was completely out of control from the start
so that all the errors had the same sign. If the sign was positive there
would be no difference between elf e2, . . . , et and \ ex\, | e2 I, . . . ,
| et |, and hence between et and MAD?, so Tt = 1. If the sign was
negative then et = — I et I, i = 1, . . . , f, and so Tt = —1. As these are the
two extreme possibilities it is clear that T, and equivalently T \ must lie
in the interval (—1, 1). This useful condition is the result of using b as
the smoothing constant for both et and MADf.
In the derivation of the control limits it was pointed out that et was
assumed to be normal and MADf was treated as a constant; this implies
the assumption that T is normal, which cannot be the case with T lying
between limits. The validity of a normal approximation for T must be
checked and it is probably easiest to do this by simulation. Thus we
may generate a long sequence of random variables e having the sort of
property we expect from our actual situation when it is stable. The
sequence of T values calculated from the e values is then examined and
a cumulative frequency polygon constructed. From this we can read off
values of T which will only be exceeded with any required probability.
A discussion of the distribution of T is given in Brown (1963).
It is difficult to detect small changes in the mean error using the type of
control chart discussed in section 16.1. One way of improving the
sensitivity is to base the control chart on the averages of several errors
instead of on the individual errors. If we could only guess where a
change in mean had occurred, then the average of all the errors
obtained since the change would be most sensitive in indicating that the
change had occurred. Unfortunately, we do not know when the change
occurred, if it did, so we do not know how many observations to
include in our average. One way of avoiding this problem is to have a
whole battery of control procedures based on the averages of all past
sets of errors, one using just the last error, one using the average of the
last two, one the average of the last three, and so on. Each of these
averages will be the most sensitive available for detecting a sudden
change in the underlying mean error, provided the change in mean
occurred immediately before the earliest error used in that particular
average. If we have say 100 observations, this process effectively
involves the construction of 100 control procedures. This is obviously
impractical but the basic idea does lead to some very useful methods of
control. The first modification required to obtain a practical method is
to replace the average errors by the sums of errors. These are easier to
calculate and are easily related to each other. Thus the quantities used
as the basis for the procedure are
= e
51 t
52 = et + et_ j
Ti L2 ^ L(y
Lt = Wo(i + h)
where a is the standard deviation of the errors and W and h are the
parameters to be chosen. To choose these parameters, simulation
methods can be used. Thus we generate two sets of data from a normal
distribution, or whatever distribution is appropriate, with the given o
and means zero and d, where d is a change in the mean of a magnitude
that should be detected fairly rapidly, e.g. d = 1.5 a. These data are
then used in the control scheme for a range of values of w and h. The
measure of control of clearest meaning here is the average run length
(ARL), which is the average number of observations after a change in
mean before the control system indicates a possible lack of control. If a
real set of data giving forecasts with average error close to zero is
available, these data and the data with d added to each observation can
be used in place of the simulated sets of data. Results from such a
simulation are given in the ICI monograph referred to above. By way of
example, suppose W = 0.4 and h = 3.4. Then when the situation is
under control the average run length is 9, so that with monthly data we
will unnecessarily review the situation once in 9 months. When,
however, a change in mean of magnitude 1.5 a occurs, a lack of control
will be indicated after about 2 observations.
The above describes the principle of the method, but for practical
operation a method can be devised by which only four quantities need
be recorded to use in the scheme. This suprising point is made in the
paper by Harrison and Davies (1964). The essence of the scheme is that
in looking for points which might cross the control limits one need only
look at the sum which is closest to the limits. A recurrence relation is
devised which enables this closest point to be identified at each stage,
without having to investigate points which Eire nowhere near the limits.
254
If the successive errors are independent, this has an expected value of 2a2.
The error variance a2 can be estimated by
1 n
S2 = - Z (et — e)2
n /= I
Z) b {&t — r &t— r )
s2
r-0
255
References
Brown, R. G. (1959). Statistical Forecasting for Inventory Control. McGraw-Hill,
New York.
Brown, R. G. (1963). Smoothing, Forecasting and Prediction of Discrete Time
Series. Prentice-Hall. Englewood Cliffs, New Jersey.
Coutie, G. A., Davis, 0. L., Hassall C. H., Miller, D. W. G. P. and Morrell, A. J. H.
(1966). Short Term Forecasting. ICI Monograph No. 2. Oliver and Boyd.
Edinburgh.
Harrison, P. J. and Davis, 0. L. (1964). The use of cusum techniques for the control
of routine forecasts of product demand. Op. Res. Quarterly, 12, 325—333.
Trigg, D. W. (1964). Monitoring a forecasting system. Op. Res. Quarterly, 15,
271-274.
Chapter 17
Two-stage forecasting
17.1 Introduction
There are three basic aspects of applied forecasting: first, that of
analysis of past data and the construction of models for this; second,
the analysis of the use to which the forecast is to be put and again the
appropriate model building; and third, the bringing together of the two
to obtain the best forecasting method for the given application. Having
said this we must acknowledge that this is the ideal. At the opposite
extreme to the ideal, data are poured into one end of a computer
manufacturer’s forecasting program and whatever comes out at the other
end is used as the forecast. There are several factors that tend to
encourage people to take an approach that is less than ideal.
256
257
(a ) Ideal forecasting
Data
'
Stage I Forecast
Past statistical information
External information
Stage H | Adjustment 1«
r Other forecasts
Specific criteria
| ForecasT
~ ^t+h ^t+h
by substituting from the above equation, It is clear that we will obtain
reduced forecast errors provided the forecast et+h is a reasonable
forecast of et+h. There are usually two features that we would like to
improve by the use of et+h. Firstly we would like to reduce any bias
that there may be in the stage I forecasts. We can do this completely if
the forecast et+ h is such that
E(et+h) = E(et+h)
for then E(zt+h) = 0. Secondly, we will want to reduce the forecast
mean square error, or variance if the forecasts are unbiased. Consider
the latter case. Denote the variances of zt+h, et+h and et+h by o2,a2
and k2 o2, respectively, and the correlation between et+h and el + h by
p. Then
ol = a2 + k2 a2 —2o ko p
Example 1
The method discussed here was developed to provide forecasts of the
number of telephone calls arriving at a switchboard during different
half-hourly periods during a working week. The total time required to
answer these calls was also forecast. The data showed very large
systematic variations during the day and also between the days of the
\
260
week. There were also odd days where, for unforeseeable reasons, the
load was very heavy. The usual way to forecast such a situation might
be to fit some trigonometric model to the data, perhaps with the use of
some discounting procedure. Alternatively, some modified form of the
classical methods for seasonal adjustment could have been used.
However, both these methods would have required considerable
computation. The system required had to be used by someone without
any mathematical ability, and it had to give an answer by hand with
something like a minute’s effort. It had also to be very sensitive to pick
out the odd day with extra high or low loads. In addition it was
necessary for the system to follow changing patterns in the distribution
of load within the week. Such changes occurred sometimes during
different seasons of the year. Within this context any attempt to find
the best models or forecasting formulae with smallest mean square error
would have inevitably led to methods that were too complicated to
work. Instead, attention was focused on modifying the simplest
methods available.
The method devised was based on the technique of exponential
smoothing in its simplest form. As this technique was not appropriate
to data with periodic form, it was applied in two stages. An advantage
of the method was that the preliminary stage I forecast could be made a
week ahead and the final adjustment made on the actual day an hour in
advance of the half-hour being forecast. The method used in stage I was
to forecast separately for each of the 85 half-hourly periods in the
working week, using for each period only data from that period in
previous weeks. Thus exponential smoothing was applied to 85 separate
series. The nomogram given in Chapter 4 was used to simplify the
practice.
This method gave adequate forecasts most of the time. There were,
however, occasional days when the load was particularly high. This
produced a systematic positive bias in the forecast errors on such a day.
This bias was forecast one hour in advance by exponentially smoothing
the forecast errors from stage I as they arose during the day. Thus if
x10 was the stage I forecast, made the previous week, of the tenth
half-hour in the day and es was the exponentially smoothed error
obtained at the eighth half-hour, then at that time a revised forecast
would be
y i o = *io + e8
Example 2
A large firm produces at the end of each year a planned sales for each
week of the following year. This plan was based on fitting a linear
regression model to log sales data and then making various adjustments
for holidays, etc. A study of several years’ data indicated that this was a
satisfactory model. The use of log sales in the model suggested that
errors be defined as percentage deviations from plan. Thus if pt is the
planned sales for week £, xt the actual sales and dt the percentage
deviation, then by definition
xt =pt{ 1 + df/100)
A study of past values of dt indicated that in any year there were runs
of deviations above and below zero and that there was an overall bias
away from zero. If we manage to forecast the future value, dt+h, by
dtfh, then it is reasonable that a revised plan will be given by
1 2 3 4 5 6
(l)-(2) (2) + (4) <1> — (5)
xt xt e = (pet_ ! y
Comparison
Mean 17.82 12.76
absolute
error
A A
j3 = 2xx/2x2
Thus the second-stage forecast is
A
A /O A
y = P x
Example 1
(X it x < xmav
Stage II forecast = {
I'^max if X > Xmax
A similar situation may arise where a lower limit xmin is set so that
if £ > x min
Stage II forecast =
if x < x min
or, alternatively, limits might be put at both upper and lower ends of
the expected range of the forecast value. These are shown graphically in
Figure 17.5(a) and (b).
Example 2
if x < x max
Stage II forecast =
+
(1 6 )x max if x > x max
The parameter 0(0 < 6 < 1) would be made small if strong reliance was
put on the expert opinion rather than the forecasting method. If the
converse, 6 would be put close to 1.
Example 3
the most natural way of combining them. Thus the new forecast is
y = kx1 + (1 — k)x2
The new forecast y will be unbiased and will have a minimum mean
square error for suitably chosen k. If we consider the forecast errors of
the three forecasts, it is clear that
ey = ke1 + (1 — k) e2
using an obvious notation. If the forecast errors of x1 and x2 are
independent of each other, then E(ey) is minimum when
E(e\)
k
E(ej ) + E{e\)
(See Bates and Granger, 1959). In practice we would replace this by
ez2
k =
k = —
a
\ + + n
Combined Combined
Observation Forecast 1 Forecast 2 Forecast 1 Forecast 2
locally made forecasts would be moderately good, but for longer lead
times the p per cent, would be the best figure to use. It was also
assumed a priori that growth rates for all regions should be positive.
Using these assumptions the central authority adjusted all the regional
forecasts. Denoting the regional forecasts of the percentage annual
growth for a lead time of h years by xh and the central authorities
combined forecast by yh , the formula for the adjustment was
. fP if xh < 0
h
\ ahxh + (1 — ah )p if xh > 0
where a is some constant in the range (0,1). It is seen that this formula
is similar to those above with k replaced by ah . The effect of this form
for k is that for large lead times yh tends towards the value p. This
adjustment procedure, with values of a based on trials with past
forecasts, produced very large reductions in the average of the mean
square errors over all the regions.
Example 1
yt = xt +a
Stage I error
Stage II error
(a ) Cost
Error
-r 0 s
(b) Cost
a +s
(a — b)(j> + (b — c)0 + (c — d)<j> =0
o
where 0(u) is the ordinate of the standard normal curve. It will be seen
that if the cost function is symmetrical with a = d, b = c and r = s the
solution of the equation is a = 0 as we would expect. For the
non-symmetrical case with b = c = 0 the equation becomes
= d/a
271
a—r a—r
a u< <f>
o o
a—r a a—r
*1 “
-4>
o o o
a a+s a+s
-<u< 4> 4>l ^
o o o
a+s a+s
<u 1 — 4)
o o
Example 2
From the distribution of the stage I error we can find the distribution
of xt conditional on xt. For example, xt might be distributed as N(x t, o2 ).
Thus we can find P as a function of both yt and xt. We may then choose
y to minimize P. If we denote the distribution of xt given xt by f(xt I xt)
this exercise gives
f(byt 1 xt) = a
f(ayt\xt) b
as the relation giving minimum P. As an example, if the stage I forecast
272
a = 1.1, b = 0.9
xt 02 yt xt a
2
yt
10 10 16.20 100 1 100.10
20 10 24.15 100 4 101.58
50 10 51.93 100 5 110.24
100 10 100.99 100 50 208.38
500 10 500.20 100 100 366.76
a = 2.0, b = 0.5
xt O2 h
100 1 80.00
100 50 80.46
100 200 81.80
100 1000 88.37
100 2000 95.48
100 4000 107.50
errors are normally distributed with zero mean and variance o2,
left-hand side of the above equation becomes
References
Bates, J. M., and Granger, C. W. J. (1959). The combination of forecasts. Op. Res.
Quarterly, 20,451—468.
273
Chapter 18
Problems in practice
18.1 Introduction
The aim of this chapter is to discuss some problems that arise when
forecasting is used in practice. In this discussion we will not attempt
anything like a full coverage of the field. Instead, we will examine a
number of selected examples. The examples have been chosen to
illustrate one main theme. This theme is that it is incorrect to regard
forecasting as an exercise that can be done on its own without reference
to the use to which we put the forecasts. All too frequently this is
exactly what is done. Having produced a forecast it is regarded as a
number that can be used for any situation where a forecast is required.
We might say that in such a case the forecast and application are
'decoupled’ from each other. Such decoupling often occurs where a
firm has a ‘forecasting program’ on its computer used simply to
generate forecasts which are then given to anyone who needs them.
There are several possible sources of trouble in adopting this type of
approach to forecasting. These are discussed and illustrated in this
chapter. In summary, incorrect decoupling of forecast and application
can lead to:
274
275
xf = (l — a) Z arxt_r (1)
r-0
on the previous week’s order Pt_ j minus the amount drawn xt. Hence
a2 = a2 +4 Var(5cf _ x)
Week t 0 1 2 3 4 5 6
Dt=S0-St (8)
Dt = xt — 2xt_ i
Substituting the model for xt in xt gives
Figure 18.2 shows a rough sketch of this type of curve. It is seen that
there is a maximum depletion Dm in stock after a time tm (to the
nearest integer), but that after this the stock in fact starts to increase
and the system overcompensates for the trend. In theory this looks bad,
but in practice the type of trend we are considering here is simply a
local phenomenon. If it appears to have become a permanent part of the
situation, the forecasting formula would be changed to a trend-
correcting form. The intuitive explanation of the form of Dt is that
initially in the transient state, the forecast does not follow the trend, so
the orders do not keep up with the demand. After a time the forecast
increases at the same rate as the trend but with a fixed lag behind it.
Though in this steady state the forecasts are lower than the sales, the
occurrence in the ordering rule of equation (5) of the term 2
— xt-i), which has expectation 2/3, means that the orders will on
average be for more than next week’s sales and thus, in the steady state,
there will be a continual increase in stocks. The calculation of Dm from
Dt is best done by evaluating D1? D2, etc., until the largest value is
found. If there were no random variation in sales and we could
reasonably guarantee that no local trends would have a slope greater
than |3, then a safety stock greater than Dm would ensure that no ‘stock
out’ is likely to occur.
If we now allow for both random variation and the possibility of a
trend, we see that it is reasonable to set the safety stock at 2 + Dm .
Referring to equations (7) and (9) it is clear that both these quantities
depend on the choice of a, if a is increased then os decreases and Dm
increases, and vice versa. Thus there is, in fact, a value of a for which
the safety stock is minimum. This value, for example, is approximately
0.95 for a/|3 = 10. We thus have here an example where a forecasting
parameter is chosen, not with reference to getting best forecasts in any
mean square error sense but with reference to the particular application
in which the forecast is used.
Pt = xt +2(xt — xt_1)
xt = axt_ j + (1 — a)xt
we then have
xt = cxt
Pt = xt + 2 c(xt — xt_x)
Thus if x is in fact increasing Pt > xt9 irrespective of c, and if x is
decreasing Pt < xt. The consequence of this is that if sales are
increasing the manufacturer will overestimate, in his forecasts, the
total amount to be sold ultimately to the public If sales are
decreasing he will underestimate. The amount of overestimation or
underestimation will increase with the rate of change of x. An
example of this type of behaviour is given in the paper by Kuchn
and Day (1963), where the term ‘acceleration effect’ is used to
describe the phenomena.
280
Randomness
(a)
amplified
(b) Trends
exagerated
Oscillation
(c) /W\
AM amplified
Change
_y\
delayed
Model
Prob(D = 1) = 0.6
Prob(D = 2) = 0.4
if C = 1 is 10 x 0.6 + 10 x 0.4 = 10
if C = 2 is 5 x 0.6 + 20 x 0.4 = 11
Table 18.2 Maximum loads for Prob (delay) < 0.01. Units are 100 seconds of
operator contact per half-hour for the Erlang iC model
Number of operators M 1 2 3 4 5 6 7 8
Maximum 6, (0M) 0.1 2.6 7.7 14.4 22.5 31.5 41.2 51.7
Number of operators M 9 10 15 20 25 30 40 50
Maximum 6, {6M) 62 73 133 197 265 335 479 627
r = C/a
K 0.5 0.75 1.0
of the table. In the type of application in which this problem arose this
corresponds to quite a good forecast, but yet it gives a probability of
0.24 of being understaffed by one. With larger forecast variability
considerably worse errors are seen to occur.
As the probabilities of being understaffed and over-staffed are
equally likely, one might think that no great damage has been done.
Unfortunately, overstaffing improves the system very little whereas
understaffing has disastrous effects on the quality of service given. For
example, for the case where we scheduled six operators and had a load
of 35 units, which required seven operators, the delay probability is
0.03. If the load had been 50 units, this would go up to 0.07. Thus
though the probabilities of error in the numbers of operators is
balanced between understaffing and overstaffing, the effect on the
probability of delay is to increase it above the value of 0.01 used in the
original table.
The fact that the load is forecast rather than known completely
changes the meaning of the tables. Table 18.4 gives approximate values
for the actual delay probabilities when forecast values of load are used
in Table 18.2. The basis for the derivation of Table 18.4 was the use of
the probabilities of Table 18.3 together with information from
queueing theory on the effects of under- and overstaffing. It is clear
from Table 18.4 that if one has a small load for which two of three
operators are adequate, then large forecasting errors can lead to very
poor service. One way to improve service is obviously to improve the
forecasts. Whether or not this can be done, the obvious answer to the
difficulties that are produced by the use of forecasts in the tables is to
use the forecasts in a different way to the way one would use the true
load if it were known. One simple method of doing this would be to
use, in the tables, not the forecast but an adjusted forecast. The
adjusted forecast would be larger than the original forecast, thus cutting
down on the probability of understaffing and putting up the pro-
bability of overstaffing; this in turn will decrease the actual probability
of delays occurring.
We have discussed above a situation where a forecast, 0, is used to
choose between a number of alternatives, M = 1, M = 2, etc. The basic
objective is to use the forecast to make the best decision M in terms of
some criteria, here the delay. We have discussed this in some detail as it
is an example of a very common type of forecasting situation.
In summary, the lesson of this example is that we cannot treat
forecasts in the same way as we would treat the true values. The effect
of the random variation in forecasts can, however, be studied and
appropriate adjustments made once its effects are understood.
Before leaving this example, it is perhaps worth noting the possibility
of extending to other applications the use of the type of information
shown in Figure 18.5. Suppose, for example, that a study of a forecast
287
Table 18.4 Approximate actual probabilities
of delay when using tables
Number of r = Cjo
operators 0.5 0.75 1.0
Number of r = C/a
operators 0.5 0.75 1.0
the forecaster but because of the nature of the situation being dealt
with. Let us briefly consider some examples of these effects.
(a) When a new product is launched, sales figures may show a nice
steady increase. One is tempted then to forecast a continuing
increase. There have been cases where this has been done only to
find a very sharp levelling off at a later stage. The explanation has
been that there has been a potential market for say 1,000 sales per
week, but when the item was first introduced the production
capacity was only geared to 500 a week. The production capacity
has then been increased steadily. Thus until the 1,000 items/week
figure was reached the sales equalled the production and forecasts
based on the trend were naturally correct.
(b) The use of forecasts in setting salesmen’s targets produces some
interesting phenomena. Suppose the forecasts were made a target
and no incentive was provided for passing the target, though one
was provided for reaching the target. In this case one would expect
the salesman’s sales to equal the target and hence the forecast.
Alternatively, if there was a fixed pay up to the target and good
bonuses for exceeding the target, then there is a fair chance that
the forecast will be exceeded. In essence one really needs to
distinguish in this situation between a forecast of the natural
market, the target given the salesmen and the forecast of the future
sales, allowing for the influence of the target and the pay structure.
(c) In (b) we have a situation where the value forecast actually
influences the future sales. Another example would be where the
sales staff takes one look at the disasterous forecast of sales and
takes drastic action to change the situation. In this case any
resulting forecast error is not a measure of bad forecasting but
partly of the success or otherwise of the actions taken.
(d) An interesting example is given by Forrester (1961), which, though
there based on a simulation experiment, certainly happens in
practice. This is the situation where an initial use of a seasonal
289
Throughout this book we have used the concepts of global and local
models (e.g. section 2.3). These ideas emerge from the consideration of
stability as underlying the approach of statistical forecasting. If we can
assume that we have correctly identified the model, that it is stable or
that, if it does undergo a sudden change, we can identify and use the
new model rapidly, then a global use of our model is clearly the best
approach. If, however, we judge the situation to be subject to slow
evolution in time, either on the basis of past data or knowledge of the
environment in which the data will arise, then we can regard the model
as an approximation that is fitted locally for forecasting purposes (e.g.
Chapters 4 and 5).
Having decided on the model and the approach to its use, the
method of obtaining a forecast is usually fairly clear, as we have
illustrated in Chapters 4 to 11 and 14. A general discussion of methods
was given in Chapter 12. Often with experience we will seek to extend
and modify our initial methods (section 14.2 and chapters 17 and 18).
291
References
Forrester, J. N. (1961). Industrial Dynamics. MIT Press, Cambridge, Mass.
Kuchn, A. A. and Day, R. L. (1963). The acceleration effect in forecasting
industrial shipments. Journal of Marketing, 27, 25—28.
Leicester, C. S. (1963). The cause of the shoe trade cycle. British Boot and Shoe
Institute, 11, 307—316.
Morse, P. M. (1957). Queues, Inventories and Maintainance, 1967 edn. John Wiley
and Sons, New York.
Thomeycroft, W. T., Greener, J. W. and Patrick, H. A. (1968). Investment decisions
under uncertainty and variability. Op. Res. Quarterly, 19,143—160.
Trigg, D. W. and Pitts, E. (1963). The optimal choice of the smoothing constant in
an inventory control system. Op. Res. Quarterly, 13, 287—298.
.
„
Appendix A
The aim of this appendix is to define some of the terms that have been
used in previous chapters, particularly Chapter 7. It also seeks to
provide some additional ideas that will be of use when some of the
more advanced references are consulted.
A discrete stochastic process is a process that generates a sequence of
observations, {xj, in time. Figure A.l represents a particular sequence
from such a process, often called a ‘realization’ of the process. The
probability density function of x* at a specific time i is given by
fixfidxi = Prob{x; < Xj < xt + dxfi
i.e. it is the probability that the sequence passes through the small
window, dxj, in an imaginary wall built at time i. Similarly, if we
imagine two such windows at times i and j, we can obtain a joint
probability density, f(xhXj), as being defined from the probability of
the process passing through both windows, dxt and dxj, as in Figure
A.l(b). If we know that at time i the sequence has the value x,, then
the probability of it passing through the window dxj is f(Xj \ xfidxj; this
is the conditional probability.
If f(Xj | X/) does not depend on the value xz-, we say that the two
random variables xt and xy are independent (Figure A.lc). If all
variables { x J at all times are independent of each other, then the whole
process is called an independent stochastic process or sometimes ‘white
noise’. The sequence{ et] used throughout this book is formally defined
as such a process.
If all the distributions of one, two or more variables depend only on
the relative positions of the ‘walls’, and thus are not effected by any
change in time origin, then the process is said to be stationary. The
global use of the models in Chapter 7 essentially assumed that the
processes being dealt with were stationary. The local use of the models
293
294
is based on the belief that often real processes are only locally
stationary and are non-stationary over long periods of time.
We can define expectations for the above distributions as in Chapter
3, Table 3.5. Thus we can define:
E(Xj | Xj) as giving the mean value of the process at time j, given
that it took the value of xz- at time i
This quantity measures the relationship between xt and Xj. To see the
way in which it does this we need to modify it slightly by defining the
autocorrelation function of xt and x7 by
p(Xi,Xj) = Cov(x/,x;-)/y/(Var(x/) Var(x7))
This quantity takes the value 1 if j Xj, 1»0« if X/ and Xj are perfectly
linearly related, the value —1 if xz- = —x7 and the value 0 if xt and x7 are
independent of each other. Thus p(x,-,X/) is a measure of linear
relationship obtained by standardizing the autocovariance. Notice that
Cov(xz-,X/) = Var(x;)
If the process is stationary, Cov(x*,x7) and p(xz-,x7) will depend only on
the time difference j — i and not on the actual times. Thus putting
j = i + k we can simplify our notation for a stationary process:
P (xi9xi+tc) = p(k)
Var<X/)= C(0)
so,
p(k) = C(k)/C( 0)
The plot of p(k) against k is called the correlogram.
By way of example, consider the first-order autoregressive process
x* = 0x,_ i + et (| 0 | < 1)
if this is re-expressed as
X, = et + 4>et_t +<A2e t„2 + . ..
296
We here show that the forecast with the minimum mean square error is
simply the expected value of the future xt+k, given that we already
know the values OC | ^ OC 2 y * • • + ^ • This conditional expectation we
denote by
E^Xt+k I X i , . . . , Xf )
Let us start by asking what constant 5ct k we would use to forecast xt+ k
to obtain the minimum mean square error. To find this we simply
expand the mean square error and choose 5ct k to give the minimum;
thus
MSE =E(xt+k —xt'k)2
= E(xj+k) — 2xt,kE(xt+k) +x2:k
Differentiating with respect to xt k and equating to zero gives
xt,k=E(xt+k) (1)
and the mean square error for this xt k is
MSE =E{xt+k — E(xr+k)}2 = Vax(xt+k)
We thus obtain the best constant forecast by taking xt k equal to the
expected value of the future xt+k.
To obtain a realistic forecast of xt+k we must obviously use our data
xxt, which for now we regard as a sequence of random
variables. Suppose we write the forecast as
We now want to find the function xtk (xl5 . . . , xt) which gives the
minimum mean square forecast error, so we want to minimize
298
299
E (• )=E{E(- | z)}
(y r-O 2 y
Thus we first hold z constant and find the expectation with respect to
y. This will give an answer which depends on z. We now find the
expectation of this quantity with respect to Applying this idea to
finding the mean square error we can rewrite the mean square error as
MSE =E[E {(xt+k — xtk(xi. ,xt)}2 I Xj,. . . , x,} ]
x Xt+k
E \.{.xt+k x
t,k} I x]
x
t+ k
xt,k =E{xt+k | x}
This choice must obviously minimize the total mean square error
expression, as it is only the inner expectation that relates the past,
Xj, . . . , xt, to the future, xt+k. We thus arrive at the conclusion that
the minimum mean square error forecast of a future observation, xt+k,
is the expectation of that observation given the data, x1,...,xf,
available at present, i.e.
x x
t, k (*^T > • • • 5 f) ~ E {Xf + k | Xj, . . . , xt~j
Again using the results obtained for the constant forecast, it is clear
that the minimum mean square error obtained using xt^k (xx, . . . , xf)
is
That xtfk (Xj,. . . , x?) is unbiased follows from studying the error, which
is
@t+ k ~ Xf+k X-i k (x 1 , . . . > X^)
300
SO
E(et+k | x) = 0
But
and so
=
E{et+ k:) 0
and
aWf_ j +1 W o =1
Wi = aWi+ i +1 Wt = 1
Then the local estimate of pT is
AT = (ST + ST -xT)l(WT +WT-1)
301
\
Appendix D
S = V ae _
r=Q
r 2
t r
and we require b0, bx and b2 to be the values that minimize this with a
as the fixed discounting factor (0 < a < 1). Substituting for e gives
t- I
3s t- i
— =—2 2 arxitt_r(yt_r — b0—blxl t_r — b2X2,t-r) =
0
OOl r=0
35 t- I
Sorting these terms out and dropping the summation limits for clarity
302
303
gives
b0l,ar + b1'Larxltt_r + b2larx2,t-r = 2aryt_r
b0Xat'xl yt_r + biltfxl't-r + b2Ilarx1 ,t_rx2it_r
= 2arxx ,f_ ryt-r
b0T,arx2 ,f_r + SG"X2
r
jf_r + b2^o x2yt_r
= Zarx2ft_ryt_r
These are the normal equations discussed in section 6.2. Notice that the
three equations are equivalent to
'Laret_ r =0
2G 0
Y,arX2 t— r^t— r = 0
The normal equations used in section 5.2.3 are obtained by a simple
change in notation:
(a) b0 becomes the estimate, ptt, of the current mean.
(b) b1 becomes the current estimate of slope (3t by letting xt^r be
simply —r, the time into the past.
(c) x2 and b2 are set at zero as there are only two parameters and two
equations.
(d) We use x as the dependent variable instead of the y here.
We thus obtain
f.\t'Lar — jS^Za'V = Xarxt_r
pLtlLar r — r2 ='Larrxt_r
Reference
Gilchrist, W. G. (1967). Methods of estimation using discounting. J. Roy. Statist.
Soc., 29, 355-369.
X
.
Index
305
306
QA Gilchrist
280
. G54 Statistical forecasting
DATE DUE
mi i tj