Chapter 06 Merged

Chapter 6
Univariate time-series modelling and

forecasting
‘Introductory Econometrics for Finance’ © Chris Brooks 2019 1

Univariate Time-series Models
• Where we attempt to predict returns using only information contained in their

past values.
Some Notation and Concepts
• A Strictly Stationary Process
A strictly stationary process is one where
P{yt1  b1,..., ytn  bn}  P{yt1 m  b1,..., ytn  m  bn}
i.e. the probability measure for the sequence {yt} is the same as that for {yt+m}  m.
• A Weakly Stationary Process
If a series satisfies the next three equations, it is said to be weakly or covariance
stationary
1. E(yt) =  , t = 1,2,...,
2. E ( yt   )( yt   )    
2
3. E ( yt1   )( yt 2   )   t 2  t1  t1 , t2
Univariate Time-series Models (cont’d)
• So if the process is covariance stationary, all the variances are the same and all
the covariances depend on the difference between t1 and t2. The moments
E ( yt  E ( yt ))( yt  s  E ( yt  s ))   s , s = 0,1,2, ...
are known as the covariance function.
• The covariances, s, are known as autocovariances.
• However, the value of the autocovariances depend on the units of measurement

of yt.
• It is thus more convenient to use the autocorrelations which are the
autocovariances normalised by dividing by the variance:

  s , s = 0,1,2, ...
s
0
• If we plot s against s=0,1,2,... then we obtain the autocorrelation function or
correlogram.
A White Noise Process
• A white noise process is one with (virtually) no discernible structure. A

definition of a white noise process is E ( y )  
t
Var ( yt )   2
 2 if t  r
 t r  
0 otherwise
• Thus the autocorrelation function will be zero apart from a single peak of 1
at s = 0. s  approximately N(0,1/T) where T = sample size
• We can use this to do significance tests for the autocorrelation coefficients

by constructing a confidence interval.
1
• For example, a 95% confidence interval would be given by  .196 
T . If
the sample autocorrelation coefficient, s , falls outside this region for any
value of s, then we reject the null hypothesis that the true value of the
coefficient at lag s is zero.
Joint Hypothesis Tests
• We can also test the joint hypothesis that all m of the k correlation coefficients
are simultaneously equal to zero using the Q-statistic developed by Box and
m
Q  T   k2
Pierce:
k 1
where T = sample size, m = maximum lag length
• The Q-statistic is asymptotically distributed as a  m2.
• However, the Box Pierce test has poor small sample properties, so a variant
has been developed, called the Ljung-Box statistic:
m
 k2
Q  T T  2

~  m2
k 1 T k
• This statistic is very useful as a portmanteau (general) test of linear dependence
in time-series.
An ACF Example
• Question:
Suppose that a researcher had estimated the first 5 autocorrelation coefficients
using a series of length 100 observations, and found them to be (from 1 to 5):
0.207, -0.013, 0.086, 0.005, -0.022.
Test each of the individual coefficient for significance, and use both the Box-
Pierce and Ljung-Box tests to establish whether they are jointly significant.
• Solution:
A coefficient would be significant if it lies outside (-0.196,+0.196) at the 5%
level, so only the first autocorrelation coefficient is significant.
Q=5.09 and Q*=5.26
Compared with a tabulated 2(5)=11.1 at the 5% level, so the 5 coefficients
are jointly insignificant.

Moving Average Processes
• Let ut (t=1,2,3,...) be a sequence of independently and identically

distributed (iid) random variables with E(ut)=0 and Var(ut)=2 , then
yt =  + ut + 1ut-1 + 2ut-2 + ... + qut-q
is a qth order moving average model MA(q).
• Its properties are

E(yt)=; Var(yt) = 0 = (1+12   22 ... q2 )2
Covariances
( s   s 1 1   s  2 2  ...   q q  s ) 2 for s  1,2,..., q
s 
0 for s  q

Example of an MA Problem
1. Consider the following MA(2) process:

X t  u t   1u t 1   2 u t  2
where ut is a zero mean white noise process with variance  2 .
(i) Calculate the mean and variance of Xt
(ii) Derive the autocorrelation function for this process (i.e. express the
autocorrelations, 1, 2, ... as functions of the parameters 1 and
2).
(iii) If 1 = -0.5 and 2 = 0.25, sketch the acf of Xt.

Solution
(i) If E(ut)=0, then E(ut-i)=0  i.

So
E(Xt) = E(ut + 1ut-1+ 2ut-2)= E(ut)+ 1E(ut-1)+ 2E(ut-2)=0
Var(Xt) = E[Xt-E(Xt)][Xt-E(Xt)]
but E(Xt) = 0, so
Var(Xt) = E[(Xt)(Xt)]
= E[(ut + 1ut-1+ 2ut-2)(ut + 1ut-1+ 2ut-2)]
= E[ u t2   12 u t21   22 u t2 2 +cross-products]
But E[cross-products]=0 since Cov(ut,ut-s)=0 for s0.

Solution (cont’d)
So Var(Xt) = 0= E [ u t   1 u t 1   2 u t  2 ]
2 2 2 2 2
=  1   2 
2 2 2 2 2
= (1   1   2 )
2 2 2
(ii) The acf of Xt.

1 = E[Xt-E(Xt)][Xt-1-E(Xt-1)]
= E[Xt][Xt-1]
= E[(ut +1ut-1+ 2ut-2)(ut-1 + 1ut-2+ 2ut-3)]
= E[(  1u t 1   1 2 u t  2 )]
2 2
=  1  2   1 2  2
= ( 1   1 2 ) 2

Solution (cont’d)
2 = E[Xt-E(Xt)][Xt-2-E(Xt-2)]
= E[Xt][Xt-2]
= E[(ut +1ut-1+2ut-2)(ut-2 +1ut-3+2ut-4)]
= E[(  2 u t2 2 )]
=  2
2
3 = E[Xt-E(Xt)][Xt-3-E(Xt-3)]
= E[Xt][Xt-3]
= E[(ut +1ut-1+2ut-2)(ut-3 +1ut-4+2ut-5)]
=0
So s = 0 for s > 2.

Solution (cont’d)
We have the autocovariances, now calculate the autocorrelations:


0   0  1
0
1 ( 1   1 2 ) 2 ( 1   1 2 )
1   
 0 (1   12   22 ) 2 (1   12   22 )
2 ( 2 ) 2 2
2   
 0 (1   12   22 ) 2 (1   12   22 )

3   3  0
0

s   s  0s  2
0
(iii) For 1 = -0.5 and 2 = 0.25, substituting these into the formulae above
gives 1 = -0.476, 2 = 0.190.
ACF Plot
Thus the acf plot will appear as follows:
1.2
0.8
0.6
0.4
acf
0.2
0
0 1 2 3 4 5 6
-0.2
-0.4
-0.6

Autoregressive Processes
• An autoregressive model of order p, an AR(p) can be expressed as

y t    1 y t 1   2 y t  2  ...   p y t  p  u t
• Or using the lag operator notation:
Lyt = yt-1 Liyt = yt-i
p
y t      i y t i  u t
i 1
p
or y t      i L y t  u t
i
•
i 1
or  ( L) y t    u t where  ( L)  1  (1 L  2 L2 ... p Lp ) .

The Stationary Condition for an AR Model
• The condition for stationarity of a general AR(p) model is that the

roots of 1  1z   2 z2 ... p z p  0 all lie outside the unit circle.
• A stationary AR(p) model is required for it to have an MA()
representation.
• Example 1: Is yt = yt-1 + ut stationary?
The characteristic root is 1, so it is a unit root process (so non-
stationary)
• Example 2: Is yt = 3yt-1 – 2.75yt-2 + 0.75yt-3 +ut stationary?
The characteristic roots are 1, 2/3, and 2. Since only one of these lies
outside the unit circle, the process is non-stationary.

Wold’s Decomposition Theorem
• This states that any stationary series can be decomposed into the sum of
two unrelated processes, a purely deterministic part and a purely stochastic
part, which will be an MA().
• For the AR(p) model,  ( L) yt  ut , ignoring the intercept, the Wold

decomposition is
yt   ( L)ut
where,
 ( L)  (1  1 L  2 L2 ... p Lp ) 1

The Moments of an Autoregressive Process
• The moments of an autoregressive process are as follows. The mean is

given by 0
E ( yt ) 
1  1  2  ...   p
• The autocovariances and autocorrelation functions can be obtained by

solving what are known as the Yule-Walker equations:
1  1  12  ...   p 1 p
 2  11  2  ...   p  2 p
  
 p   p 11   p  22  ...   p
• If the AR model is stationary, the autocorrelation function will decay

exponentially to zero.
Sample AR Problem
• Consider the following simple AR(1) model

yt    1 yt 1  ut
(i) Calculate the (unconditional) mean of yt.
For the remainder of the question, set =0 for simplicity.
(ii) Calculate the (unconditional) variance of yt.
(iii) Derive the autocorrelation function for yt.

Solution
(i) Unconditional mean:

E(yt) = E(+1yt-1)
= +1E(yt-1)
But also
So E(yt)=  +1 ( +1E(yt-2))

=  +1  +12 E(yt-2))
E(yt) =  +1  +12 E(yt-2))

=  +1  +12 ( +1E(yt-3))
=  +1  +12  +13 E(yt-3)

Solution (cont’d)
An infinite number of such substitutions would give

E(yt) =  (1+1+12 +...) + 1y0
So long as the model is stationary, i.e. 1 < 1, then 1 = 0.
So E(yt) =  (1+1+12 +...) = 

1  1
(ii) Calculating the variance of yt: yt  1 yt 1  ut
From Wold’s decomposition theorem:

yt (1  1 L)  ut
yt  (1  1 L) 1 u t
y t  (1  1 L  1 L2  ...)u t
2

Solution (cont’d)
So long as 1  1 , this will converge.

yt  ut  1ut 1  1 ut 2  ...
2
Var(yt) = E[yt-E(yt)][yt-E(yt)]
but E(yt) = 0, since we are setting  = 0.
Var(yt) = E[(yt)(yt)]
  
= E[ ut  1ut 1  12ut 2  .. ut  1ut 1  12ut 2  .. ]
= E[(ut 2  12ut 12  14ut 2 2  ...  cross  products)]
= E[(ut  1 ut 1  1 ut 2  ...)]
2 2 2 4 2
=  u2  12 u2  14 u2  ...

=  u (1  1  1  ...)
2 2 4
=  2
u
(1  12 )
Solution (cont’d)
(iii) Turning now to calculating the acf, first calculate the autocovariances:
1 = Cov(yt, yt-1) = E[yt-E(yt)][yt-1-E(yt-1)]
Since a0 has been set to zero, E(yt) = 0 and E(yt-1) = 0, so
1 = E[ytyt-1]
1 = E[(ut  1ut 1  1 ut 2  ...)(u t 1  1u t 2  1 u t 3  ...) ]
2 2
 u
= E[ 1 t 1
2
  1
3
u t  2  ...  cross  products]
2
= 1  1   1   ...
2 3 2 5 2
1 2
=
(1  12 )

Solution (cont’d)
For the second autocorrelation coefficient,

2 = Cov(yt, yt-2) = E[yt-E(yt)][yt-2-E(yt-2)]
Using the same rules as applied above for the lag 1 covariance
2 = E[ytyt-2]
= E[(ut  1ut 1  1 ut  2  ...)(ut 2  1ut 3  1 ut 4  ...) ]
2 2
= E[ 1 2 u t  2 2  1 4 u t 3 2  ...  cross  products]

= 12 2  14 2  ...
= 1  (1  1  1  ...)
2 2 2 4
12 2
=
(1  12 )

Solution (cont’d)
• If these steps were repeated for 3, the following expression would be
obtained
13 2
3 =
(1  12 )
and for any lag s, the autocovariance would be given by
1s  2
s =
(1  12 )
The acf can now be obtained by dividing the covariances by the
variance:

Solution (cont’d)
0
0 = 1
0
   2 2 
 1 2   1  
 2   2 
 (1   1 )  (1   1 )
1   2  
1 =    1 2 =   12
  0  
0
 2   2 
 2   2 
 (1   1 )  (1   1 )
   
3 = 1
3
…
s = 1s
The Partial Autocorrelation Function (denoted kk)
• Measures the correlation between an observation k periods ago and the

current observation, after controlling for observations at intermediate lags
(i.e. all lags < k).
• So kk measures the correlation between yt and yt-k after removing the effects
of yt-k+1 , yt-k+2 , …, yt-1 .
• At lag 1, the acf = pacf always
• At lag 2, 22 = (2-12) / (1-12)
• For lags 3+, the formulae are more complex.

The Partial Autocorrelation Function (denoted kk)
(cont’d)
• The pacf is useful for telling the difference between an AR process and an
ARMA process.
• In the case of an AR(p), there are direct connections between yt and yt-s only
for s p.
• So for an AR(p), the theoretical pacf will be zero after lag p.
• In the case of an MA(q), this can be written as an AR(), so there are direct
connections between yt and all its previous values.
• For an MA(q), the theoretical pacf will be geometrically declining.

ARMA Processes
• By combining the AR(p) and MA(q) models, we can obtain an ARMA(p,q)

model:  ( L) yt     ( L)ut
where  ( L)  1  1 L  2 L2 ... p Lp
and  ( L)  1  1L   2 L2  ...   q Lq
or yt    1 yt 1   2 yt 2  ...   p yt  p   1u t 1   2 u t 2  ...   q u t q  u t
with E(ut )  0; E(ut )   ; E(ut u s )  0, t  s

2 2

The Invertibility Condition
• Similar to the stationarity condition, we typically require the MA(q) part of

the model to have roots of (z)=0 greater than one in absolute value.
• The mean of an ARMA series is given by


E ( yt ) 
1  1  2 ...p
• The autocorrelation function for an ARMA process will display

combinations of behaviour derived from the AR and MA parts, but for lags
beyond q, the acf will simply be identical to the individual AR(p) model.

Summary of the Behaviour of the acf for
AR and MA Processes
An autoregressive process has

• a geometrically decaying acf
• a number of spikes of pacf = AR order
A moving average process has

• a number of spikes of acf = MA order
• a geometrically decaying pacf

Some sample acf and pacf plots
for standard processes
The acf and pacf are not produced analytically from the relevant formulae for a model of that
type, but rather are estimated using 100,000 simulated observations with disturbances drawn
from a normal distribution.
ACF and PACF for an MA(1) Model: yt = – 0.5ut-1 + ut
0.05
0
1 2 3 4 5 6 7 8 9 10
-0.05
-0.1
-0.15
acf and pacf
-0.2
-0.25
-0.3
acf
-0.35
pacf
-0.4
-0.45
Lag

ACF and PACF for an MA(2) Model:
yt = 0.5ut-1 - 0.25ut-2 + ut
0.4
0.3 acf
pacf
0.2
0.1
acf and pacf
0
1 2 3 4 5 6 7 8 9 10
-0.1
-0.2
-0.3
-0.4
Lags

ACF and PACF for a slowly decaying AR(1) Model:
yt = 0.9yt-1 + ut
0.9
acf
0.8 pacf
0.7
0.6
acf and pacf
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
-0.1
Lags

ACF and PACF for a more rapidly decaying AR(1)
Model: yt = 0.5yt-1 + ut
0.6
0.5
acf
pacf
0.4
acf and pacf
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
-0.1
Lags

ACF and PACF for a more rapidly decaying AR(1)
Model with Negative Coefficient: yt = -0.5yt-1 + ut
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
acf and pacf
-0.1
-0.2
-0.3
-0.4
acf
-0.5 pacf
-0.6
Lags
‘Introductory Econometrics for Finance’ © Chris Brooks 2019

ACF and PACF for a Non-stationary Model
(i.e. a unit coefficient): yt = yt-1 + ut
0.9
acf
pacf
0.8
0.7
0.6
acf and pacf
0.5
0.4
0.3
0.2
0.1
0
1 2 3 4 5 6 7 8 9 10
Lags

ACF and PACF for an ARMA(1,1):
yt = 0.5yt-1 + 0.5ut-1 + ut
0.8
0.6
acf
pacf
0.4
acf and pacf
0.2
0
1 2 3 4 5 6 7 8 9 10
-0.2
-0.4
Lags

Building ARMA Models
- The Box Jenkins Approach
• Box and Jenkins (1970) were the first to approach the task of estimating an
ARMA model in a systematic manner. There are 3 steps to their approach:
1. Identification
2. Estimation
3. Model diagnostic checking
Step 1:
- Involves determining the order of the model.
- Use of graphical procedures
- A better procedure is now available

Building ARMA Models
- The Box Jenkins Approach (cont’d)
Step 2:
- Estimation of the parameters
- Can be done using least squares or maximum likelihood depending
on the model.
Step 3:
- Model checking
Box and Jenkins suggest 2 methods:

- deliberate overfitting
- residual diagnostics

Some More Recent Developments in
ARMA Modelling
• Identification would typically not be done using acf’s.

• We want to form a parsimonious model.
• Reasons:
- variance of estimators is inversely proportional to the number of degrees of
freedom.
- models which are profligate might be inclined to fit to data specific features
• This gives motivation for using information criteria, which embody 2 factors
- a term which is a function of the RSS
- some penalty for adding extra parameters
• The object is to choose the number of parameters which minimises the
information criterion.
Information Criteria for Model Selection
• The information criteria vary according to how stiff the penalty term is.
• The three most popular criteria are Akaike’s (1974) information criterion
(AIC), Schwarz’s (1978) Bayesian information criterion (SBIC), and the
Hannan-Quinn criterion (HQIC).
AIC  ln( 2 )  2k / T
k
SBIC  ln( ˆ 2 )  ln T
T
2k
HQIC  ln( ˆ 2 )  ln(ln( T ))
T
where k = p + q + 1, T = sample size. So we min. IC s.t. p  p, q  q
SBIC embodies a stiffer penalty term than AIC.
• Which IC should be preferred if they suggest different model orders?
– SBIC is strongly consistent but (inefficient).
– AIC is not consistent, and will typically pick “bigger” models.
ARIMA Models
• As distinct from ARMA models. The I stands for integrated.
• An integrated autoregressive process is one with a characteristic root

on the unit circle.
• Typically researchers difference the variable as necessary and then

build an ARMA model on those differenced variables.
• An ARMA(p,q) model in the variable differenced d times is equivalent

to an ARIMA(p,d,q) model on the original data.

Exponential Smoothing
• Another modelling and forecasting technique
• How much weight do we attach to previous observations?
• Expect recent observations to have the most power in helping to forecast

future values of a series.
• The equation for the model

St =  yt + (1-)St-1 (1)
where
 is the smoothing constant, with 01
yt is the current realised value
St is the current smoothed value
Exponential Smoothing (cont’d)
• Lagging (1) by one period we can write

St-1 =  yt-1 + (1-)St-2 (2)
• and lagging again
St-2 =  yt-2 + (1-)St-3 (3)
• Substituting into (1) for St-1 from (2)

St =  yt + (1-)( yt-1 + (1-)St-2)
=  yt + (1-) yt-1 + (1-)2 St-2 (4)
• Substituting into (4) for St-2 from (3)

St =  yt + (1-) yt-1 + (1-)2 St-2
=  yt + (1-) yt-1 + (1-)2( yt-2 + (1-)St-3)
=  yt + (1-) yt-1 + (1-)2 yt-2 + (1-)3 St-3
• T successive substitutions of this kind would lead to

 T 
St    1    yt i   1    S0
i T
 i 0 
since 0, the effect of each observation declines exponentially as we
move another observation forward in time.
• Forecasts are generated by

ft+s = St
for all steps into the future s = 1, 2, ...
• This technique is called single (or simple) exponential smoothing.

• It doesn’t work well for financial data because

– there is little structure to smooth
– it cannot allow for seasonality
– it is an ARIMA(0,1,1) with MA coefficient (1-) - (See Granger & Newbold, p174)
– forecasts do not converge on long term mean as s
• Can modify single exponential smoothing

– to allow for trends (Holt’s method)
– or to allow for seasonality (Winter’s method).
• Advantages of Exponential Smoothing

– Very simple to use
– Easy to update the model if a new realisation becomes available.
Forecasting in Econometrics
• Forecasting = prediction.
• An important test of the adequacy of a model.
e.g.
- Forecasting tomorrow’s return on a particular share
- Forecasting the price of a house given its characteristics
- Forecasting the riskiness of a portfolio over the next year
- Forecasting the volatility of bond returns
• We can distinguish two approaches:

- Econometric (structural) forecasting
- Time-series forecasting
• The distinction between the two types is somewhat blurred (e.g, VARs).
In-Sample Versus Out-of-Sample
• Expect the “forecast” of the model to be good in-sample.
• Say we have some data - e.g. monthly FTSE returns for 120 months:
1990M1 – 1999M12. We could use all of it to build the model, or keep
some observations back:
• A good test of the model since we have not used the information from
1999M1 onwards when we estimated the model parameters.
How to produce forecasts
• Multi-step ahead versus single-step ahead forecasts

• Recursive versus rolling windows
• To understand how to construct forecasts, we need the idea of conditional

expectations:
E(yt+1  t )
• We cannot forecast a white noise process: E(ut+s  t ) = 0  s > 0.
• The two simplest forecasting “methods”

1. Assume no change : f(yt+s) = yt
2. Forecasts are the long term average f(yt+s) = y
Models for Forecasting
• Structural models
e.g. y = X + u
yt  1   2 x2t     k xkt  ut
To forecast y, we require the conditional expectation of its future
Eyt t 1   E1   2 x2t     k xkt  ut 

value:
= 1   2 E x2t      k E xkt 
But what are ( x 2t ) etc.? We could use x 2 , so
E  yt   1   2 x2     k xk
= y !!

Models for Forecasting (cont’d)
• Time-series Models
The current value of a series, yt, is modelled as a function only of its previous
values and the current value of an error term (and possibly previous values of
the error term).
• Models include:
• simple unweighted averages
• exponentially weighted averages
• ARIMA models
• Non-linear models – e.g. threshold models, GARCH, bilinear models, etc.

Forecasting with ARMA Models
The forecasting model typically used is of the form:

p q
f t , s     i f t , s i   j ut  s  j
i 1 j 1
where ft,s = yt+s , s 0; ut+s = 0, s > 0

= ut+s , s  0

Forecasting with MA Models
• An MA(q) only has memory of q.
e.g. say we have estimated an MA(3) model:
yt =  + 1ut-1 +  2ut-2 +  3ut-3 + ut

yt+1 =  +  1ut +  2ut-1 +  3ut-2 + ut+1
yt+2 =  +  1ut+1 +  2ut +  3ut-1 + ut+2
yt+3 =  +  1ut+2 +  2ut+1 +  3ut + ut+3
• We are at time t and we want to forecast 1,2,..., s steps ahead.
• We know yt , yt-1, ..., and ut , ut-1

Forecasting with MA Models (cont’d)
ft, 1 = E(yt+1  t ) = E( +  1ut +  2ut-1 +  3ut-2 + ut+1)

=  +  1ut +  2ut-1 +  3ut-2
ft, 2 = E(yt+2  t ) = E( +  1ut+1 +  2ut +  3ut-1 + ut+2)

=  +  2ut +  3ut-1
ft, 3 = E(yt+3  t ) = E( +  1ut+2 +  2ut+1 +  3ut + ut+3)

=  +  3ut
ft, 4 = E(yt+4  t ) = 
ft, s = E(yt+s  t ) =  s4

Forecasting with AR Models
• Say we have estimated an AR(2)

yt =  + 1yt-1 +  2yt-2 + ut
yt+1 =  +  1yt +  2yt-1 + ut+1
yt+2 =  +  1yt+1 +  2yt + ut+2
yt+3 =  +  1yt+2 +  2yt+1 + ut+3
ft, 1 = E(yt+1  t ) = E( +  1yt +  2yt-1 + ut+1)

=  +  1E(yt) +  2E(yt-1)
=  +  1yt +  2yt-1
ft, 2 = E(yt+2  t ) = E( +  1yt+1 +  2yt + ut+2)

=  +  1E(yt+1) +  2E(yt)
=  +  1 ft, 1 +  2yt
Forecasting with AR Models (cont’d)
ft, 3 = E(yt+3  t ) = E( +  1yt+2 +  2yt+1 + ut+3)

=  +  1E(yt+2) +  2E(yt+1)
=  +  1 ft, 2 +  2 ft, 1
• We can see immediately that
ft, 4 =  +  1 ft, 3 +  2 ft, 2 etc., so
ft, s =  +  1 ft, s-1 +  2 ft, s-2
• We can easily generate ARMA(p,q) forecasts in the same way.

How can we test whether a forecast is accurate or not?
•For example, say we predict that tomorrow’s return on the FTSE will be 0.2, but
the outcome is actually -0.4. Is this accurate? Define ft,s as the forecast made at
time t for s steps ahead (i.e. the forecast made for time t+s), and yt+s as the
realised value of y at time t+s.
• Some of the most popular criteria for assessing the accuracy of time-series
forecasting techniques are:
N
1
MSE 
N
t 1
( yt  s  f t , s ) 2
N
1
MAE is given by MAE 
N
t 1
yt  s  f t , s
1 N yt s  ft ,s
Mean absolute percentage error: MAPE  100 
N t 1 yt s
How can we test whether a forecast is accurate or not?
(cont’d)
• It has, however, also been shown (Gerlow et al., 1993) that the accuracy of
forecasts according to traditional statistical criteria are not related to trading
profitability.
• A measure more closely correlated with profitability:
1 N
% correct sign predictions =  zt  s
N t 1
where zt+s = 1 if (yt+s . ft,s ) > 0

zt+s = 0 otherwise

Forecast Evaluation Example
• Given the following forecast and actual values, calculate the MSE, MAE and
percentage of correct sign predictions:
Steps Ahead Forecast Actual

1 0.20 -0.40
2 0.15 0.20
3 0.10 0.10
4 0.06 -0.10
5 0.04 -0.05
• MSE = 0.079, MAE = 0.180, % of correct sign predictions = 40

What factors are likely to lead to a
good forecasting model?
• “signal” versus “noise”
• “data mining” issues
• simple versus complex models
• financial or economic theory

Statistical Versus Economic or
Financial loss functions
• Statistical evaluation metrics may not be appropriate.

• How well does the forecast perform in doing the job we wanted it for?
Limits of forecasting: What can and cannot be forecast?

• All statistical forecasting models are essentially extrapolative
• Forecasting models are prone to break down around turning points
• Series subject to structural changes or regime shifts cannot be forecast
• Predictive accuracy usually declines with forecasting horizon
• Forecasting is not a substitute for judgement

Back to the original question: why forecast?
• Why not use “experts” to make judgemental forecasts?

• Judgemental forecasts bring a different set of problems:
e.g., psychologists have found that expert judgements are prone to the
following biases:
– over-confidence
– inconsistency
– recency
– anchoring
– illusory patterns
– “group-think”.
• The Usually Optimal Approach

To use a statistical forecasting model built on solid theoretical
foundations supplemented by expert judgements and interpretation.
Chapter 7
Multivariate models

Simultaneous Equations Models
• All the models we have looked at thus far have been single equations models of
the form y = X + u
• All of the variables contained in the X matrix are assumed to be EXOGENOUS.
• y is an ENDOGENOUS variable.
An example from economics to illustrate - the demand and supply of a good:

Qdt    Pt  St  ut (1)
Qst    Pt  Tt  vt (2)
Qdt  Qst (3)
where Qdt = quantity of the good demanded
Qst = quantity of the good supplied
St = price of a substitute good
Tt = some variable embodying the state of technology
Simultaneous Equations Models:
The Structural Form
• Assuming that the market always clears, and dropping the time subscripts
for simplicity
Q    P  S  u (4)
Q    P  T  v (5)
This is a simultaneous STRUCTURAL FORM of the model.
• The point is that price and quantity are determined simultaneously (price
affects quantity and quantity affects price).
• P and Q are endogenous variables, while S and T are exogenous.
• We can obtain REDUCED FORM equations corresponding to (4) and (5)

by solving equations (4) and (5) for P and for Q (separately).
Obtaining the Reduced Form
• Solving for Q,
  P  S  u    P  T  v (6)
• Solving for P,
Q  S u Q  T v (7)
      
       
• Rearranging (6),
P  P      T  S  v  u
(  ) P  (   )  T  S  (v  u)
    vu (8)
P  T S
   

Obtaining the Reduced Form (cont’d)
• Multiplying (7) through by ,

Q    S  u  Q    T  v
Q  Q      T  S  u  v
(    )Q  (    )  T  S  ( u  v)
     u  v
Q  T S (9)
   
• (8) and (9) are the reduced form equations for P and Q.

Simultaneous Equations Bias
• But what would happen if we had estimated equations (4) and (5), i.e. the
structural form equations, separately using OLS?
• Both equations depend on P. One of the CLRM assumptions was that

E(Xu) = 0, where X is a matrix containing all the variables on the RHS of
the equation.
• It is clear from (8) that P is related to the errors in (4) and (5) - i.e. it is
stochastic.
• What would be the consequences for the OLS estimator,  , if we ignore

the simultaneity?

Simultaneous Equations Bias (cont’d)
• Recall that   ( X ' X ) 1 X ' y and y  X  u
• So that ˆ  ( X ' X ) 1 X ' ( X  u )

 ( X ' X ) 1 X ' X  ( X ' X ) 1 X ' u
   ( X ' X ) 1 X ' u
• Taking expectations, E ( )  E (  )  E (( X ' X ) 1 X ' u)
   ( X ' X ) 1 E ( X ' u )
• If the X’s are non-stochastic, E(Xu) = 0, which would be the case in a

single equation system, so that E( )  , which is the condition for
unbiasedness.
• But .... if the equation is part of a system, then E(Xu)  0, in general.

• Conclusion: Application of OLS to structural equations which are part

of a simultaneous system will lead to biased coefficient estimates.
• Is the OLS estimator still consistent, even though it is biased?
• No - In fact the estimator is inconsistent as well.
• Hence it would not be possible to estimate equations (4) and (5)

validly using OLS.

Avoiding Simultaneous Equations Bias
So What Can We Do?

• Taking equations (8) and (9), we can rewrite them as
P  10  11T  12 S  1 (10)
Q   20   21T   22 S  2 (11)
• We CAN estimate equations (10) & (11) using OLS since all the RHS
variables are exogenous.
• But ... we probably don’t care what the values of the  coefficients are;
what we wanted were the original parameters in the structural
equations - , , , , , .

Identification of Simultaneous Equations
Can We Retrieve the Original Coefficients from the ’s?

Short answer: sometimes.
• As well as simultaneity, we sometimes encounter another problem:

identification.
• Consider the following demand and supply equations
Supply equation Q    P (12)
Demand equation Q    P (13)
We cannot tell which is which!
• Both equations are UNIDENTIFIED or NOT IDENTIFIED, or
UNDERIDENTIFIED.
• The problem is that we do not have enough information from the equations
to estimate 4 parameters. Notice that we would not have had this problem
with equations (4) and (5) since they have different exogenous variables.
What Determines whether an Equation is Identified
or not?
• We could have three possible situations:
1. An equation is unidentified
· like (12) or (13)
· we cannot get the structural coefficients from the reduced form estimates
2. An equation is exactly identified

· e.g. (4) or (5)
· can get unique structural form coefficient estimates
3. An equation is over-identified
· Example given later
· More than one set of structural coefficients could be obtained from the
reduced form.
What Determines whether an Equation is Identified
or not? (cont’d)
• How do we tell if an equation is identified or not?

• There are two conditions we could look at:
- The order condition - is a necessary but not sufficient condition for an

equation to be identified.
- The rank condition - is a necessary and sufficient condition for

identification. We specify the structural equations in a matrix form and
consider the rank of a coefficient matrix.

Statement of the Order Condition (from Ramanathan 1995, p. 666)

• Let G denote the number of structural equations. An equation is just
identified if the number of variables excluded from an equation is G-1.
• If more than G-1 are absent, it is over-identified. If less than G-1 are
absent, it is not identified.
Example
• In the following system of equations, the Y’s are endogenous, while the X’s
are exogenous. Determine whether each equation is over-, under-, or just-
identified. Y   Y  Y  X  X  u
1 0 1 2 3 3 4 1 5 2 1
Y2  0  1Y3  2 X1  u2
(14)-(16)
Y3   0   1Y2  u3
Solution
G = 3;
If # excluded variables = 2, the eqn is just identified
If # excluded variables > 2, the eqn is over-identified
If # excluded variables < 2, the eqn is not identified
Equation 14: Not identified

Equation 15: Just identified
Equation 16: Over-identified

Tests for Exogeneity
• How do we tell whether variables really need to be treated as endogenous

or not?
• Consider again equations (14)-(16). Equation (14) contains Y2 and Y3 - but
do we really need equations for them?
• We can formally test this using a Hausman test, which is calculated as
follows:
1. Obtain the reduced form equations corresponding to (14)-(16). The
reduced forms turn out to be:
Y1  10  11 X1  12 X 2  v1
Y2  20  21 X1  v2 (17)-(19)
Y3  30  31 X1  v3
Estimate the reduced form equations (17)-(19) using OLS, and obtain the
fitted values, Y1 ,Y2 ,Y3

Tests for Exogeneity (cont’d)
2. Run the regression corresponding to equation (14).
3. Run the regression (14) again, but now also including the fitted values
Y2 , Y3 as additional regressors:
Y1   0  1Y2   3Y3   4 X 1   5 X 2  2Yˆ2  3Yˆ3  u1

1 1
(20)
4. Use an F-test to test the joint restriction that 2 = 0, and 3 = 0. If the

null hypothesis is rejected, Y2 and Y3 should be treated as endogenous.

Recursive Systems
• Consider the following system of equations:

Y1  10   11 X1   12 X 2  u1
Y2  20  21Y1   21 X 1 22 X 2  u2 (21-23)
Y3  30  31Y1  32Y2   31 X1   32 X 2  u3
• Assume that the error terms are not correlated with each other. Can we estimate
the equations individually using OLS?
• Equation 21: Contains no endogenous variables, so X1 and X2 are not correlated

with u1. So we can use OLS on (21).
• Equation 22: Contains endogenous Y1 together with exogenous X1 and X2. We
can use OLS on (22) if all the RHS variables in (22) are uncorrelated with that
equation’s error term. In fact, Y1 is not correlated with u2 because there is no Y2
term in equation (21). So we can use OLS on (22).
Recursive Systems (cont’d)
• Equation 23: Contains both Y1 and Y2; we require these to be

uncorrelated with u3. By similar arguments to the above, equations
(21) and (22) do not contain Y3, so we can use OLS on (23).
• This is known as a RECURSIVE or TRIANGULAR system. We do

not have a simultaneity problem here.
• But in practice not many systems of equations will be recursive...

Indirect Least Squares (ILS)
• We cannot use OLS on structural equations, but we can validly apply it

to the reduced form equations.
• If the system is just identified, ILS involves estimating the reduced

form equations using OLS, and then using them to substitute back to
obtain the structural parameters.
• However, ILS is not used much because

1. Solving back to get the structural parameters can be tedious.
2. Most simultaneous equations systems are over-identified.

Estimation of Systems
Using Two-Stage Least Squares
• In fact, we can use this technique for just-identified and over-identified

systems.
• Two stage least squares (2SLS or TSLS) is done in two stages:
Stage 1:
• Obtain and estimate the reduced form equations using OLS. Save the
fitted values for the dependent variables.
Stage 2:
• Estimate the structural equations, but replace any RHS endogenous
variables with their stage 1 fitted values.
Using Two-Stage Least Squares (cont’d)
Example: Say equations (14)-(16) are required.
Stage 1:
• Estimate the reduced form equations (17)-(19) individually by OLS and
obtain the fitted values, Y1 ,Y2 ,Y3 .
Stage 2:
• Replace the RHS endogenous variables with their stage 1 estimated values:
Y1   0  1Y2   3Y3   4 X 1  5 X 2  u1
Y2  0  1Y3  2 X 1  u2 (24)-(26)
Y3   0   1Y2  u3
• Now Y2 and Y3 will not be correlated with u1, Y2 will not be correlated with u2 ,
and Y3 will not be correlated with u3 .
Using Two-Stage Least Squares (cont’d)
• It is still of concern in the context of simultaneous systems whether the

CLRM assumptions are supported by the data.
• If the disturbances in the structural equations are autocorrelated, the

2SLS estimator is not even consistent.
• The standard error estimates also need to be modified compared with

their OLS counterparts, but once this has been done, we can use the
usual t- and F-tests to test hypotheses about the structural form
coefficients.

Instrumental Variables
• Recall that the reason we cannot use OLS directly on the structural
equations is that the endogenous variables are correlated with the errors.
• One solution to this would be not to use Y2 or Y3 , but rather to use some
other variables instead.
• We want these other variables to be (highly) correlated with Y2 and Y3, but
not correlated with the errors - they are called INSTRUMENTS.
• Say we found suitable instruments for Y2 and Y3, z2 and z3 respectively. We

do not use the instruments directly, but run regressions of the form
Y2  1  2 z2  1
Y3  3  4 z3  2 (27) & (28)
Instrumental Variables (cont’d)
• Obtain the fitted values from (27) & (28), Y2 and Y3 , and replace Y2 and
Y3 with these in the structural equation.
• We do not use the instruments directly in the structural equation.
• It is typical to use more than one instrument per endogenous variable.
• If the instruments are the variables in the reduced form equations, then
IV is equivalent to 2SLS.

Instrumental Variables (cont’d)
What Happens if We Use IV / 2SLS Unnecessarily?

• The coefficient estimates will still be consistent, but will be inefficient
compared to those that just used OLS directly.
The Problem With IV

• What are the instruments?
Solution: 2SLS is easier.
Other Estimation Techniques

1. 3SLS - allows for non-zero covariances between the error terms.
2. LIML - estimating reduced form equations by maximum likelihood
3. FIML - estimating all the equations simultaneously using maximum
likelihood
An Example of the Use of 2SLS: Modelling
the Bid-Ask Spread and Volume for Options
• George and Longstaff (1993)
• Introduction
- Is trading activity related to the size of the bid / ask spread?
- How do spreads vary across options?
• How Might the Option Price / Trading Volume and the Bid / Ask Spread
be Related?
Consider 3 possibilities:
1. Market makers equalise spreads across options.
2. The spread might be a constant proportion of the option value.
3. Market makers might equalise marginal costs across options irrespective
of trading volume.
Market Making Costs
• The S&P 100 Index has been traded on the CBOE since 1983 on a
continuous open-outcry auction basis.
• Transactions take place at the highest bid or the lowest ask.
• Market making is highly competitive.

What Are the Costs Associated with Market Making?
• For every contract (100 options) traded, a CBOE fee of 9c and an

Options Clearing Corporation (OCC) fee of 10c is levied on the firm
that clears the trade.
• Trading is not continuous.
• Average time between trades in 1989 was approximately 5 minutes.

The Influence of Tick-Size Rules on Spreads
• The CBOE limits the tick size:
$1/8 for options worth $3 or more

$1/16 for options worth less than $3
• The spread is likely to depend on trading volume

... but also trading volume is likely to depend on the spread.
• So there will be a simultaneous relationship.

The Data
• All trading days during 1989 are used for observations.
• The average bid & ask prices are calculated for each option during the
time 2:00pm – 2:15pm Central Standard time.
• The following are then dropped from the sample for that day:
1. Any options that do not have bid / ask quotes reported during the ¼ hour.
2. Any options with fewer than 10 trades during the day.
• The option price is defined as the average of the bid & the ask.
• We get a total of 2456 observations. This is a pooled regression.

The Models
• For the calls:

CBAi  0  1CDUMi  2Ci  3CLi  4Ti  5CRi  ei (1)
CLi   0   1CBAi   2Ti   3Ti 2   4 Mi2  vi (2)
• And symmetrically for the puts:
PBAi  0  1 PDUMi  2 Pi  3 PLi  4Ti  5 PRi  ui (3)
PLi  0  1 PBAi  2Ti  3Ti 2  4 Mi2  wi (4)
where PRi & CRi are the squared deltas of the options

The Models (cont’d)
• CDUMi and PDUMi are dummy variables

=0 if Ci or Pi < $3
=1 if Ci or Pi  $3
• T2 allows for a nonlinear relationship between time to maturity and the

spread.
• M2 is used since ATM options have a higher trading volume.
• Aside: are the equations identified?
• Equations (1) & (2) and then separately (3) & (4) are estimated using
2SLS.
Results 1
Call Bid-Ask Spread and Trading Volume Regression

CBAi   0  1CDUM i   2Ci   3CLi   4Ti   5CRi  ei (6.55)
CLi   0   1CBAi   2Ti   3Ti 2   4 M i2  vi (6.56)
0 1 2 3 4 5 Adj. R2
0.08362 0.06114 0.01679 0.00902 -0.00228 -0.15378 0.688
(16.80) (8.63) (15.49) (14.01) (-12.31) (-12.52)
0 1 2 3 4 Adj. R2
-3.8542 46.592 -0.12412 0.00406 0.00866 0.618
(-10.50) (30.49) (-6.01) (14.43) (4.76)
Note: t-ratios in parentheses. Source: George and Longstaff (1993). Reprinted with the permission of
the School of Business Administration, University of Washington.

Results 2
Put Bid-Ask Spread and Trading Volume Regression

PBA i   0  1 PDUM i   2 Pi   3 PLi   4Ti   5 PR i  ui (6.57)
PLi   0   1 PBAi   2Ti   3Ti 2   4 M i2  wi (6.58)
0 1 2 3 4 5 Adj. R2
0.05707 0.03258 0.01726 0.00839 -0.00120 -0.08662 0.675
(15.19) (5.35) (15.90) (12.56) (-7.13) (-7.15)
0 1 2 3 4 Adj. R2
-2.8932 46.460 -0.15151 0.00339 0.01347 0.517
(-8.42) (34.06) (-7.74) (12.90) (10.86)
Note: t-ratios in parentheses. Source: George and Longstaff (1993). Reprinted with the permission of
the School of Business Administration, University of Washington.

Comments:
Adjusted R2  60%
1 and 1 measure the tick size constraint on the spread
2 and 2 measure the effect of the option price on the spread
3 and 3 measure the effect of trading activity on the spread
4 and 4 measure the effect of time to maturity on the spread
5 and 5 measure the effect of risk on the spread
1 and 1 measure the effect of the spread size on trading activity etc.
Calls and Puts as Substitutes
• The paper argues that calls and puts might be viewed as substitutes
since they are all written on the same underlying.
• So call trading activity might depend on the put spread and put trading
activity might depend on the call spread.
• The results for the other variables are little changed.

Conclusions
• Bid - Ask spread variations between options can be explained by reference

to the level of trading activity, deltas, time to maturity etc. There is a 2 way
relationship between volume and the spread.
• The authors argue that in the second part of the paper, they did indeed find
evidence of substitutability between calls & puts.
Comments
- No diagnostics.
- Why do the CL and PL equations not contain the CR and PR variables?
- The authors could have tested for endogeneity of CBA and CL.
- Why are the squared terms in maturity and moneyness only in the
liquidity regressions?
- Wrong sign on the squared deltas.
Vector Autoregressive Models
• A natural generalisation of autoregressive models popularised by Sims

(1980)
• A VAR is in a sense a systems regression model i.e. there is more than one
dependent variable.
• Simplest case is a bivariate VAR

y1t  10  11 y1t 1 ... 1k y1t  k  11 y2 t 1 ...1k y2 t  k  u1t
y2 t  20  21 y2 t 1 ... 2 k y2 t  k   21 y1t 1 ... 2 k y1t  k  u2 t
where uit is an iid disturbance term with E(uit)=0, i=1,2; E(u1t u2t)=0.
• The analysis could be extended to a VAR(g) model, or so that there are g

variables and g equations.

Vector Autoregressive Models:
Notation and Concepts
• One important feature of VARs is the compactness with which we can

write the notation. For example, consider the case from above where k=1.
y1t  10  11 y1t 1  11 y2t 1  u1t
• We can write this as
y2t  20  21 y2t 1  21 y1t 1  u2t
 y1t   10   11 11  y1t 1   u1t 

or         
 y2 t   20   21 21   y2 t 1   u2 t 
or even more compactly as
yt = 0 + 1 yt-1 + ut
g1 g1 gg g1 g1
Vector Autoregressive Models:
Notation and Concepts (cont’d)
• This model can be extended to the case where there are k lags of each
variable in each equation:
yt = 0 + 1 yt-1 + 2 yt-2 +...+ k yt-k + ut

g1 g1 gg g1 gg g1 gg g1 g1
• We can also extend this to the case where the model includes first
difference terms and cointegrating relationships (a VECM).

Vector Autoregressive Models Compared with
Structural Equations Models
• Advantages of VAR Modelling

- Do not need to specify which variables are endogenous or exogenous - all are
endogenous
- Allows the value of a variable to depend on more than just its own lags or
combinations of white noise terms, so more general than ARMA modelling
- Provided that there are no contemporaneous terms on the right hand side of the
equations, can simply use OLS separately on each equation
- Forecasts are often better than “traditional structural” models.
• Problems with VAR’s
- VAR’s are a-theoretical (as are ARMA models)
- How do you decide the appropriate lag length?
- So many parameters! If we have g equations for g variables and we have k lags of each
of the variables in each equation, we have to estimate (g+kg2) parameters. e.g. g=3, k=3,
parameters = 30
- Do we need to ensure all components of the VAR are stationary?
- How do we interpret the coefficients?
Choosing the Optimal Lag Length for a VAR
 2 possible approaches: cross-equation restrictions and information criteria
Cross-Equation Restrictions
 In the spirit of (unrestricted) VAR modelling, each equation should have
the same lag length
 Suppose that a bivariate VAR(8) estimated using quarterly data has 8 lags
of the two variables in each equation, and we want to examine a restriction
that the coefficients on lags 5 through 8 are jointly zero. This can be done
using a likelihood ratio test
 Denote the variance-covariance matrix of residuals (given by uûˆ /T), as̂ .
The likelihood ratio test for this joint hypothesis is given by

LR  T log ˆ r  log ˆ u 
Choosing the Optimal Lag Length for a VAR
(cont’d)
where ̂ r is the variance-covariance matrix of the residuals for the restricted

model (with 4 lags), ̂u is the variance-covariance matrix of residuals for the
unrestricted VAR (with 8 lags), and T is the sample size.
• The test statistic is asymptotically distributed as a 2 with degrees of freedom
equal to the total number of restrictions. In the VAR case above, we are
restricting 4 lags of two variables in each of the two equations = a total of 4 *
2 * 2 = 16 restrictions.
• In the general case where we have a VAR with p equations, and we want to
impose the restriction that the last q lags have zero coefficients, there would
be p2q restrictions altogether
• Disadvantages: Conducting the LR test is cumbersome and requires a
normality assumption for the disturbances.

Information Criteria for VAR Lag Length Selection
• Multivariate versions of the information criteria are required. These can

be defined as:
ln ˆ  2k  / T
MAIC
 k
ˆ
MSBIC  ln   ln(T)
T
ˆ 2k 
MHQIC ln   ln(ln(T))
T
where all notation is as above and k is the total number of regressors in all
equations, which will be equal to g2k + g for g equations, each with k lags
of the g variables, plus a constant term in each equation. The values of the
information criteria are constructed for 0, 1, … lags (up to some pre-
specified maximum k ).

Does the VAR Include Contemporaneous Terms?
• So far, we have assumed the VAR is of the form

y1t  10  11 y1t 1  11 y2t 1  u1t
y2t  20  21 y2t 1  21 y1t 1  u2t
• But what if the equations had a contemporaneous feedback term?

y1t  10  11 y1t 1  11 y2t 1  12 y2t  u1t
y2t  20  21 y2t 1  21 y1t 1  22 y1t  u2t
• We can write this as
 y1t   10   11 11  y1t 1   12 0   y2 t   u1t 

            
y  
 2 t   20   21 21   2 t 1 
y  0  22   y1t   u2 t 
• This VAR is in primitive form.

Primitive versus Standard Form VARs
• We can take the contemporaneous terms over to the LHS and write
 1  12   y1t   10   11 11  y1t 1   u1t 
           
   22 1   y2 t   20    21 21   y2 t 1  u2 t 
or
A yt = 0 +  1 yt-1 + ut
• We can then pre-multiply both sides by A-1 to give

yt = A-10 + A-11 yt-1 + A-1ut
or
yt = A0 + A1 yt-1 + et
• This is known as a standard form VAR, which we can estimate using OLS.
Block Significance and Causality Tests
• It is likely that, when a VAR includes many lags of variables, it will be

difficult to see which sets of variables have significant effects on each
dependent variable and which do not. For illustration, consider the following
bivariate VAR(3):
 y1t    10    11  12  y1t 1    11  12  y1t  2    11  12  y1t 3   u1t 

                   
 y 2t   20    21  22  y 2t 1    21  22  y 2t  2    21  22  y 2t 3   u 2t 
• This VAR could be written out to express the individual equations as
y1t  10  11 y1t 1  12 y2t 1   11 y1t 2   12 y2t 2   11 y1t 3   12 y2t 3  u1t
y2t   20   21 y1t 1   22 y2t 1   21 y1t 2   22 y2t 2   21 y1t 3   22 y2t 3  u2t
• We might be interested in testing the following hypotheses, and their
implied restrictions on the parameter matrices:

Block Significance and Causality Tests (cont’d)
Hypothesis Implied Restriction

1. Lags of y1t do not explain current y2t 21 = 0 and 21 = 0 and 21 = 0
• Each of these four joint hypotheses can be tested within the F-test
framework, since each set of restrictions contains only parameters drawn
from one equation.
• These tests could also be referred to as Granger causality tests.
• Granger causality tests seek to answer questions such as “Do changes in y1
cause changes in y2?” If y1 causes y2, lags of y1 should be significant in the
equation for y2. If this is the case, we say that y1 “Granger-causes” y2.
• If y2 causes y1, lags of y2 should be significant in the equation for y1.
• If both sets of lags are significant, there is “bi-directional causality”
Impulse Responses
• VAR models are often difficult to interpret: one solution is to construct

the impulse responses and variance decompositions.
• Impulse responses trace out the responsiveness of the dependent variables
in the VAR to shocks to the error term. A unit shock is applied to each
variable and its effects are noted.
• Consider for example a simple bivariate VAR(1):
y1t  10  11 y1t 1  11 y2 t 1  u1t
y2 t  20  21 y2 t 1  21 y1t 1  u2 t
• A change in u1t will immediately change y1. It will change change y2 and
also y1 during the next period.
• We can examine how long and to what degree a shock to a given equation
has on all of the variables in the system.

Variance Decompositions
• Variance decompositions offer a slightly different method of examining

VAR dynamics. They give the proportion of the movements in the
dependent variables that are due to their “own” shocks, versus shocks to the
other variables.
• This is done by determining how much of the s-step ahead forecast error
variance for each variable is explained innovations to each explanatory
variable (s = 1,2,…).
• The variance decomposition gives information about the relative importance

of each shock to the variables in the VAR.

Impulse Responses and Variance Decompositions:
The Ordering of the Variables
• But for calculating impulse responses and variance decompositions, the
ordering of the variables is important.
• The main reason for this is that above, we assumed that the VAR error terms
were statistically independent of one another.
• This is generally not true, however. The error terms will typically be correlated
to some degree.
• Therefore, the notion of examining the effect of the innovations separately has
little meaning, since they have a common component.
• What is done is to “orthogonalise” the innovations.
• In the bivariate VAR, this problem would be approached by attributing all of
the effect of the common component to the first of the two variables in the
VAR.
• In the general case where there are more variables, the situation is more
complex but the interpretation is the same.
An Example of the use of VAR Models:
The Interaction between Property Returns and the
Macroeconomy
• Brooks and Tsolacos (1999) employ a VAR methodology for investigating the
interaction between the UK property market and various macroeconomic variables.
• Monthly data are used for the period December 1985 to January 1998.
• It is assumed that stock returns are related to macroeconomic and business
conditions.
• The variables included in the VAR are
– FTSE Property Total Return Index (with general stock market effects removed)
– The rate of unemployment
– Nominal interest rates
– The spread between long and short term interest rates
– Unanticipated inflation
– The dividend yield.
The property index and unemployment are I(1) and hence are differenced.
Marginal Significance Levels associated with Joint
F-tests that all 14 Lags have not Explanatory Power
for that particular Equation in the VAR
• Multivariate AIC selected 14 lags of each variable in the VAR
Lags of Variable
Dependent variable SIR DIVY SPREAD UNEM UNINFL PROPRES
SIR 0.0000 0.0091 0.0242 0.0327 0.2126 0.0000
DIVY 0.5025 0.0000 0.6212 0.4217 0.5654 0.4033
SPREAD 0.2779 0.1328 0.0000 0.4372 0.6563 0.0007
UNEM 0.3410 0.3026 0.1151 0.0000 0.0758 0.2765
UNINFL 0.3057 0.5146 0.3420 0.4793 0.0004 0.3885
PROPRES 0.5537 0.1614 0.5537 0.8922 0.7222 0.0000

Variance Decompositions for the
Property Sector Index Residuals
• Ordering for Variance Decompositions and Impulse Responses:

– Order I: PROPRES, DIVY, UNINFL, UNEM, SPREAD, SIR
– Order II: SIR, SPREAD, UNEM, UNINFL, DIVY, PROPRES.
Explained by innovations in
SIR DIVY SPREAD UNEM UNINFL PROPRES
Months ahead I II I II I II I II I II I II
1 0.0 0.8 0.0 38.2 0.0 9.1 0.0 0.7 0.0 0.2 100.0 51.0
2 0.2 0.8 0.2 35.1 0.2 12.3 0.4 1.4 1.6 2.9 97.5 47.5
3 3.8 2.5 0.4 29.4 0.2 17.8 1.0 1.5 2.3 3.0 92.3 45.8
4 3.7 2.1 5.3 22.3 1.4 18.5 1.6 1.1 4.8 4.4 83.3 51.5
12 2.8 3.1 15.5 8.7 15.3 19.5 3.3 5.1 17.0 13.5 46.1 50.0
24 8.2 6.3 6.8 3.9 38.0 36.2 5.5 14.7 18.1 16.9 23.4 22.0

Impulse Responses and Standard Error Bands for
Innovations in Dividend Yields and
the Treasury Bill Yield
Innovations in Dividend Yields
0.06
0.04
0.02
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
-0.02
-0.04
-0.06 Innovations in the T-Bill Yield

Steps Ahead
0.12
0.1
0.08
0.06
0.04
0.02
0
-0.02 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
-0.04
-0.06
-0.08
‘Introductory Econometrics for Finance’ © Chris Brooks 2019 Steps Ahead 55
Chapter 8
Modelling long-run relationship in finance

Stationarity and Unit Root Testing
Why do we need to test for Non-Stationarity?
• The stationarity or otherwise of a series can strongly influence its

behaviour and properties - e.g. persistence of shocks will be infinite for
nonstationary series
• Spurious regressions. If two variables are trending over time, a

regression of one on the other could have a high R2 even if the two are
totally unrelated
• If the variables in the regression model are not stationary, then it can
be proved that the standard assumptions for asymptotic analysis will
not be valid. In other words, the usual “t-ratios” will not follow a t-
distribution, so we cannot validly undertake hypothesis tests about the
regression parameters.

Value of R2 for 1000 Sets of Regressions of a
Non-stationary Variable on another Independent
Non-stationary Variable

Value of t-ratio on Slope Coefficient for 1000 Sets of
Regressions of a Non-stationary Variable on another
Independent Non-stationary Variable

Two Types of Non-Stationarity
• Various definitions of non-stationarity exist

• In this chapter, we are really referring to the weak form or covariance
stationarity
• There are two models which have been frequently used to characterise
non-stationarity: the random walk model with drift:
yt =  + yt-1 + ut (1)
and the deterministic trend process:
yt =  +  t + u t (2)
where ut is iid in both cases.

Stochastic Non-Stationarity
• Note that the model (1) could be generalised to the case where yt is an
explosive process:
yt =  + yt-1 + ut
where  > 1.
• Typically, the explosive case is ignored and we use  = 1 to

characterise the non-stationarity because
–  > 1 does not describe many data series in economics and finance.
–  > 1 has an intuitively unappealing property: shocks to the system
are not only persistent through time, they are propagated so that a
given shock will have an increasingly large influence.

Stochastic Non-stationarity: The Impact of Shocks
• To see this, consider the general case of an AR(1) with no drift:

yt = yt-1 + ut (3)
Let  take any value for now.
• We can write: yt-1 = yt-2 + ut-1
yt-2 = yt-3 + ut-2
• Substituting into (3) yields: yt = (yt-2 + ut-1) + ut
= 2yt-2 + ut-1 + ut
• Substituting again for yt-2: yt = 2(yt-3 + ut-2) + ut-1 + ut
= 3 yt-3 + 2ut-2 + ut-1 + ut
• T successive substitutions of this type lead to:
yt = T y0 + ut-1 + 2ut-2 + 3ut-3 + ...+ Tu0 + ut

The Impact of Shocks for
Stationary and Non-stationary Series
• We have 3 cases:
1. <1  T0 as T
So the shocks to the system gradually die away.
2. =1  T =1 T
So shocks persist in the system and never die away. We obtain:

yt  y0   ut as T
t 0
So just an infinite sum of past shocks plus some starting value of y0.
3. >1. Now given shocks become more influential as time goes on,
since if >1, 3>2> etc.
Detrending a Stochastically Non-stationary Series
• Going back to our 2 characterisations of non-stationarity, the r.w. with drift:

yt =  + yt-1 + ut (1)
and the trend-stationary process
yt =  +  t + u t (2)
• The two will require different treatments to induce stationarity. The second case
is known as deterministic non-stationarity and what is required is detrending.
• The first case is known as stochastic non-stationarity. If we let
yt = yt - yt-1
and L yt = yt-1
so (1-L) yt = yt - L yt = yt - yt-1
If we take (1) and subtract yt-1 from both sides:
yt - yt-1 =  + ut
 yt =  + ut
We say that we have induced stationarity by “differencing once”.
Detrending a Series: Using the Right Method
• Although trend-stationary and difference-stationary series are both

“trending” over time, the correct approach needs to be used in each case.
• If we first difference the trend-stationary series, it would “remove” the

non-stationarity, but at the expense on introducing an MA(1) structure into
the errors.
• Conversely if we try to detrend a series which has a stochastic trend, then

we will not remove the non-stationarity.
• We will now concentrate on the stochastic non-stationarity model since

deterministic non-stationarity does not adequately describe most series in
economics or finance.
Sample Plots for Various Stochastic Processes:
A White Noise Process
4
3
2
1
0
-1 1 40 79 118 157 196 235 274 313 352 391 430 469
-2
-3
-4

Sample Plots for various Stochastic Processes:
A Random Walk and a Random Walk with Drift
70
60
Random Walk
50
Random Walk with Drift
40
30
20
10
0
1 19 37 55 73 91 109 127 145 163 181 199 217 235 253 271 289 307 325 343 361 379 397 415 433 451 469 487
-10
-20

Sample Plots for Various Stochastic Processes:
A Deterministic Trend Process
30
25
20
15
10
5
0
-5 1 40 79 118 157 196 235 274 313 352 391 430 469

Autoregressive Processes with
Differing Values of  (0, 0.8, 1)
15
Phi=1
10
Phi=0.8
Phi=0
5
0
1 53 105 157 209 261 313 365 417 469 521 573 625 677 729 781 833 885 937 989
-5
-10
-15
-20

Definition of Non-Stationarity
• Consider again the simplest stochastic trend model:

yt = yt-1 + ut
or  yt = ut
• We can generalise this concept to consider the case where the series
contains more than one “unit root”. That is, we would need to apply the
first difference operator, , more than once to induce stationarity.
Definition
If a non-stationary series, yt must be differenced d times before it becomes
stationary, then it is said to be integrated of order d. We write yt I(d).
So if yt  I(d) then dyt I(0).
An I(0) series is a stationary series
An I(1) series contains one unit root,
e.g. yt = yt-1 + ut
Characteristics of I(0), I(1) and I(2) Series
• An I(2) series contains two unit roots and so would require differencing
twice to induce stationarity.
• I(1) and I(2) series can wander a long way from their mean value and
cross this mean value rarely.
• I(0) series should cross the mean frequently.
• The majority of economic and financial series contain a single unit root,
although some are stationary and consumer prices have been argued to
have 2 unit roots.

How do we test for a unit root?
• The early and pioneering work on testing for a unit root in time series
was done by Dickey and Fuller (Dickey and Fuller 1979, Fuller 1976).
The basic objective of the test is to test the null hypothesis that  =1 in:
yt = yt-1 + ut
against the one-sided alternative  <1. So we have
H0: series contains a unit root
vs. H1: series is stationary.
• We usually use the regression:

yt = yt-1 + ut
so that a test of =1 is equivalent to a test of =0 (since -1=).

Different Forms for the DF Test Regressions
• Dickey Fuller tests are also known as  tests: , , .

• The null (H0) and alternative (H1) models in each case are
i) H0: yt = yt-1+ut
H1: yt = yt-1+ut, <1
This is a test for a random walk against a stationary autoregressive process of
order one (AR(1))
ii) H0: yt = yt-1+ut
H1: yt = yt-1++ut, <1
This is a test for a random walk against a stationary AR(1) with drift.
iii) H0: yt = yt-1+ut
H1: yt = yt-1++t+ut, <1
This is a test for a random walk against a stationary AR(1) with drift and a
time trend.

Computing the DF Test Statistic
• We can write
yt=ut
where yt = ytyt-1, and the alternatives may be expressed as
yt = yt-1++t +ut
with ==0 in case i), and =0 in case ii) and =-1. In each case, the
tests are based on the t-ratio on the yt-1 term in the estimated regression of
yt on yt-1, plus a constant in case ii) and a constant and trend in case iii).
The test statistics are defined as 

test statistic = 

SE( )
• The test statistic does not follow the usual t-distribution under the null,
since the null is one of non-stationarity, but rather follows a non-standard
distribution. Critical values are derived from Monte Carlo experiments in,
for example, Fuller (1976). Relevant examples of the distribution are
shown in table 4.1 below
Critical Values for the DF Test
Significance level 10% 5% 1%

C.V. for constant -2.57 -2.86 -3.43
but no trend
C.V. for constant -3.12 -3.41 -3.96
and trend
Table 4.1: Critical Values for DF and ADF Tests (Fuller,
1976, p373).
The null hypothesis of a unit root is rejected in favour of the stationary alternative
in each case if the test statistic is more negative than the critical value.

The Augmented Dickey Fuller (ADF) Test
• The tests above are only valid if ut is white noise. In particular, ut will be
autocorrelated if there was autocorrelation in the dependent variable of the
regression (yt) which we have not modelled. The solution is to “augment”
the test using p lags of the dependent variable. The alternative model in
case (i) is now written: p
yt yt 1   i yt i  ut
i 1
• The same critical values from the DF tables are used as before. A problem
now arises in determining the optimal number of lags of the dependent
variable.
There are 2 ways
- use the frequency of the data to decide
- use information criteria
Testing for Higher Orders of Integration
• Consider the simple regression:

yt = yt-1 + ut
We test H0: =0 vs. H1: <0.
• If H0 is rejected we simply conclude that yt does not contain a unit root.
• But what do we conclude if H0 is not rejected? The series contains a unit
root, but is that it? No! What if ytI(2)? We would still not have rejected. So
we now need to test
H0: ytI(2) vs. H1: ytI(1)
We would continue to test for a further unit root until we rejected H0.
• We now regress 2yt on yt-1 (plus lags of 2yt if necessary).
• Now we test H0: ytI(1) which is equivalent to H0: ytI(2).
• So in this case, if we do not reject (unlikely), we conclude that yt is at least
I(2).
The Phillips-Perron Test
• Phillips and Perron have developed a more comprehensive theory of

unit root non-stationarity. The tests are similar to ADF tests, but they
incorporate an automatic correction to the DF procedure to allow for
autocorrelated residuals.
• The tests usually give the same conclusions as the ADF tests, and the
calculation of the test statistics is complex.

Criticism of Dickey-Fuller and
Phillips-Perron-type tests
• The main criticism is that the power of the tests is low if the process is
stationary but with a root close to the non-stationary boundary.
e.g. the tests are poor at deciding if
=1 or =0.95,
especially with small sample sizes.
• If the true data generating process (dgp) is

yt = 0.95yt-1 + ut
then the null hypothesis of a unit root should be rejected.
• One way to get around this is to use a stationarity test as well as the
unit root tests we have looked at.

Stationarity Tests
• Stationarity tests have

H0: yt is stationary
versus H1: yt is non-stationary
So that by default under the null the data will appear stationary.
• One such stationarity test is the KPSS test (Kwaitowski, Phillips,

Schmidt and Shin, 1992).
• Thus we can compare the results of these tests with the ADF/PP
procedure to see if we obtain the same conclusion.

Stationarity Tests (cont’d)
• A Comparison
ADF / PP KPSS
H0: yt  I(1) H0: yt  I(0)
H1: yt  I(0) H1: yt  I(1)
• 4 possible outcomes
Reject H0 and Do not reject H0

Do not reject H0 and Reject H0
Reject H0 and Reject H0
Do not reject H0 and Do not reject H0
Unit Root Tests with Structural Breaks
• The standard Dickey-Fuller-type unit root tests presented above do not

perform well if there are structural breaks in the series
• The tests have low power in such circumstances and they fail to reject
the unit root null hypothesis when it is incorrect as the slope parameter
in the regression of yt on yt−1 is biased towards unity
• The larger the break and the smaller the sample, the lower the power of
the test
• Unit root tests are also oversized in the presence of structural breaks
• Perron (1989) demonstrates that after allowing for structural breaks in
the tests, a whole raft of macroeconomic series may be stationary
• He argues that most economic time series are best characterised by
broken trend stationary processes, i.e. a deterministic trend but with a
structural break.
The Perron (1989) Procedure - Background
• Perron (1989) proposes three test equations differing dependent on the

type of break that is thought to be present:
1. A ‘crash’ model that allows a break in the level (i.e. the intercept)
2. A ‘changing growth’ model that allows for a break in the growth rate
(i.e. the slope)
3. A model that allows for both types of break to occur at the same time,
changing both the intercept and the slope of the trend.
• Define the break point in the data as Tb and Dt is a dummy variable

defined as

The Perron (1989) Procedure - Details
• The equation for the third (most general) version of the test is
• For the crash only model, set α2 = 0

• For the changing growth only model, set α1 = 0
• In all three cases, there is a unit root with a structural break at Tb under
the null hypothesis and a series that is a stationary process with a break
under the alternative
• A limitation of this approach is that it assumes that the break date is
known in advance
• It is possible, however, that the date will not be known and must be
determined from the data.
The Banerjee et al. (1992) and Zivot and Andrews
(1992) Procedures - Background
• More seriously, Christiano (1992) has argued that the critical values
employed with the test will presume the break date to be chosen
exogenously
• But most researchers will select a break point based on an examination
of the data
• Thus the asymptotic theory assumed will no longer hold
• Banerjee et al. (1992) and Zivot and Andrews (1992) introduce an
approach to testing for unit roots in the presence of structural change
that allows the break date to be selected endogenously.
• Their methods are based on recursive, rolling and sequential tests.

(1992) Procedures - Details
• For the recursive and rolling tests, Banerjee et al. propose four
specifications.
1. The standard Dickey-Fuller test on the whole sample
2. The ADF test conducted repeatedly on the sub-samples and the minimal
DF statistic is obtained
3. The maximal DF statistic obtained from the sub-samples
4. The difference between the maximal and minimal statistics
• For the sequential test, the whole sample is used each time with the
following regression being run
where tused = Tb/T .

(1992) Procedures – Details 2
• The test is run repeatedly for different values of Tb over as much of the
data as possible (a ‘trimmed sample’)
• This excludes the first few and the last few observations
• Clearly it is τt(tused) allows for the break, which can either be in the
level (where τt(tused) = 1 if t> tused and 0 otherwise); or the break can be
in the deterministic trend (where τt(tused) = t − tused if t > tused and 0
otherwise
• For each specification, a different set of critical values is required, and

these can be found in Banerjee et al.(1992).

Further Extensions
• Perron (1997) proposes an extension of the Perron (1989) technique

but using a sequential procedure that estimates the test statistic
allowing for a break at any point during the sample to be determined
by the data
• This technique is very similar to that of Zivot and Andrews, except that
Perron’s is more flexible since it allows for a break under both the null
and alternative hypotheses
• A further extension would be to allow for more than one structural

break in the series – for example, Lumsdaine and Papell (1997)
enhance the Zivot and Andrews (1992) approach to allow for two
structural breaks.
Testing for Unit Roots with Structural Breaks
Example: EuroSterling Interest Rates
• Brooks and Rew (2002) examine whether EuroSterling interest rates

are best viewed as unit root process or not, allowing for the possibility
of structural breaks in the series
• Failure to account for structural breaks (caused, for example, by
changes in monetary policy or the removal of exchange rate controls)
may lead to incorrect inferences regarding the validity or otherwise of
the expectations hypothesis.
• Their sample covers the period 1 January 1981 to 1 September 1997
• They use the standard Dickey-Fuller test, the recursive and sequential
tests of Banerjee et al. They also employ the rolling test, the Perron
(1997) approach and several other techniques

Testing for Unit Roots with Structural Breaks in
EuroSterling Interest Rates – Results

Testing for Unit Roots with Structural Breaks in
EuroSterling Interest Rates – Conclusions
• The findings for the recursive tests are that the unit root null should
not be rejected at the 10% level for any of the maturities examined
• For the sequential tests, the results are slightly more mixed with the
break in trend model not rejecting the null hypothesis, while it is
rejected for the short, 7-day and the 1-month rates when a structural
break is allowed for in the mean
• The weight of evidence indicates that short term interest rates are best
viewed as unit root processes that have a structural break in their level
around the time of ‘Black Wednesday’ (16 September 1992) when the
UK dropped out of the European Exchange Rate Mechanism
• The longer term rates, on the other hand, are I(1) processes with no
breaks

Seasonal Unit Roots
• It is possible that a series may contain seasonal unit roots, so that it

requires seasonal differencing to induce stationarity
• We would use the notation I(d,D) to denote a series that is integrated
of order d,D and requires differencing d times and seasonal
differencing D times to obtain a stationary process
• Osborn (1990) develops a test for seasonal unit roots based on a
natural extension of the Dickey-Fuller approach.
• However, Osborn also shows that only a small proportion of
macroeconomic series exhibit seasonal unit roots; the majority have
seasonal patterns that can better be characterised using dummy
variables, which may explain why the concept of seasonal unit roots
has not been widely adopted

Cointegration: An Introduction
• In most cases, if we combine two variables which are I(1), then the
combination will also be I(1).
• More generally, if we combine variables with differing orders of

integration, the combination will have an order of integration equal to the
largest. i.e.,
if Xi,t  I(di) for i = 1,2,3,...,k
so we have k variables each integrated of order di.
Let k (1)
zt   i X i,t
i 1
Then zt  I(max di)

Linear Combinations of Non-stationary Variables
• Rearranging (1), we can write

k
X1,t   i X i ,t  z't
i 2
i zt

where i  
1 , z ' 
t  , i  2,..., k
1
• This is just a regression equation.
• But the disturbances would have some very undesirable properties: zt´ is
not stationary and is autocorrelated if all of the Xi are I(1).
• We want to ensure that the disturbances are I(0). Under what circumstances
will this be the case?
Definition of Cointegration (Engle & Granger, 1987)
• Let zt be a k1 vector of variables, then the components of zt are cointegrated

of order (d,b) if
i) All components of zt are I(d)
ii) There is at least one vector of coefficients  such that  zt  I(d-b)
• Many time series are non-stationary but “move together” over time.
• If variables are cointegrated, it means that a linear combination of them will
be stationary.
• There may be up to r linearly independent cointegrating relationships (where
r  k-1), also known as cointegrating vectors. r is also known as the
cointegrating rank of zt.
• A cointegrating relationship may also be seen as a long term relationship.

Cointegration and Equilibrium
• Examples of possible Cointegrating Relationships in finance:

– spot and futures prices
– ratio of relative prices and an exchange rate
– equity prices and dividends
• Market forces arising from no arbitrage conditions should ensure an

equilibrium relationship.
• No cointegration implies that series could wander apart without bound

in the long run.

Equilibrium Correction or Error Correction Models
• When the concept of non-stationarity was first considered, a usual

response was to independently take the first differences of a series of I(1)
variables.
• The problem with this approach is that pure first difference models have no
long run solution.
e.g. Consider yt and xt both I(1).
The model we may want to estimate is
 yt =  xt + ut
But this collapses to nothing in the long run.
• The definition of the long run that we use is where

yt = yt-1 = y; xt = xt-1 = x.
• Hence all the difference terms will be zero, i.e.  yt = 0; xt = 0.
Specifying an ECM
• One way to get around this problem is to use both first difference and
levels terms, e.g.
 yt = 1xt + 2(yt-1-xt-1) + ut (1)
• yt-1-xt-1 is known as the error correction term.
• Providing that yt and xt are cointegrated with cointegrating coefficient

, then (yt-1-xt-1) will be I(0) even though the constituents are I(1).
• We can thus validly use OLS on (1).
• The Granger representation theorem shows that any cointegrating

relationship can be expressed as an equilibrium correction model.
Testing for Cointegration in Regression
• The model for the equilibrium correction term can be generalised to

include more than two variables:
yt =  1 + 2x2t + 3x3t + … + kxkt + ut (3)
• ut should be I(0) if the variables yt, x2t, ... xkt are cointegrated.
• So what we want to test is the residuals of equation (3) to see if they

are non-stationary or stationary. We can use the DF / ADF test on ut.
So we have the regression
with vt  iid.
ut  ut 1  vt
• However, since this is a test on the residuals of an actual model, ut ,
then the critical values are changed.

Testing for Cointegration in Regression:
Conclusions
• Engle and Granger (1987) have tabulated a new set of critical values
and hence the test is known as the Engle Granger (E.G.) test.
• We can also use the Durbin Watson test statistic or the Phillips Perron
approach to test for non-stationarity of ut .
• What are the null and alternative hypotheses for a test on the residuals
of a potentially cointegrating regression?
H0 : unit root in cointegrating regression’s residuals
H1 : residuals from cointegrating regression are stationary

Methods of Parameter Estimation in
Cointegrated Systems:
The Engle-Granger Approach
• There are (at least) 3 methods we could use: Engle Granger, Engle and Yoo,
and Johansen.
• The Engle Granger 2 Step Method
This is a single equation technique which is conducted as follows:
Step 1:
- Make sure that all the individual variables are I(1).
- Then estimate the cointegrating regression using OLS.
- Save the residuals of the cointegrating regression, ut .
- Test these residuals to ensure that they are I(0).
Step 2:
- Use the step 1 residuals as one variable in the error correction model e.g.
 yt =  1xt +  2( uˆt 1 ) + ut
where uˆt 1= yt-1-ˆ xt-1
An Example of a Model for Non-stationary
Variables: Lead-Lag Relationships between Spot
and Futures Prices
Background
• We expect changes in the spot price of a financial asset and its corresponding
futures price to be perfectly contemporaneously correlated and not to be
cross-autocorrelated.
i.e. expect Corr(ln(Ft),ln(St))  1

Corr(ln(Ft),ln(St-k))  0  k
Corr(ln(Ft-j),ln(St))  0  j
• We can test this idea by modelling the lead-lag relationship between the two.
• We will consider two papers Tse(1995) and Brooks et al (2001).

Futures & Spot Data
• Tse (1995): 1055 daily observations on NSA stock index and stock
index futures values from December 1988 - April 1993.
• Brooks et al (2001): 13,035 10-minutely observations on the FTSE

100 stock index and stock index futures prices for all trading days in
the period June 1996 – 1997.

Methodology
• The fair futures price is given by
F*t = S t e(r-d)(T-t)
where Ft* is the fair futures price, St is the spot price, r is a
continuously compounded risk-free rate of interest, d is the
continuously compounded yield in terms of dividends derived from the
stock index until the futures contract matures, and (T-t) is the time to
maturity of the futures contract. Taking logarithms of both sides of
equation above gives
f t *  st  (r - d)(T - t)
• First, test ft and st for nonstationarity.

Dickey-Fuller Tests on Log-Prices and Returns for
High Frequency FTSE Data
Futures Spot
Dickey-Fuller Statistics -0.1329 -0.7335
for Log-Price Data
Dickey Fuller Statistics -84.9968 -114.1803

for Returns Data

Cointegration Test Regression and Test on Residuals
• Conclusion: log Ft and log St are not stationary, but log Ft and log St are
stationary.
• But a model containing only first differences has no long run relationship.
• The solution is to see if there exists a cointegrating relationship between ft
and st which would mean that we can validly include levels terms in this
framework.
• Potential cointegrating regression:

st   0   1 f t  z t
where zt is a disturbance term.

• Estimate the regression, collect the residuals, zt , and test whether they are
stationary.

Estimated Equation and Test for Cointegration for
Cointegrating Regression
Coefficient Estimated Value
0 0.1345
1 0.9834
DF Test on residuals Test Statistic
ẑt -14.7303

Conclusions from Unit Root and Cointegration Tests
• Conclusion: zt are stationary and therefore we have a cointegrating

relationship between log Ft and log St.
• Final stage in Engle-Granger 2-step method is to use the first stage

residuals, zt as the equilibrium correction term in the general equation.
• The overall model is
 ln S t   0  zˆt 1  1 ln S t 1  1 ln Ft 1  vt

Estimated Error Correction Model for
Coefficient Estimated Value t-ratio

0 9.6713E-06 1.6083
 -8.3388E-01 -5.1298
1 0.1799 19.2886
1 0.1312 20.4946
Look at the signs and significances of the coefficients:

• ̂1 is positive and highly significant
• ̂ 1 is positive and highly significant
•  is negative and highly significant

Forecasting High Frequency FTSE Returns
• Is it possible to use the error correction model to produce superior

forecasts to other models?
Comparison of Out of Sample Forecasting Accuracy

ECM ECM-COC ARIMA VAR
RMSE 0.0004382 0.0004350 0.0004531 0.0004510
MAE 0.4259 0.4255 0.4382 0.4378
% Correct 67.69% 68.75% 64.36% 66.80%
Direction

Can Profitable Trading Rules be Derived from the
ECM-COC Forecasts?
• The trading strategy involves analysing the forecast for the spot return, and
incorporating the decision dictated by the trading rules described below. It is assumed
that the original investment is £1000, and if the holding in the stock index is zero, the
investment earns the risk free rate.
– Liquid Trading Strategy - making a round trip trade (i.e. a purchase and sale of
the FTSE100 stocks) every ten minutes that the return is predicted to be positive
by the model.
– Buy-&-Hold while Forecast Positive Strategy - allows the trader to continue
holding the index if the return at the next predicted investment period is positive.
– Filter Strategy: Better Predicted Return Than Average - involves purchasing the
index only if the predicted returns are greater than the average positive return.
– Filter Strategy: Better Predicted Return Than First Decile - only the returns
predicted to be in the top 10% of all returns are traded on
– Filter Strategy: High Arbitrary Cut Off - An arbitrary filter of 0.0075% is
imposed,
Spot Trading Strategy Results for Error Correction
Model Incorporating the Cost of Carry
Trading Strategy Terminal Return ( % ) Terminal Return ( % )

Wealth {Annualised} Wealth (£) {Annualised} Number
(£) with slippage with slippage of trades
Passive 1040.92 4.09 1040.92 4.09 1
Investment {49.08} {49.08}
Liquid Trading 1156.21 15.62 1056.38 5.64 583

{187.44} {67.68}
Buy-&-Hold while 1156.21 15.62 1055.77 5.58 383

Forecast Positive {187.44} {66.96}
Filter I 1144.51 14.45 1123.57 12.36 135

{173.40} {148.32}
Filter II 1100.01 10.00 1046.17 4.62 65
{120.00} {55.44}
Filter III 1019.82 1.98 1003.23 0.32 8

{23.76} {3.84}

Strategy Trading Results – Conclusions
• The futures market “leads” the spot market because:

• the stock index is not a single entity, so
• some components of the index are infrequently traded
• it is more expensive to transact in the spot market
• stock market indices are only recalculated every minute
• Spot & futures markets do indeed have a long run relationship.
• Since it appears impossible to profit from lead–lag relationships, their

existence is entirely consistent with the absence of arbitrage
opportunities and in accordance with modern definitions of the
efficient markets hypothesis.

The Engle-Granger Approach: Some Drawbacks
This method suffers from a number of problems:

1. Unit root and cointegration tests have low power in finite samples
2. We are forced to treat the variables asymmetrically and to specify one as
the dependent and the other as independent variables.
3. Cannot perform any hypothesis tests about the actual cointegrating
relationship estimated at stage 1.
 Problem 1 is a small sample problem that should disappear asymptotically.

 Problem 2 is addressed by the Johansen approach.
 Problem 3 is addressed by the Engle and Yoo approach or the Johansen approach.

The Engle & Yoo 3-Step Method
• One of the problems with the EG 2-step method is that we cannot make
any inferences about the actual cointegrating regression.
• The Engle & Yoo (EY) 3-step procedure takes its first two steps from EG.
• EY add a third step giving updated estimates of the cointegrating vector

and its standard errors.
• The most important problem with both these techniques is that in the
general case above, where we have more than two variables which may be
cointegrated, there could be more than one cointegrating relationship.
• In fact there can be up to r linearly independent cointegrating vectors

(where r  g-1), where g is the number of variables in total.
The Engle & Yoo 3-Step Method (cont’d)
• So, in the case where we just had y and x, then r can only be one or
zero.
• But in the general case there could be more cointegrating relationships.
• And if there are others, how do we know how many there are or
whether we have found the “best”?
• The answer to this is to use a systems approach to cointegration which

will allow determination of all r cointegrating relationships -
Johansen’s method.

Testing for and Estimating Cointegrating Systems
Using the Johansen Technique Based on VARs
• To use Johansen’s method, we need to turn the VAR of the form
yt = 1 yt-1 + 2 yt-2 +...+ k yt-k + ut

g×1 g×g g×1 g×g g×1 g×g g×1 g×1
into a vector error correction model (VECM), which can be written as
yt =  yt-k + 1 yt-1 + 2 yt-2 + ... + k-1 yt-(k-1) + ut

k i
where  = (  i )  I g and i  (  j )  I g
j 1 j 1
 is a long run coefficient matrix since all the yt-i = 0.

Review of Matrix Algebra
necessary for the Johansen Test
• Let  denote a gg square matrix and let c denote a g1 non-zero vector,
and let  denote a set of scalars.
•  is called a characteristic root or set of roots of  if we can write

c= c
gg g1 g1
• We can also write

 c =  Ip c
and hence
(  - Ig ) c = 0
where Ig is an identity matrix.
Review of Matrix Algebra (cont’d)
• Since c  0 by definition, then for this system to have zero solution, we

require the matrix (  - Ig ) to be singular (i.e. to have zero determinant).
  - Ig  = 0
 5 1
• For example, let  be the 2  2 matrix  
2 4 
• Then the characteristic equation is
 5 1 1 0
  - Ig        0
2 4  0 1
5  1
  (5   )(4   )  2  2  9  18
2 4
Review of Matrix Algebra (cont’d)
• This gives the solutions  = 6 and  = 3.
• The characteristic roots are also known as Eigenvalues.
• The rank of a matrix is equal to the number of linearly independent rows or

columns in the matrix.
• We write Rank () = r
• The rank of a matrix is equal to the order of the largest square matrix we
can obtain from  which has a non-zero determinant.
• For example, the determinant of  above  0, therefore it has rank 2.

The Johansen Test and Eigenvalues
• Some properties of the eigenvalues of any square matrix A:

1. the sum of the eigenvalues is the trace
2. the product of the eigenvalues is the determinant
3. the number of non-zero eigenvalues is the rank
• Returning to Johansen’s test, the VECM representation of the VAR was

yt =  yt-1 + 1 yt-1 + 2 yt-2 + ... + k-1 yt-(k-1) + ut
• The test for cointegration between the y’s is calculated by looking at the
rank of the  matrix via its eigenvalues. (To prove this requires some
technical intermediate steps).
• The rank of a matrix is equal to the number of its characteristic roots

(eigenvalues) that are different from zero.
The Johansen Test and Eigenvalues (cont’d)
• The eigenvalues denoted i are put in order:

1  2  ...  g
• If the variables are not cointegrated, the rank of  will not be
significantly different from zero, so i = 0  i.
Then if i = 0, ln(1–i) = 0.
If the ’s are roots, they must be less than 1 in absolute value.
• Say rank () = 1, then ln(1–1) will be negative and ln(1–i) = 0.
• If the eigenvalue i is non-zero, then ln(1-i) < 0  i > 1.

The Johansen Test Statistics
• The test statistics for cointegration are formulated as

g
trace (r )  T  ln( 1  î )
i  r 1
and
max (r , r  1)  T ln(1  r 1 )
where is the estimated value for the ith ordered eigenvalue from the
 matrix.
trace tests the null that the number of cointegrating vectors is less than
equal to r against an unspecified alternative.
trace = 0 when all the i = 0, so it is a joint test.
max tests the null that the number of cointegrating vectors is r against
an alternative of r+1.

Decomposition of the  Matrix
• For any 1 < r < g,  is defined as the product of two matrices:

 = 
gg gr rg
•  contains the cointegrating vectors while  gives the “loadings” of
each cointegrating vector in each equation.
• For example, if g=4 and r=1,  and  will be 41, and yt-k will be
given by:
  11   y1   11 
     
 12   y 2  or    12  y   y   y   y 
    11  12  13  14      11 1 12 2 13 3 14 4
 y  13 
 13   3
  y   14 
 14   4  t k
Johansen Critical Values
• Johansen & Juselius (1990) provide critical values for the 2 statistics.
The distribution of the test statistics is non-standard. The critical values
depend on:
1. the value of g-r, the number of non-stationary components
2. whether a constant and/or trend are included in the regressions.
• If the test statistic is greater than the critical value from Johansen’s
tables, reject the null hypothesis that there are r cointegrating vectors in
favour of the alternative that there are more than r.

The Johansen Testing Sequence
• The testing sequence under the null is r = 0, 1, ..., g-1

so that the hypotheses for trace are
H 0: r = 0 vs H1: 0 < r  g
H 0: r = 1 vs H1: 1 < r  g
H0: r = 2 vs H1: 2 < r  g
... ... ...
H0: r = g-1 vs H1: r = g
• We keep increasing the value of r until we no longer reject the null.

Interpretation of Johansen Test Results
• But how does this correspond to a test of the rank of the  matrix?
• r is the rank of .
•  cannot be of full rank (g) since this would correspond to the original
yt being stationary.
• If  has zero rank, then by analogy to the univariate case, yt depends
only on yt-j and not on yt-1, so that there is no long run relationship
between the elements of yt-1. Hence there is no cointegration.
• For 1 < rank () < g , there are multiple cointegrating vectors.

Hypothesis Testing Using Johansen
• EG did not allow us to do hypothesis tests on the cointegrating relationship

itself, but the Johansen approach does.
• If there exist r cointegrating vectors, only these linear combinations will be
stationary.
• You can test a hypothesis about one or more coefficients in the
cointegrating relationship by viewing the hypothesis as a restriction on the
 matrix.
• All linear combinations of the cointegrating vectors are also cointegrating
vectors.
• If the number of cointegrating vectors is large, and the hypothesis under
consideration is simple, it may be possible to recombine the cointegrating
vectors to satisfy the restrictions exactly.

Hypothesis Testing Using Johansen (cont’d)
• As the restrictions become more complex or more numerous, it will

eventually become impossible to satisfy them by renormalisation.
• After this point, if the restriction is not severe, then the cointegrating
vectors will not change much upon imposing the restriction.
• A test statistic to test this hypothesis is given by

r
 T  [ln( 1  i )  ln( 1  i *)]  2(m)
i 1
where,
i* are the characteristic roots of the restricted model
i are the characteristic roots of the unrestricted model
r is the number of non-zero characteristic roots in the unrestricted model,
and m is the number of restrictions.
Cointegration Tests using Johansen:
Three Examples
Example 1: Hamilton(1994, p. 647)
• Does the PPP relationship hold for the US/Italian exchange rate–price
system?
• A VAR was estimated with 12 lags on 189 observations. The Johansen

test statistics were
r max critical value
0 22.12 20.8
1 10.19 14.0
• Conclusion: there is one cointegrating relationship.

Example 2: Purchasing Power Parity (PPP)
• PPP states that the equilibrium exchange rate between 2 countries is

equal to the ratio of relative prices
• A necessary and sufficient condition for PPP is that the log of the
exchange rate between countries A and B, and the logs of the price
levels in countries A and B be cointegrated with cointegrating vector
[ 1 –1 1] .
• Chen (1995) uses monthly data for April 1973–December 1990 to test
the PPP hypothesis using the Johansen approach.

Cointegration Tests of PPP with European Data
Tests for r=0 r1 r2 1 2

cointegration between
FRF – DEM 34.63* 17.10 6.26 1.33 -2.50
FRF – ITL 52.69* 15.81 5.43 2.65 -2.52
FRF – NLG 68.10* 16.37 6.42 0.58 -0.80
FRF – BEF 52.54* 26.09* 3.63 0.78 -1.15
DEM – ITL 42.59* 20.76* 4.79 5.80 -2.25
DEM – NLG 50.25* 17.79 3.28 0.12 -0.25
DEM – BEF 69.13* 27.13* 4.52 0.87 -0.52
ITL – NLG 37.51* 14.22 5.05 0.55 -0.71
ITL – BEF 69.24* 32.16* 7.15 0.73 -1.28
NLG – BEF 64.52* 21.97* 3.88 1.69 -2.17
Critical values 31.52 17.95 8.18 - -
Notes: FRF- French franc; DEM – German Mark; NLG – Dutch guilder; ITL – Italian lira; BEF –
Belgian franc. Source: Chen (1995). Reprinted with the permission of Taylor and Francis Ltd.
(www.tandf.co.uk).

Example 3: Are International
Bond Markets Cointegrated?
• Mills & Mills (1991)
• If financial markets are cointegrated, this implies that they have a

“common stochastic trend”.
Data:
• Daily closing observations on redemption yields on government bonds for
4 bond markets: US, UK, West Germany, Japan.
• For cointegration, a necessary but not sufficient condition is that the yields
are nonstationary. All 4 yields series are I(1).

Testing for Cointegration Between the Yields
• The Johansen procedure is used. There can be at most 3 linearly independent

cointegrating vectors.
g
• Mills & Mills use the trace test statistic: trace (r )  T  ln( 1  î )
i  r 1
where i are the ordered eigenvalues.
Johansen Tests for Cointegration between International Bond Yields

r (number of cointegrating Test statistic Critical Values
vectors under the null hypothesis) 10% 5%
0 22.06 35.6 38.6
1 10.58 21.2 23.8
2 2.52 10.3 12.0
3 0.12 2.9 4.2
Source: Mills and Mills (1991). Reprinted with the permission of Blackwell Publishers.

Testing for Cointegration Between the Yields
(cont’d)
• Conclusion: No cointegrating vectors.
• The paper then goes on to estimate a VAR for the first differences of the
yields, which is of the form k
X t   i X t i   t
i 1
where
 X (US ) t   11i 12i 13i 14i  1t 
 X (UK )   22i 23i 24i   
Xt   t , i   21i ,t   2 t 
 X (WG ) t  31i 32i 33i 34i  3t 
 X ( JAP)   42i 43i 44i   
They set k = 8.  t  41i  4t 

Variance Decompositions for VAR
of International Bond Yields
Variance Decompositions for VAR of International Bond Yields

Explaining Days Explained by movements in
movements in ahead US UK Germany Japan
US 1 95.6 2.4 1.7 0.3
5 94.2 2.8 2.3 0.7
10 92.9 3.1 2.9 1.1
20 92.8 3.2 2.9 1.1
UK 1 0.0 98.3 0.0 1.7

5 1.7 96.2 0.2 1.9
10 2.2 94.6 0.9 2.3
20 2.2 94.6 0.9 2.3
Germany 1 0.0 3.4 94.6 2.0

5 6.6 6.6 84.8 3.0
10 8.3 6.5 82.9 3.6
20 8.4 6.5 82.7 3.7
Japan 1 0.0 0.0 1.4 100.0

5 1.3 1.4 1.1 96.2
10 1.5 2.1 1.8 94.6
20 1.6 2.2 1.9 94.2
Impulse Responses for VAR of
International Bond Yields
Impulse Responses for VAR of International Bond Yields

Response of US to innovations in
Days after shock US UK Germany Japan
0 0.98 0.00 0.00 0.00
1 0.06 0.01 -0.10 0.05
2 -0.02 0.02 -0.14 0.07
3 0.09 -0.04 0.09 0.08
4 -0.02 -0.03 0.02 0.09
10 -0.03 -0.01 -0.02 -0.01
20 0.00 0.00 -0.10 -0.01
Response of UK to innovations in
0 0.19 0.97 0.00 0.00
1 0.16 0.07 0.01 -0.06
2 -0.01 -0.01 -0.05 0.09
3 0.06 0.04 0.06 0.05
4 0.05 -0.01 0.02 0.07
10 0.01 0.01 -0.04 -0.01
20 0.00 0.00 -0.01 0.00
Response of Germany to innovations in

0 0.07 0.06 0.95 0.00
1 0.13 0.05 0.11 0.02
2 0.04 0.03 0.00 0.00
3 0.02 0.00 0.00 0.01
4 0.01 0.00 0.00 0.09
10 0.01 0.01 -0.01 0.02
20 0.00 0.00 0.00 0.00
Response of Japan to innovations in

0 0.03 0.05 0.12 0.97
1 0.06 0.02 0.07 0.04
2 0.02 0.02 0.00 0.21
3 0.01 0.02 0.06 0.07
4 0.02 0.03 0.07 0.06
10 0.01 0.01 0.01 0.04
20 0.00 0.00 0.00 0.01

Chapter 9
Modelling volatility and correlation

An Excursion into Non-linearity Land
• Motivation: the linear structural (and time series) models cannot

explain a number of important features common to much financial data
- leptokurtosis
- volatility clustering or volatility pooling
- leverage effects
• Our “traditional” structural model could be something like:

yt = 1 +  2x2t + ... + kxkt + ut, or more compactly y = X + u
• We also assumed that ut  N(0,2).

A Sample Financial Asset Returns Time Series
Daily S&P 500 Returns for August 2003 – August 2019

Non-linear Models: A Definition
• Campbell, Lo and MacKinlay (1997) define a non-linear data

generating process as one that can be written
yt = f(ut, ut-1, ut-2, …)
where ut is an iid error term and f is a non-linear function.
• They also give a slightly more specific definition as

yt = g(ut-1, ut-2, …)+ ut2(ut-1, ut-2, …)
where g is a function of past error terms only and 2 is a variance
term.
• Models with nonlinear g(•) are “non-linear in mean”, while those with
nonlinear 2(•) are “non-linear in variance”.

Types of non-linear models
• The linear paradigm is a useful one.

• Many apparently non-linear relationships can be made linear by a
suitable transformation.
• On the other hand, it is likely that many relationships in finance are
intrinsically non-linear.
• There are many types of non-linear models, e.g.

- ARCH / GARCH
- switching models
- bilinear models

Testing for Non-linearity – The RESET Test
• The “traditional” tools of time series analysis (acf’s, spectral analysis)

may find no evidence that we could use a linear model, but the data
may still not be independent.
• Portmanteau tests for non-linear dependence have been developed.
• The simplest is Ramsey’s RESET test, which took the form:
ut  0  1 yt2  2 yt3 ...  p 1 ytp  vt
• Here the dependent variable is the residual series and the independent
variables are the squares, cubes, …, of the fitted values.

Testing for Non-linearity – The BDS Test
• Many other non-linearity tests are available - e.g., the BDS and
bispectrum test
• BDS is a pure hypothesis test. That is, it has as its null hypothesis that
the data are pure noise (completely random)
• It has been argued to have power to detect a variety of departures from
randomness – linear or non-linear stochastic processes, deterministic
chaos, etc)
• The BDS test follows a standard normal distribution under the null
• The test can also be used as a model diagnostic on the residuals to ‘see
what is left’
• If the proposed model is adequate, the standardised residuals should be
white noise.
Chaos Theory
• Chaos theory is a notion taken from the physical sciences

• It suggests that there could be a deterministic, non-linear set of
equations underlying the behaviour of financial series or markets
• Such behaviour will appear completely random to the standard
statistical tests
• A positive sighting of chaos implies that while, by definition, long-
term forecasting would be futile, short-term forecastability and
controllability are possible, at least in theory, since there is some
deterministic structure underlying the data
• Varying definitions of what actually constitutes chaos can be found in
the literature.

Detecting Chaos
• A system is chaotic if it exhibits sensitive dependence on initial

conditions (SDIC)
• So an infinitesimal change is made to the initial conditions (the initial
state of the system), then the corresponding change iterated through the
system for some arbitrary length of time will grow exponentially
• The largest Lyapunov exponent is a test for chaos
• It measures the rate at which information is lost from a system
• A positive largest Lyapunov exponent implies sensitive dependence,
and therefore that evidence of chaos has been obtained
• Almost without exception, applications of chaos theory to financial
markets have been unsuccessful
• This is probably because financial and economic data are usually far
noisier and ‘more random’ than data from other disciplines.
Neural Networks
• Artificial neural networks (ANNs) are a class of models whose

structure is broadly motivated by the way that the brain performs
computation
• ANNs have been widely employed in finance for tackling time series
and classification problems
• Applications have included forecasting financial asset returns,
volatility, bankruptcy and takeover prediction
• Neural networks have virtually no theoretical motivation in finance
(they are often termed a ‘black box’)
• They can fit any functional relationship in the data to an arbitrary
degree of accuracy.

Feedforward Neural Networks
• The most common class of ANN models in finance are known as

feedforward network models
• These have a set of inputs (akin to regressors) linked to one or more
outputs (akin to the regressand) via one or more ‘hidden’ or
intermediate layers
• The size and number of hidden layers can be modified to give a closer
or less close fit to the data sample
• A feedforward network with no hidden layers is simply a standard
linear regression model
• Neural network models work best where financial theory has virtually
nothing to say about the likely functional form for the relationship
between a set of variables.

Neural Networks – Some Disadvantages
• Neural networks are not very popular in finance and suffer from
several problems:
– The coefficient estimates from neural networks do not have any real
theoretical interpretation
– Virtually no diagnostic or specification tests are available for estimated
models
– They can provide excellent fits in-sample to a given set of ‘training’ data,
but typically provide poor out-of-sample forecast accuracy
– This usually arises from the tendency of neural networks to fit closely to
sample-specific data features and ‘noise’, and so they cannot ‘generalise’
– The non-linear estimation of neural network models can be cumbersome
and computationally time-intensive, particularly, for example, if the model
must be estimated repeatedly when rolling through a sample.

Models for Volatility
• Modelling and forecasting stock market volatility has been the subject
of vast empirical and theoretical investigation
• There are a number of motivations for this line of inquiry:
– Volatility is one of the most important concepts in finance
– Volatility, as measured by the standard deviation or variance of returns, is
often used as a crude measure of the total risk of financial assets
– Many value-at-risk (VaR) models for measuring market risk require the
estimation or forecast of a volatility parameter
– The volatility of stock market prices also enters directly into the Black–
Scholes formula for deriving the prices of traded options
• We will now examine several volatility models.

Historical Volatility
• The simplest model for volatility is the historical estimate

• Historical volatility simply involves calculating the variance (or
standard deviation) of returns in the usual way over some historical
period
• This then becomes the volatility forecast for all future periods
• Evidence suggests that the use of volatility predicted from more
sophisticated time series models will lead to more accurate forecasts
and option valuations
• Historical volatility is still useful as a benchmark for comparing the
forecasting ability of more complex time models

Heteroscedasticity Revisited
• An example of a structural model is

yt = 1 + 2x2t + 3x3t + 4x4t + u t
with ut  N(0, u2 ).
• The assumption that the variance of the errors is constant is known as

homoscedasticity, i.e. Var (ut) =  u2 .
• What if the variance of the errors is not constant?

- heteroscedasticity
- would imply that standard error estimates could be wrong.
• Is the variance of the errors likely to be constant over time? Not for
financial data.
Autoregressive Conditionally Heteroscedastic
(ARCH) Models
• So use a model which does not assume that the variance is constant.
• Recall the definition of the variance of ut:
t2= Var(ut ut-1, ut-2,...) = E[(ut-E(ut))  ut-1, ut-2,...]
2
We usually assume that E(ut) = 0

so  2= Var(ut  ut-1, ut-2,...) = E[ut2 ut-1, ut-2,...].
t
What could the current value of the variance of the errors plausibly
depend upon?
– Previous squared error terms.
• This leads to the autoregressive conditionally heteroscedastic model
for the variance of the errors:
t2= 0 + 1ut21
• This is known as an ARCH(1) model
• The ARCH model due to Engle (1982) has proved very useful in
finance.
Autoregressive Conditionally Heteroscedastic
(ARCH) Models (cont’d)
• The full model would be
yt = 1 +  2x2t + ... + kxkt + ut , ut  N(0, t2)
where t2 = 0 + 1 ut21
• We can easily extend this to the general case where the error variance
depends on q lags of squared errors:
t2 = 0 + 1 ut 1 +2 ut  2 +...+qut  q
2 2 2
• This is an ARCH(q) model.
• Instead of calling the variance t2, in the literature it is usually called ht,
so the model is
yt = 1 + 2x2t + ... + kxkt + ut , ut  N(0,ht)
where ht = 0 + 1 ut21 +2 ut2 2 +...+q ut2 q

Another Way of Writing ARCH Models
• For illustration, consider an ARCH(1). Instead of the above, we can

write
yt = 1 +  2x2t + ... + kxkt + ut , ut = vtt
t  0  1ut21 , vt  N(0,1)
• The two are different ways of expressing exactly the same model. The
first form is easier to understand while the second form is required for
simulating from an ARCH model, for example.

Testing for “ARCH Effects”
1. First, run any postulated linear regression of the form given in the equation
above, e.g. yt = 1 + 2x2t + ... + kxkt + ut
saving the residuals, ût .
2. Then square the residuals, and regress them on q own lags to test for ARCH
of order q, i.e. run the regression
uˆt2   0   1uˆt21   2uˆt2 2  ...   quˆt2 q  vt
where vt is iid.
Obtain R2 from this regression
3. The test statistic is defined as TR2 (the number of observations multiplied

by the coefficient of multiple correlation) from the last regression, and is
distributed as a 2(q).

Testing for “ARCH Effects” (cont’d)
4. The null and alternative hypotheses are

H0 : 1 = 0 and 2 = 0 and 3 = 0 and ... and q = 0
H1 : 1  0 or 2  0 or 3  0 or ... or q  0.
If the value of the test statistic is greater than the critical value from the
2 distribution, then reject the null hypothesis.
• Note that the ARCH test is also sometimes applied directly to returns
instead of the residuals from Stage 1 above.

Problems with ARCH(q) Models
• How do we decide on q?
• The required value of q might be very large
• Non-negativity constraints might be violated.
– When we estimate an ARCH model, we require i >0  i=1,2,...,q
(since variance cannot be negative)
• A natural extension of an ARCH(q) model which gets around some of

these problems is a GARCH model.

Generalised ARCH (GARCH) Models
• The model is due to Bollerslev (1986). Allow the conditional variance to

be dependent upon previous own lags
• The variance equation is now
t2 = 0 + 1 ut21 +t-12 (1)
• This is a GARCH(1,1) model, which is like an ARMA(1,1) model for the
variance equation.
• We could also write
t-12 = 0 + 1ut22 +t-22
t-22 = 0 + 1 ut23 +t-32
• Substituting into (1) for t-12 :
 t2 = 0 + 1 ut21 +(0 + 1ut22 +t-22)
= 0 + 1 ut21 +0 + 1ut22 + t-22
Generalised ARCH (GARCH) Models (cont’d)
• Now substituting into (2) for t-22

t2 = 0 + 1 ut21 +0 + 1ut22 +2(0 + 1ut23 +t-32)
t2 = 0 + 1 ut21 +0 + 1ut22 +02 + 12ut23 +3t-32
t2 = 0 (1++2) + 1 ut21 (1+L+2L2 ) + 3t-32
• An infinite number of successive substitutions would yield
t2 = 0 (1++2+...) + 1 ut21 (1+L+2L2+...) + 02
• So the GARCH(1,1) model can be written as an infinite order ARCH model.
• We can again extend the GARCH(1,1) model to a GARCH(p,q):

t2 = 0+1 ut21 +2ut2 2 +...+qut2q +1t-12+2t-22+...+pt-p2
q p
t2 =  0   i u    j t  j
2 2
t i
i 1 j 1

Generalised ARCH (GARCH) Models (cont’d)
• But in general a GARCH(1,1) model will be sufficient to capture the

volatility clustering in the data.
• Why is GARCH Better than ARCH?

- more parsimonious - avoids overfitting
- less likely to breech non-negativity constraints

The Unconditional Variance under the GARCH
Specification
• The unconditional variance of ut is given by

0
Var(ut) =
1  (1   )
when 1   < 1
• 1    1 is termed “non-stationarity” in variance
• 1   = 1 is termed intergrated GARCH
• For non-stationarity in variance, the conditional variance forecasts will

not converge on their unconditional value as the horizon increases.

Estimation of ARCH / GARCH Models
• Since the model is no longer of the usual linear form, we cannot use
OLS.
• We use another technique known as maximum likelihood.
• The method works by finding the most likely values of the parameters
given the actual data.
• More specifically, we form a log-likelihood function and maximise it.

Estimation of ARCH / GARCH Models (cont’d)
• The steps involved in actually estimating an ARCH or GARCH model

are as follows
1. Specify the appropriate equations for the mean and the variance - e.g. an
AR(1)- GARCH(1,1) model:
yt =  + yt-1 + ut , ut  N(0,t2)
t2 = 0 + 1 ut21 +t-12
2. Specify the log-likelihood function to maximise:
T 1 T 1 T
L   log( 2 )   log(  t )   ( y t    y t 1 ) 2 /  t
2 2
2 2 t 1 2 t 1
3. The computer will maximise the function and give parameter values and
their standard errors

Parameter Estimation using Maximum Likelihood
• Consider the bivariate regression case with homoscedastic errors for

simplicity: y t   1   2 xt  u t
• Assuming that ut  N(0,2), then yt  N( 1   2 xt , 2) so that the

probability density function for a normally distributed random variable
with this mean and variance is given by
1  1 ( y t   1   2 xt ) 2  (1)
f ( y t  1   2 xt ,  ) 
2
exp  
 2  2 2 
• Successive values of yt would trace out the familiar bell-shaped curve.
• Assuming that ut are iid, then yt will also be iid.

(cont’d)
• Then the joint pdf for all the y’s can be expressed as a product of the
individual density functions
f ( y1 , y 2 ,..., yT  1   2 X t ,  2 )  f ( y1  1   2 X 1 ,  2 ) f ( y 2  1   2 X 2 ,  2 )...
f ( yT  1   2 X 4 ,  2 )
(2)
T
  f ( yt 1   2 X t ,  2 )
t 1
• Substituting into equation (2) for every yt from equation (1),

1  1 T ( y t   1   2 xt ) 2 
f ( y1 , y 2 ,..., yT 1   2 xt ,  )  T
2
exp    (3)
 ( 2 ) T
 2 t 1  2

(cont’d)
• The typical situation we have is that the xt and yt are given and we want to
estimate 1, 2, 2. If this is the case, then f() is known as the likelihood
function, denoted LF(1, 2, 2), so we write
1  1 T ( y t   1   2 xt ) 2 
LF ( 1 ,  2 ,  )  T
2
exp    (4)
 ( 2 ) T
 2 t 1  2

• Maximum likelihood estimation involves choosing parameter values (1,

2,2) that maximise this function.
• We want to differentiate (4) w.r.t. 1, 2,2, but (4) is a product containing
T terms.

(cont’d)
• Since max f ( x )  maxlog( f ( x )) , we can take logs of (4).
x x
• Then, using the various laws for transforming functions containing

logarithms, we obtain the log-likelihood function, LLF:
T 1 T ( y t   1   2 xt ) 2
LLF  T log   log( 2 )  
2 2 t 1 2
• which is equivalent to
T T 1 T
( y     x ) 2
LLF   log  2  log( 2 )   t 1 2 t
(5)
2 2 2 t 1 2
• Differentiating (5) w.r.t. 1, 2,2, we obtain
LLF 1 ( y  1   2 xt ).2.  1
  t
1 2 2 (6)
(cont’d)
LLF 1 ( y t  1   2 xt ).2.  xt (7)
 
 2 2 2
LLF T 1 1 ( y t   1   2 xt ) 2
   (8)
 2 22 2 4
• Setting (6)-(8) to zero to minimise the functions, and putting hats above
the parameters to denote the maximum likelihood estimators,
• From (6),  ( y  ˆ  ˆ x )  0
t 1 2 t
 y   ˆ   ˆ x  0
t 1 2 t
 y  Tˆ  ˆ  x  0
t 1 2 t
1 ˆ  ˆ 1
T
 t 1 2 T  xt  0
y  
(9)
ˆ1  y  ˆ 2 x
(cont’d)
• From (7),  ( y  ˆ  ˆ x ) x  0
t 1 2 t t
 y x   ˆ x   ˆ x  0
t t 1 t
2
2 t
 y x  ˆ  x  ˆ  x  0
t t 1 t 2
2
t
ˆ  x   y x  ( y  ˆ x ) x
2
2
t t t 2 t
ˆ  x   y x  Txy  ˆ Tx
2
2
t t t 2
2
ˆ 2 ( xt2 Tx 2 )   yt xt  Txy
ˆ 2 
 y x  Txy
t t
(10)
( x Tx ) 2
t
2
T 1
• From (8), ˆ 2


ˆ 4
(y t  ˆ1  ˆ 2 xt ) 2
(cont’d)
• Rearranging, ˆ 2  1  ( yt  ˆ1  ˆ 2 xt ) 2
T
1
 2   ut2 (11)
T
• How do these formulae compare with the OLS estimators?

(9) & (10) are identical to OLS
(11) is different. The OLS estimator was
1
 2 
Tk
 ut2
• Therefore the ML estimator of the variance of the disturbances is biased,

although it is consistent.
• But how does this help us in estimating heteroscedastic models?

Estimation of GARCH Models Using
Maximum Likelihood
• Now we have yt =  + yt-1 + ut , ut  N(0, t2)

t2 = 0 + 1 ut21 +t-12
T 1 T 1 T
L   log( 2 )   log(  t )   ( y t    y t 1 ) 2 /  t
2 2
2 2 t 1 2 t 1
• Unfortunately, the LLF for a model with time-varying variances cannot be
maximised analytically, except in the simplest of cases. So a numerical
procedure is used to maximise the log-likelihood function. A potential
problem: local optima or multimodalities in the likelihood surface.
• The way we do the optimisation is:
1. Set up LLF.
2. Use regression to get initial guesses for the mean parameters.
3. Choose some initial guesses for the conditional variance parameters.
4. Specify a convergence criterion - either by criterion or by value.
Non-Normality and Maximum Likelihood (ML)
• Recall that the conditional normality assumption for ut is essential.
• We can test for normality using the following representation

u t = v t t vt  N(0,1)
ut
 t  0  1ut 1  2 t 1
2 2 v 
t
 t
uˆt
• The sample counterpart is vˆt 
ˆ t
• Are the v̂t normal? Typically v̂t are still leptokurtic, although less so than
the ût . Is this a problem? Not really, as we can use the ML with a robust
variance/covariance estimator. ML with robust standard errors is called Quasi-
Maximum Likelihood or QML.
Extensions to the Basic GARCH Model
• Since the GARCH model was developed, a huge number of extensions

and variants have been proposed. Three of the most important
examples are EGARCH, GJR, and GARCH-M models.
• Problems with GARCH(p,q) Models:

- Non-negativity constraints may still be violated
- GARCH models cannot account for leverage effects
• Possible solutions: the exponential GARCH (EGARCH) model or the

GJR model, which are asymmetric GARCH models.

The EGARCH Model
• Suggested by Nelson (1991). The variance equation is given by
u t 1  u 2
 
t 1
log(  t )     log(  t 1 )   
2 2
 t 1 2   t 1 2 
 
• Advantages of the model

- Since we model the log(t2), then even if the parameters are negative, t2
will be positive.
- We can account for the leverage effect: if the relationship between
volatility and returns is negative, , will be negative.

The GJR Model
• Due to Glosten, Jaganathan and Runkle (1993)

t2 = 0 + 1 ut21 +t-12+ut-12It-1
where It-1 = 1 if ut-1 < 0

= 0 otherwise
• For a leverage effect, we would see  > 0.
• We require 1 +   0 and 1  0 for non-negativity.

An Example of the use of a GJR Model
• Using monthly S&P 500 returns, December 1979 - June 1998
• Estimating a GJR model, we obtain the following results.

y t  0.172
(3.198)
 t 2  1.243  0.015ut21  0.498 t 1 2  0.604ut21 I t 1

(16.372) (0.437) (14.999) (5.772)

News Impact Curves
The news impact curve plots the next period volatility (ht) that would arise from various
positive and negative values of ut-1, given an estimated model.
News Impact Curves for S&P 500 Returns using Coefficients from GARCH and GJR
Model Estimates: 0.14
GARCH
GJR
0.12
Value of Conditional Variance
0.1
0.08
0.06
0.04
0.02
0
-1 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Value of Lagged Shock
GARCH-in Mean
• We expect a risk to be compensated by a higher return. So why not let

the return of a security be partly determined by its risk?
• Engle, Lilien and Robins (1987) suggested the ARCH-M specification.

A GARCH-M model would be
yt =  + t-1+ ut , ut  N(0,t2)
t2 = 0 + 1 ut21 +t-12
•  can be interpreted as a sort of risk premium.
• It is possible to combine all or some of these models together to get

more complex “hybrid” models - e.g. an ARMA-EGARCH(1,1)-M
model.
What Use Are GARCH-type Models?
• GARCH can model the volatility clustering effect since the conditional
variance is autoregressive. Such models can be used to forecast volatility.
• We could show that

Var (yt  yt-1, yt-2, ...) = Var (ut  ut-1, ut-2, ...)
• So modelling t2 will give us models and forecasts for yt as well.
• Variance forecasts are additive over time.

Forecasting Variances using GARCH Models
• Producing conditional variance forecasts from GARCH models uses a

very similar approach to producing forecasts from ARMA models.
• It is again an exercise in iterating with the conditional expectations
operator.
• Consider the following GARCH(1,1) model:
yt    ut , ut  N(0,t2),  t 2   0  1ut21   t 1 2
• What is needed is to generate forecasts of T+12 T, T+22 T, ...,
T+s2 T where T denotes all information available up to and
including observation T.
• Adding one to each of the time subscripts of the above conditional
variance equation, and then two, and then three would yield the
following equations
T+12 = 0 + 1uT2 +T2 , T+22 = 0 + 1uT+12 +T+12 , T+32 = 0 + 1
uT+2+T+22
Forecasting Variances
using GARCH Models (Cont’d)
Let 1f,T be the one step ahead forecast for 2 made at time T. This is
2
•
easy to calculate since, at time T, the values of all the terms on the
RHS are known.
•  1f,T would be obtained by taking the conditional expectation of the
2
first equation at the bottom of slide 44:

 1f,T = 0 + 1 uT2 +T2
2
Given,  1f,T how is 2f,T , the 2-step ahead forecast for 2 made at time T,
2
•
2
calculated? Taking the conditional expectation of the second equation

at the bottom of slide 36:
 2f,T = 0 + 1E( uT 1 T) +  1f,T
2 2 2
• where E( uT 1 T) is the expectation, made at time T, of uT 1, which is

2 2
the squared disturbance term.

Forecasting Variances
using GARCH Models (Cont’d)
• We can write
E(uT+12  t) = T+12
• But T+12 is not known at time T, so it is replaced with the forecast for
it,  f 2 , so that the 2-step ahead forecast is given by
1,T
 2f,T = 0 + 1 1f,T + 1f,T
2 2 2
f 2 =  + ( + ) f
2
 2,T 0 1 1,T
• By similar arguments, the 3-step ahead forecast will be given by
 3f,T = ET(0 + 1uT+22 + T+22)
2
= 0 + (1+) 2f,T
2
= 0 + (1+)[ 0 + (1+) 1f,T ]

2
= 0 + 0(1+) + (1+)2 1f,T

2
• Any s-step ahead forecast (s  2) would be produced by

s 1
 f 2
  0  ( 1   ) i 1  ( 1   ) s 1  1f,T
2
s ,T
i 1
What Use Are Volatility Forecasts?
1. Option pricing
C = f(S, X, 2, T, rf)
2. Conditional betas
im,t
i ,t 
 m2 ,t
3. Dynamic hedge ratios

The Hedge Ratio - the size of the futures position to the size of the
underlying exposure, i.e. the number of futures contracts to buy or sell per
unit of the spot good.
What Use Are Volatility Forecasts? (Cont’d)
• What is the optimal value of the hedge ratio?

• Assuming that the objective of hedging is to minimise the variance of the
hedged portfolio, the optimal hedge ratio will be given by
s
h p
F
where h = hedge ratio
p = correlation coefficient between change in spot price (S) and
change in futures price (F)
S = standard deviation of S
F = standard deviation of F
• What if the standard deviations and correlation are changing over time?
 s ,t
Use ht  p t
 F ,t
Testing Non-linear Restrictions or
Testing Hypotheses about Non-linear Models
• Usual t- and F-tests are still valid in non-linear models, but they are
not flexible enough.
• There are three hypothesis testing procedures based on maximum

likelihood principles: Wald, Likelihood Ratio, Lagrange Multiplier.
• Consider a single parameter,  to be estimated, Denote the MLE as ˆ

and a restricted estimate as ~ .

Likelihood Ratio Tests
• Estimate under the null hypothesis and under the alternative.

• Then compare the maximised values of the LLF.
• So we estimate the unconstrained model and achieve a given maximised
value of the LLF, denoted Lu
• Then estimate the model imposing the constraint(s) and get a new value of
the LLF denoted Lr.
• Which will be bigger?
• Lr  Lu comparable to RRSS  URSS
• The LR test statistic is given by

LR = -2(Lr - Lu)  2(m)
where m = number of restrictions.

Likelihood Ratio Tests (cont’d)
• Example: We estimate a GARCH model and obtain a maximised LLF of

66.85. We are interested in testing whether  = 0 in the following equation:
yt =  + yt-1 + ut , ut  N(0, t )
2
t2 = 0 + 1 ut21 +   t 1
2
• We estimate the model imposing the restriction and observe the maximised
LLF falls to 64.54. Can we accept the restriction?
• LR = -2(64.54-66.85) = 4.62.
• The test follows a 2(1) = 3.84 at 5%, so reject the null.
• Denoting the maximised value of the LLF by unconstrained ML as L(ˆ)
~
and the constrained optimum as L( ) . Then we can illustrate the 3 testing
procedures in the following diagram:
Comparison of Testing Procedures under Maximum
Likelihood: Diagramatic Representation
L 
A

L ˆ
B
L
~
~
 ˆ 

Hypothesis Testing under Maximum Likelihood
• The vertical distance forms the basis of the LR test.
• The Wald test is based on a comparison of the horizontal distance.
• The LM test compares the slopes of the curve at A and B.
• We know at the unrestricted MLE, L(ˆ), the slope of the curve is zero.
~
• But is it “significantly steep” at L( ) ?
• This formulation of the test is usually easiest to estimate.

An Example of the Application of GARCH Models
- Day & Lewis (1992)
• Purpose
• To consider the out of sample forecasting performance of GARCH and
EGARCH Models for predicting stock index volatility.
• Implied volatility is the markets expectation of the “average” level of

volatility of an option:
• Which is better, GARCH or implied volatility?
• Data
• Weekly closing prices (Wednesday to Wednesday, and Friday to Friday)
for the S&P100 Index option and the underlying 11 March 83 - 31 Dec. 89
• Implied volatility is calculated using a non-linear iterative procedure.

The Models
• The “Base” Models

For the conditional mean
RMt  RFt  0  1 ht  ut (1)
And for the variance ht   0   1u t21   1 ht 1 (2)
ut 1  u  2 
1/ 2 
or ln( ht )   0  1 ln( ht 1 )  1 (    t 1    ) (3)
ht 1  ht 1    
where
RMt denotes the return on the market portfolio
RFt denotes the risk-free rate
ht denotes the conditional variance from the GARCH-type models while
t2 denotes the implied variance from option prices.
• Add in a lagged value of the implied volatility parameter to equations (2)

and (3).
(2) becomes
ht   0   1u t21   1 ht 1   t21 (4)
and (3) becomes

ut 1  u 2 
1/ 2
ln( ht )   0  1 ln( ht 1 )  1 (   t 1
   )   ln(  t21 ) (5)
ht 1  ht 1    
• We are interested in testing H0 :  = 0 in (4) or (5).

• Also, we want to test H0 : 1 = 0 and  1 = 0 in (4),
• and H0 : 1 = 0 and  1 = 0 and  = 0 and  = 0 in (5).
• If this second set of restrictions holds, then (4) & (5) collapse to
ht2   0   t21 (4’)
• and (3) becomes

ln( ht2 )   0   ln(  t21 ) (5’)
• We can test all of these restrictions using a likelihood ratio test.

In-sample Likelihood Ratio Test Results:
GARCH Versus Implied Volatility
R Mt  R Ft   0  1 ht  u t (8.78)
ht   0   1u t21   1 ht 1 (8.79)
ht   0   1u t21   1 ht 1   t21 (8.81)
ht2   0   t21 (8.81)
Equation for 0 1 010-4 1 1  Log-L 2
Variance
specification
(8.79) 0.0072 0.071 5.428 0.093 0.854 - 767.321 17.77
(0.005) (0.01) (1.65) (0.84) (8.17)
(8.81) 0.0015 0.043 2.065 0.266 -0.068 0.318 776.204 -
(0.028) (0.02) (2.98) (1.17) (-0.59) (3.00)
(8.81) 0.0056 -0.184 0.993 - - 0.581 764.394 23.62
(0.001) (-0.001) (1.50) (2.94)
Notes: t-ratios in parentheses, Log-L denotes the maximised value of the log-likelihood function in
each case. 2 denotes the value of the test statistic, which follows a 2(1) in the case of (8.81) restricted
to (8.79), and a 2 (2) in the case of (8.81) restricted to (8.81). Source: Day and Lewis (1992).
Reprinted with the permission of Elsevier Science.

In-sample Likelihood Ratio Test Results:
EGARCH Versus Implied Volatility
R Mt  R Ft   0  1 ht  u t (8.78)
u t 1  u t 1 2
1/ 2
ln( ht )   0   1 ln( ht 1 )   1 (     ) (8.80)
ht 1  ht 1   
u t 1  u t 1 2 
1/ 2
ln( ht )   0   1 ln( ht 1 )   1 (       )   ln(  t21 ) (8.82)

ht 1  ht 1    
ln( ht2 )   0   ln(  t21 ) (8.82)
Equation for 0 1 010 -4
1    Log-L 2
Variance
specification
(c) -0.0026 0.094 -3.62 0.529 -0.273 0.357 - 776.436 8.09
(-0.03) (0.25) (-2.90) (3.26) (-4.13) (3.17)
(e) 0.0035 -0.076 -2.28 0.373 -0.282 0.210 0.351 780.480 -
(0.56) (-0.24) (-1.82) (1.48) (-4.34) (1.89) (1.82)
(e) 0.0047 -0.139 -2.76 - - - 0.667 765.034 30.89
(0.71) (-0.43) (-2.30) (4.01)
Notes: t-ratios in parentheses, Log-L denotes the maximised value of the log-likelihood function in
each case. 2 denotes the value of the test statistic, which follows a 2(1) in the case of (8.82) restricted
to (8.80), and a 2 (2) in the case of (8.82) restricted to (8.82). Source: Day and Lewis (1992).

Conclusions for In-sample Model Comparisons &
Out-of-Sample Procedure
• IV has extra incremental power for modelling stock volatility beyond

GARCH.
• But the models do not represent a true test of the predictive ability of
IV.
• So the authors conduct an out of sample forecasting test.
• There are 729 data points. They use the first 410 to estimate the
models, and then make a 1-step ahead forecast of the following week’s
volatility.
• Then they roll the sample forward one observation at a time,

constructing a new one step ahead forecast at each step.
Out-of-Sample Forecast Evaluation
• They evaluate the forecasts in two ways:

• The first is by regressing the realised volatility series on the forecasts plus
a constant:
t21  b0  b1 2ft  t 1 (7)
where t 1 is the “actual” value of volatility, and  2ft is the value forecasted
2
for it during period t.

• Perfectly accurate forecasts imply b0 = 0 and b1 = 1.
• But what is the “true” value of volatility at time t ?
Day & Lewis use 2 measures
1. The square of the weekly return on the index, which they call SR.
2. The variance of the week’s daily returns multiplied by the number
of trading days in that week.

Out-of Sample Model Comparisons
 t21  b0  b1 2ft   t 1 (8.83)

Forecasting Model Proxy for ex b0 b1 R2
post volatility
Historic SR 0.0004 0.129 0.094
(5.60) (21.18)
Historic WV 0.0005 0.154 0.024
(2.90) (7.58)
GARCH SR 0.0002 0.671 0.039
(1.02) (2.10)
GARCH WV 0.0002 1.074 0.018
(1.07) (3.34)
EGARCH SR 0.0000 1.075 0.022
(0.05) (2.06)
EGARCH WV -0.0001 1.529 0.008
(-0.48) (2.58)
Implied Volatility SR 0.0022 0.357 0.037
(2.22) (1.82)
Implied Volatility WV 0.0005 0.718 0.026
(0.389) (1.95)
Notes: Historic refers to the use of a simple historical average of the squared returns to forecast
volatility; t-ratios in parentheses; SR and WV refer to the square of the weekly return on the S&P 100,
and the variance of the week’s daily returns multiplied by the number of trading days in that week,
respectively. Source: Day and Lewis (1992). Reprinted with the permission of Elsevier Science.

Encompassing Test Results: Do the IV Forecasts
Encompass those of the GARCH Models?
 t21  b0  b1 It2  b2 Gt2  b3 Et2  b4 Ht
2
  t 1 (8.86)
Forecast comparison b0 b1 b2 b3 b4 R2
-0.00010 0.601 0.298 - - 0.027
Implied vs. GARCH (-0.09) (1.03) (0.42)
Implied vs. GARCH 0.00018 0.632 -0.243 - 0.123 0.038

vs. Historical (1.15) (1.02) (-0.28) (7.01)
Implied vs. EGARCH -0.00001 0.695 - 0.176 - 0.026

(-0.07) (1.62) (0.27)
Implied vs. EGARCH 0.00026 0.590 -0.374 - 0.118 0.038

vs. Historical (1.37) (1.45) (-0.57) (7.74)
GARCH vs. EGARCH 0.00005 - 1.070 -0.001 - 0.018

(0.37) (2.78) (-0.00)
Notes: t-ratios in parentheses; the ex post measure used in this table is the variance of the week’s daily
returns multiplied by the number of trading days in that week. Source: Day and Lewis (1992).
Conclusions of Paper
• Within sample results suggest that IV contains extra information not

contained in the GARCH / EGARCH specifications.
• Out of sample results suggest that nothing can accurately predict

volatility!

Stochastic Volatility Models
• It is a common misconception that GARCH-type specifications are

stochastic volatility models
• However, as the name suggests, stochastic volatility models differ
from GARCH principally in that the conditional variance equation of a
GARCH specification is completely deterministic given all
information available up to that of the previous period
• There is no error term in the variance equation of a GARCH model,
only in the mean equation
• Stochastic volatility models contain a second error term, which enters
into the conditional variance equation.

Autoregressive Volatility Models
• A simple example of a stochastic volatility model is the autoregressive

volatility specification
• This model is simple to understand and simple to estimate, because it
requires that we have an observable measure of volatility which is then
simply used as any other variable in an autoregressive model
• The standard Box-Jenkins-type procedures for estimating
autoregressive (or ARMA) models can then be applied to this series
• For example, if the quantity of interest is a daily volatility estimate, we
could use squared daily returns, which trivially involves taking a
column of observed returns and squaring each observation
• The model estimated for volatility, t2, is then

A Stochastic Volatility Model Specification
• The term ‘stochastic volatility’ is usually associated with a different

formulation to the autoregressive volatility model, a possible example
of which would be
where ηt is another N(0,1) random variable that is independent of ut

• The volatility is latent rather than observed, and so is modelled
indirectly
• Stochastic volatility models are superior in theory compared with
GARCH-type models, but the former are much more complex to
estimate.

Covariance Modelling: Motivation
• A limitation of univariate volatility models is that the fitted conditional

variance of each series is entirely independent of all others
• This is potentially an important limitation for two reasons:
– If there are ‘volatility spillovers’ between markets or assets, the univariate
model will be mis-specified
– It is often the case that the covariances between series are of interest too
– The calculation of hedge ratios, portfolio value at risk estimates, CAPM
betas, and so on, all require covariances as inputs
• Multivariate GARCH models can be used for estimation of:
– Conditional CAPM betas
– Dynamic hedge ratios
– Portfolio variances

Simple Covariance Models:
Historical and Implied
• In exactly the same fashion as for volatility, the historical covariance

or correlation between two series can be calculated from a set of
historical data
• Implied covariances can be calculated using options whose payoffs are

dependent on more than one underlying asset
• The relatively small number of such options that exist limits the
circumstances in which implied covariances can be calculated
• Examples include rainbow options, ‘crack spread’ options for different

grades of oil, and currency options.

Implied Covariance Models
• To give an illustration for currency options, the implied variance of the

cross-currency returns is given by
where and are the implied variances of the x and y

returns, respectively, and is the implied covariance between
x and y
• So if the implied covariance between USD/DEM and USD/JPY is of

interest, then the implied variances of the returns of USD/DEM and
USD/JPY and the returns of the cross-currency DEM/JPY are required.

EWMA Covariance Models
• A EWMA specification gives more weight in the covariance to recent

observations than an estimate based on the simple average
• The EWMA model estimates for variances and covariances at time t in
the bivariate setup with two returns series x and y may be written as
hij,t = λhij,t−1 + (1 − λ)xt−1yt−1
where i  j for the covariances and i = j; x = y for the variances
• The fitted values for h also become the forecasts for subsequent
periods
• λ (0 < λ < 1) denotes the decay factor determining the relative weights
attached to recent versus less recent observations
• This parameter could be estimated but is often set arbitrarily (e.g.,
Riskmetrics use a decay factor of 0.97 for monthly data but 0.94 for
daily data).
EWMA Covariance Models - Limitations
• This equation can be rewritten as an infinite order function of only the

returns by successively substituting out the covariances:
• The EWMA model is a restricted version of an integrated GARCH

(IGARCH) specification, and it does not guarantee the fitted variance-
covariance matrix to be positive definite
• EWMA models also cannot allow for the observed mean reversion in
the volatilities or covariances of asset returns that is particularly
prevalent at lower frequencies.

Multivariate GARCH Models
• Multivariate GARCH models are used to estimate and to forecast

covariances and correlations.
• The basic formulation is similar to that of the GARCH model, but where the
covariances as well as the variances are permitted to be time-varying.
• There are 3 main classes of multivariate GARCH formulation that are widely
used: VECH, diagonal VECH and BEKK.
VECH and Diagonal VECH

• e.g. suppose that there are two variables used in the model. The conditional
covariance matrix is denoted Ht, and would be 2  2. Ht and VECH(Ht) are
 h11t 
 h11t h12t 
Ht    VECH ( H t )  h22t 
h21t h22t   h12t 
VECH and Diagonal VECH
• In the case of the VECH, the conditional variances and covariances would
each depend upon lagged values of all of the variances and covariances
and on lags of the squares of both error terms and their cross products.
• In matrix form, it would be written
VECH H t   C  A VECH  t 1t 1   B VECH H t 1   t  t 1 ~ N 0, H t 
• Writing out all of the elements gives the 3 equations as
h11t  c11  a11u12t  a12 u 22t  a13 u1t u 2 t  b11 h11t 1  b12 h22 t 1  b13 h12 t 1
h22 t  c21  a 21u12t  a 22 u 22t  a 23 u1t u 2 t  b21 h11t 1  b22 h22 t 1  b23 h12 t 1
h12 t  c31  a 31u12t  a 32 u 22t  a 33 u1t u 2 t  b31 h11t 1  b32 h22 t 1  b33 h12 t 1
• Such a model would be hard to estimate. The diagonal VECH is much
simpler and is specified, in the 2 variable case, as follows:
h11t   0   1u12t 1   2 h11t 1
h22t   0   1u 22t 1   2 h22t 1
h12t   0   1u1t 1u 2t 1   2 h12t 1
BEKK and Model Estimation for M-GARCH
• Neither the VECH nor the diagonal VECH ensure a positive definite
variance-covariance matrix.
• An alternative approach is the BEKK model (Engle & Kroner, 1995).
• The BEKK Model uses a Quadratic form for the parameter matrices to
ensure a positive definite variance / covariance matrix Ht.
• In matrix form, the BEKK model is H t  W W  AHt 1 A  Bt 1t 1B
• Model estimation for all classes of multivariate GARCH model is
again performed using maximum likelihood with the following LLF:
   
TN
2
1 T

log 2   log H t   t' H t1 t
2 t 1

where N is the number of variables in the system (assumed 2 above), 
is a vector containing all of the parameters, and T is the number of obs.

Correlation Models and the CCC Model
• The correlations between a pair of series at each point in time can be

constructed by dividing the conditional covariances by the product of
the conditional standard deviations from a VECH or BEKK model
• A subtly different approach would be to model the dynamics for the
correlations directly
• In the constant conditional correlation (CCC) model, the correlations
between the disturbances to be fixed through time
• Thus, although the conditional covariances are not fixed, they are tied
to the variances
• The conditional variances in the fixed correlation model are identical
to those of a set of univariate GARCH specifications (although they
are estimated jointly):

More on the CCC Model
• The off-diagonal elements of Ht, hij,t (i  j), are defined indirectly via
the correlations, denoted ρij:
• Is it empirically plausible to assume that the correlations are constant

through time?
• Several tests of this assumption have been developed, including a test

based on the information matrix due and a Lagrange Multiplier test
• There is evidence against constant correlations, particularly in the

context of stock returns.
The Dynamic Conditional Correlation Model
• Several different formulations of the dynamic conditional correlation

(DCC) model are available, but a popular specification is due to Engle
(2002)
• The model is related to the CCC formulation but where the correlations
are allowed to vary over time.
• Define the variance-covariance matrix, Ht, as Ht = DtRtDt
• Dt is a diagonal matrix containing the conditional standard deviations
(i.e. the square roots of the conditional variances from univariate
GARCH model estimations on each of the N individual series) on the
leading diagonal
• Rt is the conditional correlation matrix
• Numerous parameterisations of Rt are possible, including an
exponential smoothing approach
The DCC Model – A Possible Specification
• A possible specification is of the MGARCH form:
where:
• S is the unconditional correlation matrix of the vector of standardised
residuals (from the first stage estimation), ut = Dt−1ϵt
• ι is a vector of ones
• Qt is an N × N symmetric positive definite variance-covariance matrix
• ◦ denotes the Hadamard or element-by-element matrix multiplication
procedure
• This specification for the intercept term simplifies estimation and
reduces the number of parameters.
The DCC Model – A Possible Specification
• Engle (2002) proposes a GARCH-esque formulation for dynamically

modelling Qt with the conditional correlation matrix, Rt then constructed
as
where diag(·) denotes a matrix comprising the main diagonal elements

of (·) and Q∗ is a matrix that takes the square roots of each element in Q
• This operation is effectively taking the covariances in Qt and dividing

them by the product of the appropriate standard deviations in Qt∗ to
create a matrix of correlations.

DCC Model Estimation
• The model may be estimated in a single stage using ML although this will
be difficult. So Engle advocates a two-stage procedure where each variable
in the system is first modelled separately as a univariate GARCH
• A joint log-likelihood function for this stage could be constructed, which
would simply be the sum (over N) of all of the log-likelihoods for the
individual GARCH models
• In the second stage, the conditional likelihood is maximised with respect
to any unknown parameters in the correlation matrix
• The log-likelihood function for the second stage estimation will be of the
form
• where θ1 and θ2 denote the parameters to be estimated in the 1st and 2nd
stages, respectively.
Asymmetric Multivariate GARCH
• Asymmetric models have become very popular in empirical applications,

where the conditional variances and / or covariances are permitted to react
differently to positive and negative innovations of the same magnitude
• In the multivariate context, this is usually achieved in the Glosten et al.
(1993) framework
• Kroner and Ng (1998), for example, suggest the following extension to the
BEKK formulation (with obvious related modifications for the VECH or
diagonal VECH models)
where zt−1 is an N-dimensional column vector with elements taking the

value −ϵt−1 if the corresponding element of ϵt−1 is negative and zero
otherwise.

An Example: Estimating a Time-Varying Hedge Ratio
for FTSE Stock Index Returns
(Brooks, Henry and Persand, 2002).
• Data comprises 3580 daily observations on the FTSE 100 stock index and
stock index futures contract spanning the period 1 January 1985 - 9 April 1999.
• Several competing models for determining the optimal hedge ratio are
constructed. Define the hedge ratio as .
– No hedge (=0)
– Naïve hedge (=1)
– Multivariate GARCH hedges:
• Symmetric BEKK
• Asymmetric BEKK
In both cases, estimating the OHR involves forming a 1-step ahead
forecast and computing
hCF ,t 1
OHR t 1   t
hF ,t 1
OHR Results
In Sample
Unhedged Naïve Hedge Symmetric Time Asymmetric
=0 =1 Varying Time Varying
Hedge Hedge
hFC,t hFC ,t
t  t 
hF , t h F ,t
Return 0.0389 -0.0003 0.0061 0.0060
{2.3713} {-0.0351} {0.9562} {0.9580}
Variance 0.8286 0.1718 0.1240 0.1211
Out of Sample
Unhedged Naïve Hedge Symmetric Time Asymmetric
=0 =1 Varying Time Varying
Hedge Hedge
hFC ,t hFC ,t
t  t 
hF , t h F ,t
Return 0.0819 -0.0004 0.0120 0.0140
{1.4958} {0.0216} {0.7761} {0.9083}
Variance 1.4972 0.1696 0.1186 0.1188
Plot of the OHR from Multivariate GARCH
Time Varying Hedge Ratios

1.00
0.95
Conclusions
0.90
- OHR is time-varying and less
than 1
0.85 - M-GARCH OHR provides a
better hedge, both in-sample
0.80
and out-of-sample.
0.75
- No role in calculating OHR for
asymmetries
0.70
0.65
500 1000 1500 2000 2500 3000
Symmetric BEKK
Asymmetric BEKK

Chapter 11
Panel Data

The Nature of Panel Data
• Panel data, also known as longitudinal data, have both time series and cross-
sectional dimensions.
• They arise when we measure the same collection of people or objects over a
period of time.
• Econometrically, the setup is
yit    xit  uit
where yit is the dependent variable,  is the intercept term,  is a k  1 vector
of parameters to be estimated on the explanatory variables, xit; t = 1, …, T;
i = 1, …, N.
• The simplest way to deal with this data would be to estimate a single, pooled
regression on all the observations together.
• But pooling the data assumes that there is no heterogeneity – i.e. the same
relationship holds for all the data.
The Advantages of using Panel Data
There are a number of advantages from using a full panel technique when a panel
of data is available.
• We can address a broader range of issues and tackle more complex problems
with panel data than would be possible with pure time series or pure cross-
sectional data alone.
• It is often of interest to examine how variables, or the relationships between

them, change dynamically (over time).
• By structuring the model in an appropriate way, we can remove the impact of

certain forms of omitted variables bias in regression results.

Seemingly Unrelated Regression (SUR)
• One approach to making more full use of the structure of the data would be
to use the SUR framework initially proposed by Zellner (1962). This has
been used widely in finance where the requirement is to model several
closely related variables over time.
• A SUR is so-called because the dependent variables may seem unrelated
across the equations at first sight, but a more careful consideration would
allow us to conclude that they are in fact related after all.
• Under the SUR approach, one would allow for the contemporaneous
relationships between the error terms in the equations by using a generalised
least squares (GLS) technique.
• The idea behind SUR is essentially to transform the model so that the error
terms become uncorrelated. If the correlations between the error terms in the
individual equations had been zero in the first place, then SUR on the system
of equations would have been equivalent to running separate OLS
regressions on each equation.
Fixed and Random Effects Panel Estimators
• The applicability of the SUR technique is limited because it can only be

employed when the number of time series observations per cross-sectional
unit is at least as large as the total number of such units, N.
• A second problem with SUR is that the number of parameters to be estimated

in total is very large, and the variance-covariance matrix of the errors also
has to be estimated. For these reasons, the more flexible full panel data
approach is much more commonly used.
• There are two main classes of panel techniques: the fixed effects estimator
and the random effects estimator.

Fixed Effects Models
• The fixed effects model for some variable yit may be written
yit    xit  i  vit
• We can think of i as encapsulating all of the variables that affect yit cross-
sectionally but do not vary over time – for example, the sector that a firm
operates in, a person's gender, or the country where a bank has its
headquarters, etc. Thus we would capture the heterogeneity that is
encapsulated in i by a method that allows for different intercepts for each
cross sectional unit.
• This model could be estimated using dummy variables, which would be
termed the least squares dummy variable (LSDV) approach.

Fixed Effects Models (Cont’d)
• The LSDV model may be written
yit    xit  1 D1i  2 D2i  3 D3i     N DN i  vit
where D1i is a dummy variable that takes the value 1 for all observations on
the first entity (e.g., the first firm) in the sample and zero otherwise, D2i is a
dummy variable that takes the value 1 for all observations on the second
entity (e.g., the second firm) and zero otherwise, and so on.
• The LSDV can be seen as just a standard regression model and therefore it
can be estimated using OLS.
• Now the model given by the equation above has N+k parameters to estimate.
In order to avoid the necessity to estimate so many dummy variable
parameters, a transformation, known as the within transformation, is used to
simplify matters.
The Within Transformation
• The within transformation involves subtracting the time-mean of each entity

away from the values of the variable.
• So define yi  Tt1 yit as the time-mean of the observations for cross-sectional
unit i, and similarly calculate the means of all of the explanatory variables.
• Then we can subtract the time-means from each variable to obtain a
regression containing demeaned variables only.
• Note that such a regression does not require an intercept term since now the
dependent variable will have zero mean by construction.
• The model containing the demeaned variables is yit  yi   ( xit  xi )  uit  ui
• We could write this as yit  xit 
 uit
where the double dots above the variables denote the demeaned values.
• This model can be estimated using OLS, but we need to make a degrees of
freedom correction.
The Between Estimator
• An alternative to this demeaning would be to simply run a cross-sectional

regression on the time-averaged values of the variables, which is known as
the between estimator.
• An advantage of running the regression on average values (the between
estimator) over running it on the demeaned values (the within estimator) is
that the process of averaging is likely to reduce the effect of measurement
error in the variables on the estimation process.
• A further possibility is that instead, the first difference operator could be
applied so that the model becomes one for explaining the change in yit rather
than its level. When differences are taken, any variables that do not change
over time will again cancel out.
• Differencing and the within transformation will produce identical estimates
in situations where there are only two time periods.

Time Fixed Effects Models
• It is also possible to have a time-fixed effects model rather than an entity-

fixed effects model.
• We would use such a model where we think that the average value of yit
changes over time but not cross-sectionally.
• Hence with time-fixed effects, the intercepts would be allowed to vary over
time but would be assumed to be the same across entities at each given point
in time. We could write a time-fixed effects model as
yit    xit   t  vit
where t is a time-varying intercept that captures all of the variables that
affect y and that vary over time but are constant cross-sectionally.
• An example would be where the regulatory environment or tax rate changes
part-way through a sample period. In such circumstances, this change of
environment may well influence y, but in the same way for all firms.
Time Fixed Effects Models (Cont’d)
• Time-variation in the intercept terms can be allowed for in exactly the same
way as with entity fixed effects. That is, a least squares dummy variable
model could be estimated
yit  xit  1D1t  2 D2t  ...  T DTt  vit
where D1t, for example, denotes a dummy variable that takes the value 1 for
the first time period and zero elsewhere, and so on.
• The only difference is that now, the dummy variables capture time variation
rather than cross-sectional variation. Similarly, in order to avoid estimating a
model containing all T dummies, a within transformation can be conducted to
subtract away the cross-sectional averages from each observation
• Finally, it is possible to allow for both entity fixed effects and time fixed
effects within the same model. Such a model would be termed a two-way
error component model, and the LSDV equivalent model would contain both
cross-sectional and time dummies
Investigating Banking Competition with a Fixed Effects
Model
• The UK banking sector is relatively concentrated and apparently extremely
profitable.
• It has been argued that competitive forces are not sufficiently strong and that
there are barriers to entry into the market.
• A study by Matthews, Murinde and Zhao (2007) investigates competitive
conditions in UK banking between 1980 and 2004 using the Panzar-Rosse
approach.
• The model posits that if the market is contestable, entry to and exit from the
market will be easy (even if the concentration of market share among firms is
high), so that prices will be set equal to marginal costs.
• The technique used to examine this conjecture is to derive testable
restrictions upon the firm's reduced form revenue equation.

Methodology
• The empirical investigation consists of deriving an index (the Panzar-Rosse

H-statistic) of the sum of the elasticities of revenues to factor costs (input
prices).
• If this lies between 0 and 1, we have monopolistic competition or a partially
contestable equilibrium, whereas H < 0 would imply a monopoly and H = 1
would imply perfect competition or perfect contestability.
• The key point is that if the market is characterised by perfect competition, an
increase in input prices will not affect the output of firms, while it will under
monopolistic competition. The model Matthews et al. investigate is given by
ln REVit   0  1 ln PLit   2 ln PKit   3 ln PFit  1 ln RISKASSit 
2 ln ASSETit  3 ln BRit   1GROWTH t  i  vit
where REVit is the ratio of bank revenue to total assets for firm i at time t, PL
is personnel expenses to employees (the unit price of labour); PK is the ratio
of capital assets to fixed assets (the unit price of capital); and PF is the ratio
of annual interest expenses to total loanable funds (the unit price of funds).
Methodology (Cont’d)
• The model also includes several variables that capture time-varying bank-
specific effects on revenues and costs, and these are: RISKASS, the ratio of
provisions to total assets; ASSET is bank size, as measured by total assets; BR
is the ratio of the bank's number of branches to the total number of branches
for all banks.
• Finally, GROWTHt is the rate of growth of GDP, which obviously varies
over time but is constant across banks at a given point in time; i is a bank-
specific fixed effects and vit is an idiosyncratic disturbance term. The
contestability parameter, H is given as 1 + 2 + 3
• Unfortunately, the Panzar-Rosse approach is only valid when applied to a
banking market in long-run equilibrium. Hence the authors also conduct a
test for this, which centres on the regression
ln ROAit   0 '1 ' ln PLit   2 ' ln PKit   3 ' ln PFit  1 ' ln RISKASSit 
2 ' ln ASSETit   3 ' ln BRit   1 ' GROWTH t  i  wit
Methodology (Cont’d)
• The explanatory variables for the equilibrium test regression are identical to
those of the contestability regression but the dependent variable is now the
log of the return on assets (lnROA).
• Equilibrium is argued to exist in the market if 1' + 2' + 3'
• Matthews et al. employ a fixed effects panel data model which allows for
differing intercepts across the banks, but assumes that these effects are fixed
over time.
• The fixed effects approach is a sensible one given the data analysed here
since there is an unusually large number of years (25) compared with the
number of banks (12), resulting in a total of 219 bank-years (observations).
• The data employed in the study are obtained from banks' annual reports and
the Annual Abstract of Banking Statistics from the British Bankers
Association. The analysis is conducted for the whole sample period, 1980-
2004, and for two sub-samples, 1980-1991 and 1992-2004.

Results from Test of Banking Market Equilibrium
by Matthews et al.
Source: Matthews, Murinde and Zhao (2007)

Analysis of Equilibrium Test Results
• The null hypothesis that the bank fixed effects are jointly zero (H 0: i = 0) is
rejected at the 1% significance level for the full sample and for the second
sub-sample but not at all for the first sub-sample.
• Overall, however, this indicates the usefulness of the fixed effects panel
model that allows for bank heterogeneity.
• The main focus of interest in the table on the previous slide is the equilibrium
test, and this shows slight evidence of disequilibrium (E is significantly
different from zero at the 10% level) for the whole sample, but not for either
of the individual sub-samples.
• Thus the conclusion is that the market appears to be sufficiently in a state of
equilibrium that it is valid to continue to investigate the extent of competition
using the Panzar-Rosse methodology. The results of this are presented on the
following slide.

Results from Test of Banking Market Competition
by Matthews et al.
Source: Matthews, Murinde and Zhao (2007)

Analysis of Competition Test Results
• The value of the contestability parameter, H, which is the sum of the input
elasticities, falls in value from 0.78 in the first sub-sample to 0.46 in the
second, suggesting that the degree of competition in UK retail banking
weakened over the period.
• However, the results in the two rows above that show that the null
hypotheses that H = 0 and H = 1 can both be rejected at the 1% significance
level for both sub-samples, showing that the market is best characterised by
monopolistic competition.
• As for the equilibrium regressions, the null hypothesis that the fixed effects
dummies (i) are jointly zero is strongly rejected, vindicating the use of the
fixed effects panel approach and suggesting that the base levels of the
dependent variables differ.
• Finally, the additional bank control variables all appear to have intuitively
appealing signs. For example, the risk assets variable has a positive sign, so
that higher risks lead to higher revenue per unit of total assets; the asset
variable has a negative sign, and is statistically significant at the 5% level or
below in all three periods, suggesting that smaller banks are more profitable.
The Random Effects Model
• An alternative to the fixed effects model described above is the random

effects model, which is sometimes also known as the error components
model.
• As with fixed effects, the random effects approach proposes different
intercept terms for each entity and again these intercepts are constant over
time, with the relationships between the explanatory and explained variables
assumed to be the same both cross-sectionally and temporally.
• However, the difference is that under the random effects model, the
intercepts for each cross-sectional unit are assumed to arise from a common
intercept  (which is the same for all cross-sectional units and over time),
plus a random variable i that varies cross-sectionally but is constant over
time.
• i measures the random deviation of each entity’s intercept term from the
“global” intercept term . We can write the random effects panel model as
yit    xit  it , it   i  vit

How the Random Effects Model Works
• Unlike the fixed effects model, there are no dummy variables to capture the
heterogeneity (variation) in the cross-sectional dimension.
• Instead, this occurs via the i terms.
• Note that this framework requires the assumptions that the new cross-
sectional error term, i, has zero mean, is independent of the individual
observation error term vit, has constant variance, and is independent of the
explanatory variables.
• The parameters ( and the  vector) are estimated consistently but
inefficiently by OLS, and the conventional formulae would have to be
modified as a result of the cross-correlations between error terms for a given
cross-sectional unit at different points in time.
• Instead, a generalised least squares (GLS) procedure is usually used. The
transformation involved in this GLS procedure is to subtract a weighted mean
of the yit over time (i.e. part of the mean rather than the whole mean, as was
the case for fixed effects estimation).

Quasi-Demeaning the Data
• Define the ‘quasi-demeaned’ data as yit*  yit  yi and similarly for xit,
•  will be a function of the variance of the observation error term, v2, and of
the variance of the entity-specific error term, 2:
v
 1
T 2   v2
• This transformation will be precisely that required to ensure that there are no
cross-correlations in the error terms, but fortunately it should automatically
be implemented by standard software packages.
• Just as for the fixed effects model, with random effects, it is also
conceptually no more difficult to allow for time variation than it is to allow
for cross-sectional variation.
• In the case of time-variation, a time period-specific error term is included and
again, a two-way model could be envisaged to allow the intercepts to vary
both cross-sectionally and over time.
Fixed or Random Effects?
• It is often said that the random effects model is more appropriate when the
entities in the sample can be thought of as having been randomly selected
from the population, but a fixed effect model is more plausible when the
entities in the sample effectively constitute the entire population.
• More technically, the transformation involved in the GLS procedure under

the random effects approach will not remove the explanatory variables that
do not vary over time, and hence their impact can be enumerated.
• Also, since there are fewer parameters to be estimated with the random
effects model (no dummy variables or within transform to perform), and
therefore degrees of freedom are saved, the random effects model should
produce more efficient estimation than the fixed effects approach.
• However, the random effects approach has a major drawback which arises
from the fact that it is valid only when the composite error term it is
uncorrelated with all of the explanatory variables.
Fixed or Random Effects? (Cont’d)
• This assumption is more stringent than the corresponding one in the fixed
effects case, because with random effects we thus require both i and vit to be
independent of all of the xit.
• This can also be viewed as a consideration of whether any unobserved
omitted variables (that were allowed for by having different intercepts for
each entity) are uncorrelated with the included explanatory variables. If they
are uncorrelated, a random effects approach can be used; otherwise the fixed
effects model is preferable.
• A test for whether this assumption is valid for the random effects estimator is
based on a slightly more complex version of the Hausman test.
• If the assumption does not hold, the parameter estimates will be biased and
inconsistent.
• To see how this arises, suppose that we have only one explanatory variable,
x2it that varies positively with yit, and also with the error term, it. The
estimator will ascribe all of any increase in y to x when in reality some of it
arises from the error term, resulting in biased coefficients.

Credit Stability of Banks in Central and Eastern
Europe: A Random Effects Analysis
• Foreign participants in the banking sector may improve competition and
efficiency to the benefit of the economy that they enter.
• They may have a stabilising effect on credit provision since they will
probably be better diversified than domestic banks and will therefore be more
able to continue to lend when the host economy is performing poorly.
• But on the other hand, it is also argued that foreign banks may alter the credit
supply to suit their own aims rather than that of the host economy.
• They may act more pro-cyclically than local banks, since they have
alternative markets to withdraw their credit supply to when host market
activity falls.
• Moreover, worsening conditions in the home country may force the
repatriation of funds to support a weakened parent bank.

The Data
• There may also be differences in policies for credit provision dependent upon
the nature of the formation of the subsidiary abroad – i.e. whether the
subsidiary's existence results from a take-over of a domestic bank or from the
formation of an entirely new startup operation (a ‘greenfield investment’).
• A study by de Haas and van Lelyveld (2006) employs a panel regression

using a sample of around 250 banks from ten Central and East European
countries.
• They examine whether domestic and foreign banks react differently to

changes in home or host economic activity and banking crises.
• The data cover the period 1993-2000 and are obtained from BankScope.

The Model
• The core model is a random effects panel regression of the form:

grit    1Takeoverit  2Greenfieldi   3Crisisit   4 Macroit
 5Contrit  ( i   it )
where the dependent variable, grit, is the percentage growth in the credit of
bank i in year t; Takeover is a dummy variable taking the value one for
foreign banks resulting from a takeover and zero otherwise; Greenfield is a
dummy taking the value one if bank is the result of a foreign firm making a
new banking investment rather than taking over an existing one; crisis is a
dummy variable taking the value one if the host country for bank i was
subject to a banking disaster in year t.
• Macro is a vector of variables capturing the macroeconomic conditions in the
home country (the lending rate and the change in GDP for the home and host
countries, the host country inflation rate, and the differences in the home and
host country GDP growth rates and the differences in the home and host
country lending rates).

The Model (Cont’d)
• Contr is a vector of bank-specific control variables that may affect the

dependent variable irrespective of whether it is a foreign or domestic bank.
• These are: weakness parent bank, defined as loan loss provisions made by
the parent bank; solvency is the ratio of equity to total assets; liquidity is the
ratio of liquid assets / total assets; size is the ratio of total bank assets to total
banking assets in the given country; profitability is return on assets and
efficiency is net interest margin.
•  and the 's are parameters (or vectors of parameters in the cases of 4 and
5), i is the unobserved random effect that varies across banks but not over
time, and it is an idiosyncratic error term.

Estimation Options
• de Haas and van Lelyveld discuss the various techniques that could be
employed to estimate such a model.
• OLS is considered to be inappropriate since it does not allow for differences
in average credit market growth rates at the bank level.
• A model allowing for entity-specific effects (i.e. a fixed effects model that
effectively allowed for a different intercept for each bank) is ruled out on the
grounds that there are many more banks than time periods and thus too many
parameters would be required to be estimated.
• They also argue that these bank-specific effects are not of interest to the
problem at hand, which leads them to select the random effects panel model.
• This essentially allows for a different error structure for each bank. A
Hausman test is conducted, and shows that the random effects model is valid
since the bank-specific effects i are found “in most cases not to be
significantly correlated with the explanatory variables.”

Results
Source: de Haas and van Lelyveld (2006)

Analysis of Results
• The main result is that during times of banking disasters, domestic banks
significantly reduce their credit growth rates (i.e. the parameter estimate on
the crisis variable is negative for domestic banks), while the parameter is
close to zero and not significant for foreign banks.
• There is a significant negative relationship between home country GDP

growth, but a positive relationship with host country GDP growth and credit
change in the host country.
• This indicates that, as the authors expected, when foreign banks have fewer
viable lending opportunities in their own countries and hence a lower
opportunity cost for the loanable funds, they may switch their resources to
the host country.
• Lending rates, both at home and in the host country, have little impact on
credit market share growth.
Analysis of Results (Cont’d)
• Interestingly, the greenfield and takeover variables are not statistically

significant (although the parameters are quite large in absolute value),
indicating that the method of investment of a foreign bank in the host country
is unimportant in determining its credit growth rate or that the importance of
the method of investment varies widely across the sample leading to large
standard errors.
• A weaker parent bank (with higher loss provisions) leads to a statistically

significant contraction of credit in the host country as a result of the reduction
in the supply of available funds.
• Overall, both home-related (‘push’) and host-related (‘pull’) factors are found
to be important in explaining foreign bank credit growth.

Panel Unit Root and Cointegration Tests - Background
• Dickey-Fuller and Phillips-Perron unit root tests have low power, especially
for modest sample sizes
• It is believed that more powerful versions of the tests can be employed when
time-series and cross-sectional information is combined – as a result of the
increase in sample size
• We could increase the number of observations by increasing the sample
period, but this data may not be available, or may have structural breaks
• A valid application of the test statistics is much more complex for panels than
single series
• Two important issues to consider:
– The design and interpretation of the null and alternative hypotheses needs careful
thought
– There may be a problem of cross-sectional dependence in the errors across the
unit root testing regressions
• Early studies that assumed cross-sectional independence are sometimes
known as ‘first generation’ panel unit root tests
The MADF Test
• We could run separate regressions over time for each series but to use
Zellner’s seemingly unrelated regression (SUR) approach, which we might
term the multivariate ADF (MADF) test
• This method can only be employed if T >> N, and Taylor and Sarno (1998)
provide an early application to tests for purchasing power parity
• However, that technique is now rarely used, researchers preferring instead to
make use of the full panel structure
• A key consideration is the dimensions of the panel – is the situation that T is
large or that N is large or both? If T is large and N small, the MADF
approach can be used
• But as Breitung and Pesaran (2008) note, in such a situation one may
question whether it is worthwhile to adopt a panel approach at all, since for
sufficiently large T, separate ADF tests ought to be reliable enough to render
the panel approach hardly worth the additional complexity.

The LLC Test
• Levin, Lin and Chu (2002) – LLC – develop a test based on the equation:
• The model is very general since it allows for both entity-specific and time-
specific effects through αi and θt respectively as well as separate
deterministic trends in each series through δit, and the lag structure to mop
up autocorrelation in Δy
• Of course, as for the Dickey-Fuller tests, any or all of these deterministic

terms can be omitted from the regression
• The null hypothesis is H0: ρi ≡ ρ = 0 ∀ i and the alternative is H1: ρ < 0 ∀ i.

So the autoregressive dynamics are the same for all series under the
alternative.
The LLC Test and Nuisance Parameters
• One of the reasons that unit root testing is more complex in the panel
framework in practice is due to the plethora of ‘nuisance parameters’ in the
equation which are necessary to allow for the fixed effects (i.e. αi, θt, δit)
• These nuisance parameters will affect the asymptotic distribution of the test
statistics and hence LLC propose that two auxiliary regressions are run to
remove their impacts
• The resulting test statistic is asymptotically distributed as a standard normal

variate (as both T and N tend to infinity)
• Breitung (2000) develops a modified version of the LLC test which does
not include the deterministic terms and which standardises the residuals
from the auxiliary regression in a more sophisticated fashion.

The LLC Test – How to Interpret the Results
• Under the LLC and Breitung approaches, only evidence against the non-
stationary null in one series is required before the joint null will be rejected
• Breitung and Pesaran (2008) suggest that the appropriate conclusion when
the null is rejected is that ‘a significant proportion of the cross-sectional
units are stationary’
• Especially in the context of large N, this might not be very helpful since no
information is provided on how many of the N series are stationary
• Often, the homogeneity assumption is not economically meaningful either,

since there is no theory suggesting that all of the series have the same
autoregressive dynamics and thus the same value of ρ

Panel Unit Root Tests with Heterogeneous Processes
• This difficulty led Im, Pesaran and Shin (2003) – IPS – to propose an
alternative approach where the null and alternative hypotheses are now H0:
ρi = 0 ∀ i and H1: ρi < 0, i = 1, 2, . . . , N1; ρi = 0, i = N1 + 1, N1 + 2, . . . , N
• So the null hypothesis still specifies all series in the panel as nonstationary,
but under the alternative, a proportion of the series (N1/N) are stationary,
and the remaining proportion ((N − N1)/N) are nonstationary
• No restriction where all of the ρ are identical is imposed
• The statistic in this case is constructed by conducting separate unit root
tests for each series in the panel, calculating the ADF t-statistic for each one
in the standard fashion, and then taking their cross-sectional average
• This average is then transformed into a standard normal variate under the
null hypothesis of a unit root in all the series
• While IPS’s heterogeneous panel unit root tests are superior to the
homogeneous case when N is modest relative to T, they may not be
sufficiently powerful when N is large and T is small, in which case the LLC
approach may be preferable.
The Maddala and Wu (1999) and Choi (2001) Tests
• Maddala and Wu (1999) and Choi (2001) developed a slight variant on the
IPS approach based on an idea dating back to Fisher (1932)
• Unit root tests are conducted separately on each series in the panel and the
p-values associated with the test statistics are then combined
• If we call these p-values pvi, i = 1, 2, . . . ,N, under the null hypothesis of a
unit root in each series, pvi will be distributed uniformly over the [0,1]
interval and hence the following will hold for given N as T → ∞
• Note that the cross-sectional independence assumption is crucial

• The p-values for use in the test must be obtained from Monte Carlo
simulations.

Allowing for Cross-Sectional Heterogeneity
• The assumption of cross-sectional independence of the error terms in the

panel regression is highly unrealistic
• For example, in the context of testing for whether purchasing power parity
holds, there are likely to be important unspecified factors that affect all
exchange rates or groups of exchange rates in the sample, and will result in
correlated residuals.
• O’Connell (1998) demonstrates the considerable size distortions that can
arise when such cross-sectional dependencies are present
• We can adjust the critical values employed but the power of the tests will
fall such that in extreme cases the benefit of using a panel structure could
disappear completely
• O’Connell proposes a feasible GLS estimator for ρ where an assumed form
for the correlations between the disturbances is employed

Allowing for Cross-Sectional Heterogeneity 2
• To overcome the limitation that the correlation matrix must be specified

(and this may be troublesome because it is not clear what form it should
take), Bai and Ng (2004) propose to separate the data into a common factor
component that is highly correlated across the series and a specific part that
is idiosyncratic
• A further approach is to proceed with OLS but to employ modified standard
errors – so-called ‘panel corrected standard errors’ (PCSEs) – see, for
example Breitung and Das (2005)
• Overall, however, it is clear that satisfactorily dealing with cross-sectional
dependence makes an already complex issue considerable harder still
• In the presence of such dependencies, the test statistics are affected in a
non-trivial way by the nuisance parameters.

Panel Cointegration Tests
• Testing for cointegration in panels is complex since one must consider the
possibility of cointegration across groups of variables (what we might term
‘cross-sectional cointegration’) as well as within the groups
• Most of the work so far has relied upon a generalisation of the single
equation methods of the Engle-Granger type following the pioneering work
by Pedroni (1999, 2004)
• For a set of M variables yit and xm,i,t that are individually integrated of order
one and thought to be cointegrated, the model is
• The residuals from this regression are then subjected to separate Dickey-
Fuller or augmented Dickey-Fuller type regressions for each group
• The null hypothesis is that the residuals from all of the test regressions are
unit root processes and therefore that there is no cointegration.

The Pedroni Approach to Panel Cointegration
• Pedroni proposes two possible alternative hypotheses:

– All of the autoregressive dynamics are the same stationary process
– The dynamics from each test equation follow a different stationary process
• In the first case no heterogeneity is permitted, while in the second it is –
analogous to the difference between LLC and IPS as described above
• Pedroni then constructs a raft of different test statistics
• These standardised test statistics are asymptotically standard normally
distributed
• It is also possible to use a generalisation of the Johansen technique
• We could employ the Johansen approach on each group of series
separately, collect the p-values for the trace test and then take −2 times the
sum of their logs following Maddala and Wu (1999) above
• A full systems approach based on a ‘global VAR’ is possible but with
considerable additional complexity – see Breitung and Pesaran (2008).
Panel Unit Root Example: The Link between Financial
Development and GDP Growth
• To what extent are economic growth and the sophistication of the country’s
financial markets linked?
• Excessive government regulations may impede the development of the
financial markets and consequently economic growth will be slower
• On the other hand, if economic agents are able to borrow at reasonable rates
of interest or raise funding easily on the capital markets, this can increase
the viability of real investment opportunities
• Given that long time-series are typically unavailable for developing
economies, traditional unit root and cointegration tests that examine the link
between these two variables suffer from low power
• This provides a strong motivation for the use of panel techniques, as in the
study by Chrisopoulos and Tsionas (2004)

Panel Unit Root Example: The Link between Financial
Development and GDP Growth 2
• Defining real output for country i as yit, financial ‘depth’ as F, the

proportion of total output that is investment as S, and the rate of inflation as
, the core model is
• Financial depth, F, is proxied by the ratio of total bank liabilities to GDP

• Data are from the IMF’s International Financial Statistics for ten countries
(Colombia, Paraguay, Peru, Mexico, Ecuador, Honduras, Kenya, Thailand,
the Dominican Republic and Jamaica) over the period 1970-2000
• The panel unit root tests of Im, Pesaran and Shin, and the Maddala-Wu chi-
squared test are employed separately for each variable, but using a panel
comprising all ten countries
• The number of lags of Δyit is determined using AIC
• The null hypothesis in all cases is that the process is a unit root.

Panel Unit Root Example: Results
• The results, are much stronger than for the single series unit root tests and
show conclusively that all four series are non-stationary in levels but
stationary in differences:

Panel Cointegration Test: Example
• The LLC approach is used along with the Harris-Tzavalis technique, which
is broadly the same as LLC but has slightly different correction factors in
the limiting distribution
• These techniques are based on a unit root test on the residuals from the
potentially cointegrating regression
• Christopoulos and Tsionis investigate the use of panel cointegration tests
with fixed effects, and with both fixed effects and a deterministic trend in
the test regressions
• These are applied to the regressions both with y, and separately F, as the
dependent variables
• The results quite strongly demonstrate that when the dependent variable is
output, the LLC approach rejects the null hypothesis of a unit root in the
potentially cointegrating regression residuals when fixed effects only are
included in the test regression, but not when a trend is also included.

Panel Cointegration Test: Findings
• In the context of the Harris-Tzavalis variant of the residuals-based test, for

both the fixed effects and the fixed effects + trend regressions, the null is
rejected
• When financial depth is instead used as the dependent variable, none of
these tests reject the null hypothesis
• Thus, the weight of evidence from the residuals-based tests is that
cointegration exists when output is the dependent variable, but it does not
when financial depth is
• In the final row of the table, a systems approach to testing for cointegration
based on the sum of the logs of the p-values from the Johansen test shows
that the null hypothesis of no cointegrating vectors is rejected
• The conclusion is that one cointegrating relationship exists between the
four variables across the panel.

Panel Cointegration Test: Table of Results

Chapter 06 Merged

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 06 Merged

Uploaded by

Copyright:

Available Formats

Chapter 6

Univariate time-series modelling and

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 1

• Where we attempt to predict returns using only information contained in their

• However, the value of the autocovariances depend on the units of measurement

• A white noise process is one with (virtually) no discernible structure. A

• We can use this to do significance tests for the autocorrelation coefficients

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 6

• Let ut (t=1,2,3,...) be a sequence of independently and identically

is a qth order moving average model MA(q).

• Its properties are

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 7

1. Consider the following MA(2) process:

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 8

(i) If E(ut)=0, then E(ut-i)=0  i.

E(Xt) = E(ut + 1ut-1+ 2ut-2)= E(ut)+ 1E(ut-1)+ 2E(ut-2)=0

But E[cross-products]=0 since Cov(ut,ut-s)=0 for s0.

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 9

(ii) The acf of Xt.

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 10

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 11

We have the autocovariances, now calculate the autocorrelations:

Thus the acf plot will appear as follows:

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 13

• An autoregressive model of order p, an AR(p) can be expressed as

or  ( L) y t    u t where  ( L)  1  (1 L  2 L2 ... p Lp ) .

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 14

• The condition for stationarity of a general AR(p) model is that the

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 15

• For the AR(p) model,  ( L) yt  ut , ignoring the intercept, the Wold

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 16

• The moments of an autoregressive process are as follows. The mean is

• The autocovariances and autocorrelation functions can be obtained by

• If the AR model is stationary, the autocorrelation function will decay

• Consider the following simple AR(1) model

For the remainder of the question, set =0 for simplicity.

(ii) Calculate the (unconditional) variance of yt.

(iii) Derive the autocorrelation function for yt.

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 18

(i) Unconditional mean:

So E(yt)=  +1 ( +1E(yt-2))

E(yt) =  +1  +12 E(yt-2))

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 19

An infinite number of such substitutions would give

So E(yt) =  (1+1+12 +...) = 

(ii) Calculating the variance of yt: yt  1 yt 1  ut

From Wold’s decomposition theorem:

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 20

So long as 1  1 , this will converge.

=  u2  12 u2  14 u2  ...

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 22

For the second autocorrelation coefficient,

= E[ 1 2 u t  2 2  1 4 u t 3 2  ...  cross  products]

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 23

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 24

• Measures the correlation between an observation k periods ago and the

• At lag 1, the acf = pacf always

• At lag 2, 22 = (2-12) / (1-12)

• For lags 3+, the formulae are more complex.

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 26

• So for an AR(p), the theoretical pacf will be zero after lag p.

• For an MA(q), the theoretical pacf will be geometrically declining.

‘Introductory Econometrics for Finance’ © Chris Brooks 2019 27

• By combining the AR(p) and MA(q) models, we can obtain an ARMA(p,q)