Econ 654 Unit1

Addis Ababa University
College of Business and Economics

Department of Economics
Econ 654: Research Methods & Applied Econometrics
1. Univariate Time Series Analysis & Forecasting
Fantu Guta Chemrie (PhD)

F. Guta (CoBE) Econ 654 September, 2017 1 / 296
1. Univariate Time Series Analysis and Forecasting
1.1 Expectations, Stationarity, Ergodicity & WN Process
1.2 Autoregressive and Moving Average Processes
1.3 Use of Lag operator
1.4 Invertibility of Lag Polynomials
1.5 Problem of Common Roots of Lag Polynomials
1.6 Autocovariance Generating Function
1.7 Trend Stationarity
1.8 Testing for Omitted Serial Correlation
1.9 Stationarity and Unit Roots
1.10 Estimation of ARMA Models
1. Univariate Time Series Continued ...
1.11 Model Selection and Diagnostic Checking
1.12 Predicting with ARMA models

1. Univariate Time Series Analysis and Forecasting
1.1 Expectations, Stationarity, Ergodicity and White Noise
Process
A time series yt is a process observed in sequence

over time, t = 1, . . . , T .
To indicate the dependence on time, we adopt new
notation, and use the subscript t to denote the
individual observation, and T to denote the number
of observations.
Because of sequential nature of time series, we
expect that yt and yt 1 are not independent, so
classical assumptions of linear regression models are
not valid.
We can separate time series into two categories:
univariate (yt 2 R is scalar); and multivariate
(yt 2 Rn is vector-valid) .
The primary model for univariate time series is ARs.
The primary model for multivariate time series is
VARs.
Suppose we observed a sample of size T of some
T
random variable Yt : fyt gt =1 .
The observed sample represents T particular
numbers, but this T numbers is only one possible
outcome of the underlying stochastic process 1 that
generated the data.
1A stochastic process can be de…ned as a function on T Ω , that is,
X :T Ω ! R such that X : (t, ω ) ! X (t, ω ) = Xt (ω ). For a given t 2 T,
Xt ( ) : Ω ! R is a random variable. For a given ω 2 Ω, X ( , ω ) : T ! R is a
function mapping the index set T to R. Such a function is called a sample
function (a sample path, a realization or a trajectory).
Indeed, even if we were to imagine having observed
∞
the process fyt gt = ∞, this in…nite sequence still be
viewed as a single realization from a time series
process.
Imagine we have N computers generating sequences
n o n o n o
(1) ∞ (2) ∞ (N ) ∞
yt , yt , . . . , yt , and
t= ∞ t= ∞ t= ∞
consider selecting the observation associated with
n o
(1) (2) (N )
date t from each sequence: yt , yt , . . . , yt .
This would be described as a sample of N

realizations of a random variable Yt .
This random variable has some density function,

denoted by fYt (yt ), which is the unconditional
density of Yt .
The expectation of the t th observation of the time
series refers to the mean of this probability
distribution provided it exists:

Z ∞
E (Yt ) = yt fYt (yt ) dyt (1.1)
∞

We might view this as the probability limit of the
ensemble average:
N
1
∑ yt
(i )
E (Yt ) = plim (1.2)
N !∞ N i =1
Sometimes for emphasis the expectation E (Yt ) is

called the unconditional mean of Yt , denoted by µt .
Note that this notation allows the general possibility
that the mean can be a function of the date of the
observation t.
Variance of the random variable Yt (denoted by γ0 )
is similarly de…ned as:
Z ∞
γ0t = E (Yt µt )2 = (yt µt )2 fYt (yt ) dyt (1.3)
∞
Autocovariance at lag k of Yt (denoted by γk ) is

de…ned as:
γkt = E (Y t µt ) Y t k µt k (1.4)
Z ∞ Z ∞
= (Y t µt ) Y t k µt k fYt ,Yt 1 ,...,Y t k (y t , y t 1 , . . . , yt k ) dy t dy t 1 dyt k
∞ ∞

Again we may also think of the autocovariance at
lag k as the probability limit of an ensemble
average:
N
1
∑
(i ) (i )
γkt = plim yt µt yt k µt k (1.5)
N !∞ N i =1
Stationarity and Ergodicity

De…nition
1.1.1 fyt g is covariance (weakly) stationary
if E (yt ) = µ is independent of t, and

De…nition
cov (yt , yt k) = γk is independent of t for all k. γk
is the autocovariance function at lag k.
ρk = γk /γ0 = corr (yt , yt k) is the autocorrelation
function at lag k.
1.1.2 fyt g is strictly stationary if the joint distribution of

(yt , . . . , yt k) is the same as the joint distribution
of (yt +τ , . . . , yt +τ k) for all possible τ.
We have viewed expectations of a time series in

terms of ensemble averages such as (1.2) and (1.5).
These de…nitions may seem a bit contrived, since

usually all one has available is a single realization of
size T from the process, which may be denoted by
n o
(1) (1) (1)
y1 , y2 , . . . , yT .
From these observations, we would calculate the
sample mean y.
This, of course, is not an ensemble average but
rather a time average.
(1)
y = (1/T ) ∑Tt=1 yt (1.6)
Whether time averages such as (1.6) eventually

converges to the ensemble average concept for a
stationary process has to do with ergodicity.
De…nition
1.1.3 A stationary time series is ergodic if γk ! 0 as
k " ∞.
The following two theorems are essential to the

analysis of stationary time series. There proofs are
rather di¢ cult, however.
Theorem
1.1.1 If yt is strictly stationary and ergodic and
xt = f (yt , yt 1 , . . .) is a random variable, then xt is
strictly stationary and ergodic.
1.1.2 (Ergodic Theorem). If yt is strictly stationary and
ergodic and E (jyt j) < ∞, then as T " ∞ ,
1 T p
∑t =1 yt ! E (yt ).
T
This allows us to consistently estimate parameters

using time-series moments:
1 T
b=
The sample mean: µ T ∑t =1 yt .
The sample autocovariance:
1 T
bk =
γ T ∑t =1 (yt b ) (yt
µ k b ).
µ
The sample autocorrelation: b b k /b
ρk = γ γ0 .
Theorem
1.1.3 If yt is strictly stationary and ergodic and
E (yt2 ) < ∞ then as T " ∞
p
b ! µ.
1). µ

Theorem
p
bk
2). γ ! γk .
p
3). b
ρk ! ρk .
Proof.
Proof of Theorem 1.1.3. Part (1) is a direct
consequence of the Ergodic theorem.
For part (2), note that
1 T 1 T 1 T
bk =
γ
T ∑t =1 yt yt k
T ∑t =1 yt µb T ∑t =1 yt b+µ
kµ b 2

Proof.
By Theorem 1.1.1 above, the sequence yt yt k is
strictly stationary and ergodic and it has a …nite
mean by the assumption that E (yt2 ) < ∞.
Thus an application of the Ergodic Theorem yields:
1 p
∑t =1 yt yt
T
k ! E (yt yt k)
T
Hence,
p
bk
γ ! E (yt yt k) µ2 µ2 + µ2 = E (yt yt k) µ2 = γk
A covariance stationary process is said to be ergodic
for the mean if y converges in probability to E (yt )
as T ! ∞.
Alternatively, if the autocovariance for a
covariance-stationary process satisfy ∑j∞=1 γj < ∞
then fyt g is ergodic for the mean.
Lp can be de…ned as the set of random variables X

p
with E (jX j ) < ∞ (convergence in mean of order
p).
When p is two, the set of random variables X
satisfying E (X 2 ) < ∞ are said to be square
integrable.
The set of square integrable real random variable X
is a normed linear space R with norm
1/2
kX k = [E (X 2 )] .
De…nition
1.1.4 Two random variables X and Y such that
E (X 2 ) < ∞ and E (Y 2 ) < ∞ are said to be
orthogonal with respect L2 if E (XY ) = 0.
De…nition
1.1.5 A sequence of variables xn , satisfying E (x 2 ) < ∞
converges to a variable x with respect L2 (or
converges in mean square) if kxn x k ! 0 as
n " ∞.
The function γ : k ! γk , k being integer, is called

autocovariance function. This function is
i ). Even: that is, 8k, γ k = γk ;
ii ).Positive since
n n
∑j =1 ∑k =1 φj φk γ (tj tk ) = var ∑nj=1 φj xtj > 0,
for all positive integer n, for all real numbers φj and
φk , and for all integer tj .
Theorem
1.1.4 If xt is a stationary process, and if
(φi , i being integer) forms a sequence of absolutely
summable real numbers with ∑i∞= ∞ jφi j < ∞, the
new variable obtained by yt = ∑i∞= ∞ φi xt i de…nes
a new stationary process.

Proof.
The series formed by φi xt i is convergent in mean
square since
∞ ∞ 1 /2 ∞
∑ k φi xt ik = ∑ j φi j kxt ik = γ0 + µ2 ∑ j φi j < ∞
i= ∞ i= ∞ i= ∞
The expression ∑i∞= ∞ φi xt i is an element of L2 , yt

so de…ned is square integrable.
The moment of fyt g can be written as

∞ ∞
E (yt ) = E ∑ φi xt i = ∑ φi E (xt i )
i= ∞ i= ∞
Proof.
= µx ∑i∞= ∞ φi = µy
!
∞ ∞
cov (yt , yt +k ) = cov ∑ φi xt i , ∑ φj xt +k j
i= ∞ j= ∞
∞ ∞
= ∑ ∑ φi φj cov (xt i , xt +k j)
i= ∞ j= ∞
∞ ∞
= ∑ ∑ φi φj γx (k + i j ) = γy (k )
i= ∞ j= ∞
Both mean and covariance are independent of t.

White Noise Process:
The building block for all the process considered in

∞
this unit is a sequence fet gt = ∞ whose elements
have mean zero and variance σ2 :
9
E ( et ) = 0 =
(1.7)
E ( e2 ) = σ 2 ;
t
and for which the e’s are uncorrelated across time:
E (et eτ ) = 0 for t 6= τ (1.8)

A process satisfying (1.7) and (1.8) is described as
a white noise process.
One may on occasion wish to replace (1.8) with
slightly stronger condition that the e‘s are
independent across time:
et and eτ are independent for t 6= τ (1.9)
A process satisfying (1.7) and (1.9) is called an

independent white noise process.

Finally, if (1.7) and (1.9) hold along with
et N 0, σ2 (1.10)
then we have the Gaussian white noise process.
1.2 Autoregressive and Moving Average Processes
In time series, the series f. . . , y1 , y2 , . . . , yT , . . .g

are jointly random.
We consider the conditional expectation
E ( yt j Ft 1) where Ft 1 = fyt 1 , yt 2 , . . .g is the
history of the series or the information set at time
t 1.
An autoregressive (AR) model speci…es that only a

…nite number of past lags matter.
E ( yt j Ft 1) = E ( yt j yt 1 , yt 2 , . . . , yt k)
A linear AR model (the most common type used in

practice) speci…es linearity:
E ( yt j Ft 1) = ρ0 + ρ1 yt 1 + ρ2 yt 2 + + ρk yt k

Letting et = yt E ( yt j Ft 1) , then we have the
AR model
yt = ρ0 + ρ1 yt 1 + ρ2 yt 2 + + ρk yt k + et
E ( et j Ft 1) =0
The last property de…nes a special time-series

process.
De…nition
1.2.1 et is a martingale di¤erence sequence (MDS) if
E ( et j Ft 1) = 0.
Regression errors are naturally MDS.
Some time-series processes may be a MDS as a
consequence of optimizing behavior.
Example
Some versions of the life-cycle hypothesis imply that
either changes in consumption or consumption
growth rates should be a MDS.
Most asset pricing models imply that asset returns
should be the sum of a constant plus a MDS.

The MDS property for the regression error plays the
same role in a time-series regression as does the
conditional mean-zero property for the regression
error in a cross-section regression.
In fact, it is even more important in the time-series
context, as it is di¢ cult to derive distribution
theories without this property.
A useful property of a MDS is that et is
uncorrelated with any function of the lagged

information Ft 1. Thus for k > 0,
E ( et j Ft k) = 0.
1.2.1 Stationary AR(1) Process
A …rst-order autoregressive process, denoted by

AR (1), satisfy the following di¤erence equation:
yt = α + φyt 1 + et (1.2.1)
where, fet g is white noise sequence.

If jφj 1, the consequences of the e’s for y
accumulate rather than die out over time and hence
there does not exist a covariance stationary process
for yt with …nite variance that satisfy (1.2.1).
In the case where jφj < 1, there is a covariance
stationary process for yt satisfying (1.2.1).
By repeated backward-substitution, (1.2.1) can be
expressed as
α
yt = + et + φet 1 + φ2 et 2 + (1.2.2)
1 φ
This can be viewed as an MA(∞) process.
When jφj < 1, ∑j∞=0 jφj =
j 1
1 jφj .
The remainder of this discussion of …rst-order AR
process assumes that jφj < 1.
This ensures that the MA(∞) representation exists
and that the AR (1) process is ergodic for the mean.
Theorem
1.2.1 If jφj < 1 then yt is strictly stationary and ergodic.
Taking expectation of both sides of (1.2.2) yields

the mean of AR (1) process given by:

α
E (yt ) = µ = (1.2.3)
1 φ
Variance of a stationary AR (1) process is
2
γ0 = E (yt µ)2 = E et + φet 1 + φ2 et 2 +
σ2
= 1 + φ2 + φ4 + φ6 + σ2 = (1.2.4)
1 φ2
Covariance at lag k of a stationary AR (1) process is
γk = E (yt µ) (yt k µ)

=E et + φet 1 + φ2 et 2 + + φk et k + φ k +1 e t k 1 +
et k + φet k 1 + φ2 et k 2 +
φk
γ k = φ k + φ k +2 + φ k +4 + σ2 = σ2 (1.2.5)
1 φ2
From (1.2.4) and (1.2.5) it follows that the

autocorrelation function at lag k of a stationary
AR (1) is
γk
ρk = = φk (1.2.6)
γ0
The moments for a stationary AR (1) process were
derived above by viewing it as an MA(∞) process.
The second way to arrive at the same results is to
assume that the process is covariance- stationary
and calculate the moments directly from (1.2.1).
Taking expectations of both sides of (1.2.1),
E (yt ) = α + φE (yt 1) + E ( et ) (1.2.7)
Assuming that the process is covariance stationary,
E (yt ) = E (yt 1) =µ (1.2.8)

Substituting (1.2.8) into (1.2.7)
α
µ = α + φµ + 0 ) µ = (1.2.9)
1 φ
To …nd the second moment of yt in analogous
manner, use (1.2.9) to rewrite (1.2.1) as
9
yt = µ (1 φ) + φyt 1 + et =
(1.2.10)
(yt µ) = φ (yt 1 µ) + et ;
Now square both sides of (1.2.10) and take

expectations
E (yt µ )2 = φ2 E (yt 1 µ)2 + 2φE [(yt 1 µ) et ] + E e2t (1.2.11)
Recall from (1.2.2) that (yt 1 µ) is a linear

function of et 1, et 2 , . . ., i.e.,
(yt 1 µ ) = et 1 + φet 2 + φ2 et 3 +
But et is uncorrelated with et 1, et 2 , . . ., so et

must be uncorrelated with (yt 1 µ ).
Hence, the second term on the right hand side of
(1.2.11) is zero:
E [(yt 1 µ ) et ] = 0 (1.2.12)
Again, assuming covariance-stationarity, we have
2 2
E (yt µ) = E (yt 1 µ ) = γ0 (1.2.13)
Substituting (1.2.12) and (1.2.13) into (1.2.11),

2 σ2 2
γ0 = φ γ0 + σ ) γ0 =
1 φ2
Similarly, we could multiply both sides of (1.2.10)
by (yt k µ) and take expectations:
E [(yt µ ) (y t k µ)] = φE [(yt 1 µ ) (y t k µ)] + E [(yt k µ) et ] (1.2.14)

But the term (yt k µ) is a linear function of
et k, et k 1 , . . ., which, for k > 0, will be
uncorrelated with et .
Thus, for k > 0, the last term on the right hand
side of (1.2.14) is zero.
Hence, for k > 0, (1.2.14) becomes:
γk = φγk 1 ) γk = φk γ0 (1.2.15)
The data for the simulated AR (1) processes with

parameter φ equal to 0.5 and 0.9 are depicted in
the following …gures, combined with their
autocorrelation function.
All series are standardized to have unit variance and

zero mean.
If we compare the AR series with φ = 0.5 and
φ = 0.9, it appears that the latter process is
smoother, that is, a higher degree of persistence.
The autocorrelation function show an exponential
decay in both cases, although it takes large lags for
the ACF of the φ = 0.9 series to become close to
zero.
For example, after 15 periods, the e¤ect of a shock

is still 0.915 ' 0.21 of its original e¤ect.
For φ = 0.5 series, the e¤ect at lag 15 is virtually
zero.

1.2.2 The Second-Order Autoregressive Process
A second-order Autoregression, denoted by AR (2),

satis…es:
yt = α + φ1 yt 1 + φ2 yt 2 + et (1.2.16)
Or, in the lag operator notation,
1 φ1 L φ2 L2 yt = α + et (1.2.17)
The di¤erence equation (1.2.16) is stable provided

that the roots of
(1 φ1 z φ2 z 2 ) = 0 (1.2.18)
lie outside the unit circle.
When this condition is satis…ed, the AR (2) process

turns out to be covariance-stationary, and the
inverse of the AR operator in (1.2.17) is given by
1
ψ (L) = 1 φ1 L φ2 L2 = ψ0 + ψ1 L + ψ2 L2 + (1.2.19)
The value of ψj ’s can be found from the fact that

1 φ1 L φ2 L2 ψ0 + ψ1 L + ψ2 L2 + ψ3 L3 + =1
From this it follows that
ψ0 = 1, ψ1 = φ1 , ψj = φ1 ψj 1 + φ2 ψj 2 for j 2
Multiplying both sides of (1.2.17) by ψ(L) gives
yt = ψ (L) α + ψ (L) et (1.2.20)
One can easily show that

α
ψ (L) α = (1.2.21)
1 φ1 φ2
and
∞
∑ j =0 ψj < ∞ (1.2.22)
Proof.
E (yt ) = α + φ1 E (yt 1) + φ2 E (yt 2 )
Assuming that the process is covariance stationary,

we obtain

Proof.
α
E (yt ) = µ =
1 φ1 φ2
E (yt ) = ψ (L) α
The preceding two expression imply that
α
ψ (L) α = = ψ ( 1) α
1 φ1 φ2
This proves the result in (1.2.21) for a covariance

Proof.
stationary second-order autoregressive process.
Proof of (1.2.22) proceeds as follows:

Factorizing a second-order polynomial in the lag
operator,
1 1
1 φ1 L φ2 L2 = [(1 λ1 L) (1
λ2 L)]
c1 c2
= +
(1 λ1 L) (1 λ2 L)
λ1 λ2
Where c1 = and c2 = .
λ1 λ2 λ2 λ1
Proof.
Assuming that the eigenvalues are inside the unit
circle or the roots of the reverse characteristic
polynomial lie outside the unit circle, one would
express the second order polynomial in the lag
operator as
∞ ∞
= c1 ∑ (λ1 L) + c2 ∑ (λ2 L)j
2 1 j
1 φ1 L φ2 L
j =0 j =0
∞
= ∑ c1 λj1 + c2 λj2 Lj
j =0

Proof.
From this last expression it follows that
ψj = c1 λj1 + c2 λj2 .
Hence,
∞ ∞ ∞ ∞
∑ ψj = ∑ c1 λj1 + c2 λj2 jc1 j ∑ jλ1 jj + jc2 j ∑ jλ2 jj
j =0 j =0 j =0 j =0
jc1 j jc2 j
= + <∞
1 j λ1 j 1 j λ2 j
This proves the expression in (1.2.22).

To …nd the second moments, write (1.2.16) as
yt = µ (1 φ1 φ2 ) + φ1 yt 1 + φ2 yt 2 + et
Or
(yt µ) = φ1 (yt 1 µ) + φ2 (yt 2 µ) + et (1.2.23)
Multiplying both sides of (1.2.23) by (yt k µ)

and taking expectations produces
γk = φ1 γk 1 + φ2 γk 2 for k = 1, 2, . . . (1.2.24)

Hence, the autocovariances follow the same
second-order di¤erence equation as does the process
for yt .
An AR (2) process is covariance stationary if the
roots of the second-order polynomial in the lag
operator lie outside the unit circle.
If both roots are real and lie outside the unit circle,
the autocovariance function γk is the sum of two
decaying exponential functions of k.

When both roots are complex and their modulus lie
outside the unit circle, the autocovariance function
γk is a damped sinusoidal function.
The autocorrelations are found by dividing both
sides of (1.2.24) by γ0 :
ρk = φ1 ρk 1 + φ2 ρk 2 for k = 1, 2, . . . (1.2.25)
In particular, setting k = 1 yields
ρ1 = φ1 + φ2 ρ1
φ1
Or, ρ1 = (1.2.26)
1 φ2
For k = 2,
ρ2 = φ1 ρ1 + φ2 (1.2.27)
Variance of a covariance stationary second-order AR
can be found by multiplying both sides of (1.2.23)
by (yt µ) and taking expectations:
E (yt µ)2 = φ1 E [(yt 1 µ) (yt µ)]
+φ2 E [(yt 2 µ) (yt µ)] + E [et (yt µ)]

Or
γ0 = φ1 γ1 + φ2 γ2 + σ 2 (1.2.28)
Equation (1.2.28) can be written as:
γ0 = φ1 ρ1 γ0 + φ2 ρ2 γ0 + σ 2 (1.2.29)
Substituting (1.2.26) and (1.2.27) into (1.2.29)

gives
(1 φ2 ) σ 2
γ0 = h i
(1 + φ2 ) (1 φ2 )2 φ21

1.2.3 The pth -order Autoregressive process
A pth -order Autoregression, denoted by AR (p ),
satis…es
yt = α + φ1 yt 1 + φ2 yt 2 + + φp yt p + et (1.2.30)
Provided that the roots of
1 φz φz 2 φp z p = 0 (1.2.31)
all lie outside the unit circle, one can easily verify
that a covariance stationary representation of the
form
yt = µ + ψ (L) et (1.2.32)
exits where
1 ∞
ψ ( L) = 1 φL 2
φL p
φp L and ∑ ψj < ∞
j =0
Theorem
1.3.1 The AR (p ) process is strictly stationary and ergodic
if and only if jλk j < 1 (eigenvalues) for all k.
Assuming that the stationarity condition is satis…ed,
one alternative way to …nd the mean is to take
expectations of (1.2.30):
α
µ= (1.2.33)
1 φ1 φ2 φp
Using (1.2.33), one can easily rewrite equation
(1.2.30) as
yt µ = φ1 (yt 1 µ) + φ2 (yt 2 µ) +
+ φp (yt p µ ) + et (1.2.34)
One can easily …nd autocovariances by multiplying
both sides of (1.2.34) by (yt k µ) and taking
expectations:
8
>
>
>
< φ1 γk + φ2 γk + + φp γk for k = 1, 2, . . .
1 2 p
γk = (1.2.35)
>
>
>
: φ1 γ1 + φ2 γ2 + + φp γp + σ 2
for k = 0
Using the fact that γ k = γk , the system of

equations in (1.2.35) can be solved for
γ0 , γ1 , . . . , γp 1 as functions of σ2 , φ1 , φ2 . . . , φp .
It can be shown that the (p 1) vector
0
γ0 γ1 γp 1
is given by the …rst p
elements of the …rst column of the (p2 p2 ) matrix
1
σ2 Ip2 (F F) where F is the (p p ) matrix
de…ned as
0 1
B φ1 φ2 φp 1 φp C
B C
B 1 0 0 0 C
B C
B C
B 0 1 0 0 C
B C
B .. .. ... .. .. C
B . . . . C
@ A
0 0 1 0

and indicates the Kronecker product.
Dividing (1.2.35) by γ0 produces the Yule-Walker

equations:
ρk = φ1 ρk 1 + φ2 ρk 2 + + φp ρk p for k = 1, 2, . . . (1.2.36)
Hence, the autocovariances and the autocorrelations

follow the same pth -order di¤erence equation as
does the process (1.2.30) itself.
For distinct roots, their solutions take the form
γk = g1 λk1 + g2 λk2 + + gp λkp (1.2.37)
where the eigenvalues (λ1 , λ2 , . . . , λp ) are the
solutions to
λp φ1 λp 1
φ2 λp 2
φp = 0
1.2.4 Stationary MA(1) process
Let fet g be a white noise process and consider the

process
yt = µ + et + θet 1 (1.2.38)
where µ and θ could be any constants.
This time series process is called a …rst-order
moving average process, denoted by MA(1).
The term “moving average” comes from the fact
that yt is constructed from a weighted sum of two
most recent values of e.
The expectation of yt is given by
E (yt ) = µ + E (et ) + θE (et 1) =µ (1.2.39)
We used the symbol µ for the constant term in

(1.2.38) in anticipation of the result that this
constant term turns out to be the mean of the
process.
The variance of yt is:
E (yt µ)2 = E (et + θet 1)

2
2
= E e2t + 2θ E (et et 1) + θ E e2t 1
= 1 + θ 2 σ2 (1.2.40)

The autocovariance at lag 1 is
E [(yt µ) (yt 1 µ)] = E [(et + θet 1 ) ( et 1 + θet 2 )]
= θσ2 (1.2.41)
Autocovariance at lags larger than one are all zero:
E [(yt µ) (yt k µ)] = E [(et + θet 1 ) ( et k + θet k 1 )]
= 0 for k > 1 (1.2.42)
Since the mean and autocovariances are not

functions of time, an MA(1) process is
covariance-stationary regardless of the magnitude of
θ.
Furthermore, an MA(1) process satis…es the
condition that
∞
∑ j γk j = 1 + θ 2 σ2 + θσ2 < ∞
k =0
Hence, if fet g is Gaussian white noise process, then

the MA(1) process in (1.2.38) is ergodic for all
moments.
The autocorrelation at lag k of a
covariance-stationary process (denoted by ρk ) is
de…ned as its autocovariance at lag k divided by the
variance:
γk
ρk = (1.2.43)
γ0
Notice also that the autocorrelation at lag 0 is equal
to unity for any covariance-stationary process by
de…nition.
From (1.2.40) and (1.2.41), the …rst order
autocorrelation for an MA(1) process is given by:
θσ2 θ
ρ1 = = (1.2.44)
1 + θ 2 σ2 1 + θ2
Higher order autocorrelations are all zero.

1.2.5 Stationary MA(q) process
A qth -order moving average process, denoted by
MA(q ), is characterized by
yt = µ + et + θ 1 et 1 + θ 2 et 2 + + θ q et q (1.2.45)
where fet g is a white noise process and

(θ 1 , θ 2 , . . . , θ q ) could be any real numbers.
The mean of the process in (1.2.45) is given by:

E ( yt ) = µ + E ( e t ) + θ 1 E ( e t 1 ) + θ2 E ( et 2) + + θ q E ( et q)
=µ (1.2.46)
The variance of an MA(q ) process is:
γ0 = E ( yt µ )2 = E ( et + θ 1 et 1 + θ 2 et 2 + + θ q et q)
2
= 1 + θ 21 + θ 22 + + θ 2q σ2 (1.2.47)
For k = 1, 2, . . . , q,
γk = E f(et + θ 1 et 1 + θ 2 et 2 + + θ q et q) (1.2.48)
et k + θ 1 et k 1 + θ 2 et k 2 + + θ q et k q

= E θ k e2t k + θ k +1 θ 1 e2t k 1 + θ k +2 θ 2 e2t k 2 + + θq θq 2
k et q
Terms involving e’s at di¤erent dates have been

dropped because their product has expectation zero,
and θ 0 is de…ned to be unity.
For k > q, there are no e’s with common dates in
the de…nition of γk , and so the expectation is zero.
Hence,
8
>
> 2
< ( θ k + θ k +1 θ 1 + θ k +2 θ 2 + + θq θq k)σ for k = 1, 2, . . . , q
γk = (1.2.49)
>
>
: 0 for k > q
For any values of (θ 1 , θ 2 , . . . , θ q ), the MA(q )

process is thus covariance-stationary as
∞
∑k =0 jγk j < ∞ and furthermore, for Gaussian et
the MA(q ) process is covariance-stationary and also
ergodic for all moments.
The autocorrelation function is zero after q lags.
1.2.6 The In…nite-Order Moving Average Process
The MA(q ) process can be written as
yt = µ + ∑j =0 ψj et
q
j with ψ0 = 1
Consider the process that result as q " ∞:

yt = µ + ∑j∞=0 ψj et j = µ + ψ0 et + ψ1 et 1 + ψ2 et 2 + with ψ0 = 1 (1.2.50)
This is described as an MA(∞) process.

The in…nite sequence in (1.2.50) generates a well
de…ned covariance-stationary process provided that
∞
∑j =0 ψ2j < ∞ (1.2.51)
It is often convenient to work with a slightly

stronger condition than (1.2.51) given by:
∞
∑ j =0 ψj < ∞ (1.2.52)
n o∞
A sequence of numbers ψj satisfying (1.2.51)
j =0
is said to be square summable, whereas a sequence
satisfying (1.2.52) is said to be absolutely
summable.
Absolute summability implies square summability,
but the converse does not hold.
Proof.
Proof of the result in (1.2.51):
We need to show that square-summability of a
moving average coe¢ cients implies that the
Proof.
MA(∞) representation in (1.2.50) generates a
mean square convergent random variable.
For a stochastic process such as (1.2.50), the

question is whether ∑Tj=0 ψj et j converges in mean
square to some random variable yt as T " ∞.
To prove this we use the Cauchy Criterion for a
stochastic process which states that ∑j∞=0 ψj et j
converges if and only if, for any ε > 0, there exists a

suitably large integer N such that for any integer
Proof.
M>N
h i2
∑ ∑
M N
E j =0
ψj et j j =0
ψj et j <ε (1.2.53)
The expression on the left hand side of (1.2.53) can

written
h i2
∑ j =N +1 ψ j e t
M
E j = ψ2M + ψ2M 1 + + ψ2N +1 σ2
h i
∑j =0 ψ2j ∑j =0 ψ2j
M N
= σ2 (1.2.54)

Proof.
But if ∑j∞=0 ψ2j converges as required by (1.2.51),
then by the Cauchy Criterion the right hand side of
(1.2.54) may be made as small as desired by choice
of a suitably large N.
Hence, the in…nite series in (1.2.50) converges in
mean square provided that (1.2.51) is satis…ed.
Proof of the result that absolute summability
implies square summability:
Absolute summability of the moving average
Proof.
coe¢ cients implies square summability.
n o∞
Suppose that ψj is absolutely summable.
j =0
Then there exists an N < ∞ such that ψj < 1 for
all j N. Then
∞ N 1 ∞ N 1 ∞
∑ ψ2j = ∑ ψ2j + ∑ ψ2j < ∑ ψ2j + ∑ ψj
j =0 j =0 j =N j =0 j =N
But ∑Nj =01 ψ2j is …nite, since N is …nite, and

n o
∞
∑j =N ψj is …nite since ψj is absolutely
Proof.
summable.
Hence, ∑j∞=0 ψ2j < ∞ , establishing the result that

(1.2.52) implies (1.2.51).
The mean and variance of an MA(∞) process with

absolutely summable coe¢ cients can be calculated
from a simple extrapolation of the results for an
MA(q ) process:
E (yt ) = lim E (µ + ψ0 et + ψ1 et 1 + ψ2 et 2 + + ψT et T) (1.2.55)
T !∞
γ0 = E (yt µ )2
2
= lim E (ψ0 et + ψ1 et 1 + ψ2 et 2 + + ψT et T)
T !∞
= lim ψ20 + ψ21 + ψ22 + + ψ2T σ2 (1.2.56)

T !∞
γk = E (yt µ) (yt k µ) (1.2.57)
= ψ k ψ 0 + ψ k +1 ψ 1 + ψ k +2 ψ 2 + ψ k +3 ψ 3 + σ2

Moreover, an MA(∞) process with absolutely
summable coe¢ cients has absolutely summable
autocovariance:
∞
∑ j γk j < ∞ (1.2.58)
k =0
Hence an MA(∞) process satisfying (1.2.52) is

ergodic for the mean. Proof of this result is as
follows:
Proof.
Write (1.2.57) as γk = σ2 ∑j∞=0 ψj +k ψj . Then
Proof.
jγk j = σ2 ∑j∞=0 ψj +k ψj σ2 ∑j∞=0 ψj +k ψj
Hence,
∞ ∞ ∞ ∞ ∞
∑ j γk j σ2 ∑∑ ψ j +k ψ j = σ 2 ∑ ψj ∑ ψ j +k
k =0 k =0 j =0 j =0 k =0
But there exists an M < ∞ such that

∞ ∞
∑j =0 ψj < M, and therefore ∑k =0 ψj +k < M for
j = 0, 1, 2, . . . . Therefore,
∞ ∞
∑ k =0 j γ k j < σ 2 ∑ j =0 ψ j M < σ2 M 2 < ∞.
Proof.
Hence (1.2.51) holds and the process is ergodic for
the mean.
If the e’s are Gaussian, then the process is ergodic

for all moments.
1.2.7 Stationary ARMA(p,q) Process
An ARMA(p, q ) process includes both

autoregressive and moving average terms de…ned as:
yt = α + φ1 yt 1 + + φp yt p + et
+ θ 1 et 1 + + θ q et q (1.2.59)
Or, in lag operator form,
1 φ1 L φ2 L 2 φp L p y t = α + 1 + θ 1 L + θ 2 L 2 + + θ q L q et (1.2.60)
Provided that the roots of
1 φ1 z φ2 z 2 φp z p = 0 (1.2.61)
all lie outside the unit circle, both sides of (1.2.60)

can be dived by 1 φ1 L φ2 L2 φp Lp to
obtain yt = µ + ψ (L) et
where
(1 + θ 1 L + θ 2 L2 + + θ q Lq )
ψ (L) =
1 φ1 L φ2 L2 φp Lp
∞
α
∑ ψj < ∞ and µ =
1 φ1 φ2 φp
j =0
Hence, stationarity of an ARMA(p, q ) process

depends entirely on the autoregressive parameters
φ1 , φ2 , . . . , φp and not on the moving average
parameters (θ 1 , θ 2 , . . . , θ q ).
It is often convenient to write the ARMA process
(1.2.59) in terms of deviations from the mean:
yt µ = φ1 (yt 1 µ) + + φp (yt p µ)
+ et + θ 1 et 1 + + θ q et q (1.2.62)
Autocovariances are found by multiplying both sides

of (1.2.62) by (yt k µ) and taking expectations.
For k > q, the resulting equations take the form
γk = φ1 γk 1 + φ2 γk 2 + + φp γk p (1.2.63)
for k = q + 1, q + 2, . . .
Hence, after q lags the autocovariance function γk

(and the autocorrelation function ρk ) follow the
pth -order di¤erence equation governed by the
autoregressive parameters.
Note that (1.2.63) does not hold for k q, owing
to the correlation between θ k et k and yt k.
Therefore, an ARMA(p, q ) process will have more

complicated autocovariances for lags 1 through q
than would the corresponding AR (p ) process.
For k > q with distinct autoregressive roots, the
autocovariances will be given by:
γk = h1 λk1 + h2 λk2 + + hp λkp (1.2.64)
This takes the same form as the autocovariances for

an AR (p ) process, though because the initial
conditions γ0 , γ1 , . . . , γq di¤er for the ARMA and
AR processes, the parameters hk in (1.2.64) will not
be the same as the parameters gk in (1.2.37).
There is a potential for overparameterization with
ARMA processes.
Consider, for instance, a simple white noise process
yt = et (1.2.65)
Suppose we multiply both sides of (1.2.65) by

(1 ρL):
(1 ρL) yt = (1 ρL) et (1.2.66)
If (1.2.65) is a valid parameterization, then so is

(1.2.66) for any value of ρ. Hence, (1.2.66) might
be described as an ARMA(1, 1) process with
φ1 = ρ and θ 1 = ρ.
It is important to avoid such a parameterization.
Since any value of ρ in (1.2.66) describes the data

equally well, we will obviously get into trouble trying

to estimate the parameter ρ in (1.2.66) by
maximum likelihood.
If we are using an ARMA(1, 1) model in which θ 1 is
close to φ1 , then the data might better be
modeled as simple white noise process.
A related overparameterization can arise with an
ARMA(p, q ) model.
Consider factoring the lag polynomial in (1.2.60) as
p q
∏ (1 λk L) (yt µ) = ∏ (1 η k L)et (1.2.67)
k =1 k =1
Assume jλi j < 1 for all i, so that the process is
covariance-stationary.
If the autoregressive operator
1 φ1 L φ2 L2 φp Lp and the moving average
operator 1 + θ 1 L + θ 2 L2 + + θ q Lq have any roots in
common, say λi = η j for some i and j, then both
sides of (1.2.67) can be divided by (1 λi L):

p q
∏ (1 λk L) (yt µ) = ∏ (1 η k L) et (1.2.68)
k =1, k 6=i k =1, k 6=j
Or,
p 1 q 1
1 φ1 L φp 1L ( yt µ) = 1 + θ 1 L + + θq 1L et
where
p
1 φ1 L φp 1L
p 1
= ∏ (1 λk L)
k =1, k 6=i
q
1 + θ1 L + + θq 1L
q 1
= ∏ (1 η k L)
k =1, k 6=j
The stationary ARMA(p, q ) process satisfying

(1.2.60) is identical to the stationary
ARMA(p 1, q 1) process satisfying (1.2.68).
Wold’s Decomposition Theorem
All of the covariance-stationary processes considered
in this section can be written in the form
∞
yt = µ + ∑j =0 ψj et j (1.2.69)
where et is the white noise error one would make in

forecasting yt as a linear function of lagged y and
where ∑j∞=0 ψ2j < ∞ with ψ0 = 1.
Proposition: (Wold’s Decomposition). Any
zero-mean covariance-stationary process yt can be
represented in the form
∞
yt = ∑ j =0 ψ j e t j + κt (1.2.70)
where ψ0 = 1 and ∑j∞=0 ψ2j < ∞.
The term et is a white noise and represents the

error made in forecasting yt on the basis of a linear
function of lagged y :
et = yt E ( yt j yt 1 , yt 2 , . . . ) (1.2.71)
The value of κ t is uncorrelated with et j for any j,
though κ t can be predicted arbitrarily well from a
linear function of past values of y:
κ t = E ( κ t j yt 1 , yt 2 , . . . )
The term κ t is called the linearly deterministic

component of yt , while ∑j∞=0 ψj et j is called the
linearly indeterministic component.
If κ t = 0, then the process is called purely linearly
indeterministic.
This proposition was …rst proved by Wold (1938).
The proposition relies on stable second moments of
y but makes no use of higher moments.
It thus describes only optimal linear forecasts of y.
Finding the Wold representation requires …tting an
in…nite number of parameters fψ1 , ψ2 , . . .g to the
data.
With a …nite number of observations on
fy1 , y2 , . . . , yT g this will never be possible.
As a practical matter, we therefore, need to make
some additional assumptions about the nature of
f ψ1 , ψ2 , . . . g.
A typical assumption is that ψ (L) can be expressed
as the ratio of two …nite-order polynomials:
∞
θ (L) 1 + θ 1 L + θ 2 L2 + + θ q Lq
∑ ψj Lj = φ (L)
=
1 φ1 L φ2 L2 φp Lp
j =0

1.3 Use of Lag operator
An algebraic construct which is useful for the

analysis of autoregressive models is the lag operator.
De…nition
1.3.1 The lag operator L satis…es Lyt = yt 1.
De…ning L2 = L.L, we see that

L2 yt = Lyt 1 = yt 2 . In general, Lk yt = yt k.
The AR (1) model can be written in the format

yt = α + φyt 1 + et
Or (1 φL) yt = α + et
The operator Φ (L) = 1 φL is a polynomial in the
lag operator L.
We say that the root of the polynomial is 1/φ,
since Φ (z ) = 0 when z = 1/φ.
We call Φ (L) the autoregressive polynomial of yt .
From Theorem 1.2.1, an AR (1) is stationary i¤
jφj < 1.
Note that an equivalent way to say this is that an
AR (1) is stationary i¤ the root of the autoregressive
polynomial lies outside the unit circle or larger than
one (in absolute value).
We can write a general AR (p ) model as
Φ (L) yt = et
where Φ (L) is a polynomial of order p in the lag

operator L, usually referred to as a lag polynomial,
given by
Φ (L) = 1 φ1 L φ2 L2 φp Lp
We can interpret the lag polynomial as a …lter that,
if applied to a time series, produces a new series.
So the …lter Φ (L) applied to an AR (p ) process
produces a white noise process et .
It is relatively easy to manipulate lag polynomials.
Example
Transforming a series by two such polynomials one after the
other is the same as transforming the series once by a
polynomial that is the product of the two original ones.

This way we can de…ne the inverse of a …lter, which
is naturally given by the inverse of the polynomial.
Thus the inverse of Φ (L), denoted by Φ 1
(L), is
de…ned so as to satisfy Φ 1
(L) Φ (L) = 1.
If Φ (L) is a …nite-order polynomial in L, its inverse
will be one of in…nite order.
For the AR (1) case we …nd

∞
(1 φ L) 1
= ∑ φj Lj provided that jφj < 1 (1.3.1)
j =0

This is similar to the result that the in…nite sum
∞ 1
∑j =0 φj equals to (1 φ) if jφj < 1, while it
does not converge for jφj 1.
In general, the inverse of a polynomial Φ (L) exists
if it satis…es certain conditions on its parameters, in
which case we call Φ (L) invertible.
For example the AR (1) model can be written as
1 1
(1 φL) (1 φL) yt = (1 φL) et . Or
∞ ∞
yt = ∑ φ L et = ∑ φj et
j j
j (1.3.2)
j =0 j =0
Under appropriate conditions, the converse is also
possible and we can write a moving average model
in autoregressive form.
Using the lag operator, we can write the MA(1)
process as
yt = (1 + θL) et
and the general MA(q ) process as
yt = θ (L) et
where θ (L) = 1 + θ 1 L + θ 2 L2 + + θ q Lq .
1
Now, if θ (L) exists, we can write
1
θ (L) yt = et
which in general, will be an AR model with in…nite

order.
For the MA(1) case, we use,
∞
(1 + θ L) 1
= ∑( θ )j Lj , provided that jθ j < 1 (1.3.3)
j =0
Consequently, an MA(1) model can be written as

∞
yt = θ ∑j =0 ( θ ) yt
j
j 1 + et (1.3.4)
A necessary condition for the in…nite AR

representation (AR (∞)) to exist is that the MA
polynomial is invertible, which, in the MA(1) case,
requires that jθ j < 1.
Particularly for making predictions conditional upon

an observed past, the AR representations are very
convenient.

The MA representations are often convenient to
determine variances and covariances.
For a more parsimonious representation, we may
want to work with an ARMA model that contains
both an autoregressive and moving average part.
The general ARMA model can be written as
Φ (L) yt = θ (L) et
which (if the AR lag polynomial is invertible) can be

written in MA(∞) representation as
yt = Φ 1
(L) θ (L) et
or (if the MA lag polynomial is invertible) can be

written in AR (∞) form as
θ 1
(L) Φ (L) yt = et
Both Φ 1
(L) θ (L) and θ 1
(L) Φ (L) are lag
polynomials of in…nite lag length, with restrictions
on the coe¢ cients.
1.4 Invertibility of Lag Polynomials
As we have seen before, the …rst order lag

polynomial 1 φL is invertible if jφj < 1.
This condition can be generalized to higher-order
lag polynomials.
Consider the case of a second-order lag polynomial,
given by 1 φ1 L φ2 L2 .
Generally we can …nd values λ1 and λ2 such that
the polynomial can be written as
1 φ1 L φ2 L2 = (1 λ1 L) (1 λ2 L) (1.4.1)
It can easily be veri…ed that λ1 and λ2 can be

solved from λ1 + λ2 = φ1 and λ1 λ2 = φ2 .
The conditions for invertibility of the second-order
polynomial are just the conditions that both the
…rst-order polynomials (1 λ1 L) and (1 λ2 L) are
invertible.
Thus, the requirement for invertibility is that both
jλ1 j < 1 and jλ2 j < 1.
These requirements can be formulated in terms of
the so-called reverse characteristic equation.
(1 λ1 z ) (1 λ2 z ) = 0 (1.4.2)
This equation has two solutions, z1 and z2 , referred

to as the reverse characteristic roots.
The requirement jλi j < 1 corresponds to jzi j > 1.
If any solution satis…es jzi j 1, the corresponding
polynomial is noninvertible.
A solution that equals unity is referred to as a unit
root.
The presence of a unit root in the lag polynomial
Φ (L) can be detected relatively easily, without
solving the characteristic equation, by nothing that
the polynomial Φ (z ) evaluated at z = 1 is zero if
p
∑j =1 φj = 1.
Thus, the presence of a …rst unit root can be
veri…ed by checking whether the polynomial
coe¢ cients sum to one.
If the sum exceeds one, the polynomial is not
invertible.
As an example, consider the following AR (2) model
yt = 1.2yt 1 0.32yt 2 + et (1.4.3)
The AR (2) model in equation (1.4.3) can be

written as
(1 0.8L) (1 0.4L) yt = et (1.4.4)
with reverse characteristic equation

1 1.2z + 0.32z 2 = (1 0.8z ) (1 0.4z )
=0 (1.4.5)
The solution (reverse characteristic roots) are

z1 = 1/0.8 = 1.25, and z2 = 1/0.4 = 2.5 which
are both larger than one.
Consequently, the AR polynomial in (1.4.3) is
invertible.
Note that the following AR (1) model

yt = 1.2yt 1 + et (1.4.6)
describes a noninvertible AR process.
Invertibility of a lag polynomial is very important for

several reasons.
For moving average models, or more generally,

models with moving average component,
invertibility of the MA is important for estimation
and prediction.

For models with an autoregressive part, the AR
polynomial is invertible if and only if the process is
stationary.
1.5 Problem of Common Roots of Lag Polynomials
Decomposing the moving average and

autoregressive polynomials into products of linear
functions in L also shows the problem of common
roots or cancelling roots.
This means that the AR and MA parts of the model
have roots that are identical and the corresponding
linear functions in L cancel out.
To illustrate this, consider the model described by
1 φ1 L φ2 L2 yt = (1 + θL) et (1.5.1)
Thus we can write this as
(1 λ1 L) (1 λ2 L) yt = (1 + θL) et (1.5.2)
Now, if θ = λ1 , we can divide both sides by

(1 + θL) to obtain
(1 λ2 L) yt = et (1.5.3)
which is exactly the same as (1.5.2).
Thus, in the case of one cancelling root, an

ARMA(p, q ) model can be written equivalently as
an ARMA(p 1, q 1) model.
As an example, the following ARMA(2, 1) model
yt = yt 1 0.25yt 2 + et 0.5et 1 (1.5.4)
which can be rewritten as

2
(1 0.5L) yt = (1 0.5L) et (1.5.5)
This reduces to an AR (1) model given by
(1 0.5L) yt = et (1.5.6)
which describes the same process as (1.5.4).
The problem of common roots illustrates why it may

be problematic, in practice, to estimate an ARMA
model with an AR and an MA part of a high order.

The reason is that identi…cation and estimation are
hard if roots of the MA and AR polynomials are
almost identical.
In this case, a simpli…ed ARMA(p 1, q 1)
model will yield an almost equivalent representation.
1.6 Autocovariance Generating Function
For each covariance-stationary process yt we

calculated the sequence of autocovariances
∞
γj j= ∞
.
If this sequence is absolutely summable, then one
way of summarizing the autocovariances is through
a scalar-valued function called the
autocovariance-generating function:
∞
gy (z ) = ∑ γj z j (1.6.1)
j= ∞
The argument of this function is taken to be a

complex scalar.
Of particular interest as an argument for the
autocovariance-generating function is any value of z
that lies on the complex unit circle,
z = e i ω = cos ω i sin ω
p
where i = 1 and ω is the radian angle that z
makes with the real axis.
If the autocovariance-generating function is
iω
evaluated at z = e and divided by 2π, the
resulting function of ω,
1 1 ∞
2π j =∑∞ j
i ωj
Sy (ω ) = gy (z ) = γe
2π
is called the population spectrum of y.
For a process with absolutely summable

autocovariances, the function Sy (ω ) exits and can
be used to calculate all of the autocovariances.
This means that if two di¤erent processes share the
same autocovariance-generating function, then the
two processes exhibit the identical sequence of
autocovariances.
As an example of calculating an autocovariance-
generating function, consider the MA(1) process.
From equations (1.2.40) to (1.2.42), its
autocovariance-generating function is
h i
gy (z ) = θσ2 z 1
+ 1 + θ 2 σ2 z 0 + θσ2 z 1 = σ2 θ z 1
+ 1 + θ2 + θz
This expression could alternatively be written as
gy (z ) = σ2 (1 + θz ) 1 + θz 1
(1.6.2)
The autocovariance generating function for MA(q )

might be calculated as
g y (z ) = σ 2 1 + θ 1 z + θ 2 z 2 + + θq z q 1 + θ1 z 1 + θ2 z 2 + + θq z q (1.6.3 )
This conjecture can be veri…ed by carrying out the

multiplication in (1.6.3) and collecting terms by
power of z:
1 + θ1 z + θ2 z 2 + + θq z q 1 + θ1 z 1
+ θ2 z 2
+ + θq z q
= θ q z q + (θ q 1 + θq θ1 ) z q 1
+ (θ q 2 + θq 1 θ1 + θq θ2 ) z q 2
+
+ (θ 1 + θ 2 θ 1 + θ 3 θ 2 + + θq θq 1) z
1
+ 1 + θ 21 + θ 22 + + θ 2q z 0
1 q
+ (θ 1 + θ 2 θ 1 + θ 3 θ 2 + + θq θq 1) z + + θq z (1.6.4)

Comparison of (1.6.4) with (1.2.47) or (1.2.49)
con…rms that the coe¢ cient on z k in (1.6.3) is
indeed the k th autocovariance.
This method for …nding gy (z ) extends to the
MA(∞) case. If
yt = µ + θ (L) et (1.6.5)
with
θ (L) = 1 + θ 1 L + θ 2 L2 + (1.6.6)
and
∞
∑j =0 jθ j j < ∞, (1.6.7)
then
gy (z ) = σ2 θ (z ) θ z 1
(1.6.8)
For example, the stationary AR (1) process can be

written as
1
yt µ = (1 φL) et
which is in the form of (1.6.5) with

1
θ (L) = (1 φL) .
The autocovariance-generating function for AR (1)
process could therefore be calculated from
σ2
gy (z ) = (1.6.9)
(1 φz 1 ) φz ) (1
To verify this claim, expand out the terms in
(1.6.9):
σ2
= σ 2 1 + φz + φ2 z 2 + 1 + φz 1
+ φ2 z 2
+
(1 φz ) (1 φz 1 )
From which the coe¢ cient on z k is

2 k k +1 k +2 2 σ 2 φk
σ φ +φ φ+φ φ + =
1 φ2
This indeed yields the k th autocovariance as earlier
calculated in equation (1.2.5).
The autocovariance-generating function for a
stationary ARMA(p, q ) process can be written as:
σ2 1 + θ 1 z + θ 2 z 2 + + θq z q 1 + θ1 z 1 + θ2 z 2 + + θq z q
g y (z ) = (1.6.10)
1 φ1 z φ2 z2 φp zp 1 φ1 z 1 φ2 z 2 φp z p
Filters
Sometimes the data are …ltered, or treated in a

particular way before they are analyzed, and we
would like to summarize the e¤ects of this
treatment on the autocovariances.
This calculation is particularly simple using the
autocovariance-generating function.
For example, suppose that the original data yt were
generated from an MA(1) process,
yt = (1 + θL) et (1.6.11)
with autocovariance-generating function given by

(1.6.2).
Let’s say that the data as actually analyzed, xt ,
represent the change in yt over its value the
previous period:
xt = yt yt 1 = (1 L) yt (1.6.12)
Substituting (1.6.11) into (1.6.12), the observed

data can be characterized as the following MA(2)
process,
xt = (1 L) (1 + θ L) et = 1 + ( θ 1) L θ L2 et
= 1 + θ 1 L + θ 2 L2 et (1.6.13)
with θ 1 = (θ 1) and θ 2 = θ.
The autocovariance-generating function of the
observed xt series can be calculated by direct
application of (1.6.3):
gx (z ) = σ2 1 + θ 1 z + θ 2 z 2 1 + θ1z 1
+ θ2z 2
(1.6.14)
It is often instructive, however, to keep the

polynomial (1 + θ 1 z + θ 2 z 2 ) in its factored form of
the …rst line of (1.6.13),
(1 + θ 1 z + θ 2 z 2 ) = (1 z ) (1 + θz ) ,
in which case (1.6.14) could be written as
gx (z ) = σ2 (1 z ) (1 + θz ) 1 z 1
1 + θz 1
1
= (1 z) 1 z gy (z ) (1.6.15)
Of course, (1.6.14) and (1.6.15) represent the

identical function of z, and which way we choose to
write it is simply a matter of convenience.
Applying the …lter (1 L) to yt thus results in
multiplying its autocovariance-generating function
1
by (1 z ) (1 z ).
This principle readily generalizes.
Suppose that the original data series fyt g satis…es
(1.6.5) through (1.6.7). Let’s say the data are
…ltered according to
xt = h (L) yt (1.6.16)
with h (L) = ∑j∞= ∞ hj Lj and ∑j∞= ∞ jhj j < ∞.

Substituting (1.6.5) into (1.6.16) , the observed
data xt are then generated by
xt = h (1) µ + h (L) θ (L) et = µ + θ (L) et .
where µ = h (1) µ and θ (L) = h (L) θ (L).
The sequence of coe¢ cients associated with the

∞
compound operator θ j j= ∞
turns out to be
absolutely summable and the
autocovariance-generating function of xt can
accordingly be calculated as
gx (z ) = σ 2 θ (z ) θ z 1
= σ 2 h (z ) θ (z ) h z 1
θ z 1
1
= h (z ) h z gy (z ) (1.6.17)
Applying the …lter h(L) to a series thus results in
multiplying its autocovariance-generating function
1
by h (z ) h (z ).
1.7 Trend Stationarity
yt = µ0 + µ1 t + St (1.7.1)
St = ρ1 St 1 + ρ2 St 2 + + ρk St k + et (1.7.2)
Or,
yt = µ0 + µ1 t + ρ1 St 1 + ρ2 St 2 + + ρk St k + et (1.7.3)
There are two essentially equivalent ways to
estimate the autoregressive parameters
( ρ1 , ρ2 , . . . ρk ).
You can estimate (1.7.3) by OLS.
You can estimate (1.7.1) (1.7.2) sequentially by
OLS.
bt ,
That is, …rst estimate (1.7.1), get the residual S
and then perform regression (1.7.2) replacing St
bt .
with S
This procedure is sometimes called Detrending.
Seasonal E¤ects
There are three popular methods to deal with

seasonal data.
1). Include dummy variables for each season. This

presumes that “seasonality” does not change over
the sample.
2). Use “seasonally adjusted” data. The seasonal factor
is typically estimated by a two-sided weighted
average of the data for that season in neighboring
years.
Thus the seasonally adjusted data is a “…ltered”

series.
This is a ‡exible approach which can extract a wide
range of seasonal factors.
The seasonal adjustment, however, also alters the
time-series correlations of the data.
3). First apply a seasonal di¤erencing operator.

If s is the number of seasons (typically s = 4 or
s = 12);
∆s yt = yt yt s ,
or the season-to-season change.

The series ∆s yt is clearly free of seasonality.
But the long-run trend is also eliminated, and
perhaps this was of relevance.
1.8 Testing for Omitted Serial Correlation
For simplicity, let the null hypothesis be an AR (1):

yt = α + ρyt 1 + ut (1.8.1)
We are interested in the question if the error ut is
serially correlated.
We model this as an AR (1):
ut = θut 1 + et (1.8.2)
with et a MDS.
The hypothesis of no omitted serial correlation is
H0 : θ = 0 vs H1 : θ 6= 0
We want to test H0 against H1 :
To combine (1.8.1) and (1.8.2), we take (1.8.1)
and lag the equation once:
yt 1 = α + ρyt 2 + ut 1
We then multiply this by θ and subtract from

(1.8.1), to …nd
yt θyt 1 =α αθ + ρyt 1 θρyt 2 + ut θut 1,
Or yt = α (1 θ ) + (ρ + θ ) yt 1 θρyt 2 + et = AR (2)
Thus under H0 , yt is an AR (1), and under H1 it is
an AR (2).
H0 may be expressed as the restriction that the
coe¢ cient on yt 2 is zero.
An appropriate test of H0 against H1 is therefore a
Wald test that the coe¢ cient on yt 2 is zero (a
simple exclusion test).
In general, if the null hypothesis is that yt is an
AR (k ), and the alternative is that the error is an
AR (m), this is the same as saying that under the
alternative yt is an AR (k + m), and this is
equivalent to the restriction that the coe¢ cients on
yt k 1 , yt k 2 , . . . , yt k m are jointly zero.
An appropriate test is the Wald test of this restriction.
1.9 Stationarity and Unit Roots
Stationarity of a stochastic process requires that the

variances and autocovariances are …nite and
independent of time.
It is easily veri…ed that …nite-order MA processes
are stationary by construction as they correspond to
a weighted sum of white noise processes.
However, this result breaks down if we allow the MA
coe¢ cients to vary over time, as in
yt = et + g (t ) et 1 (1.9.1)
where g (t ) is some deterministic function of t.

Now, we have
2
E yt2 = σ2 + g (t ) σ2
which is not independent of t. Consequently, the
process in (1.9.1) is non-stationary.
Stationarity of AR or ARMA processes is less trivial.
Consider, for example, the following AR (1) process
yt = φyt 1 + et , (1.9.2)
with φ = 1.
Taking variances on both sides gives
var (yt ) = var (yt 1) + σ2 , which has no solution
for the variance of the process consistent with
stationarity, unless σ2 = 0, in which case in…nity of
solutions exists.
The process in (1.9.2) is a …rst-order autoregressive

process with a unit root (φ = 1), usually referred to
as a random walk without a drift.
The unconditional variance of yt does not exist, i.e.,
is in…nite and the process is non-stationary.
In fact, for any value of φ with φ 1, (1.9.2)
describes a non-stationary process.
We can formalize the above result as follows.
The AR (1) process is stationary if and only if the
polynomial 1 φL is invertible, i.e., if the roots of
the reverse characteristic equation 1 φz = 0 lies
outside the unit circle.
This result is straightforwardly generalized to
arbitrary ARMA models.
The ARMA(p, q ) model
ψ (L) yt = θ (L) et (1.9.3)

corresponds to a stationary process if and only if the
solutions z1 , . . . , zp to ψ (z ) = 0 lie outside the unit
circle, that is, when the AR polynomial is invertible.
For example, the ARMA(2, 1) process given by
yt = 1.2yt 1 0.2yt 2 + et 0.5et (1.9.4)
is non-stationary because z = 1 is a solution to

1 1.2z + 0.2z 2 = 0.
A special case that is of particular interest arises
when one root is exactly equal to one, while the
other roots are larger than one.
If this arises, we can write the process for yt as
ψ (L) (1 L) yt = ψ (L) ∆yt = θ (L) et (1.9.5)
where ψ (L) is an invertible polynomial in L of

order p 1, and ∆ = 1 L is the …rst-di¤erence
operator.
Because the roots of the AR polynomial are the
solutions to ψ (z ) (1 z ) = 0, there is one
solution z = 1, or in other words a single unit root.
Equation (1.9.5) thus shows that ∆yt can be
described as a stationary ARMA model if the
process for yt has one unit root.
Consequently, we can eliminate the non-stationarity
by transforming the series into …rst-di¤erences.
Writing the process in (1.9.4) as
(1 0.2L) (1 L) yt = (1 0.5L) et
shows that ∆yt is described as a stationary

ARMA(1, 1) process given by
∆yt = 0.2∆yt 1 + et 0.5et 1
A series that becomes stationary after

…rst-di¤erencing is said to be integrated of order
one, denoted by I (1).
If ∆yt is described by stationary ARMA(p, q ) model,
we say that yt is described by an autoregressive
integrated moving average (ARIMA) model of order
p, 1, q, or in short an ARIMA(p, 1, q ) model.
First di¤erencing quite often transforms a
non-stationary series into a stationary series.
In particular this may be the case for aggregate

economic series or their natural logarithms.
In some cases, taking …rst-di¤erences is insu¢ cient
to produce stationary series and another di¤erencing
step is required.
In this case the stationary series is given by:
∆ (∆yt ) = ∆2 yt = ∆yt ∆yt 1
If the series must be di¤erenced twice before it

becomes stationary, then it is said to be integrated
of order two, denoted by I (2), and it must have two
unit roots.
Thus, a series yt is I (2) if ∆yt is non-stationary but
∆2 yt is stationary.
In general, the main di¤erence between I (0) and
I (1) processes can be summarized as follows.
An I (0) series ‡uctuates around its mean with a
…nite variance that does not depend on time, while
an I (1) series wanders widely.
Typically, an I (0) series is mean reverting, as there
is a tendency in the long run to return to its mean.
Furthermore, an I (0) series has a limited memory of
its past behavior (implying that the e¤ects of a
particular random innovation are only transitory).
However, an I (1) series has in…nitely long memory
(implying that an innovation will permanently a¤ect
the process).
These aspects becomes clear from the
autocorrelation functions: for an I (0) series the
autocorrelation function declines rapidly as the lag
length increases, while for the I (1) process the
estimated autocorrelation coe¢ cients decay to zero
very slowly.
The last property makes the presence of a unit root

an interesting question from an economic point of
view.
In models with unit roots, shocks (which may be
due to policy interventions) have persistent e¤ects
that last forever, while, in the case of stationary
models, shocks can only have a temporary e¤ect.
The long-run e¤ect of a shock is not necessarily of

the same magnitude as the short-run e¤ect.
The fact that the autocorrelations of a stationary
series die out rapidly may help in determining the
number of times di¤erencing is needed to achieve
stationarity.
In addition, a number of formal unit root test has
been proposed in the literature.
1.9.1 Testing for Unit Roots in a AR(1) Model
Consider the AR (1) process
yt = µ + ρyt 1 + et (1.9.6)
where ρ = 1 corresponds to a unit root.
It seems obvious to use the estimate b

ρ for ρ from an
OLS procedure (which is consistent, irrespective of
the true value of ρ) and the corresponding standard
error to test the null hypothesis of a unit root.
However, as shown in the seminal paper of Dickey

and Fuller (1979), under the null hypothesis that
ρ = 1 the standard t-ratio does not have a
t-distribution, not even asymptotically.
The reason for this is the non-stationarity of the
process invalidating standard results on the
distribution of the OLS estimator b
ρ.
To test for the presence of a unit root in AR (1) the
null and the alternative hypothesis are given by
H0 : ρ = 1
H1 : ρ < 1
It is possible to use the standard t-statistic to test

the above hypothesis, where the t-statistic is given
by
b
ρ 1
b
τµ = (1.9.7)
.e. (b
sd ρ)
where s.e. (b
ρ) denotes the usual OLS standard
error. Critical Values, however, have to be taken
from the appropriate distribution, which under the
null hypothesis of non-stationarity is nonstandard.
In particular, the distribution is skewed to the left

(with a long left-hand tail) so that critical values are
smaller than those for (the normal approximation
of) the t-distribution.
Using a 5% signi…cance level in a one-tailed test of
H0 : ρ = 1 (a unit root) against H1 : ρ < 1
(stationary), the correct critical value in large
samples is 2.86 rather than 1.65 for the normal
approximation.
Usually, a slightly more convenient regression

procedure is used, in which case, the model is
written as:
∆yt = µ + (ρ 1) yt 1 + et (1.9.8)
from which the b

τ µ -statistic for H0 : ρ 1 = 0 is
identical to the b
τ µ above.
The reason for this is that the least squares method
is invariant to linear transformations of the model.
Under the null hypothesis of a unit root the above
model turns out to be
∆yt = µ + et (1.9.9)
which is known as a random walk with drift, where

µ is the drift parameter.
In the model for the level variable yt , µ corresponds
to a linear time trend as (1.9.9) implies that
E (∆yt ) = µ.
Hence for a given initial value y0 , E (yt ) = y0 + µt.

This shows that the interpretation of the intercept
term in (1.9.9) depends on the presence of a unit
root.
In the stationary case, µ re‡ects the nonzero mean
of the series, while in the unit root case it re‡ects a
deterministic trend in yt .
Because in the latter case …rst-di¤erencing produces
a stationary time series, the process for yt is referred
to as di¤erence stationary.
It is also possible that non-stationarity is caused by
the presence of a deterministic time trend in the
process, rather than by the presence of a unit root.
This happens when the AR (1) model is extended to
yt = µ + γt + ρyt 1 + et (1.9.10)
with jρj < 1 and γ 6= 0.

In this case we have a non-stationary process
because of the linear trend γt.
This non-stationarity can be removed by regressing

yt upon a constant and t, and then considering the
residuals of this regression, or by simply including t
as an additional variable in the model.
The process for yt in this case is referred to as being
trend stationary.
Non-stationary process may thus be characterized by
the presence of a deterministic trend or a stochastic
trend implied by the presence of a unit root or both.
It is possible to test whether yt follows a random
walk against the alternative that it follows the trend
stationary process as in (1.9.10).
This can be tested by running the regression
∆yt = µ + γt + (ρ 1) yt 1 + et (1.9.11)
The null hypothesis one would like to test is that

the process is a random walk given by
H0 : µ = γ = ρ 1 = 0.
Instead of testing this joint hypothesis, it is quite
common to use the t-ratio on b
ρ 1, denoted by b
ττ ,
assuming that the other restrictions in the null
hypothesis are satis…ed.
Although the null hypothesis is still the same as in
the previous unit root test, the testing regression is
di¤erent and thus we have di¤erent distribution of
the test statistic.
It should be noted that if the unit root hypothesis

ρ 1 = 0 is rejected, we cannot conclude that the
process for yt is likely to be stationary.
Under the alternative hypothesis γ may be nonzero

so that the process for yt is not stationary (but only
trend stationary).
If a graphical inspection of the series indicates a
clear positive or negative trend, it is most
appropriate to perform the Dickey-Fuller test with a
trend.
This implies that the alternative hypothesis allows
the process to exhibit a linear deterministic trend.
If we are unable to reject the presence of a unit
root, it does not necessarily mean that it is true.
It could just be that there is insu¢ cient information
in the data to reject it.
Kwiatkowski, Phillips, Schmidt and Shin (1992)
propose an alternative test where stationarity is the
null hypothesis and the existence of a unit root is
the alternative.
This test is usually referred to as the KPSS test.
The basic idea is that a time series is decomposed
into the sum of a deterministic time trend, a
random walk and a stationary error term (typically
not a white noise).
The null hypothesis (of trend stationarity) speci…es
that the variance of the random walk component is
zero.
The test is actually a Lagrange multiplier test, and
computation of the test statistic is fairly simple.
First run an auxiliary regression of yt upon an

intercept and a time trend t.
Next, save the OLS residuals et and compute the
partial sums St = ∑ts =1 es for all t.
Then the test statistic is given by
T
KPSS = T 2
∑ St2 /bσ2 (1.9.12)
t =1
b2 is an estimator of the error variance.

where σ
b2 may involve corrections for
This latter estimator σ
autocorrelations based on the Newey-West formula.
The asymptotic distribution is nonstandard, and
Kwiatkowski, Phillips, Schmidt and Shin (1992)
report a 5% critical value of 0.146.
If the null hypothesis is stationary rather than trend
stationary, the trend term should be dropped from
the auxiliary regression.
The test statistic is computed in the same fashion,
but the 5% critical value is 0.463.
1.9.2 Testing for Unit Roots in Higher-order AR Models
A test for a unit root in higher-order AR processes

can easily be obtained by extending the
Dickey-Fuller test procedures.
The general strategy is that lagged di¤erences, such
as ∆yt 1, ∆yt 2, . . . are included in the regression,
such that its error term corresponds to white

noise.This leads to the so called augmented
Dickey-Fuller tests (ADF tests), for which the same
asymptotic critical values hold as those of the
Dickey-Fuller tests.
Consider the AR (k ) model
ρ (L) yt = µ + et (1.9.13)
ρ (L) = 1 ρ1 L ρ2 L2 ρk Lk
As we discussed before, yt has a unit root when

ρ (1) = 0, or ρ1 + ρ2 + + ρk = 1.
In this case, yt is non-stationary.
The ergodic theorem and MDS CLT do not apply,
and test statistics are asymptotically non-normal.
A helpful way to write the above equation is using
the so-called Dickey-Fuller (DF) reparametrization:
∆yt = µ + α0 yt 1 + α 1 ∆ yt 1 + + αk 1 ∆ yt ( k 1 ) + et (1.9.14)
These models are equivalent linear transformations

of one another.
The DF parameterization is convenient because the
parameter α0 summarizes the information about the
unit root, since ρ (1) = α0 .
To see this, observe that the lag polynomial for the
yt computed from (1.9.14) is
(1 L) α0 L α1 L L2 αk 1 Lk 1 Lk
But this must equal to ρ (L), as the models are

equivalent.
Thus
ρ ( 1) = ( 1 1) α0 α1 (1 1) αk 1 (1 1) = α0
Hence, the hypothesis of a unit root in yt can be

stated as
H0 : α0 = 0
Note that the model is stationary if α0 < 0.

So the natural alternative is
H1 : α0 < 0
Under H0 , the model for yt is

∆yt = µ + α1 ∆yt 1 + + αk 1 ∆yt (k 1) + et
which is an AR (k 1) in the …rst-di¤erence ∆yt .
Thus if yt has a (single) unit root, then ∆yt is a

stationary AR process.
Because of this property, we say that if yt is
non-stationary but ∆d yt is stationary, then yt is
“integrated of order d”, or I (d ).
Thus a time series with unit root is I (1).
Since α0 is the parameter of a linear regression, the
natural test statistic is the t-statistic for H0 from
OLS estimation of (1.9.14).
Indeed, this is the most popular unit root test, and

is called the Augmented Dickey-Fuller (ADF) test
for a unit root.
It would seem natural to assess the signi…cance of
the ADF statistic using the normal table.
However, under H0 , yt is non-stationary, so
conventional normal asymptotics are invalid.
An alternative asymptotic framework has been
developed to deal with non-stationary data.
We do not have the time to develop this theory in
detail, but simply assert the main results.
Theorem
1.9.1 Dickey-Fuller Theorem : Assume α0 = 0. As
T ! ∞,
d
Tb
α0 ! (1 α1 α2 αk 1 ) ρµ
b
α0 d
ADF = ! τµ
b (b
se α0 )
The limiting distributions ρµ and τ µ are non-normal.
They are skewed to the left, and have negative
means.
The …rst result states that b
α0 converges to its true
value (of zero) at rate T , rather than the
conventional rate of T 1/2 .
This is called a “super-consistent” rate of
convergence.

The second result states that the t-statistic for b
α0
converges to a limiting distribution which is
non-normal, but does not depend on the parameters.
This distribution has been extensively tabulated,
and may be used for testing the hypothesis H0 .
Note: The standard error se (b

α0 ) is the conventional
(“homoscedastic”) standard error.
But the theorem does not require an assumption of

homoscedasticity.
Thus the Dickey-Fuller test is robust to
heteroscedasticity.
Since the alternative hypothesis is one-sided, the
ADF test rejects H0 in favour of H1 when ADF < c,
where c is the critical value from the ADF table.
If the test rejects H0 , this means that the evidence
points to yt being stationary.
If the test does not reject H0 , a common conclusion
is that the data suggests that yt is non-stationary.
However, this is not really a correct conclusion.
All we can say is that there is insu¢ cient evidence
to conclude whether the data are stationary or not.
We have described the test for the setting with an
intercept.
Another popular setting includes as well a linear
time trend. This model is
∆yt = µ1 + µ2 t + α0 yt 1 + α1 ∆yt 1 + + αk 1 ∆ y t (k 1 ) + et (1.9.15)
This is natural when the alternative hypothesis is

that the series is stationary about a linear time
trend.
If the series has a linear trend (e.g. GDP, Stock

Prices), then the series itself is non- stationary, but
it may be stationary around the linear time trend.
In this context, it is a waste of time to …t an AR
model to the level of the series without a time
trend, as the AR model cannot conceivably describe
this data.
The natural solution is to include a time trend in
the …tted OLS equation.
When conducting the ADF test, this means that it
is computed as the t-ratio for α0 from OLS
estimation of (1.9.15). If a time trend is included,
the test procedure is the same, but di¤erent critical
values are required.
The ADF test has a di¤erent distribution when the
time trend has been included, and a di¤erent table
should be consulted.
If too many lags are included, this will somewhat
reduce the power of the tests, but, if too few lags
are included, the asymptotic distributions from the
table are simply invalid (because of autocorrelation
in the residuals), and the tests may lead to seriously
biased conclusions.
It is possible to use the model selection criterion or
statistical signi…cance of the additional variables to
select the lag length in the ADF tests.

A regression of the form (1.9.14) can also be used
to test for a unit root in a general (invertible)
ARMA model.
Said and Dickey (1984) argue that when,
theoretically, one lets the number of lags in the
regression grow with the sample size, the same
asymptotic distributions hold and the ADF tests are
also valid for an ARMA model.
The argument essentially is that an ARMA model

(with invertible MA polynomial) can be written as
an in…nite order autoregressive process.
This explains why, when testing for unit roots,

people usually do not worry about MA components.
Phillips and Perron (1988) have suggested an
alternative to the augmented Dickey-Fuller tests.
Instead of adding additional lags in the regressions
to obtain an error term that has no autocorrelation,
they stick to the original Dickey-Fuller regressions
but make nonparametric adjustments to the
Dickey-Fuller test statistics to take into account of
potential autocorrelation pattern in the errors.
Phillips-Perron tests are nonparametric in nature

and applicable to a wide class of weakly dependent
and heterogeneously distributed innovations.
If the ADF test does not reject the null hypothesis
of one unit root, the presence of a second unit root
may be tested by estimating the regression of ∆2 yt
on ∆yt 1, ∆2 yt 1, . . . , ∆
2
yt p +1 , and comparing the
t-ratio of the coe¢ cient on ∆yt 1 with the
appropriate critical value.
Alternatively, the presence of two unit roots may be

tested jointly by estimating the regression of ∆2 yt
on yt 1, ∆yt 1, ∆2 yt 1, . . . , ∆
2
yt p +1 , and
computing the usual F -statistic for testing the joint
signi…cance of yt 1 and ∆yt 1 .
This test statistic, under the null hypothesis of a
double unit root, has a distribution that is not the
usual F -distribution.
Critical values of this distribution are tabulated by

Hasza and Fuller (1979).
A stochastic process may be non-stationary for
other reasons than the presence of unit roots.
A linear deterministic trend is one example and
structural breaks in the series may also mimic the
presence of unit root non-stationarity.
Without going into details, it may be mentioned
that the recent literature on unit roots also includes
discussions on stochastic unit roots, seasonal unit
roots, fractional integration and panel data unit
roots.
A stochastic unit root implies that a process is
characterized by a root that is not constant but
stochastic and vary around unity.
Such a process can be stationary for some periods

and explosive for others (see Granger and Swanson,
1977).
A seasonal unit root arises if a series becomes

stationary after seasonal di¤erencing.
Fractional integration starts from the idea that a
series may be integrated of order d, where d is not
an integer.
1
If d 2, the process is non-stationary and said to
be fractionally integrated.
Finally, panel data unit root tests involve tests for
unit roots in multiple series, for example GDP in
ten di¤erent countries.
1.10 Estimation of ARMA Models
Suppose we know that the data series,y1 , y2 , . . . , yT

, is generated by an ARMA process of order p, q.
Depending upon the speci…cation of the model, and
the distributional assumptions we are willing to

make, we can estimate the unknown parameters by
ordinary or nonlinear least squares or by maximum
likelihood.
1.10.1 Least Squares
The least squares approach chooses the model

parameters such that the residual sum of squares is
minimal.
This is particularly easy for models in autoregressive
form.
Consider the AR (p) model
yt = α + φ1 yt 1 + + φp yt p + et (1.10.1)
where et is a white noise error term that is

uncorrelated with anything dated t 1 or before.
Consequently, we have
E (yt j et ) = 0 for j = 1, 2, . . . , p
that is, error terms and explanatory variables are

contemporaneously uncorrelated and OLS applied
to (1.10.1) provides consistent estimates.
Estimation of an autoregressive model is thus not
di¤erent from a linear regression model with lagged
dependent variables.
Estimation of the AR(p) process
0
Let Xt = 1 yt 1 yt 2 yt p
0
β= α φ1 φ2 φp
Then the AR (p ) model can be written as
yt = Xt0 β + et .
The OLS estimator of the model is then given by:
b = (X0 X ) 1
β X0 y
b it helpful to de…ne the
To study properties of β,
process ut = Xt et
Note that ut is a MDS, since
E ( ut j Ft 1) = E ( Xt et j Ft 1 ) = Xt E ( et j Ft 1 ) = 0
By Theorem 1.1.1, it is also strictly stationary and

ergodic. Hence,
1 T 1 T p
∑t =1 Xt et = ∑t =1 ut ! E (ut ) = 0 (1.10.2)
T T
The vector Xt is strictly stationary and ergodic, and
by Theorem 1.1.1, so is Xt Xt0 .
Therefore by the Ergodic Theorem,
T
1 p
T ∑ Xt Xt0 ! E (Xt Xt0 ) = Q
t =1
Combined with (1.10.2) and the continuous
mapping theorem, we see that

! !
T T
1 1
b = β+
β
T ∑ Xt Xt0 T ∑ Xt et ! β+Q 1
0=β
t =1 t =1
We have shown the following:
Theorem
1.10.1 If the AR (p ) process is strictly stationary and
p
b !
ergodic and E (yt2 ) < ∞, then β β as T " ∞.
Asymptotic Distribution
Theorem
1
1.10.2 MDS CLT . If ut is strictly stationary and ergodic

MDS and E (ut ut0 ) = Ω < ∞, then as T " ∞,
d
p1 T
∑t =1 ut
F. Guta (CoBE)
! N (0, Ω) .
Econ 654 September, 2017 207 / 296
Since Xt et is a MDS, we can apply theorem 1.10.2
to see that
T
1
∑ Xt et
d
p ! N (0, Ω) , where Ω = E Xt Xt0 et2
T t =1
Theorem
1.10.3 If the AR (p ) process yt is strictly stationary and
ergodic and E (yt4 ) < ∞, then as T " ∞,
p d
b
T β β ! N 0, Q 1 ΩQ 1

This is identical in form to the asymptotic
distribution of OLS estimator in cross-section
regression.
In particular, the asymptotic covariance matrix is
estimated just as in the cross-section case.
Estimation of MA Process
For moving average models, estimation is somewhat

more complicated.
Suppose that we have an MA(1) model
yt = α + et + θet 1
Because et 1 is not observed, we cannot apply

regression here.
In theory, ordinary least squares would minimize
∑t =2 (yt
T 2
S (θ, α) = α θet 1)
A possible solution arises if we write et 1 in this

expression as a function of observed yt ’s.
This is only possible if the MA polynomial is
invertible.
In this case we can use
∞
et 1 = ∑j =0 ( θ )j (yt 1 j α)
and write
h i2
T ∞
S (θ , α) = ∑t =2 yt α θ ∑j =0 ( θ )j yt 1 j α
In practice, yt is not observed for t = 0, 1, . . . , so

we have to cut o¤ the in…nite sum in the above
expression to obtain an approximate sum of squares
h i2
e ( θ , α ) = ∑T
S t =2 yt α θ ∑tj =02 ( θ )j (yt 1 j α) (1.10.3)
Because, asymptotically, if T goes to in…nity the

e (θ, α) disappears
di¤erence between S (θ, α) and S
and minimizing (1.10.3) with respect to α and θ
α and b
gives consistent estimators b θ.
Unfortunately, (1.10.3) is a higher-order polynomial
in θ and thus has very many local minima.
Therefore, analytically (1.10.3) is complicated.
However, as we know that 1 < θ < 1, a grid
search can be performed.
The resulting nonlinear least squares estimators for
α and θ are consistent and asymptotically normal.
1.10.2 Maximum Likelihood Estimation
Consider an ARMA model of the form
yt = α + φ1 yt 1 + + φp yt p + et + θ 1 et 1 + + θ q et q (1.10.4)
with et white noise:
E ( et ) = 0 (1.10.5)
8
>
>
< σ2 , t = τ
E ( et e τ ) = (1.10.6)
>
>
: 0 , otherwise
This section explores how to estimate the values of
α, φ1 , . . . , φp , θ 1 , . . . , θ q , σ2 on the basis of
observations on y.
The primary principle on which estimation will be
based is maximum likelihood.
0
2
Let θ = α, φ1 , . . . , φp , θ 1 , . . . , θ q , σ denote the
vector of population parameters.
Suppose we have observed a sample of size T
(y1 , . . . , yT ).
The approach will be to calculate the probability
density
fyT ,...,y1 (y1 , . . . , yT ; θ) (1.10.7)
which might be viewed as the probability of having

observed this particular sample.
The maximum likelihood estimate (MLE ) of θ is
the value for which this sample is most likely to
have been observed; that is, it is the value of θ that
maximizes (1.10.7).
This approach requires specifying a particular
distribution for the white noise process et .
Typically we will assume that et is Gaussian white
noise:
et i.i.d.N 0, σ2 (1.10.8)
This assumption is strong, however, the estimates

of θ that results from it will often turn out to be a
sensible for non-Gaussian processes as well.
Finding the MLE conceptually involves two steps.
First, the likelihood function (1.10.7) must be
calculated.
Second, values of θ must be found that maximize
this function.
The Likelihood Function for a Gaussian AR(1)
Process
A Gaussian AR (1) process takes the form
Yt = α + φYt 1 + et (1.10.9)
with et i.i.d.N (0, σ2 ).

For this case , the vector of parameters to be
0
estimated consists of θ = (α, φ, σ2 ) .
Consider the probability distribution of Y1 , the …rst
observation in the sample.
From equations (1.2.3) and (1.2.4) this is a random
variable with mean E (Y1 ) = µ = α/ (1 φ) and
2
variance E (Y1 µ ) = σ2 / 1 φ2 .
∞
Since fet gt = ∞ is Gaussian, Y1 is also Gaussian.

Hence, the density of the …rst observation takes the
form
fY 1 (y1 ; θ) = fY 1 y1 , α, φ, σ2 (1.10.10)
( )
1 [y1 α/ (1 φ)]2
= q exp
2πσ2 / 1 φ2 2σ 2 / 1 φ2
Next consider the distribution of the second

observation Y2 conditional on observing Y1 = y1 .
From (1.10.9),
Y2 = α + φY1 + e2 (1.10.11)
Conditional on Y1 = y1 , Y 2 j Y 1 = y1 N α + φy1 , σ 2 . Hence,
( )
1 ( y2 α φ y1 ) 2
f Y 2 jY 1 ( y2 j y1 ; θ) = p exp (1.10.12)
2πσ2 2σ 2
The joint density of observations 1 and 2 is then

just the product of (1.10.10) and (1.10.12):
fY2 ,Y1 (y2 , y1 ; θ) = f Y2 jY1 ( y2 j y1 ; θ) fY1 (y1 ; θ)
Similarly, distribution of the third observation

conditional on the …rst two is
( )
1 (y3 α φy2 )2
f Y 3 jY 2 ,Y 1 ( y3 j y2 , y1 ; θ) = p exp
2πσ2 2σ 2
from which,
fY3 ,Y2 ,Y1 (y3 , y2 , y1 ; θ) = f Y3 jY2 ,Y1 ( y3 j y2 , y1 ; θ)
f Y2 jY1 ( y2 j y1 ; θ) fY1 (y1 ; θ)
In general, the values of Y1 , Y2 , . . . , Yt 1 matter for

Yt only through the value of Yt 1, and the density
of observation t conditional on the preceding t 1
observations is given by
f Y t jY t 1 ,...,Y 1
( yt j yt = f Y t jY t 1 ( yt j yt 1 ; θ)
1 , . . . , y1 ; θ) (1.10.13)
( )
1 (yt α φyt 1 )2
=p exp
2πσ2 2σ 2
The joint density of the …rst t observations is then
fY t ,...,Y 1 (yt , . . . , y1 ; θ) = f Y t jY t 1
( yt j yt 1 ; θ) (1.10.14)
fY t 1 ,...,Y 1
( yt 1 , . . . , y1 ; θ)
The likelihood of the complete sample can thus be

calculated as
T
fY T ,...,Y 1 (yT , . . . , y1 ; θ) = fY 1 (y1 ; θ) ∏ f Y t jY t 1
( yt j yt 1 ; θ) (1.10.15)
t =2

The log likelihood function (denoted by ` (θ)) can be found
by taking log of (1.10.15)
T
` (θ) = ln fY1 (y1 ; θ) + ∑ ln f Y jY t t 1
( yt j yt 1 ; θ) (1.10.16)
t =2
Substituting (1.10.10) and (1.10.13) into (1.10.16), the
loglikelihood for a sample of size T from a Gaussian AR (1)
proces is seen to be
fy1 α/ (1 φ)g2
` (θ) = 1
2 ln 2π 1
2 ln σ2 / 1 φ2
2 σ 2 / 1 φ2
T
T
2
1
ln 2π T
2
1
ln σ2 1
2 σ2 ∑ (yt α φyt 1)
2
(1.10.17)
t =2
The MLE b
θ is the value for which (1.10.17) is
maximized.
In principle, this requires di¤erentiating (1.10.17)
and setting the result equal to zero.
In practice, when an attempt is made to carry this
out, the result is a system of nonlinear equations in
θ and (y1 , . . . , yT ) for which there is no simple
solution for θ in terms of (y1 , . . . , yT ).
Maximization of (1.10.17) thus requires iterative or
numerical procedures.
Conditional ML Estimates of AR(1) Process
An alternative to numerical optimization of the exact
likelihood function is to regard the value of y1 as deterministic
and maximize the likelihood conditioned on the …rst observation,
T
f Y T ,...,Y 2 jY 1 ( yT , . . . , y2 j y1 ; θ) = ∏ f Y t jY t 1
( yt j yt 1 ; θ) (1.10.18)
t =2
the objective function then being to maximize
ln f Y T ,...,Y 2 jY 1 ( yT , . . . , y2 j y1 ; θ) = T 1
2 ln 2π T 1
2 ln σ2 (1.10.19)
T
1
2 σ2 ∑ ( yt α φ yt 1)
2
t =2
Maximization of (1.10.19) with respect to α and φ
is equivalent to minimization of
T
∑ (yt α φyt 1)
2
(1.10.20)
t =2
which is achieved by OLS regression of yt on a

constant and its own lagged value.
The conditional maximum likelihood estimates of α

and φ are therefore given by
0 1 0 1 1 0 1
B b C B
B α C B T 1 ∑T
t = 2 yt 1 C
C
B
B ∑T
t = 2 yt 1 C
C
B C=B C B C,
@ A @ A @ A
b
φ ∑T
t = 2 yt 1 ∑T 2
t = 2 yt 1 ∑T
t = 2 yt 1 yt
The conditional maximum likelihood estimate of the

innovation variance is found by di¤erentiating
(1.10.19) with respect to σ2 and setting the result
equal to zero:
T
T 1 1 2
σ2
2b
+
σ4
2b
∑ =0 yt b
α byt
φ 1
t =2
1
b2 =
or σ T
∑t =2 yt b α φbyt 1 2 .
T 1
In contrast to the exact maximum likelihood
estimates, the conditional maximum likelihood
estimates are thus trivial to compute.
Moreover, if the sample size T is su¢ ciently large,

the …rst observation makes a negligible contribution
to the total likelihood.
The exact MLE and conditional MLE turns out to
have the same large sample distribution, provided
that jφj < 1.
When jφj > 1, the conditional MLE continues to
provide consistent estimates, whereas maximization
of (1.10.17) does not.
This is because (1.10.17) is derived from (1.10.10),

which does not actually describe the density of Y1
when jφj > 1.
For these reasons, in most applications the
parameters of an autoregression are estimated by
OLS (conditional ML) rather than exact maximum
likelihood.
Conditional ML Estimatesof AR(p) Process
A Gaussian AR (p ) process takes the form
Yt = α + φ1 Yt 1 + + φp Yt p + et (1.10.21)

In this case , the vector of parameters to be
0
estimated consists of θ = α, φ1 , . . . , φp , σ2 .
Maximization of the exact log likelihood for AR (p )
process must be accomplished numerically.
In contrast, the log of the likelihood conditional on
…rst p observations assumes the simple form
ln f Y Yp +1 jYp ,...,Y1 ( yT , yp +1 j yp , . . . , y1 ; θ)
T,
T p T p
= ln 2π ln σ2 (1.10.22)
2 2
T
1 2
2σ 2 ∑ Yt α φ1 Yt 1 φp Yt p
t =p +1
The values of α, φ1 , . . . , φp that maximizes
(1.10.22) are the same as those that minimize

T 2
∑ Yt α φ1 Yt 1 φp Yt p (1.10.23)
t =p +1
Thus, the conditional maximum likelihood estimates
of these parameters can be obtained from an OLS
regression of yt on a constant and p of its own
lagged values.
The conditional maximum likelihood estimate of σ2
turns out to be
T
1 2
b2 =
σ
T ∑
p t =p +1
Yt b
α b1 Yt
φ 1 bp Yt
φ p
The exact maximum likelihood estimates and the

conditional maximum likelihood estimates again
have the same large sample distributions.
Conditional LF for a Gaussian MA(1) Process
Calculation of the likelihood function for a moving

average process is simple if we condition on initial
values for the e‘s .
Consider the Gaussian MA(1) process
Yt = µ + et + θet 1 (1.10.24)

0
Let θ (µ θ σ2 ) denote the vector of parameters
to be estimated.
If the value of et 1 were known with certainty, then
2
Yt j et 1 N µ + θet 1, σ
Or
1 1 2
f Y t j et ( yt j et 1 ; θ) = p exp (yt µ θet 1)
1
σ 2π 2σ 2
(1.10.25)
Suppose that we knew for certain that e0 = 0.

Then Y1 j e0 = 0 N (µ, σ2 ).
Moreover, given observation of y1 , the value of e1 is
then known with certainty as well, e1 = y1 µ.
Allowing application of (1.10.25) again:
1 1
f Y 2 jY 1 ,e0 =0 ( y2 j y1 , e0 = 0; θ) = p exp (y2 µ θe1 )2
σ 2π 2σ 2
Since e1 is known with certainty, e2 can be

calculated from e2 = y1 µ θe1 .
Proceeding in this fashion, it is clear that given
knowledge that e0 = 0, the full sequence
fe1 , e2 , . . . , eT g can be calculated from
fy1 , y2 , . . . , yT g by iterating on
et = yt µ θet 1 (1.10.26)
for t = 1, 2, . . . , T , starting from e0 = 0.
The conditional density of the t th observation can

then be calculated from (1.10.25) as
f Y t jY t 1 ,...,Y 1 ,e0 =0
( yt j yt 1 , . . . , y1 , e 0 = 0; θ) (1.10.27)
1 e2t
= f Y t j et ( yt j e t 1 ; θ) = p exp
1
σ 2π 2σ 2

The sample likelihood would then be the product of
these individual densities:
f Y T ,...,Y 1 je0 =0 ( yT , . . . , y1 j e0 = 0; θ)
T
= f Y 1 je0 =0 ( y1 j e0 = 0; θ) ∏ f Y t jY t 1 ,...,Y 1 ,e0 =0
( yt j yt 1 , . . . , y1 , e0 = 0; θ)
t =2
The conditional log likelihood is
` (θ) = ln f YT ,...,Y1 je0 =0 ( yT , . . . , y1 j e0 = 0; θ)

T
T T 1
=
2
ln 2π
2
ln σ2
2σ 2 ∑ e2t (1.10.28)
t =1

For a particular numerical value of θ, we thus
calculate the sequence of e‘s implied by the data
from (1.10.26).
The conditional log likelihood function (1.10.28) is
then a function of the sum of squares of these e‘s.
The log likelihood is a complicated function of µ
and θ, so that an analytical expression for the ML
estimates of µ and θ is not readily calculated.
Hence, even the conditional MLEs for MA(1)
process must be found by numerical optimization.
Iteration on (1.10.26) from an arbitrary value of e0
will result in
et = (yt µ) θ (yt 1 µ) + θ 2 (yt 2 µ)

+ ( 1) t 1
θt 1
(y1 t
µ ) + ( 1) θ t e 0
If jθ j is substantially less than unity, the e¤ect of

imposing e0 = 0 will quickly die out and the
conditional likelihood (1.10.27) will give a good
approximation to the unconditional likelihood for a
reasonably large sample size.
By contrast, if jθ j > 1, the consequence of

imposing e0 = 0 accumulate over time.
The conditional approach is not reasonable in this
case.
If numerical optimization of (1.10.28) results in a
value of θ that exceeds one in absolute value, the
results must be discarded.
The numerical optimization should be attempted
again with the reciprocal of b
θ used as a starting
value for the numerical search procedure.
Conditional LF for a Gaussian MA(q) Process
For the MA(q ) process
Yt = µ + et + θ 1 et 1 + + θ q et q (1.10.29)
A simple approach is to condition on the assumption

that the …rst q values for e were all zero:
e0 = e 1 = =e q +1 =0 (1.10.30)
From these starting values we can iterate on
et = Yt µ θ 1 et 1 θ q et q (1.10.31)
for t = 1, 2, . . . , T .
Let e0 denote the (q 1) vector
0
( e0 e 1 e q +1 )
The conditional log likelihood is then
` (θ) = ln f YT ,...,Y1 je0 =0 ( yT , . . . , y1 j e0 = 0; θ)
T
= T
2 ln 2π T
2 ln σ2 1
2 σ2 ∑t =1 e2t (1.10.32)
0
where θ = (µ θ 1 . . . θ q σ2 )
Again, expression (1.10.32) is useful only if all
values of z for which
1 + θ1z + θ2z 2 + + θ q z q = 0 lie outside the
unit circle.
Conditional LF for a Gaussian ARMA(p,q) Process
A Gaussian ARMA(p, q ) process takes the form
Yt = α + φ1 Yt 1 + + φp Yt p
+ et + θ 1 et 1 + + θ q et q (1.10.33)
where et i.i.d.N (0, σ2 ).
The objective is to estimate the vector of population

0
parameters θ α φ1 . . . φp θ 1 . . . θ q σ 2 .
The approximation to the likelihood for an
autoregression is conditioned on the initial values of
the y’s.
The approximation to the likelihood function for a
moving average process conditioned on initial values
for the e‘s.
A common approach to the likelihood function for
an ARMA(p, q ) process conditions on both y’s and
e‘s.
0
Taking initial values for y0 (y0 y 1 ... y p +1 )
0
and e0 ( e0 e 1 ... e q +1 ) as given, the
sequence fe1 , e2 , . . . , eT g can be calculated from
fy1 , y2 , . . . , yT g by iterating on
et = Yt α φ1 Yt 1 φp Yt p
θ 1 et 1 θ q et q (1.10.34)
for t = 1, 2, . . . , T .
The conditional log likelihood is then
` (θ) = ln f YT ,...,Y1 jY0 ,e0 ( yT , . . . , y1 j y0 , e0 )
∑t =1 e2t
T
= T
2 ln 2π T
2 ln σ2 1
2 σ2
(1.10.35)
One option is to set initial y’s and e’s equal to their

expected values.
That is, set ys = α/ 1 φ1 φp for
s = 0, 1, . . . , p + 1 and set es = 0 for
s = 0, 1, . . . , q + 1 and then proceed with the
iteration in (1.10.34) for t = 1, 2, . . . , T .
Alternatively, Box and Jenkins (1976) recommended

setting e’s to zero but y’s equal to their actual
values.
Thus, the iteration on (1.10.34) is started at date
t = p + 1 with y1 , y2 , . . . , yp set to the observed
values and ep = ep 1 = = ep q +1 = 0.
Then the conditional log likelihood is calculated as
` (θ) = ln f ( yT , . . . , yp +1 j yp , . . . , y1 , ep = = ep q +1 = 0; θ)
T p T p 1 T
= ln 2π ln σ2 ∑ e2t
2 2 2σ 2 t =p +1
As in the case for the moving average processes,

these approximations should be used only if all
values of z satisfying
1 + θ1z + θ2z 2 + + θq z q = 0
lie outside the unit circle.
1.11 Model Selection and Diagnostic Checking

Most of the time there are no economic reasons to
choose a particular speci…cation of the model.
Consequently, to a large extent the data will
determine which time series model is appropriate.
Before estimating any model, it is common to
estimate autocorrelation and partial autocorrelation
coe¢ cients directly from the data.
Often it gives some idea about which models might
be appropriate.
After one or more models are estimated, their
quality can be judged by checking whether the
residuals are more or less white noise, and by
comparing them with alternative speci…cations.
These comparisons can be based on statistical
signi…cance tests or the use of particular model
selection criteria.
The Autocorrelation Function
The autocorrelation function (ACF ) describes the

correlation between yt and its lag yt k as a
function of k.
The k th -order autocorrelation coe¢ cient is de…ned
as
1 T
T k ∑t =k +1 (yt y ) (yt k y)
b
ρk = 1 T 2 (1.11.1)
T ∑t =1 (yt y)
where (1/T ) ∑Tt=1 yt denotes the sample average.
Alternatively, it can be estimated by regressing yt
on a constant and yt k which will give a slightly
di¤erent estimator as the summation in the
numerator and denominator will be over the same
set of observations.
It will usually not to be the case that b

ρk is zero for
an MA model of order q < k.
But we can use b
ρk to test the hypothesis that
ρk = 0 using the result that asymptotically
p d
T (b
ρk ρk ) ! N (0, νk )
where νk = 1 + 2ρ21 + 2ρ22 + + 2ρ2q if q < k.

Testing MA(k 1) versus MA(k ) is done by testing
ρk = 0 and comparing the test statistic
p b
ρk
Tq (1.11.2)
ρ21
1 + 2b ρ22
+ 2b + ρ2k 1
+ 2b
with critical values from the standard normal

distribution.
Typically, two-standard error bounds for b

ρk based on
ρ21 + 2b
the estimated variance 1 + 2b ρ22 + ρ2k
+ 2b 1
are graphically displayed in the plot of the sample

autocorrelation function.
The order of a moving average can in this way be

determined from an inspection of the sample ACF .
At least it will give us a reasonable value for q to
start with, and diagnostic checking should indicate
whether it is appropriate or not.
For autoregressive models the ACF is less helpful.
For AR (1) model we have seen that the
autocorrelation coe¢ cients do not cut o¤ at a
…nite lag length.
Instead they go to zero exponentially corresponding

to ρk = φk .
For higher-order AR models, the autocorrelation
function is more complex.
An alternative source of information that is helpful
is provided by the partial autocorrelation function.
The Partial Autocorrelation Function

The k th order sample partial autocorrelation
coe¢ cient is de…ned as an estimate for φk in an
AR (k ) model.
bk . So, estimating
We denote this by φ
yt = α + φ1 yt 1 + φ2 yt 2 + + φk yt k + et
bk , the estimated coe¢ cient for φk in the

yields φ
AR (k ) model.
The partial autocorrelation coe¢ cient measures the
additional correlation between yt and yt k after
controlling for the e¤ects of other regressors
yt 1 , . . . , yt k +1 .
Obviously if the true model is an AR (p ) process,

then estimating an AR (k ) model by OLS gives
consistent estimators for the model parameters if
k p.
bk = 0 if k > p
plim φ (1.11.3)
Moreover, it can be shown that the asymptotic

distribution is normal, i.e.,
p d
bk
T φ 0 ! N (0, 1) if k > p (1.11.4)
Consequently, the partial autocorrelation coe¢ cients

can be used to determine the order of AR process.
Testing an AR (k 1) model versus and AR (k )
model implies testing the null hypothesis that
φk = 0.
Under the null hypothesis that the model is
AR (k bk
1), the appropriate standard error of φ
p
based on (1.11.4) is 1/ T , so that φk = 0 is
p
rejected if Tφbk > 1.96.
This way one can look at the PACF and test for
each lag whether the partial autocorrelation
coe¢ cient is zero.
For a genuine AR (p ) model the partial
autocorrelation will be close to zero after the pth
lag.
For a moving average model it can be shown that
the partial autocorrelations do not have a cut-o¤
point but tail o¤ to zero just like the
autocorrelations in an autoregressive model.
In summary the AR (p ) process is described by:
an ACF that is in…nite in extent (it tails o¤).

a PACF that is (close to) zero for lags alrger than p.
For an MA(q ) process we have:
an ACF that is (close to) 0 for lags larger than q.

a PACF that is in…nite in extent (tails o¤).
In the absence of any of these two situations, a
combined ARMA model may provide a parsimonious
representation of the data.
Diagnostic Checking
As a last step in model-building cycle, some checks

on the model adequacy are required.
Possibilities are doing a residual analysis and
over…tting the speci…ed model.
For example, if an ARMA(p, q ) model is chosen, we
could also estimate an ARMA(p + 1, q ) and an
ARMA(p, q + 1) models and test the signi…cance of
the additional parameters.
A residual analysis is usually based on the fact that
the residuals of an adequate model should
approximately be white noise.
A plot of the residuals can be a useful tool in
checking for outliers.

Moreover, the estimated residual autocorrelations
are usually examined.
For a white noise series the autocorrelations are
zero.
Therefore the signi…cance of the residuals
autocorrelations is often checked by comparing with
p
approximate two-standard error bounds 2/ T .
To check the overall acceptability of the residual
autocorrelations, the Ljung-Box (1978)
portmanteau test statistic
Q = T (T + 2) ∑Kk=1 T 1 k rk2 (1.11.5)
is often used, where rk are the estimated
autocorrelation coe¢ cients of the residuals b
et , and
K is the number chosen by the researcher.
Values of Q for di¤erent K may be computed in the

residual analysis.
For an ARMA(p, q ) process, the statistic is
approximately Chi-squared distributed with
K p q (where K > p + q) degrees of freedom
under the null hypothesis that the ARMA(p, q ) is
correctly speci…ed.
If the model is rejected at this stage, the
model-building cycle has to be repeated.
Criteria for Model Selection
Economic theory does not provide any guidance to
the appropriate choice of models, some additional
criteria can be used to choose from alternative
models that are acceptable from statistical point of
view.
What is the appropriate choice of p in practice?
This is a problem of model selection.
One approach to model selection is to choose p
based on a Wald test.
Another is to minimize the AIC or BIC information
criterion assuming constant is included in the
AR (p ) model, e.g.
2 ( p + 1)
b 2 (p ) +
AIC (p ) = ln σ (1.11.6)
T
b2 (p ) is the estimated residual variance
where σ
from an AR (p ).
One ambiguity in de…ning the AIC criterion is that
the sample available for estimation changes as p
changes.
If you increase p, you need more initial conditions.
This can induce strange behaviour into the AIC .
The best remedy is to …x an upper value p, and
then reserve the …rst p as initial conditions, and
then calculate the models AR (1), AR (2), . . . ,
AR (p ) on this uni…ed sample.
Alternatively one can use the BIC given by
( p + 1)
b 2 (p ) +
BIC (p ) = ln σ ln T (1.11.7)
T
If one is to choose between alternative ARMA(p, q )
models, the AIC and BIC information criteria are
given by
2 ( p + q + 1)
b 2 (p + q ) +
AIC (p + q ) = ln σ
T
( p + q + 1)
b 2 (p + q ) +
BIC (p + q ) = ln σ ln T
T
Both criteria are likelihood based and represent a
di¤erent trade-o¤ between ‘…t’, as measured by the
loglikelihood value, and ‘parsimony’, as measured
by the number of free parameters, p + q + 1,
(assuming the models include a constant).
Usually the model with the smallest AIC or BIC
value is preferred, although one can choose to
deviate from this if the di¤erences in criterion values
are small for a subset of the models.

While the two criterion di¤er in the trade-o¤
between …t and parsimony, the BIC criterion can be
preferred because it has the property that it will
almost surely select the true model, if T ! ∞,
provided that the true model is in the class of
ARMA(p, q ) models for relatively small values of p
and q.
The AIC criterion tends to result asymptotically in
overparameterized models.

1.12 Predicting with ARMA models
The main goal of building a time series model is

predicting the future path of economic variables.
ARMA models usually perform quite well in
prediction and often outperform more complicated
structural models.
However, ARMA models do not provide any
economic insight in one’s predictions and are unable
to forecast under alternative economic scenarios.
The Optimal Predictor
Suppose we are interested in predicting YT +h at

time T , the value of YT h-periods ahead.
A predictor for YT +h will based on an information
set, denoted by FT , that contains the information
that is available and potentially used at the time of
making the forecast.
Ideally it contains all the information that is
observed and known at time T .
In univariate time series modeling we will usually
assume that the information set at any point t in
time contains the value of Yt and all its lags. Thus
we have
FT = fY ∞ , . . . , YT 1 , YT g (1.12.1)
b T +hjT (the predictor for

In general, the predictor Y
YT +h as of time T ) is a function of the information
set FT .
Our criterion for choosing a predictor from the
many possible ones is to minimize the expected
quadratic prediction error
2
E YT +h b T +h jT
Y FT (1.12.2)
where E ( j FT ) denotes the conditional

expectation given the information set FT .
It can be shown that the best predictor for YT +h

given the information set at time T is the
conditional mean of YT +h given the information set.
We denote this optimal predictor as
b T +h jT
Y E ( YT +h j FT ) (1.12.3)
The conditional expectation of YT +h given an

information set FT0 , where FT0 is a subset of ,FT is
b T +hjT based on FT .
at best as good as Y
In line with this intuition, it holds that, the more
information one uses to determine the predictor, the
better the predictor will be.

For example, E ( YT +h j YT , YT 1,... ) will usually be
a better predictor than E ( YT +h j YT ) or E (YT +h )
(an empty information set).
To simplify things, assume that the parameters of
the ARMA model for YT are known.
In practice, one would simply replace the unknown
parameters by their consistent estimates.
Now, how do we determine these conditional
expectations when YT follows an ARMA process?
To simplify the notation, consider forecasting yT +h ,
noting that YT +hjT = µ + yT +hjT .
As a …rst example, consider an AR (1) process where
yT +1 = φyT + eT +1
Consequently,
ybT +1jT = E ( yT +1 j yT , yT 1 , . . .)
= φyT + E ( eT +1 j yT , yT 1 , . . .)
= φyT (1.12.4)
To predict two periods ahead (h = 2), we write
yT +2 = φyT +1 + eT +2
from which it follows
ybT +2jT = E ( yT +2 j yT , . . .)
= φE ( yT +1 j yT , . . .) + E ( eT +2 j yT , . . .)
= φ2 yT (1.12.5)
In general, we obtain ybT +hjT = φh yT .

Thus, the last observed value yT contains all the
information to determine the predictor for any
future value.
When h is large, the predictor ybT +hjT converges to

0 (the unconditional expectation of yt ), provided
that jφj < 1.
With a nonzero mean, the best predictor for YT +h is
directly obtained as
µ + ybT +hjT = µ + φh (YT µ ).
Note that this di¤ers from φh YT .
As a second example, consider an MA(1) process
where
yt = et + θet 1
Then we have
E ( yT +1 j yT , . . .) = θE ( eT j yT , . . .) = θeT
where implicitly, we assumed that eT is observed

(contained in the information set FT ).
Assuming that the MA process is invertible, we can
write eT = ∑j∞=0 ( θ ) yT
j
j and determine the
one-period ahead predictor as
∞
ybT +1jT = θ ∑ ( θ ) yT
j
j (1.12.6)
j =0
predicting two periods ahead gives
ybT +2jT = E ( yT +2 j yT , yT 1 , . . .) (1.12.7)
= E ( eT +2 j yT , . . .) + θ E ( eT +1 j yT , . . .) = 0
which shows that the MA(1) model is

uninformative for predicting two periods ahead: the
best predictor is simply the (unconditional)
expected value of yt , normalized at zero.
This also follows from the autocorrelation function

of the process, because the ACF is zero after one
lag.
That is, the ‘memory’of the process is only one
period.
For the general ARMA(p, q ) model
yt = φ1 yt 1 + + φp yt p + et + θ 1 et 1 + + θ q et q

We can derive the following recursive formula to
determine the optimal predictors:
ybT +h jT = φ1 ybT +h 1 jT + + φp ybT +h p jT + be T +h jT
+θ 1be T +h 1 jT + + θ q be T +h q jT (1.12.8)
where b
e T +k jT is the optimal predictor for eT +k at
time T , and
ybT +k jT = yT +k if k 0
b
e T +k jT = 0 if k > 0
b
e T +k jT = e T +k if k 0
where that latter innovation can be solved from the
autoregressive representation of the model.
For this we have used the fact that the process is

stationary and invertible, in which case the
information set fyT , yT 1 , . . .g is equivalent to
f eT , eT 1 , . . . g.
That is, if all et ’s are known from ∞ to T , then

all yt ’s are known from ∞ to T , and vice versa.
To illustrate this, consider an ARMA(1, 1) model
that takes the form
yt = φyt 1 + et + θet 1
The optimal predictor for yT +1 is then given by
ybT +1jT = φyT + b

e T +1jT + θeT = φyT + θeT
Assuming invertibility
yt φ1 yt 1 = (1 + θL) et
can be written as
et = (1 + θ L) 1
( yt φ 1 yt 1) = ∑j∞=0 ( θ )j yt j φ 1 yt 1 j
We can write for the one-period-ahead predictor
∞
ybT +1jT = φyT + θ ∑( θ )j yT j φ1 yT j 1 (1.12.9)
j =0
Predicting two period ahead gives
y T +1 jT + b
ybT +2 jT = φb e T +2 jT + θb
e T +1 jT = φb
y T +1 jT (1.12.10)
Note that this does not equal to φ2 yT .
Prediction Accuracy
In addition to prediction, it is important to know
how accurate this prediction is.
To judge forecasting precision, we de…ne the
prediction error as
YT +h b T +hjT = yT +h
Y ybT +hjT and the expected
quadratic prediction error as
2
Ch = E yT +h b
y T +h jT = var ( yT +h j FT ) (1.12.11)

where the latter step follows from the fact that
ybT +hjT = E ( yT +h j FT )
Determining Ch , corresponding to the variance of

the h period-ahead prediction error, is relatively
easy with the moving average representation.
To start with the simplest case, consider an MA(1)
model.
Then we have
C1 = var ( yT +1 j yT , yT 1 , . . .) = var ( eT +1 + θeT j eT , eT 1 , . . .)
= var (eT +1 ) = σ2
Alternatively, we explicitly solve for the predictor,

which is ybT +1jT = θeT , and determine the variance
of yT +1 ybT +1jT = eT +1 , which gives the same
result.
For the two-period-ahead predictor we have
C2 = var ( yT +2 j yT , yT 1 , . . .) = var ( eT +2 + θeT +1 j eT , eT 1 , . . .)
= 1 + θ 2 σ2
As one would expect, the accuracy of the prediction
decreases if we predict further into the future.
It will not, however, increase any further if h is
increased beyond 2.
This becomes clear if we compare the expected
quadratic prediction error with that of a simple
unconditional predictor:
ybT +hjT = E (yT +h ) = E (eT +h + θeT +h 1 ) = 0
For this predictor we have

2
Ch = E (yT +h 0) = var (yT +h )
= var (eT +h + θeT +h 1 ) = 1 + θ 2 σ2
Consequently, this gives an upper bound on the

inaccuracy of the predictors.
The MA(1) model thus gives more e¢ cient
predictors only if one predicts one period ahead.
More general ARMA models, however, will yield
e¢ ciency gains also in further ahead predictors.
Suppose the general model is ARMA(p, q ),
which we can write as an MA(∞) model, with αj
coe¢ cients to be determined:
∞
yt = ∑ αj et j with α0 = 1
j =0
The h period ahead predictor (in terms of et ’s) is
given by
b
y T +h jT = E ( yT +h j yT , yT 1 , . . .)
∞ ∞
= ∑ α j E ( e T +h j j e T , e T 1 , . . .) = ∑ α j e T +h j
j =0 j =h

such that
h 1
yT +h ybT +hjT = ∑ α j e T +h j
j =0
h 1
∑ α2j
2 2
E yT +h ybT +hjT =σ (1.12.12)
j =0
This shows how the variance of the forecast errors

can easily be determined from the coe¢ cients of the
moving average representation of the model.
Recall that, for the computation of the predictor,
the autoregressive representation was most
convenient.
As an illustration, consider the AR (1) model where
αj = φj .
The expected quadratic prediction error are given by
C1 = σ2 , C2 = σ2 1 + φ2 , C3 = σ2 1 + φ2 + φ4 , etc.

For h = ∞, we have
C∞ = σ2 1 + φ2 + φ4 + = σ2 / 1 φ2
which is the unconditional variance of yt .
Consequently, the informational value contained in
AR (1) process slowly decays over time.
In the long run the predictor equals the
unconditional predictor, being the mean of the yt
series.
In practical cases, the parameters in ARMA models
are unknown, and replaced by their estimated
values.
This introduces additional uncertainty in the

predictors.
Usually, however, this uncertainty is ignored.
The motivation is that the additional variance that
arises because of estimation error disappears
asymptotically as the sample size becomes
su¢ ciently large.

Econ 654 Unit1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Econ 654 Unit1

Uploaded by

Copyright:

Available Formats

Addis Ababa University

College of Business and Economics

1. Univariate Time Series Analysis & Forecasting

Fantu Guta Chemrie (PhD)

F. Guta (CoBE) Econ 654 September, 2017 3 / 296

A time series yt is a process observed in sequence

F. Guta (CoBE) Econ 654 September, 2017 7 / 296

This random variable has some density function,

The expectation of the t th observation of the time

series refers to the mean of this probability

distribution provided it exists:

F. Guta (CoBE) Econ 654 September, 2017 8 / 296

Sometimes for emphasis the expectation E (Yt ) is

is similarly de…ned as:

Autocovariance at lag k of Yt (denoted by γk ) is

F. Guta (CoBE) Econ 654 September, 2017 10 / 296

lag k as the probability limit of an ensemble

Stationarity and Ergodicity

if E (yt ) = µ is independent of t, and

1.1.2 fyt g is strictly stationary if the joint distribution of

We have viewed expectations of a time series in

These de…nitions may seem a bit contrived, since

Whether time averages such as (1.6) eventually

The following two theorems are essential to the

This allows us to consistently estimate parameters

F. Guta (CoBE) Econ 654 September, 2017 16 / 296

For part (2), note that

F. Guta (CoBE) Econ 654 September, 2017 17 / 296

Lp can be de…ned as the set of random variables X

The function γ : k ! γk , k being integer, is called

F. Guta (CoBE) Econ 654 September, 2017 22 / 296

The expression ∑i∞= ∞ φi xt i is an element of L2 , yt

The moment of fyt g can be written as

Both mean and covariance are independent of t.

The building block for all the process considered in

and for which the e’s are uncorrelated across time:

E (et eτ ) = 0 for t 6= τ (1.8)

et and eτ are independent for t 6= τ (1.9)

A process satisfying (1.7) and (1.9) is called an

F. Guta (CoBE) Econ 654 September, 2017 26 / 296

then we have the Gaussian white noise process.

1.2 Autoregressive and Moving Average Processes

In time series, the series f. . . , y1 , y2 , . . . , yT , . . .g

An autoregressive (AR) model speci…es that only a

A linear AR model (the most common type used in

F. Guta (CoBE) Econ 654 September, 2017 28 / 296

The last property de…nes a special time-series

F. Guta (CoBE) Econ 654 September, 2017 30 / 296

F. Guta (CoBE) Econ 654 September, 2017 31 / 296

A …rst-order autoregressive process, denoted by

where, fet g is white noise sequence.

Taking expectation of both sides of (1.2.2) yields

F. Guta (CoBE) Econ 654 September, 2017 34 / 296

Variance of a stationary AR (1) process is

Covariance at lag k of a stationary AR (1) process is

F. Guta (CoBE) Econ 654 September, 2017 35 / 296

From (1.2.4) and (1.2.5) it follows that the

E (yt ) = α + φE (yt 1) + E ( et ) (1.2.7)

Assuming that the process is covariance stationary,

E (yt ) = E (yt 1) =µ (1.2.8)

Now square both sides of (1.2.10) and take

Recall from (1.2.2) that (yt 1 µ) is a linear

But et is uncorrelated with et 1, et 2 , . . ., so et

Substituting (1.2.12) and (1.2.13) into (1.2.11),