Professional Documents
Culture Documents
Autoregressive models
Moving average models
Andrew Lesniewski
Baruch College
New York
Fall 2019
Outline
1 Basic concepts
2 Autoregressive models
Time series
A time series is a sequence of data points Xt indexed a discrete set of (ordered)
dates t, where −∞ < t < ∞.
Each Xt can be a simple number or a complex multi-dimensional object (vector,
matrix, higher dimensional array, or more general structure).
We will be assuming that the times t are equally spaced throughout, and denote
the time increment by h (e.g. second, day, month). Unless specified otherwise,
we will be choosing the units of time so that h = 1.
Typically, time series exhibit significant irregularities, which may have their origin
either in the nature of the underlying quantity or imprecision in observation (or
both).
Examples of time series commonly encountered in finance include:
(i) prices,
(ii) returns,
(iii) index levels,
(iv) trading volums,
(v) open interests,
(vi) macroeconomic data (inflation, new payrolls, unemployment, GDP,
housing prices, . . . )
Time series
For modeling purposes, we assume that the elements of a time series are
random variables on some underlying probability space.
Time series analysis is a set of mathematical methodologies for analyzing
observed time series, whose purpose is to extract useful characteristics of the
data.
These methodologies fall into two broad categories:
(i) non-parametric, where the stochastic law of the time series is not explicitly
specified;
(ii) parametric, where the stochastic law of the time series is assumed to be
given by a model with a finite (and preferably tractable) number of
parameters.
The results of time series analysis are used for various purposes such as
(i) data interpretation,
(ii) forecasting,
(iii) smoothing,
(iv) back filling, ...
We begin with stationary time series.
A time series (model) is stationary, if for any times t1 < . . . < tk and any τ the
joint probability distribution of (Xt1 +τ , . . . , Xtk +τ ) is identical with the joint
probability distribution of (Xt1 , . . . , Xtk ).
In other words, the joint probability distribution of (Xt1 , . . . , Xtk ) remains the
same if each observation time ti is shifted by the same amount (time translation
invariance).
For a stationary time series, the expected value E(Xt ) is independent of t and is
called the (ensemble) mean of Xt . We will denote its value by µ.
A stationary time series model is ergodic if
1 X
lim Xt+k = µ, (1)
T →∞ T
1≤k ≤T
Cov(Xs , Xt )
Rs,t = p p . (3)
Var(Xs ) Var(Xt )
For covariance-stationary time series, Rs,t = Rs−t,0 , i.e. the ACF is a function of
the difference s − t only.
We will write Rt = Rt,0 , and note that
Γt
Rt = . (4)
Γ0
Note that µ, Γ, and R are usually unknown, and are estimated from sample data.
The estimated sample mean µ b, autocovariance b
Γ, and autocorrelation R
b are
calculated as follows.
Consider a finite sample x1 , . . . , xT . Then
T
1 X
µ
b= xt ,
T t=1
T
1 P (x − µ
b)(xj−t − µ
b), for t = 0, 1, . . . , T − 1,
T j (5)
Γt =
b j=t+1
Γ−t , for t = −1, . . . , −(T − 1).
b
b t = Γt .
b
R
Γ0
b
These quantities are called the sample mean, sample autocovariance, and
sample ACF, respectively.
Usually, R
b t is a biased estimator of Rt , with the bias going to zero as 1/T for
T → ∞.
Notice that this method allows us to compute up to T − 1 estimated sample
autocorrelations.
One can use the above estimators to test the hypothesis H0 : Rt = 0 versus
Ha : Rt 6= 0.
The relevant t-stat is
R
bt
r = q .
1
1 + 2 t−1 b2
P
T
R
i=1 i
Another test, the Portmanteau test, allows us to test jointly for the presence of
several autocorrelations, i.e. H0 : R1 = . . . = Rk = 0, versus Ha : Ri 6= 0, for
some 1 ≤ i ≤ k.
The relevant t-stat is defined as
k
X
Q ∗ (k) = T b 2.
Ri
i=1
Xt = pt + mt + εt , (6)
where the three components on the RHS have the following meaning:
pt is a periodic function called the seasonality,
mt is a slowly varying process called the trend,
εt is a stochastic component called the error or disturbance.
Classic linear time series models fall into three broad categories:
autoregressive,
moving average,
integrated,
and their combinations.
White noise
The source of randomness in the models discussed in these lectures is white
noise. It is a process specified as follows:
Xt = εt , (7)
E(εt ) = 0,
(
σ 2 , if s = t, (8)
Cov(εs , εt ) =
0, otherwise.
Xt = at + b + εt , a 6= 0, (9)
The first class of models that we consider are the autoregressive models AR(p).
Their key characteristic is that the current observation is directly correlated with
the lagged p observations.
The simplest among them is AR(1), the autoregressive model with a single lag.
The model is specified as follows:
Xt = α + βXt−1 + εt . (10)
Xt = Xt−1 + εt ,
in which the current value of X is the previous value plus a “white noise”
disturbance.
µ = α + βµ.
This equation has a solution iff β 6= 1 (except for the random walk case
corresponding to α = 0, β = 1). In this case,
α
µ= . (11)
1−β
Xt − µ = β(Xt−1 − µ) + εt . (12)
Notice that the two terms on the RHS of this equation are independent of each
other.
Γ0 = β 2 Γ0 + σ 2 ,
and so
σ2
Γ0 = . (13)
1 − β2
Since Γ0 > 0, this equation implies that |β| < 1.
Multiplying (12) by Xt−1 − µ, we find that Γ1 = βΓ0 . Iterating, we find that
Γk = β k Γ0 , (14)
The AR(1) with |β| < 1 has a natural interpretation that can be gleaned from the
following “explicit” representation of Xt . Namely, iterating (10) we find that:
Xt = α + βXt−1 + εt
= α(1 + β) + β 2 Xt−2 + εt + βεt−1
= ... (15)
= α(1 + β + . . . + β L−1 ) + β L Xt−L + εt + βεt−1 + . . . + β L−1 εt−L+1
q
= µ(1 − β L ) + β L Xt−L + Γ0 (1 − β 2L−1 ) ξt
In other words, the AR(1) model describes a mean reverting time series. After a
large number of observations, Xt takes the form (17), i.e. it is equal to its mean
value plus a Gaussian noise.
The rate of convergence to this limit is given by |β|: the smaller this value, the
faster Xt reaches its limit behavior.
The next question is: given a set of observations, how do we determine the
values of the parameters α, β, and σ in (10)?
N
Y
L(θ|y) = p(yi |θ), (18)
i=1
The value θ∗ that maximizes L(θ|y) serves as the best fit between the model
specification and the data.
It is usualy more convenient to consider the log liklihood function (LLF)
− log L(θ|y). Then, θ∗ is the value at which the LLF attains its minimum.
As an illustration, consider a sample y = (y1 , . . . , yN ) drawn from the normal
distribution N(µ, σ 2 ). Its likelihood function is given by
N
Y (yi − µ)2
L(θ|y) = (2πσ 2 )−N/2 exp − , (19)
2σ 2
i=1
N
1 1 X
− log L(θ|y) = N log σ 2 + 2
(yi − µ)2 + const. (20)
2 2σ
i=1
Taking the µ and σ derivatives and setting them to 0, we readily find that that the
MLE estimates of µ and σ are
N
1 X
µ∗ = yi ,
N
i=1
(21)
N
1 X
(σ ∗ )2 = (yi − µ∗ )2 .
N
i=1
respectively.
Note that, while µ∗ is unbiased, the estimator σ ∗ is biased (N in the denominator
above, rather than the usual N − 1).
The fact that the MLE estimator of a parameter is biased is a common
occurance. One can show, however, that MLE estimators are consistent, i.e. in
the limit N → ∞ they converge to the appropriate value.
Going forward, we will use the notation θb rather than θ∗ for the MLE estimators.
where
T −1 T −1
1 X 1 X
xb = xt , xb+ = xt+1 . (27)
T t=0 T t=0
The exact MLE method attempts to infer the likelihood of x0 from the probability
distribution. Since x0 ∼ N(µ, Γ0 ),
s
1 − β2 (x − α/(1 − β))2
0
p(x0 |θ) = exp − . (28)
2πσ 2 2σ 2 /(1 − β 2 )
1 (x − α − βx ) 2
t t−1
p(xt |xt−1 , . . . , x1 , θ) = 2
exp − . (29)
2πσ 2σ 2
T
Y
p(x0 , x1 , . . . , xT |θ) = p(x0 |θ) p(xt |xt−1 , . . . , x1 , θ). (30)
t=1
1 σ2 1
− log L(θ|x) = log + T log σ 2
2 1 − β2 2
T (31)
(x0 − α/(1 − β))2 1 X
+ + (xt − α − βxt−1 )2 + const.
2σ 2 /(1 − β 2 ) 2σ 2 t=1
Unlike the conditional case, the minimum of the exact LLF cannot be calculated
in closed form, and the calculation has to be done by means of a numerical
search.
α
µ= , (33)
1 − β1 − β2
Rk = β1 Rk −1 + β2 Rk −2 , (36)
for k = 1, 2.
This equation allows us calculate explicitly the ACF for AR(2).
Namely, plugging in k = 1 and remembering that R−1 = R1 yields
R1 = β1 + β2 R1 , or
β1
R1 = . (37)
1 − β2
Plugging in k = 2 yields R2 = β1 R1 + β2 , or
β12
R2 = β2 + . (38)
1 − β2
(1 − β2 )σ 2
Γ0 = . (40)
(1 + β2 )((1 − β2 )2 − β12 )
In other words, the lag operator shifts the time index back by one unit.
Applying the lag operator k times shifts the time index by k units:
Lk Xt = Xt−k . (42)
ψ(L) = ψ0 + ψ1 L + . . . + ψn Ln . (43)
ψ(L)Xt = α + εt , (44)
where ψ(z) = 1 − β1 z − β2 z 2 .
Solving this equation amounts to finding the inverse ψ(L)−1 of ψ(L):
α
Xt = + ψ(L)−1 εt . (45)
ψ(1)
∞
X
ψ(L)−1 = γj Lj , (46)
j=0
with
∞
X
|γj | < ∞. (47)
j=0
with α
E(Xt ) = , (49)
ψ(1)
and
∞
X
Cov(Xt , Xt+k ) = γj γj+k , for k ≥ 0, (50)
j=0
Condition (47) holds as long as |β| < 1. Another way of saying this is that the
root z1 = 1/β of 1 − βz lies outside of the unit circle.
n
(1 − zj−1 L),
Y
ψ(L) = c (52)
j=1
Qn
where c is the constant c = (−1)n ψn j=1 zj .
If each of the roots zj (they may be complex) lies outside of the unit circle, i.e.
|zj−1 | < 1, then we can invert ψ(L) by applying (51) to each factor in (52).
It is not hard to verify that the convergence criterion (47), and thus the time
series is stationary.
We can summarize these arguments by stating that a time series model given by
the lag form equation (44) is covariance stationary if the roots of the polynomial
ψ(z) lie outside of the unit circle.
α
µ= . (54)
1 − β1 − . . . − βp
Rk = β1 Rk−1 + . . . + βp Rk −p , (57)
for k = 1, . . . , p.
Note that the autocorrelations satisfy essentially the same equation as the
process defining Xt .
The ACF Rk can be found as the solution to the Yule-Walker equation and are
expressed in terms of the roots of the characteristic polynomial.
This is in contrast with picking the model whose optimized LLF is the lowest: this
may be the result of overfitting. The AIC criterion penalizes the number of
parameters, and thus discourages overfitting.
Another popular information criteria is the Bayesian information criterion (a.k.a
the Schwarz criterion), which is defined as follows:
Xt = µ + εt + θεt−1 ., (60)
θ
R1 = , (63)
1 + θ2
T
1 1 X 2
− log L(θ|x, ε0 = 0) = T log σ 2 + ε + const. (67)
2 2σ 2 t=1 t
1 1
p(x|θ) = exp − (x − µ)T Ω−1 (x − µ) , (68)
(2π)T /2 det(Ω)1/2 2
and thus
1 1
− log L(θ|x) = log det(Ω) + (x − µ)T Ω−1 (x − µ). (69)
2 2
1 + θ2 θ 0 ... 0
θ 1 + θ2 θ ... 0
Ω = σ2 0 1 + θ2 (70)
θ ... 0
.. .. ..
. . . ... 1 + θ2 .
The numerics of minimizing (69) can be handled either by (i) a clever triangular
factorization of Ω, or by the Kalman filter method (we will discuss Kalman filters
later in this course).
Unlike the conditional MLE method, the exact method does not suffer from
instabilities if |θ| ≥ 1.
Here is the Python code snippet implementing the MLE for MA(1) using
statsmodels:
#MLE estimate with statsmodels
model=ARMA(x,order=(0,1)).fit(method=’mle’)
muMLE=model.params[0]
thetaMLE=model.params[1]
sigmaMLE=np.std(model.resid)
ARMA(p, q) model
where
ψ(z) = 1 − β1 z − . . . − βp z p ,
(76)
ϕ(z) = 1 + θ1 z + . . . + θq z q .
The process (49) is covariance stationary if the roots of ψ lie outside of the unit
circle.
ARMA(p, q) model
In this case, we can write the model in the form
Xt = µ + γ(L)εt , (77)
where µ = α/ψ(1), and γ(L) = ψ(L)−1 ϕ(L). Explicitly, γ(L) is an infinite series:
∞
X
γ(L) = γj Lj , (78)
j=0
with
∞
X
|γj |2 < ∞. (79)
j=0
This form of the model specification is called the moving average form.
The parameters ARMA models are estimated by means of the MLE method. The
complexity of computation required to minimize the LLF increases with the
number of parameters.
Information criteria, such as AIC or BIC, remain useful quantitative guides for
model selection.
A. Lesniewski Time Series Analysis
Basic concepts
Autoregressive models
Moving average models
∗
)2 .
E (Xt+1 − Xt+1|1:t (80)
∗
We claim that Xt+1|1:t is, indeed, given given by the conditional expected value:
∗
Xt+1|1:t = Et (Xt+1 ). (81)
As a result
∗
Xt+k|1:t = Et (Xt+k ). (83)
Later we will generalize this method to time series models with more complex
structure.
∗
Xt+1|1:t = Et (Xt+1 )
= Et (α + βXt + εt+1 ) (84)
= α + βXt .
The forecast error is εt+1 , and so the variance of the forecast error is σ 2 .
Likewise, a single period forecast in an AR(p) model is
∗
Xt+1|1:t = α + β1 Xt + . . . + βp Xt−p+1 . (85)
with forecast error is εt+1 , and the variance of the forecast error is σ 2 .
∗
Xt+2|1:t = Et (Xt+2 )
= Et (α + βXt+1 + εt+2 ) (86)
= (1 + β)α + β 2 Xt .
The error of the two period forecast is εt+2 + βεt+1 ; its variance is (1 + β 2 )σ 2 .
A one period forecast in an MA(1) model is
∗
Xt+1|1:t = Et (Xt+1 )
= Et (µ + εt+1 + θεt ) (87)
= µ + θεt .
References
[1] Hamilton, J. D.: Time Series Analysis, Princeton University Press (1994).