Professional Documents
Culture Documents
t
+
t
,
t+1
= T
t
t
+ R
t
t
, (1)
with p1 state vector
t
and where the system matrices Z
t
, T
t
and R
t
are xed but may depend
on known functions of parameter vector . The disturbance vectors
t
and
t
are mutually and
4
serially independent and distributed by
t
NID(0,
2
H
t
),
t
NID(0,
2
Q
t
), (2)
where
2
is a scaling factor and variance matrices H
t
and Q
t
are xed but may depend on as
well. The state space model specication is completed with the initial state vector modelled by
1
= a + A + C, N(0,
2
Q
0
), (3)
where vector a and matrices A, C and Q
0
are xed system variables of appropriate dimensions.
The random vector is independent of the other disturbances. The k 1 vector of coecients
can be treated in two ways: (i) as a xed and unknown vector; (ii) as a diuse random vector,
distributed by N(0,
2
) where
1
0. The initial state constant a is for known eects,
the coecient vector is for unknown regression eects and for initial eects in nonstationary
processes while the random vector is for the exact initialisation of stationary processes. Since
is a random vector with a properly dened variance matrix, we are not interested in case
(ii) with as a regular variance matrix and therefore we assume always that
1
0 and
E() = 0 without loss of generality. Finally, the (possibly time-varying) system matrices are
xed and known functions of the vector of nuisance parameters . Textbook treatments of state
space time series models are, amongst others, given by Anderson and Moore (1979), Harvey
(1989) and Durbin and Koopman (2001).
The state space model (1) can be represented as a linear regression model. In particular,
we can consider the formulation
y = c + X + u, u N(0,
2
). (4)
5
The equivalence of (4) with the state space model is obtained by dening
y = (y
1
, . . . , y
T
)
, (c, X) = Z
_
I , T
1
, . . . ,
1
t=T1
T
t
_
T
t=1
N
t
and the dimension of X is n k. As system matrices may depend on , the
explanatory variable matrix X = X() and covariance matrix = () may also depend on .
2.1 Prole likelihood function
In terms of the linear regression model (4) with a xed and unknown , the likelihood function
is denoted by L = exp{(y; , , )} and the scaled loglikelihood function is given by
2 log L = 2(y; , , )
= nlog 2 + nlog
2
+ log || +
2
(y c X)
1
(y c X).
(6)
Analytical expressions for the maximum likelihood estimators for and can be obtained and
are given by the generalized least squares expressions
= (X
1
X)
1
X
1
(y c),
2
= n
1
RSS, RSS = (y c)
1
M
(y c), (7)
where M
= IX(X
1
X)
1
X
1
. The loglikelihood function (6) at the maximized location
of =
is given by
2 log L
P
= 2(y;
, , ) = nlog 2 + nlog
2
+ log || +
2
RSS, (8)
and is dened as the prole loglikelihood function. We obtain the concentrated prole log-
likelihood function by replacing
2
by its maximum likelihood estimator
2
= RSS / n, that
is
2 log L
P
c
= 2(y;
, , ) = nlog 2 + nlog RSS nlog n + log || + n. (9)
6
2.2 Diuse likelihood function
In terms of the linear regression model with a random vector N(0,
2
), the loglikelihood
function is given by
(y; , ) = (y|; , ) + (; , ) (|y; , ), (10)
where (y|; , ) = (y; , , ) is given in (6) while (; , ) = (; ) with
2(; ) = k log 2 + k log
2
+ log || +
2
1
.
The density implied by (|y; , ) is obtained as follows. Since E(y) = c + XE() = c,
Var(y) =
2
(XX
+ ), E() = 0, Var() =
2
and E(y
) =
2
X
, we obtain
E(|y) = E(y
)Var(y)
1
[y E(y)]
= X
(XX
+ )
1
(y c)
= (
1
+ X
1
X)
1
X
1
(y c),
Var(|y) = Var() E(y
)Var(y)
1
E(y
)
=
2
2
X
(XX
+ )
1
X
=
2
(
1
+ X
1
X)
1
,
where we have suppressed the dependence on and . These results follow from a matrix
inversion lemma and some minor manipulations. The rst term in the right-hand side of (10)
becomes
2(|y; , ) = k log 2 + k log
2
log |
1
+ X
1
X| +
2
(
1
+ X
1
X)
+
2
(y c)
1
X(
1
+ X
1
X)
1
X
1
(y c) 2
2
(y c)
1
X.
By re-arranging the dierent terms of the loglikelihood function (10), we obtain
2(y; , ) = nlog 2 + nlog
2
+ log || + log || + log |
1
+ X
1
X|
+
2
(y c)
[
1
1
X(
1
+ X
1
X)
1
X
1
](y c).
7
The diuse loglikelihood function log L
D
is dened as
(y; , ) = lim
1
0
(y; , ) +
1
2
log ||, (11)
from which it follows that
2 log L
D
= 2
1
X| +
2
RSS, (12)
which is equivalent to (8) apart from the term log |X
1
X|. This result is due to De Jong
(1991). The loglikelihood function (12) at the maximized location of = is given by
2 log L
D
c
= 2
1
X| +n. (13)
which is equivalent to (9) apart from the term log |X
1
X|.
The denition of the diuse loglikelihood function (11) may be regarded as somewhat ad
hoc. For example, an alternative suggestion is to dene the diuse loglikelihood function as
(y; , ) = lim
1
0
(y; , ) +
1
2
log |2
2
|, (14)
see De Jong and Chu-Chun Lin (1994). In light of denition (14), the likelihood functions (12)
and (13) remain the same but with n replaced by m = nk. The alternative denition in (14)
becomes relevant in the discussion of the marginal likelihood function in the next subsection.
2.3 Marginal likelihood function
The concept of marginal likelihood has been introduced by Kalbeisch and Sprott (1970).
The marginal likelihood function for model (4) is dened as the likelihood function that is
invariant to the regression coecient vector . Many contributions in the statistics literature
have developed the concept of marginal likelihoods further and have investigated this approach
8
in more detail, for example, see Patterson and Thompson (1971), Harville (1974), King (1980),
Smyth and Verbyla (1996), and Rahman and King (1997). In particular, McCullagh and
Nelder (1989) consider the marginal likelihood function for the generalized linear model. The
marginal likelihood function has also been adopted for the inference of nuisance parameters
in time series models, for example, see Levenbach (1972), Cooper and Thompson (1977) and
Tunniclie-Wilson (1989). In the linear model y = c + X + u where u N(0, ) with
X = X() and = (), the marginal likelihood function is for a transformed data vector
y
= A
y that does not depend on . The transformation matrix A has dimension n m with
m = n k, is of full column rank and is subject to A
is given by
2(y
; , ) = mlog 2 + mlog
2
+ log |A
A| +
2
(y c)
A(A
A)
1
A
(y c), (15)
since A
X = 0. The equalities
(A, X)
A(A
A)
1
A
= (A, 0)
, (A, X)
1
M
= (A, 0)
,
imply that A(A
A)
1
A
=
1
M
. Furthermore, since
|| |A
A| |X
X| = |(A, X)
(A, X)|
= |A
A| |X
X X
A(A
A)
1
A
X|
= |A
A| |X
X X
X|
= |A
A| |X
X|
2
|X
1
X|
1
,
the determinental term in the density is |A
A| = || |A
A| |X
X|
1
|X
1
X|. Following
Harville (1974) we normalize matrix A such that A
A = I
m
and |A
A| = 1. The marginal
9
likelihood function with respect to is based on the density of y
= A
; , ) (16)
= mlog 2 + mlog
2
+ log || + log |X
1
X| log |X
X| +
2
RSS.
The marginal likelihood (16) is equivalent to (12) apart from the term log |X
X| and n replaced
by m. When the diuse likelihood function is dened as in (14), the marginal likelihood only
diers by the term log |X
X|.
The variance scalar
2
can also be concentrated out from the marginal likelihood function.
The marginal likelihood evaluated at the maximized value of is given by
2 log L
M
c
= 2(y
; , ) (17)
= mlog 2 + mlog RSS mlog m + log || + log |X
1
X| log |X
X| + m,
and is equivalent to (13) apart from the term log |X
1
X| and n replaced by m. Expressions
(16) and (17) are new and convenient for our purposes below.
2.4 Discussion of likelihood functions
The close resemblance of the diuse and marginal likelihoods has been discussed by Shephard
(1993) and Kuo (1999). Their marginal likelihood function does not have the term log |X
X| in
(16) and the marginal and diuse likelihood functions are proportional. They also argue that
the marginal likelihood function is based on the density of a random variable and therefore the
score function has zero expectation. Given that the dierence between the prole and marginal
likelihoods is the term log |X
1
X| log |X
| |X
A = I as shown in the
previous subsection. The orthonormal transformation does not depend on in linear models
and therefore can be used for the inference of . In other words, the term log |X
X| in (16) and
(17) cannot be ignored.
In case the regression model (4) implies a time series model in the state space form (1),
matrix X and its dependence on should be considered carefully. In case of stationary time
series models without regression eects, this issue does not arise as is not present. In case
regression eects are present and in case the model includes nonstationary processes, coecient
vector is present and the dependence of on covariate matrix X must be taken into account.
The use of the marginal likelihood function is recommended for this class of linear time series
models.
11
3 Evaluation of likelihood functions
The Kalman lter eectively carries out the prediction error decomposition for time series
models in the state space representation (1), see Schweppe (1965) and Harvey (1989). The
prediction error decomposition is based on
(y) = (y
1
, . . . , y
T
) = (y
1
)
T
t=2
(y
t
|Y
t1
),
where Y
t
= {y
1
, . . . , y
t
}. The prediction error v
t
= y
t
E(y
t
|Y
t1
), with its variance matrix
F
t
= Var(y
t
|Y
t1
) = Var(v
t
), is serially uncorrelated when the model is correctly specied. This
implies that Var(v) = F is block-diagonal with prediction error vector v = (v
1
, . . . , v
T
)
and
associated variance matrix F = diag(F
1
, . . . , F
T
). The Kalman lter therefore carries out the
Cholesky decomposition = L
1
FL
1
, or F = LL
t
+ H
t
,
K
t
= T
t
P
t
Z
t
F
1
t
,
a
t+1
= T
t
a
t
+ K
t
v
t
, P
t+1
= T
t
P
t
T
t
K
t
F
t
K
t
+ R
t
Q
t
R
t
,
(18)
for t = 1, . . . , T and with a
1
= a and P
1
= CQ
0
C
|) +
2
(y c)
L
1
1
L
1
L(y c)
= nlog 2 + nlog
2
+ log |F| +
2
v
F
1
v
= nlog 2 + nlog
2
+
T
t=1
log |F
t
| +
2
T
t=1
v
t
F
1
t
v
t
.
12
It follows that the Kalman lter can evaluate the likelihood function (6) with = 0 in a
computationally ecient way.
3.1 Evaluation of prole likelihood
The evaluation of the prole likelihood functions (8) and (9) focuses on
log || = log |LL
L
1
1
L
1
LM
L
1
L(y c) = v
F
1
M
v,
where
M
= LM
L
1
= I LX(X
L
1
1
L
1
LX)
1
X
L
1
L
1
= I V (V
F
1
V )
1
V
F
1
,
with V = LX. It follows that
RSS = q s
S
1
s, where q = v
F
1
v, s = V
F
1
v, S = V
F
1
V. (19)
We note that q (y c)
1
(y c), s X
1
(y c) and S X
1
X. Given that the
Kalman lter evaluates the block elements of v = L(y c) recursively, the columns of matrix
V = LX = L(X
1
, . . . , X
k
), where X
i
is the ith column of X for i = 1, . . . , k, can be evaluated
simultaneously and recursively in the following way
V
t
= X
t
Z
t
A
t
, A
t+1
= T
t
A
t
+ K
t
V
t
, (20)
with A
1
= A and V = (V
1
, . . . , V
T
)
. Further, we have
q =
T
t=1
v
t
F
1
t
v
t
, s =
T
t=1
V
t
F
1
t
v
t
, S =
T
t=1
V
t
F
1
t
V
t
.
The Kalman lter with the additional recursion (20) is referred to as the diuse Kalman lter
and is developed by De Jong (1991).
13
The likelihood function (6), for any , and the prole loglikelihood functions log L
P
and
log L
P
c
can be expressed by
2 log L = nlog 2 + nlog
2
+ log |F| +
2
(v V )
F
1
(v V ),
2 log L
P
= nlog 2 + nlog
2
+ log |F| +
2
(q s
S
1
s),
2 log L
P
c
= nlog 2 + nlog(q s
S
1
s) nlog n + log |F| + n,
which can be evaluated by the diuse Kalman lter in a computationally ecient way.
3.2 Evaluation of diuse likelihood
The diuse loglikelihood functions (12) and (13) are evaluated by
2 log L
D
= mlog 2 + mlog
2
+ log |F| + log |S| +
2
(q s
S
1
s),
2 log L
D
c
= mlog 2 + mlog(q s
S
1
s) mlog m + log |F| + log |S| + m,
respectively. Here we have replaced n by m and in eect have adopted denition (14) for the
diuse likelihood function. All terms can be evaluated by the diuse Kalman lter.
3.3 Evaluation of marginal likelihood
The marginal loglikelihood diers from the diuse loglikelihood by the term
1
2
log |X
X|. It
follows from the design of X in (5), implied by the state space model (1), that the k k matrix
S
= X
t
= Z
t
A
t
, A
t+1
= T
t
A
t
, t = 1, . . . , T, (21)
with A
1
= A
and S
n
t=1
V
t
V
t
. The marginal loglikelihood functions are given by
2 log L
M
= mlog 2 + mlog
2
+ log |F| + log |S| log |S
| +
2
(q s
S
1
s),
2 log L
M
c
= mlog 2 + mlog(q s
S
1
s) mlog m + log |F| + log |S| log |S
| + m,
and are evaluated by the diuse Kalman lter together with the additional recursion (21).
14
4 Illustrations
In this section we explore the dierences between estimation based on the prole, diuse and
marginal likelihood functions. The diuse and marginal likelihood functions have score func-
tions with zero expectations since they are based on a random variable (the transformed data
vector). As a result, the prole likelihood function does not have this property. The non-
zero expectation of the score for the prole likelihood leads to a bias in the estimation of
. Shephard and Harvey (1990), Shephard (1993) and Kuo (1999) have investigated this in
more detail in the context of estimating the signal-to-noise ratio of the stochastic trend model
y
t
=
t
+
t
with trend
t
as the random walk process
t+1
=
t
+
t
and signal-to-noise ratio
q = var(
t
) / var(
t
). Based on a set of Monte Carlo studies, it is found that the estimation
of the signal-to-noise ratio q based on the prole likelihood leads to many zero estimates while
the underlying data generating process used a strictly positive q value. Estimation based on
the diuse/marginal likelihood function reduces this bias substantially. In this section we con-
rm these ndings and review the consequences of considering stationary, nonstationary and
multivariate time series models. Furthermore, we argue that in cases of interest the marginal
likelihood function (16) should be used rather than prole or diuse likelihood functions for
parameter estimation. Since we focus on dierences between likelihood functions, we present
them explicitly in Table 1.
4.1 Stationary time series models
The state space form of a linear stationary time series model without regression eects has
a state vector depending only on stationary processes and with initial condition (3) given by
15
Table 1: Dierences between the loglikelihood functions. The loglikelihood functions log L
P
,
log L
D
and log L
M
refer to (8), (12) and (16), respectively, while log L
D
refers to the diuse
loglikelihood function as dened by (14) which is equal to (12) with n replaced by m. Matrices
S and S
are dened below (20) and (21), respectively. The lower triangular part of the table
represents the dierences of the loglikelihood functions. The upper triangular part reports the
dierences in the data vector dimensions.
2 log L
P
2 log L
D
2 log L
D
2 log L
M
2 log L
P
0 0 n m n m
2 log L
D
log |S| 0 n m n m
2 log L
D
log |S| 0 0 0
2 log L
M
log |S| log |S
| log |S
| log |S
| 0
1
= a +C, that is = 0. As a result, the matrix X is non-existent and the prole, marginal
and diuse likelihood functions are equivalent. In case the stationary time series model contains
linear regression eects, the vector = 0 in (3) represents the regression coecients in the
model. The resulting matrix X in (5) is exogenous and does not depend on . The prole
likelihood does not have the term log |S| = log |X
1
X| while only the marginal likelihood
functions has the term log |S
| = log |X
X|. Since |X
), (22)
for t = 1, . . . , T, where
u
1
=
_
_
for = 1,
N{0,
2
/ (1
2
)} for || < 1,
with as an unknown scalar. The specication of the initial condition (22) is coherent as the
variance of u
1
goes to innity for 1. The core of this problem is that the prole likelihood
degenerates in the unit root. The marginal likelihood is well-dened for 1 < 1 where
the prole likelihood is zero when = 1. Francke and de Vos (2007) show that unit root tests
17
based on the marginal likelihood ratio outperform other well-known tests in the literature. This
result holds specically for small samples.
4.3 Multivariate nonstationary time series models
The generality of the state space framework allows dierent state space representations of the
same time series model. The likelihood value should not depend on the particular state space
formulation that is used. However, we will show that this can be the case for the diuse
likelihood function while this is not the case for the prole and marginal likelihood functions.
A convenient illustration is given in the context of multivariate time series models. Consider
a model with random walk trends from which some trends are possibly common to all series.
The N 1 vector of observations y
t
is then modelled by
y
t
= +
t
+
t
,
t+1
=
t
+
t
,
t
N(0, I
r
), (23)
for t = 1, . . . , T, where
t
is an r1 vector of independent random walks with r < N and is an
N1 xed unknown vector for which the rst r elements are zero, = (0, . . . , 0,
r+1
, . . . ,
N
)
.
The N r matrix of factor loadings has unknown xed elements which are collected in the
parameter vector . The properties of disturbance vector
t
are not relevant for this illustration
but
t
is assumed Gaussian and independent of
s
for t, s = 1, . . . , T.
A valid state space formulation (1) of model (23) can be based on the N 1 state vector
t
= (
t
,
r+1
, . . . ,
N
)
1
0
2
I
Nr
_
_
, T
t
= I
N
, R
t
=
_
_
I
r
0
_
_
, Q
t
= I
r
, (24)
where
1
consists of the rst r rows of and
2
are the remaining N r rows of . Given
18
the nonstationary process for
t
, all initial values in
t
at t = 1 are treated as unknown
coecients and collected in vector of (3). The initial state condition for this time series
model is therefore given by (3) with a = 0, B = I
N
and C = 0. As a result, we have matrix
X = (Z
1
, . . . , Z
T
)
in (5) that depends on and therefore X = X(). For this state space
formulation, the marginal and diuse likelihood functions are dierent. It is easily shown that
|S
| = |X
X| = T|
1
| = T|
1
|
2
where S
and S
1
= Z
S
2
Z where Z = diag(Z
1
, . . . , Z
T
) and with Z
t
as
dened in the rst state space representation (24) for t = 1, . . . , T. The determinental terms |S
1
|
and |S
2
| therefore dier by the term T|
1
|
2
. This term is equal to |S
X = 0, does not depend on (as required) since X does not depend on . In cases that
X depends on in a linear way, matrix A does still not depend on .
19
Figure 1: Marginal and diuse loglikelihood functions for a bivariate version of the model (23), represented by the state space
form (24), as functions of =
_
Var(
t
). The true value of is 0.25.
2
0
To illustrate that the diuse likelihood function may be inappropriate, we consider state
space representation (24) for model (23) with N = 2 and r = 1. We simulate T = 100
observations from the bivariate common trend model (23) with = (0, 1)
, = (1, 0.1)
,
Var(
t
) = I
2
and Var(
t
) = 0.25
2
. Figure 1 presents the marginal and diuse loglikehoods as
functions of =
_
Var(
t
). The diuse likelihood is clearly not proportional to the marginal
likelihood while the maximum of the latter is in the neighborhood of the true value = 0.25.
The diuse and marginal loglikelihood functions for the second state space representation are
proportional to the marginal loglikelihood as depicted in Figure 1.
5 Conclusion
In this paper we have argued for the preference of the marginal likelihood function over the
prole and diuse likelihood functions when we estimate parameters in time series models
with nonstationary components and unknown regression eects. In many cases, the diuse
and marginal likelihood functions are proportional to each other. However, in cases where
the implied data transformation for the diuse likelihood function depends on parameters,
estimation based on the diuse likelihood function will lead to unreliable results. For these
cases, the marginal likelihood as dened by Harville (1974) and adapted for state space models
in this paper should be considered since the implied data transformation does not depend on
parameters in linear models.
ACKNOWLEDGEMENTS
We would like to thank Jacques J.F. Commandeur and Borus Jungbacker for their comments
on an earlier version of this paper. All errors are our own.
21
References
Anderson, B. D. O. and J. B. Moore (1979). Optimal Filtering. Englewood Clis: Prentice-
Hall.
Ansley, C. F. and R. Kohn (1985). Estimation, ltering and smoothing in state space models
with incompletely specied initial conditions. Annals of Statistics 13, 12861316.
Ansley, C. F. and R. Kohn (1990). Filtering and smoothing in state space models with
partially diuse initial conditions. J. Time Series Analysis 11, 27593.
Cooper, D. M. and R. Thompson (1977). A note on the estimation of the parameters of the
autoregressive-moving average process. Biometrika 64, 625628.
De Jong, P. (1988). The likelihood for a state-space model. Biometrika 75, 165169.
De Jong, P. (1991). The diuse Kalman lter. The Annals of Statistics 2, 10731083.
De Jong, P. and S. Chu-Chun Lin (1994). Stationary and non-stationary state space models.
Journal of Time Series Analysis 15, 151166.
Durbin, J. and S. J. Koopman (2001). Times Series Analysis by State Space Methods. Oxford
University Press, Oxford.
Francke, M. K. and A. F. de Vos (2007). Marginal likelihood and unit roots. Journal of
Econometrics 137, 708728.
Harvey, A. C. (1989). Forecasting Structural Time Series Models and the Kalman Filter.
Cambridge University Press, Cambridge.
Harville, D. A. (1974). Bayesian inference for variance components using only error contrast.
Biometrika 61, 383385.
22
Kalbeisch, J. D. and D. A. Sprott (1970). Application of likelihood methods to models
involving large numbers of parameters. Journal of the Royal Statistical Society B 32,
175208.
King, M. L. (1980). Robust tests for spherical symmetry and their application to least squares
regression. The Annals of Statistics 8, 12651271.
Koopman, S. J. (1997). Exact initial Kalman ltering and smoothing for nonstationary time
series models. Journal of the American Statistical Assocation 92, 16301638.
Kuo, B. S. (1999). Asymptotics of ML estimator for regression models with a stochastic trend
component. Econometric Theory 15, 2429.
Levenbach, H. (1972). Estimation of autoregressive parameters from a marginal likelihood
function. Biometrika 59, 6171.
McCullagh, P. and J. A. Nelder (1989). Generalized Linear Models (2 ed.). London: Chapman
& Hall.
Patterson, H. D. and R. Thompson (1971). Recovery of inter-block information when block
sizes are unequal. Biometrika 58, 545554.
Rahman, S. and M. L. King (1997). Marginal-likelihood score-based tests of regression dis-
turbances in the presence of nuisance parameters. Journal of Econometrics 82, 81106.
Rosenberg, B. (1973). Random coecients models: the analysis of a cross-section of time
series by stochastically convergent parameter regression. Annals of Economic and Social
Measurement 2, 399428.
Schweppe, F. (1965). Evaluation of likelihood functions for Gaussian signals. IEEE Transac-
23
tions on Information Theory 11, 6170.
Shephard, N. (1993). Maximum likelihood estimation of regression models with stochastic
trend components. Journal of the American Statistical Association 88, 590595.
Shephard, N. and A. C. Harvey (1990). On the probability of estimating a deterministic
component in the local level model. Journal of Time Series Analysis 11, 339347.
Smyth, G. K. and A. P. Verbyla (1996). A conditional likelihood approach to REML in
generalized linear models. Journal of the Royal Statistical Society B 58, 565572.
Tunniclie-Wilson, G. (1989). On the use of marginal likelihood in time series model estima-
tion. Journal of the Royal Statistical Society B 51, 1527.
24