You are on page 1of 39

Compound Markov Mixture Models with

Applications in Finance
John Geweke Giovanni Amisano
University of Iowa University of Brescia
October, 2003

Abstract
PRELIMINARY AND INCOMPLETE. COMMENTS WELCOME. This
paper generalizes the Markov finite mixture of normals model, by permitting
each component of the mixture to be itself a mixture of normal distributions.
The resulting model provides a flexible and elegant characterization of station-
ary multivariate time series that incorporates serial correlation, persistence in
conditional variance and higher conditional moments, and an arbitrarily good
(L1) approximation to unconditional distributions. Constraints may also be im-
posed that eliminate serial correlation while retaining serial dependence. The
paper illustrates the model using foreign exchange returns.

1. Introduction
Portfolio allocation in the presence of risk aversion and limited information motivates
asset pricing models. With the observation that volatility and the structure of con-
ditional variance are time varying, accompanied by the emergence of formal models
accommodating these features, econometrics now supports portfolio decisions on a
day-to-day basis. But whereas portfolio allocation problems are inherently multivari-
ate, most of the econometrics has been univariate. While univariate models are a first
step, there is an urgent need to move on to multivariate modeling of the time-varying
distribution of asset returns. The econometrics literature has begun to address this
need in the past few years.

1.1. MGARCH models


The generalized autoregressive conditional heteroscedasticity (GARCH) family of
models introduced in [2] has provided the most popular framework for modeling the
well established persistence in volatility observed in most asset returns. The general
form of the univariate model is

yt = β 0 xt + εt , εt ∼ N (0, ht ), (1.1)
p
X q
X
ht = γ i ht−i + δ i ε2t−i , (1.2)
i=1 i=1

in which yt is the asset return at time t and xt is a vector of deterministic time series,
such as an intercept or indicators for days of the week. In most applications p = q = 1
has proven adequate. Maximum likelihood estimation of the model is straightforward,
and it accounts well for much persistence in volatility. Normality of εt may be replaced
by alternative assumptions (like the Student-t distribution) permitting even greater
leptokurtosis, and these alternatives often provide a superior fit.
Portfolio allocation problems demand consideration of more than one time series
simultaneously. The obvious multivariate extension of (1.1) to an n × 1 vector of asset
returns yt is
yt = B0 xt + εt , εt ∼ N (0, Ht ), (1.3)
but just how to model evolution in Ht is not so clear. The original multivariate gen-
eralized autoregressive conditional heteroscedasticity (MGARCH) model, developed
in [3], specified

vech(Ht ) = Γvech(Ht−1 ) + ∆vech(εt−1 ε0t−1 ), (1.4)

in which vech(A) vectorizes the lower triangle of a symmetric matrix A. This gener-
alization suffers from two practical problems: the size of Γ and ∆ grows profligately
with n, and conditions for stationarity are awkward to impose.
A more parsimonious formulation addressing these problems is [7]

Ht = A + ΓHt−1 Γ0 + ∆εt−1 ε0t−1 ∆0 . (1.5)

For n = 10, (1.4) has over 5,000 parameters, whereas (1.5) has 210, and stationarity
conditions in (1.5) are more manageable than in (1.4), though still far from trivial.
Both models share two limitations, one general and the other important in ap-
plications to asset returns. The general problem is that MGARCH models are not
closed with respect to marginalization [18]. If yjt is a constituent of the time series
models (1.3)-(1.4) or (1.3)-(1.5), then it does not, in general, obey (1.1)-(1.2). The
same is true of any set of n∗ linear combinations of yt so long as n∗ < n.
The problem specific to asset returns is that both conditional and unconditional
distributions of εt in MGARCH (and, for that matter, εt in GARCH) are symmetric,
whereas skewness, and the related phenomenon of leverage, are well documented in
the empirical asset return literature. It also arises in more general models that exhibit
locally normal behavior but with quadratic variation [1].

1.2. Normal mixture models


In a normal mixture model, the n × 1 vector of observables yt occupies one of m
discrete states, denoted by the latent variable st taking on the values 1, . . . , m. If
st = i, then
yt | (xt , st = i) ∼ N (B0 xt + φi , Σi ). (1.6)

2
The process {st } is i.i.d. multinomial with P (st = i) = pi (i = 1, . . . , m), and
P m 0
i=1 pi = 1. Writing yt = B xt + εt , we have

m
X
−1/2 −1/2 ¡ ¢
p (εt ) = (2π) pi |Σi | exp −ε0t Σ−1
i εt /2 . (1.7)
i=1

Note that this family of models is closed with respect to marginalization: if zt = Ayt
and A is g × n of rank g, then zt obeys a normal mixture model that can be derived
trivially from (1.6).
This formulation provides considerable flexibility in the distribution of εt . In fact,
the density function (1.7) can be made to approximate any continuous p.d.f. arbi-
trarily well L1 . This result is established in [8] and [9]
From [8] it is clear that the approximations are valid even with the same variance,
Σ = Σi , across states: it is the variation in the φi that makes the approximation result
go through. Allowing variation in the Σi can, however, greatly reduce the number of
states that are required to achieve the same quality of approximation. In our work we
have found that the form
¡ ¢
yt | (Xt , st = i) ∼ N B0 xt + φi , h−1
i H
−1
(1.8)

provides a good balance between the goals of approximating the distribution well,
working with a tidy parametric form, and creating stable and reliable numerical al-
gorithms for inference. ([14] provides some detail on this point.) In (1.8) the variance
matrices Σi of the normal mixture are proportional across states, and we replace
variance with the inverse counterpart, precision.
It is advantageous to include an intercept term in xt or in the space spanned by
the xt , for the purpose of comparing models. For example, in using posterior odds
ratios to compare MGARCH and Markov mixture models–a project not included in
this paper–it is useful to include intercepts in all models, each with the same prior,
in order to keep the comparisons on an equal footing. We then achieve identification
in (1.8) by requiring
E (yt − B0 xt ) = 0, (1.9)
Pm
which is equivalent to i=1 φi pi = 0. Denoting p0 = (p1 , . . . , pm ) and Φ0 =
[φ1 · · · φm ], this is equivalent to p Φ = 0. The form h−1
0
i H
−1
for the variance in
(1.8) is chosen
¡ to facilitate
¢ comparison with a conventional multivariate normal model
yt ∼ N B0 xt , H−1 . The same prior distribution for H may be used in both models.
The parameters hi and H remain identified only up to a factor of proportionality in
(1.8). We return to a detailed consideration of these points in Sections 3.1 and 4.1.
The normal mixture model, and indeed all of the mixture models considered in this
paper, are characterized by a deeper identification problem, which often goes under
the moniker of labelling. It refers to the fact that any permutation of state indices
and parameters will leave the implied distribution of {yt } unaffected. If one wishes
to learn about properties of the individual states, then this problem is profound:
see, for example [4] and the example of separating clusters of galaxies that has been
carried through this literature. In applications to financial data these properties are

3
irrelevant. The mixture model is simply a convenient way of generalizing a multivariate
density — in much the same way that nonparametric methods do, but without their
usual curse of dimensionality. We return in greater detail to the ability of mixtures to
approximate densities in Section 2. The labelling question, while posing no problem
for identification for applications in finance, is nevertheless an important technical
problem in interference, and this issue will be addressed in Section 4.2.

1.3. Markov normal mixture models


The Markov normal mixture model maintains the same structure except that the
latent states st are not independent. Instead

P [st = j | st−1 = i, st−u (u > 1)] = pij (i = 1, . . . , m; j = 1, . . . , m) . (1.10)

Thus st evolves as a first order, m-state Markov chain. The chain is completely
characterized
P by the m × m Markov transition matrix P = [pij ]. In view of the
fact that m j=1 pij = 1 (i = 1, . . . , m), at least one of the eigenvalues of P is 1; and
since P (st+u = j | st = i) = [Pu ]ij , no eigenvalue can exceed one in modulus. If all
other m − 1 eigenvalues have modulus strictly less than one, then there is a unique
vector π with the properties π 0 P = π and π 0 en = 1. (Here, and throughout, en
denotes an n × 1 vector of units.) It can be shown that the elements of π are real
and nonnegative. The Markov normal mixture model permits serial correlation and
persistence in higher moments, and is closed with respect to marginalization.
A transition matrix that has two or more real eigenvalues of unit modulus is
reducible: that is, it is impossible to move between certain states no matter how
many periods elapse. If there are complex eigenvalues of modulus unity they must
occur in complex pairs, and such a transition matrix renders the process st strictly
periodic. Simple examples of irreducible and periodic P are
   
1 0 0 0 1 0
P =  0 .7 .3  and P =  0 0 1  ,
0 .4 .6 1 0 0

respectively. Neither property is reasonable for the latent state st in these models,
and the prior distributions introduced in Section 3.1 assign probability zero to such
transition matrices P.
Given (1.10), if P (sti ) = pti and pt = (pt1 , . . . , ptm )0 , then p0t+u = p0t Pu . If
p1 = π, then pt = π for all t > 0; π is the invariant distribution of st . In the special
case P = em π 0 , the states st are serially independent. In all that follows we shall
0
assume that yt − B Pxmt is stationary, and therefore P (s1 ) = π. Condition (1.9) is
then equivalent to i=1 φi πi = 0, or π 0 Φ = 0.
The Markov normal mixture model shares with MGARCH the undesirable asso-
ciation of absence of serial correlation with symmetry of the unconditional density
p (yt − B0 xt ) and the conditional densities

p [(yt+u − B0 xt+u ) | (yt−j − B0 xt−j ) (j ≥ 0)] .

4
(A precise statement of this relationship is given below in Section 2.2.) This limitation
prompts us to take up an extension of this model that permits asymmetric distrib-
utions in the absence of serial correlation and remains closed under marginalization.
These models have a number of interesting population properties, also discussed in
Section 2. Section 3 introduces a simple but flexible class of prior distributions from
which implications of the model for moments and other aspects of time series are easy
to simulate. The posterior distribution is derived in Section 4, along with Markov
chain Monte Carlo algorithms for drawing from the posterior distribution and approx-
imating the marginal likelihood. The power of formal Bayesian methods to discrimi-
nate among alternative structures is examined by means of some experiments whose
results are reported in Section 5. An illustrative application to foreign exchange
data is provided in Section 6. Findings there affirm absence of serious correlation,
presence of asymmetry, the existence of persistence in higher moments, and the fact
that normal, normal mixture, Markov normal mixture, and compound Markov nor-
mal mixture models provide successively better characterizations of the multivariate
foreign exchange returns in the illustrative application.

2. The model
In a compound Markov normal mixture, the distributions in some, or all, of the
states are not normal but simple mixtures of normals. As a simple example, sup-
pose that there are two states.¡ In the¢ first state the distribution is a simple mixture
2
of N (−.5, 1) (p = 2/3) and
¡ N 1,
¢ .5 (p = 1/3). In¡ the second
¢ state the distribution
is a simple mixture of N .8, .72 (p = .5) and N −.8, .32 (p = .5). The transition
matrix is · ¸
6 .4
P= .
3 .7
This is equivalent to a¡Markov
¢ normal
¡ mixture
¢ ¡ having¢ four states, component distri-
butions N (−.5, 1), N 1, .52 , N .8, .72 , N −.8, .32 , and transition matrix
 
4 .2 .2 .2
 4 .2 .2 .2 
P∗ =  2 .1 .35 .35  .
 (2.1)
2 .1 .35 .35
Some additional terminology and notation is useful in working with compound
Markov normal mixtures. Let each state st have a persistent component st1 and a
transitory component st2 , so that st = (st1 , st2 ). There are m1 persistent states,
and corresponding to each persistent state j there are m2 transitory states. In the
transition between time t − 1 and time t, first the persistent state st1 is chosen, and
then the transitory state st2 is chosen. The choice of st1 depends only on st−1,1 , with
P (st1 = j | st−1,1 = i) = pij . The choice of st2 depends only on st1 : P (st2 = k | st1 = j) =
ρjk . Thus the choice of transitory states is a simple multinomial, just as in the simple
normal mixture model.
The compound Markov normal mixture model may be regarded as an m1 m2 -
state Markov normal mixture model, with a special structure for the m1 m2 × m1 m2

5
transition matrix P∗ , reflected in (2.1): if [(i − 1) /m2 ] = [(k − 1) /m2 ] then p∗ij = p∗kj
(j = 1, . . . , m1 m2 ) and p∗ji /p∗jk = p∗`i /p∗`k (j, ` = 1, . . . , m1 m2 ). This restriction on
P∗ –which occurs with probability zero under any smooth prior for P∗ –permits
asymmetry
Pm2 in the absence of serial correlation: the corresponding constraint on φ
is i=1 φkm1 +i p∗j,km1 +i = 0 (k = 0, . . . , m1 − 1), which will be true for all j if it is
true of any j = 1, . . . , m1 m2 . A more illuminating and compact interpretation of
the compound Markov normal mixture model is as a Markov mixture of the normal
mixture models described in Section 1.2.

2.1. Exposition
For each of t = 1, . . . , T periods there are three kinds of variables in the compound
Markov normal mixture model. The k × 1 vector xt denotes deterministic variables
for period t, such as an intercept or indicators for days of the week. The latent 1 × 2
vector st = (st1 , st2 ) denotes the persistent and transitory states in period t. The
observable n × 1 vector yt denotes the asset returns of interest. The three sets of
0
variables can be expressed compactly in the T × k matrix X = [x1 , . . . , xT ] , the T × 2
0 0 0 0
matrix s = [s1 , . . . , sT ] , and the T × n matrix Y = [y1 , . . . , yT ] .
The latent states st evolve independently of the deterministic variables xt . Let
s1 = (s11 , . . . , sT 1 )0 . Then
T
Y m1 Y
Y m1
¡ ¢ T
p s1 | X = π s11 pst−1,1 st1 = π s11 pijij , (2.2)
t=2 i=1 j=1

where Tij is the number of transitions from persistent state i to j in s1 . The n × n


0
Markov transition matrix P is irreducible and aperiodic, and π = (π 1 , . . . , π m1 ) is
2 0
the unique stationary distribution of {st1 }. Let s = (s12 , . . . , sT 2 ) denote all T
transitory states. Then
T m2
m1 Y
¡ ¢ Y Y U
p s2 | s1 , X = ρst = ρijij . (2.3)
t=1 i=1 j=1

where Uij is the number of occurrences of st = (i, j) (t = 1, . . . , T ).


The observables yt depend on the latent states st and the deterministic variables
xt . If st = (i, j) then
h i
−1
yt = B0 xt + φj + ψ jk + εt ; εt ∼ N 0, (hj hjk H) . (2.4)

Conditional on (xt , st ) (t = 1, . . . , T ) the yt are independent. From (2.4) one expres-


sion for this distribution is

m1
Y m2
Y
T n/2 U n/2
p (Y | s, X) = (2π)−T n/2 |H|T /2 hi i hijij
i=1 j=1
 
m1
X m2
X X
· exp − hi hij ε0t Hεt /2 , (2.5)
i=1 j=1 t:st =(i,j)

6
where Ti is the number of occurrences of st1 = i in st1 (t = 1, . . . , T ).
Some alternative expressions
£ of (2.4)
¤ and (2.5) are useful subsequently.
£ Define the
¤
n × m1 matrix Φ0 = φ1 , . . . , φm1 , the n × m2 matrices Ψ0j = ψ j1 , . . . , ψ jm2
£ ¤
(j = 1, . . . , m1 ), and the n × m1 m2 matrix Ψ0 = Ψ01 , . . . , Ψ0m1 . Let the m2 × 1
¡ ¢0
vectors ρj denote the transitory state probabilities ρj = ρj1 , . . . , ρjm2 within each
£ ¤0
persistent state j = 1, . . . , m1 , and define the m1 × m2 matrix R = ρ1 , . . . , ρm1 .
For t = 1, . . . , T let z1t be an m1 × 1 vector of dichotomous variables in which ztj 1
=1
1 2 2
if st1 = j and ztj = 0 otherwise. Similarly define the m2 × 1 vector zt with ztj = 1 if
2
st2 = j and ztj = 0 if not. Let zt = z1t ⊗ z2t (t = 1, . . . , T ). Then from (2.4),

yt = B0 xt + Φ0 z1t + Ψ0 zt + εt (t = 1, . . . , T ) . (2.6)
It will be necessary to take explicit account of the linear restrictions imposed on
Φ and Ψ. The unconditional mean of the persistent states is 0, which is equivalent
to Φ0 π = 0. Let the m1 × (m1 − 1) matrix C0 be the orthonormal complement of
π: that is, π 0 C0 = 00 and C00 C0 = Im1 −1 . Define Φ̃ = C00 Φ and note that Φ = C0 Φ̃
because π 0 Φ = 00 .
The unconditional mean of the transitory states within each permanent state is
0, which is equivalent to Ψ0j ρj = 0 (j = 1, . . . , m1 ). Let Cj be an m2 × (m2 − 1)
orthonormal complement of ρj , define Ψ e 0 = C0 Ψj , and note that Ψj = Cj Ψ ej
j j
(j = 1, . . . , m1 ). Construct the m1 m2 × m1 (m2 − 1) block diagonal
h 0 matrix
i C =
e 0 e e 0
Blockdiag [C1 , . . . , Cm1 ] and the n × m1 (m2 − 1) matrix Ψ = Ψ1 , . . . , Ψm1 . Then
Ψ = CΨ, e and substituting in (2.6),

e 0 C0 zt + εt .
yt = B0 xt + Φ̃0 C00 z1t + Ψ (2.7)
³ ´
Define the nk × 1 vector β = vec (B0 ), the n (m1 − 1) × 1 vector φ̃ = vec Φ̃0 ,
³ ´
e = vec Ψ
and the nm1 (m2 − 1) × 1 vector ψ e 0 . Then (2.7) becomes1
¡ ¢
yt = (x0t ⊗ In ) β+ z10 0
t C0 ⊗ In φ̃ + (zt C ⊗ In ) ψ̃ + ε. (2.8)
This expression has the form yt = Wt0 γ + εt in which the n (k + m1 m2 − 1) ×1 vector
³ ´0
0
e 0 and the n × n (k + m1 m2 − 1) matrix W0 is
γ = β 0 , φ̃ , ψ t
£ ¤
Wt0 = x0t ⊗ In z10
t C0 ⊗ In z0t C ⊗ In . (2.9)
Thus conditional on the latent states st (equivalently z1t and z2t ) (t = 1, . . . , T ), and
given the restrictions on the state means, (2.4) is a linear regression model with highly
structured heteroscedasticity. If we take δ t = hst1 hst , then
T
" T #
−T n/2 T /2
Y n/2
X
0
p (Y | s, X) = (2π) |H| δ t exp − δ t εt Hεt /2
t=1 t=1
1 We make use, here and elsewhere, of the fact that for any k × m matrix A and m × n matrix B,
vec (AB) = (In ⊗ A) vec (B) = (B0 ⊗ Ik ) vec (A).

7
T
Y n/2
= (2π)−T n/2 |H|T /2 δt
t=1
" T
#
X 0
· exp − δ t (yt − Wt0 γ) H (yt − Wt0 γ) /2 . (2.10)
t=1

It will sometimes be of interest to impose the constraint that the asset returns yt
are serially uncorrelated.

Theorem 1. Conditional on xt (t = 1, . . . , T ), the observables yt (t = 1, . . . , T ) are


serially uncorrelated if Φ = 0. Suppose further that P is irreducible and aperiodic
and its eigenvalues are distinct. Then the observables yt are serially uncorrelated if
and only if Φ = 0.

P roof. Using the methods of [19] for the univariate Markov normal mixture
model,
s
cov (yt , yt−s | x1 , . . . , xT ) = Φ0 (P − em1 π 0 ) Φ. (2.11)
If the eigenvalues of P are distinct then P is diagonable and it has spectral decom-
position P = Q−1 ΛQ. We may then write

P − e0m π 0 = Q−1 (Λ − Im ) Q. (2.12)

Since Q is nonsigular, it suffices to show that QΦ = 0. The first row of QΦ


is q01 Φ ∝ π 0 Φ = 00 by construction. Substitute (2.12) in (2.11) and define Λ e =
diag (0, λ2 − 1, . . . , λm1 − 1). Then absence of serial correlation is equivalent to
e s QΦ = 0 (s = 1, 2, . . .) .
Φ0 Q−1 Λ (2.13)

Since the eigenvalues are distinct, there exist coefficients a2i , . . . , am1 i such that
m1
X
ej
aji λ = 0 (j = 2, . . . , m1 ; j 6= i) ,
j=1
Xm1
ei
aji λ 6= 0.
j=2

Hence Φ0 QDi Q−1 Φ = 0 where Di has all values 0 except for a unit in the i’th row and
column. Summing over i = 2, . . . , m1 , and remembering q01 Φ = 0, Φ0 Φ = 0 ⇒ Φ = 0.
If P is diagonable then it has spectral decomposition P = Q−1 ΛQ and we may
write P − em1 π 0 = Q−1 (Λ − Im1 ) Q. If P is irreducible and aperiodic then only one
diagonal element of Λ − Im1 (the first, without loss of generality) is zero. The first
row of Q is proportional to π 0 , and π 0 is linearly independent of the remaining rows
in Q. Hence QΦ = 0.
Thus all that is required to impose absence of serial correlation is to omit the term
z10
t C 0
0 ⊗ In from Wt and φ̃ from γ. Henceforth we refer to Φ = 0 as Case I. In Case
II this constraint is not imposed. The indicator variable ζ = 0 will denote Case I, and
ζ = 1 Case II.

8
2.2. Linear representations for Markov normal mixture processes
With a sufficiently large number of states, m1 and m2 , the compound Markov normal
mixture model is quite flexible and can accommodate many kinds of persistence in
higher moments, even in the absence of serial correlation. The flexibility in the repre-
sentation of parameters is chiefly a function of the Markov transition matrix P. Here
we develop the relationship between P and linear representations for higher moments.
The results pertain to a superset of compound Markov normal mixture models.
We are concerned with the moments of εt in yt = B0 xt + εt , and for consistency of
notation it is simpler to take B = 0 and deal directly with yt . Recall that yt is n × 1,
yt = (yt1 , . . . , ytn )0 . For any integer h > 0, define
¡ h h 0
¢
yth = yt1 , . . . , ytn
and for any p > 0 the pn × 1 vector of mixed powers
(p) ¡ ¢0
zt = yt10 , . . . , ytp0 .
Our characterization will depend on the moments
¡ ¢
µhj = E yth | st = j (j = 1, . . . , m; h = 1, . . . , p) .
Take the pn × 1 vectors of moments
³ ´ ¡
(p) (p) p0 ¢0
µj = E zt | st = j = µ10 j , . . . , µj (j = 1, . . . , m)

and arrange them in the pn × m matrix


h i
(p)
M(p) = µ1 , . . . , µ(p)
m . (2.14)

(p)
We shall also have some use for the second central moments of zt ,
½³ ´³ ´0 ¾
(p) (p) (p) (p) (p)
Rj = E zt − µj zt − µj | st = j (j = 1, . . . , m) . (2.15)

In the context of the compound Markov normal mixture model, m corresponds to


m1 , the number of persistent states. The mixtures over the m2 transitory states pro-
vide a richer variance structure than can be achieved in the Markov normal mixture
model. The representation (2.14)-(2.15) also applies to models in which the distribu-
(p)
tion of yt in state j is not a mixture of normals, so long as the moments in Rj all
exist. (These moments always exist in the compound Markov normal mixture model.)
We shall refer to all such models as Markov mixture models.
Theorem 2. Suppose that in a Markov mixture model with m states and irreducible
and aperiodic transition matrix P, the stationary distribution is π. Suppose further
(p)
that the moment matrices Rj (j = 1, . . . , m) are all finite. Then the unconditional
(p)
mean of zt is ³ ´
(p)
E zt = µ∗(p) = M(p) π.

9
The instantaneous variance matrix is
³ ´³ ´0
(p) (p) (p)
Γ0 = E zt − µ∗(p) zt − µ∗(p) (2.16)
m
X ³ ´
(p) (p) (p)0
= π j Rj + µj µj − µ∗(p) µ∗(p)0 . (2.17)
j=1

The dynamic covariance matrices are


³ ´³ ´0
(p) (p)
Γ(p)
u = E zt − µ∗(p) zt−u − µ∗(p)
= M(p) Bu0 ΠM(p)0 (u = 1, 2, 3, . . .) (2.18)
where B = P − em π 0 , Bu = Pu − em π 0 , and Π = diag (π1 , . . . , π m ).
P roof. See Appendix 8.1.
This result highlights some important restrictions on the dynamics of a multiple
time series introduced by the parsimony of a Markov mixture model. In particular,
£ ¡ ¢¤ £ h ¡ ¢¤0
E yth − E yth yt−u − E yth (2.19)
is an n×n matrix of rank no more than m−1. However the null space of (2.19) varies
with both u and h; in general, all higher moments of yt will display serial persistence.
The geometric decay in u evident in (2.18) is that of an autoregressive process of
(p)
finite order. However, this pattern does not extend to (2.17), which suggests that zt
might be represented as the sum of such a process and a serially uncorrelated process
that is uncorrelated with it.
Theorem 3. Suppose that the Markov transition matrix P has spectral decomposi-
tion P = Q−1 ΛQ with
Q = [q1 , q2 , . . . , qm ]0 = [π, q2 , . . . , qm ]0
and £ ¤ £ ¤
Q−1 = q1 , q2 , . . . , qm = em , q2 , . . . , qm .
Let λ1 , . . . , λr be the distinct eigenvalues in the open unit interval associated with at
(p)
least one column of Q0 not contained in the null space of M(p) . Then zt can be
decomposed as the sum of two vector processes,
(p) (p) (p)
zt = vt + ηt .
(p) (p)
The process η t is uncorrelated
³ ´ P with vt+u (u = 0, ±1, ±2, . . .) and is itself serially
(p) m (p) (p)
uncorrelated, var η t = j=1 πj Rj . The process vt has vector autoregressive
representation
Xr
(p) (p) (p)
vt = αu vt−u + ω t
u=1
in which the coefficients
P αu (u = 1, . . . , r) are scalars. The roots of the generating
polynomial 1 − ru=1 αu z u are λ1 , . . . , λr .

10
P roof. See Appendix 8.2.
(p)
It follows from Theorem 3 that zt has a VARMA(r, r) representation
à r
! Ã r
!
X (p)
X (p) u (p)
u
Inp − αi Inp L zt = Inp − Bi L ν t . (2.20)
u=1 u=1

The special structure of the autoregressive coefficients in (2.20) assures that every
(p)
subvector of zt follows a VARMA(r, r) process, as well. This is consistent with the
fact that Markov mixture models are closed under marginalization.

2.3. Moments
There has been substantial attention in the literature to the moments of state transi-
tion models; see in particular the recent contribution [21] on the expression of moments
of Markov normal mixture models. While the ultimate focus of our work is on financial
decision making, population moments provide a useful intermediary for summarizing
properties of any time series model, and corresponding sample moments can capture
salient properties of observed returns. Comparing inferred population moments in the
context of the compound Markov normal mixture model with corresponding sample
moments provides one means of assessing goodness of fit.
Closed form expressions for population moments in our model rapidly become
cumbersome. Rather than take this approach, which was used in [21], we provide an
algorithm mapping the parameters of the model to any population moment. Then we
introduce convenient summary moments for multivariate time series.
Normal distributions are central to the model. With this in mind, for z ∼ N (0, 1)
¡ ¢ Q
define ζ j = E z j ; ζ 0 = 1, ζ 2 = 1, ζ 4 = 3, ζ 6 = 15, etc. For j even ζ j = j/2
i=0 |2i − 1|
and for j odd ζ j = 0.
Suppose ε ∼ N (0, 1), η ∼ N (0, 1) and E (ε · η) = ρ. Define λp,q (ρ) = Eεp ηq . We
have ¡ ¢1/2
η = ρ · ε + 1 − ρ2 θ,
where θ ∼ N (0, 1) and ε and θ are independent. Hence for positive p and q,
n h ¡ ¢1/2 iq o
Eεp ηq = E εp · ρε + 1 − ρ2 θ
" q µ ¶ #
X q ¡ ¢(q−i)/2 q−i
= E εp · ρi εi 1 − ρ2 θ
i=0
i
X q µ ¶
q i¡ ¢(q−i)/2 ¡ p+i ¢ ¡ q−i ¢
= ρ 1 − ρ2 E ε E θ
i=0
i
X q µ ¶
q i¡ ¢(q−i)/2
= ρ 1 − ρ2 ζ p+i ζ q−i
i=0
i
q µ ¶
à !i
¡ ¢ X
2 q/2 q ρ
= 1−ρ ζ p+i ζ q−i
i=0
i (1 − ρ2 )1/2

11
Observe that if ρ = 0 then λp,q (0) = ζ p ζ q .
¡ ¢ ¡ ¢
Now suppose x ∼ N µx , σ 2x , y ∼ N µy , σ 2y , corr (x, y) = ρ. Define
¡ ¢
µp,q µx , σ x , µy , σ y , ρ = E (xp y q ) .

Since x = µx +σ x ε, y = µy +σ y η, where ε and η are standard normal and E (ε · η) = ρ,


expanding

Xp µ ¶ Xp µ ¶µ ¶i
p i p−i p−i p µx
xp = µx σ x ε = σpx εp−i ,
i=0
i i=0
i σ x
q
X q µ ¶ q
X q µ ¶ µ ¶
µy j q−j
yq = j q−j q−j
µ σ η = σyq
η
j=0
j y y j=0
j σy

we have
q µ ¶µ ¶ µ
p X
X ¶i µ ¶j
p q p q µ µy x
Ex y = σpx σ qy λp−i,q−j (ρ) .
i=0 j=0
i j σx σy

In the compound Markov normal mixture model, for the linear combination a0 yt ,
let

µa,(i,j) = E [a0 yt | st = (i, j)] ,


σ 2a,(i,j) = var [a0 yt | st = (i, j)] ,
ρa,b = corr (a0 yt , b0 yt ) .

Then
m1
X m2
X
p q
E (a0 yt ) (b0 yt ) = πi ρij
i=1 j=1
³ ´
·µp,q µa,(i,j) , σa,(i,j) , µb,(i,j) , σ b,(i,j) , ρa,b

and
m1
X m2
X m1
X m2
X
p q
E (a0 yt ) (b0 yt+s ) = πi rij (Ps )i,i0 ri0 ,j 0 (2.21)
i=1 j=1 i0 =1 j 0 =1
³ ´
·µp,q µa,(i,j) , σ a,(i,j) , µb,(i0 ,j 0 ) , σ b,(i0 ,j 0 ) , 0 .

In terms of our original notation,


¡ ¢
µa,(i,j) = a0 φi + ψ ij , σ 2a,(i,j) = h−1 −1 0 −1
i hij a H a,
¡ ¢1/2
and ρa,b = a0 H−1 b / a0 H−1 a · b0 H−1 b .

12
Given the ability to form (2.21) from the parameters of the model, we consider
some summary descriptive moments. For the most part, these moments map multi-
variate moments into analogues of univariate moments.
Recall that εt = yt − B0 xt and E (εt ) = 0 by construction. Again, for consistency
of notation it is useful to take B = 0 and deal directly with yt . Let Σ = var (yt ).
The u’th autocorrelation coefficient is
¡ ¢
cu = E yt0 Σ−1 yt+u / n.

The coefficient of excess kurtosis


n ¡ ¢2 o
k0 = E yt0 Σ−1 yt − n (n − 1) / n − 3.

i.i.d.
When yt ∼ N (0, Σ), k0 = 0. The u’th autokurtosis coefficient is
£ ¤
ku = E yt0 Σ−1 yt yt−u Σ−1 yt−u / n2 − 1.
i.i.d.
This function captures persistence in volatility. Again, if yt ∼ N (0, Σ), then ku =
0.
The coefficient of skewness is
n
X
3 3/2
s0 = Eyti / nσ ii ,
i=1

and the u’th autoskewness coefficient is


n
X ¡ 2 ¢ 3/2
su = E yti yt+u,i / nσ ii .
i=1

The functions cu and ku are symmetric about u = 0 and invariant to linear trans-
formations of yt . The functions su are symmetric about u = 0 if m1 ≤ 2 and otherwise
generally are not. The functions su are not invariant to linear transformations of yt :
indeed this cannot possibly be the case for any measure of asymmetry of a density
about its mean.
Corresponding sample moments are defined in the obvious way: for example,
the sample counterpart of ku is constructed by first P computing the least squares
T
residuals et of a regression of yt on xt , then Σ̂ = T −1 t=1 et e0t , and then forming
PT −u 0 −1 0
t=1 et Σ̂ et et−u Σ̂−1 et−u / T n2 − 1.

3. Priors and predictives


To exploit the full power of Bayesian methods and provide a fully consistent theory of
inference, we use proper priors in all of our work. Given a proper prior distribution,
the model itself provides a predictive distribution for what data might be observed.
This is then the basis for formal comparisons across models using posterior odds
ratios: the odds ratio favors the model with a higher density at the observed data

13
T
point (yto )t=1 . Like many procedures that are formally and internally consistent, this
one has some compelling practical advantages.
First, it is relatively easier to deduce the predictive density of the model, which in-
volves draws from the prior followed by data simulation and constructing functions of
interest of the data, than it is to construct and test the posterior simulator introduced
in Section 4.
Second, it is no more difficult to construct posterior odds ratios for models that
are non-nested (and perhaps quite different in a variety of ways) than it is in the
case of a pair of models that are simply nested. In doing so, the predictive densities
of both models can be used to ensure that the prior distributions of the models lead
to comparable prior distributions for moments or other functions of interest. (This
excercise may also reveal that a model is incapable of reproducing salient features of
the data — and in that case one may avoid the time and effort of developing a posterior
simulator for that model.)
Finally, in the case of normal mixture models, the theory of maximum likelihood
estimation breaks down entirely. The heuristic for this breakdown is that a state
may be constructed with low probability, a perfect fit to one or a small number of
observations, and a vanishing variance, resulting in a pole in the likelihood function.
In fact this can be done in an astronomical number of ways for asset return time
series, panel data, and other large data sets used by economists. This observation
was first made in a related context by Kiefer and Wolfowitz [17], and further detail
is provided in Appendix F of [13].

3.1. Prior distributions


We employ a conditionally conjugate family of prior distributions, as follows.
³ ´
β = vec (B0 ) ∼ N β, H−1 β ; (3.1)

pi ∼ Betam1 (r1 , . . . , r1 ) (independent; i = 1, . . . , m1 ) ;


ρi ∼ Betam2 (r2 , . . . , r2 ) (independent; i = 1, . . . , m2 ) ;

¡ ¢
H ∼ W S−1 , ν ; (3.2)
s21 hj 2
∼ χ (ν 1 ) (independent; j = 1, . . . , m1 ) ;
s22 hij ∼ χ2 (ν 2 ) (independent; i = 1, . . . , m1 ; j = 1, . . . , m2 ) ;

³ ´
φ̃ | H ∼ N 0, h−1 φ Im1 −1 ⊗ H
−1
;
³ ¡ ¢−1 ´
ψ̃ j | (hj , H) ∼ N 0, hψ hj Im2 −1 ⊗ H−1 (independent; j = 1, . . . , m1 ) .

In the case of φ̃, precision is scaled relative to the precision in H. It is this relative
precision that is important in establishing the set of densities for yt that is reasonable

14
under the prior. For example, when hφ is large then the probability of a bimodal or
multimodal distribution is small, but as hφ → 0 this probability approaches 1. The
scaling for the precision of the ψ̃ j reflects similar considerations. These priors are
invariant with respect to the particular choices of C0 , C1 , . . . , Cm1 . In the case of φ,
consider the two priors

φ ∼ N (0, Im1 ⊗ V) s.t. π 0 Φ = 00 ⇔ (π 0 ⊗ In1 ) φ = 0; (3.3)


φ̃ ∼ N (0, Im1 −1 ⊗ V) . (3.4)

Consider d0 φ and set d = ([C0 π] ⊗ In ) d∗ = (C0 ⊗ In ) d∗1 + (π ⊗ In ) d∗2 . From


(3.3),

d0 φ = d∗0 0 ∗0 0
1 (C0 ⊗ In ) φ + d2 (π ⊗ In ) φ
= d∗0 0
1 (C0 ⊗ In ) φ
∼ N (0, d∗0 0 ∗ ∗0 ∗
1 (C0 C0 ⊗ V) d1 ) = N (0, d1 (Im1 −1 ⊗ V) d1 ) .

But (3.4) implies

d0 φ = [d∗0 0 ∗0 0
1 (C0 ⊗ In ) + d2 (π ⊗ In )] (C0 ⊗ In ) φ̃
= d∗0 0
1 (C0 ⊗ In ) (C0 ⊗ In ) φ̃
= d∗0 ∗0 ∗
1 φ̃ ∼ N (0, d1 (Im1 −1 ⊗ V) d1 ) .

Exactly the same argument applies to the ψ j .


Observe that the prior distributions for B (3.1) and H (3.2) are required in the
simple, i.i.d. Gaussian model. Clear counterparts to these parameters exist in other
models as well. For example, B is present in the MGARCH models discussed in
Section 1.1, and the counterpart to H in (1.5) is A. Subjective priors for these
parameters must take account of the scale of the data. There are only eight other
hyperparameters of the prior distributions to be chosen by the investigator: r1 , r2 ,
s21 , ν 1 , s22 , ν 2 , hφ , and hψ . In view of the interaction of the parameters hi , hij , and
H in the model, little if anything is lost by setting s21 = ν 1 and s22 = ν 2 , leaving only
a half-dozen parameters particular to the compound Markov normal mixture model.

3.2. Unconditional predictives


In general, any model with a proper prior distribution for its parameter vector θ ∈ Θ
implies a predictive distribution for the vector of observables y addressed by the
model: Z
p (y) = p (θ) p (y | θ) dθ. (3.5)
Θ
Indeed, the adequacy of such a model, in comparison with others, is ultimately indi-
cated by p (yo ), where yo is the observed value of y (the data).
The integral in (3.5) is essential for determining p (yo ), and this can be technically
challenging. However, simulation from (3.5) is nearly trivial. It amounts to first
drawing θ (m) from the prior distribution p (θ) (Section 3.1), then drawing y(m) from

15
³ ´
the distribution p y | θ (m) (the model set forth in Section 2.1), and repeating for
m = 1, 2, 3, . . .. Observe that the prior distribution in the first step is usually simple,
as is the case here, and that the second step is just a straightforward simulation of
the model.
The unconditional predictive distribution serves at least two quite useful purposes.
First, it can be used to determine whether the distribution of observables, in conjunc-
tion with a stated prior, is capable of reproducing observed moments of other features
of the data. If, after experimentation with a variety of priors, it is found that this
is not possible, one may decide–as a matter of research strategy–not to proceed
to the formal and more tedious steps of setting up an algorithm for formal posterior
inference. If the data distribution and prior do capture these moments, then the
unconditional predictive distribution can be used to exclude priors inconsistent with
moments or other “stylized facts” regarding the observables the model is designed to
address.
The second important use of the unconditional predictive distribution is in model
comparison. Recall that the posterior odds ratio in favor of model A1 , versus model
A2 , is

p (A1 | yo ) p (A1 ) p (yo | A1 )


= (3.6)
p (A2 | yo ) p (A2 ) p (yo | A2 )
R
p (A1 ) Θ1 p (θ1 | A1 ) p (yo | θ 1 ) dθ 1
= R .
p (A2 ) Θ2 p (θ2 | A2 ) p (yo | θ 2 ) dθ 2

The critical components of the posterior odds ratio are the marginal likelihoods
p (yo | Ai ). If these values are available for a sequence of models i = 1, 2, . . . then
all combinations of posterior odds ratios can be formed. Using the marginal likeli-
hoods it is also possible to marginalize across models, as discussed in Section 4 of
[11]. The ratio (3.6) can be affected strongly by the choice of the prior distributions
p (θ 1 | A1 ) and p (θ2 | A2 ) One may put two models “on the same footing” by making
sure that for each of a set of moments or other features m (θ j ) common to the two
models, p [m (θ1 ) | A1 ] is at least of the same order of magnitude as p [m (θ1 ) | A1 ].
(We will use this approach in future research comparing the compound Markov normal
mixture model with alternatives.)
We used the unconditional predictive distribution in the first way in choosing
benchmark priors for our subsequent application to foreign exchange returns. (See
Section 6, which provides a description of the data set.) After a little experimenta-
tion, we settled on the prior hyperparameters r1 = r2 = 1, “flat priors” for P and
R; s21 = ν 1 = s22 = ν 2 = 2, exponential priors with mean 1 for the precision terms
hi (i = 1, . . . , m1 ) and hij (i = 1, . . . , m1 ; j = 1, . . . , m2 ); and relative standard de-
−1/2 −1/2
viations hφ = hψ = 2 in the prior distributions of the state mean parameter
vectors φi and ψ ij

16
Table 1
A Benchmark Prior
n = 1, m1 = 2, m2 = 2
Quantile Skewness Kurtosis c1 s1 k1
.05 —1.261 —0.765 —0.195 —0.176 —0.493
.25 —0.424 —0.143 —0.038 —0.051 —0.116
.50 0.004 1.070 —0.002 0.003 —0.016
.75 0.398 3.232 0.028 0.062 0.068
.95 1.221 11.956 0.119 —0.180 0.522
Returns:
Pound —0.144 2.462 0.003 —0.018 0.487
Mark —0.081 2.389 —0.003 —0.081 0.813
Yen -0.309 3.904 0.021 -0.140 0.721

Table 2
A Benchmark Prior
n = 3, m1 = 2, m2 = 2
Quantile Skewness Kurtosis c1 s1 k1
.05 —0.761 —0.166 —0.114 —0.097 —0.420
.25 —0.248 0.729 —0.031 —0.034 —0.074
.50 —0.020 2.097 —0.003 —0.001 —0.008
.75 0.204 5.311 0.024 0.032 0.049
.95 0.645 18.322 0.104 0.102 0.496
Returns:
Pound- —0.215 3.638 0.017 -0.115 0.418
Mark-Yen

The implied predictive distributions for some of the sample moments discussed
in Section 2.3, along with observed sample moments, are provided in Tables 1 and
2. These tables provide quantiles of the predictive distributions of sample moments
in a sample of size T = 1,006. That is, corresponding to each draw from the prior a
sample of size 1,006 was simulated, and then the sample moments were computed.
The tables provide five quantiles for each of these sample moments, along with the
observed sample moments for some of the foreign exchange series. We emphasize that
what is being predicted by the model, however, is observed sample moments rather
than population moments.
The results indicate that the predictive easily accommodates sample moments,
although the autokurtosis coefficient k1 is near the high end of the distribution. This
procedure can only address moments one at a time: there is no evidence, one way
or the other here, that the model can account for the five moments simultaneously.
The unconditional predictive density provides a check on the reasonableness of a prior
distribution. It is no substitute for full inference, to which we turn in Section 4.

17
4. Inference
It is clear that analytical approaches to inference in this model are impossible. We
address this task using a Markov chain Monte Carlo (MCMC) algorithm. In many
respects this algorithm is standard, although it is complex compared to most other
MCMC algorithms. We first outline the algorithm. Following this we turn to the
bearing of the labelling issues described in the Section on the MCMC algorithm,
and the computation of marginal likelihoods for model comparison. We also briefly
describe some diagnostic tests with substantial power against errors in the MCMC
algorithm, the code, and even the derivation of the likelihood function.

4.1. The MCMC algorithm


The model set forth in Section 2.1 may be regarded as having four components: the
observed deterministic variables X; asset returns Y; the unobserved latent states s;
and the parameter vectors θ1 = vech (H), θ 2 = (h1 , . . . , hm1 )0 , θ3 = (h11 , . . . , hm1 m2 ),
θ 4 = vec (P0 ), θ5 = vec (R0 ), and θ6 = γ. From Sections 2.1 and 3.1,

p (H, {hi } , {hij } , P, R, γ, s1 , s2 | X, y)


∝ p (H) p ({hi }) p ({hij }) p (P) p (R) p (γ, {hi } , H)
·p (s1 | P) p (s2 | R) p (Y | X, s, H, {hi } , {hij } , γ) .

The MCMC algorithm used is fundamentally a Gibbs sampling algorithm. At each


(m)
step m, each θ i is drawn in succession, from its posterior distribution conditional
(m) (m−1)
on θ j (j < i), θ j (j > i), s(m−1) , Y and X. Then s(m) is drawn from its
(m) (m)
posterior distribution conditional on θ1 , . . . , θ 6 , Y and X. The algorithm has the
important property that given any values of θ (m−1) and s(m−1) , any subset Θ∗ of the
parameter space with positive probability, and any s∗ ,
h i
P θ(m) ∈ Θ∗ , s(m) = s∗ | θ(m−1) , s(m−1) , Y, X > 0.

The Markov chain is therefore ergodic [20] and the unique invariant distribution of
the Markov chain is the posterior distribution.
The conditional posterior distributions for each of θ1 , . . . , θ6 and s are fully de-
tailed in Appendix 8.3. Since the prior distribution for H is Wishart and the kernel
of (2.5) in H is also Wishart, the conditional posterior distribution of H is Wishart.
The priors for hi (i = 1, . . . , m1 ) are independent gamma, (2.5) is a product of inde-
pendent gamma density kernels in the hi , and so the hi have independent conditional
posterior gamma distributions; and similarly for hij (j = 1, . . . , mj ; i = 1, . . . , m1 ).
The situations for P and R are a little more complex. The rows of each matrix
have conditionally independent multivariate beta distributions, and their kernels in
p (Y, s | X), (2.2) and (2.3), are also multivariate beta. However P also enters the
data density in (2.10), because C0 (an orthonormal complement of π, which is a
function of P) is part of Wt (2.9). Thus the conditional posterior density of any
row of P, say pi , departs from a multivariate beta density by a multiplicative factor

18
which cannot be written in a closed form. To overcome this difficulty we employ a
Metropolis within Gibbs step (see [6] or [11]). At iteration m, in the draw for row i
of P, a candidate p∗i is selected from the multivariate beta distribution indicated by
the prior density for pi in conjunction with (2.2). This candidate is accepted with
probability
p (p∗i | ω) / B (p∗i | ω)
³ ´ ³ ´
(m−1) (m−1)
p pi | ω / B pi |ω
where
(m) (m) (m) (m) (m−1)
ω = {θ 1 , θ 2 , θ3 , pj (j < i) , pj (j > i) ,
(m−1) (m−1)
θ5 , θ6 , s(m−1) , Y, X}

and B(·) is the beta density kernel that arises in the product of (2.2) and the prior
(m) (m) (m−1)
density of P. If p∗ii is accepted then pi = p∗i and if not then pi = pi . Full
details are given in Appendix 8.3. A similar strategy is employed for each row of R
The priors for B, Φ̃ and Ψ̃, and the kernel of (20) in γ, are all Gaussian, leading
to the Gaussian conditional posterior distribution for γ detailed in Appendix 8.3.
The final step of the MCMC algorithm is the draw of the T × 2 matrix of latent
states from its distribution
£ ¤ conditional on the parameters θ and observed X and Y.
Recall that s = s1 s2 . The conditional distribution of s2 is that of independent
multinomials, with

p (st2 = j | st1 = i, X, θ) ∝ p (yt | st1 = i, st2 = j, X, θ)

= (2π)−n/2 |H|1/2
n ¡ ¢0 ¡ ¢ o
· exp −δ t yt − B0 xt − φi − ψ ij H yt − B0 xt − φi − ψ ij / 2 (4.1)
= dtij .
P 2
In this notation, p (yt | st1 = i, X, θ) = m j=1 dtij = dti . With these conditional
probabilities in hand, the algorithm in [5] is used to draw s1 from its joint posterior
conditional distribution, marginalizing s2 , p (s1 | y, X, θ). With this draw in hand s1
is drawn from (4.1) (t = 1, . . . , T ).
We initialize the MCMC algorithm by first drawing the parameters θ from their
prior distribution, and then taking s from the conditional distribution just described.

4.2. Labelling
However, an important computation issue arises in approximating the marginal likeli-
hood using the modification of the procedure of Gelfand-Dey [10] procedure described
in [11]. It comes from the Gaussian distribution used to approximate and truncate
the posterior. To the extent one actually has state-switching in the MCMC chain,
this approximation may be poor.
To deal with the important computation issue, and to provide some cosmetic
stability in states over the chain, we re-order the persistent states so that πi ≥ π j ∀ i <

19
j. We re-order the transitory states so that ρij ≥ ρik ∀ j < k (i = 1, . . . , m1 ). These
re-orderings can be represented in the m1 × m1 row permutation matrix J0 and
the m2 × m2 row ¡permutation¢matrices Ji (i = 1, . . . , m1 ). We record Φ∗ = J0 Φ,
P∗ = J0 PJ00 , and£ h∗1 , . . . , h∗m¤1 = (h1 , . . . , hm1 ) J00 . The re-ordered R, ψ and hij ’s
are R∗ = J0 R = ρ∗1 , . . . , ρ∗m1 ,
    
J1 Ψ1 Ψ∗1
 ..   ..   .. 
Ψ∗ = (J0 ⊗ Im2 )  .  .  =  . ,
Jm1 Ψm1 Ψ∗m1
and  
J1
 .. 
h∗ = (J0 ⊗ Im2 )  .  h
Jm1
where
¡ ∗ ¢0
h∗ = h11 , . . . , h∗1m2 , . . . , h∗m1 1 , . . . , h∗m1 m2 ,
0
h = (h11 , . . . , h1m2 , . . . , hm1 1 , . . . , hm1 m2 ) .

We record ρ∗i (i = 1, . . . , m1 ), vec (Ψ∗i ) (i = 1, . . . , m1 ) and h∗ .


(More to be added to this section)

4.3. Marginal Likelihoods


With a good approximation of the marginal likelihood p (yo ) in any model one can
make formal comparisons with other models by means of posterior odds ratios (3.6).
In the compound Markov normal mixture model this computation is greatly simplified
by the fact that it is possible to evaluate p (yo | θ) directly, eliminating the T ×2 state
matrix. Referring to (4.1),
−n/2 1/2
p (yt | st1 = j, θ) = (2π) |H| dtj (j = 1, . . . , m1 ; t = 1, . . . , T ) .

With these values in hand, p (yo | θ) can be computed as a by-product of the draws
of s1 in Chib’s algorithm [5].
Thus it is straightforward to evaluate the data density. The procedure of Gelfand
and Dey [10] is based on the fact that for any probability density f (θ) whose support
Θ∗ is a subset of Θ, the support of p (θ),
Z
f (θ)
p (θ | yo ) dθ = p (yo ) .
Θ ∗ p (θ) p (yo | θ)
n o
o
Since the MCMC algorithm in Section 4.1 produces draws θ(m) from p (θ | y ),
³ ´
m
X f θ (m)
M −1 ³ ´ ³ ´ a.s.
→ p (yo ) .
(m) o (m)
m=1 p θ p y |θ

20
Procedures
n o described in [11] are used to construct f (θ) from the MCMC output
(m)
θ in such a way that this approximation is computationally efficient. Appendix
8.4 provides technical details associated with this procedure.

4.4. Quality control


It is important to establish that the MCMC algorithm of Section 4.1 and Appendix
8.3 is correct, as is the code that implements the algorithm. We use a new procedure
described in [12] that is powerful against both analytic and coding errors, including
mistakes in the derivation of the data density.
Recall the simulation of the unconditional predictive distribution described in
Section 3.2: ³ ´
θ(m) ∼ p (θ) , y(m) ∼ p y | θ (m) .

Neither simulation requires that one express the prior density p (θ) or the data density
p (y | θ) analytically. The simulation is straightforward, and even for complex mod-
els can generally be accomplished in short order by a competent graduate student.
Clearly n o
θ (m) , y(m) ∼ p (θ, y) = p (θ) p (y | θ) .

Next consider the MCMC simulator of Section 4.1, which requires considerably
more effort, including correct expression of prior and data densities in order to deter-
o
mine the requisite conditional distributions. Rather than keep y fixed
³ (that is, y´= y )
(m) (m−1)
before each draw θ ∼ p (θ | y), draw a simulated data set from p y | θ ; thus

e(0)
θ ∼ p (θ) ,
)
e(m−1) ,
e (m) ∼ p (y) | θ
y
¡ ¢ (m = 1, 2, . . .) .
e(m) ∼ p θ |e
θ y(m)

The unique invariant density of this Markov chain is p (θ, y). (In fact, there are no
transition dynamics at the outset, since θ (0) is drawn from the marginal distribution
of θ in p (θ, y).)
Thus there are two different algorithms for producing draws from p (θ, y), using
the simulators from p (θ), p (y | θ) and p (θ | y) already in hand. In fact, for the second
simulation all that is needed is a few extra lines of code in the MCMC simulator to
generate a new y e (m) at the start
½ of each iteration.
¾ With no errors in derivations or
n o (m)
coding, then θ(m) , y(m) and θ e ,y e (m) are drawn independently from the same
distribution. Methods described in [11] and available in the BACC software [15] can
be used to test this proposition formally. The only step in common between the two
simulators is drawing from the data density p (y | θ). Errors here are ½typically less
¾
n o (m)
(m)
likely to arise than elsewhere, but even if they do then θ , y (m) e
and θ , y e (m)

will generally not be drawn from the same distribution.

21
A separate procedure can be used to examine the event that the densities p (θ)
and/or p (y | θ) are being evaluated incorrectly. This is critical for the approximation
of the marginal likelihood described in Section 4.3. These evaluations are not part of
the MCMC algorithm (Section 4.1) or of draws from the prior or data density, so errors
here escape the test just described. However, it is critical for sound approximation of
marginal likelihoods that these evaluations be correct.
The principle underlying the Gelfand-Dey approximation of the marginal likeli-
hoods may be used to test the correctness of expressions for p (θ) and p (y | θ) and
their coding. For any p.d.f. f (y | θ) with support Θ∗ × Y∗ contained in Θ × Y (Y
being the support of the data density),
Z Z
f (θ, y)
p (θ, y) dydθ = 1.
Θ Y p (θ) p (y | θ)
½ ¾
(m)
e
Hence for the simulator output θ , y e (m)
,

µ ¶
(m)
m
e
f θ ,y e (m)
X
M −1 µ ¶ µ ¶ a.s.
→ 1.
(m) (m)
m=1 p θe p y e(m) | θe

In other words, when simulated data are used in place of actual data, the marginal
likelihood must be unity.
The code written for this paper has passed both of these tests.

5. Results with artificial data


Here we report a few essential findings using artificial data, in which the population
distribution is known. Subsequent versions of this paper will contain some full-fledged
Monte Carlo studies as well as simulations that illustrate the effects of uncertainty
about the current state st on conditional volatilities and distributions.
In the example reported here we generated T = 2,000 observations from a com-
pound Markov normal mixture model with n = m1 = m2 = 2, and parameters
· ¸ µ ¶ · ¸
0.9 0.1 0.2 −0.5 −1.0
P = , (hi ) = , Φ= ,
0.3 0.7 1.0 1.5 3.0
· ¸ · ¸
0.75 0.25 1.0 1.0
R = , [hij ] = ,
0.5 0.5 1.0 1.0
· ¸ · ¸
0.2 0.5 1.0 2.5
Ψ1 = , Ψ2 = .
−0.6 −1.5 −1.0 −2.5

We do not report posterior means or other estimates for these parameters, for rea-
sons discussed in Section 4.2. Instead, we concentrate on the posterior distribution
of the population moments described in Section 2.3. Table 3 provides results for a

22
single data set. The median and the interquartile range in the table are for the popu-
lation moments, not the sample moments. Two interesting points emerge from Table
3. First, interquartile ranges for the posterior moments bracket the population values
in nearly all cases. Second, the sample moments fall outside the interquartile ranges
in nearly all cases. Both are consistent with the proper functioning of the MCMC
algorithm as established in Section 4.4. They also speak to the efficiency (in the clas-
sical statistical sense) of the model in recovering population moments, in comparison
with the corresponding sample moments. The latter point is not surprising, given
the correct specification of the population by the model in this experiment. A more
interesting question is the trade-off between bias and variance when the compound
Markov normal mixture model is misspecified. Examination of this question awaits a
planned Monte Carlo study.
Table 3
Artificial data: Moments in DGP model
Posterior distribution
Population Sample Median .25 .75
Skewness 1.212 1.199 1.214 1.160 1.273
Kurtosis 5.973 5.472 5.511 5.144 5.887
c1 0.173 0.161 0.175 0.171 0.180
c2 0.104 0.082 0.107 0.103 0.112
c3 0.062 0.044 0.065 0.061 0.070
s1 0.341 0.390 0.345 0.333 0.355
s2 0.204 0.174 0.211 0.205 0.217
s3 0.123 0.054 0.129 0.124 0.135
k1 1.032 0.878 1.012 0.951 1.067
k2 0.619 0.391 0.619 0.592 0.654
k3 0.372 0.279 0.381 0.365 0.396
k4 0.222 0.148 0.234 0.222 0.244
k5 0.134 0.057 0.142 0.133 0.152

Table 4
Arificial data: Marginal likelihoods
Variable 1 Variables 1 and 2
m1 m2 ζ log (M L) m1 m2 ζ log (M L)
1 1 1 —3213.7 1 1 1 —6056.5
1 2 1 —2436.1 1 2 1 —4438.8
2 1 1 —2254.4 2 1 1 —4193.1
2 2 1 —2069.5 2 2 1 —2780.2
2 2 0 —2121.2 2 2 0 —3250.9
2 3 1 —2064.8 2 3 1 —2780.4
3 2 1 —2791.9 3 2 1 —2791.9
The ability to distinguish between different variants of the compound Markov
normal mixture model is illustrated in Table 4, using the same data set. The first

23
three columns of this table indicate the specification of the number of persistent and
transitory states, and whether serial correlation is permitted in the model (ζ = 1) or
not (ζ = 0). The correct specification has m1 = m2 = 2 and ζ = 1. Models that
impose incorrect constraints in the form of too few states, or no serial correlation,
are overwhelmingly rejected by posterior odds for any reasonable prior odds. For
example, in the case of the restriction of no serial correlation and n = 2, the prior
odds would have to be at least exp (3250.9 − 2780.2) in favor of no serial correlation,
for the posterior odds to favor no serial correlation. Models less parsimonious than
the data generating process are favored when only the first variable is used, but the
data generating process itself is favored (albeit narrowly) when both variables are
used. This point will be investigated further in subsequent experiments.

6. An application to foreign exchange returns


We illustrate inference in the compound Markov normal mixture model using foreign
exchange returns for the British pound, Canadian dollar, German Mark, French franc,
Swiss franc, and Japanese yen. These data are daily, extending from October 5, 1993
through October 18, 1997, and consist of quotations as of 10:00 a.m. Eastern Time.
Thus, n = 6 and T = 1,006. The data, assembled by the Federal Reserve Bank of
New York, are available through ftp.ny.frb.org, and were used by Hora in a recent
Ph.D. dissertation [16]. Returns are computed from daily price series through the
transformation ytj = 100 log (ptj / pt−1,j ).
For each model, results reported in this section are based on 12,000 iterations of
the MCMC simulator. The first 2,000 iterations were discarded and parameters and
functions of interest were recorded every tenth iteration. These functions exhibit very
little serial correlation, even in the most complex models which contain well over 200
parameters. (See Appendix 8.5 for details.) Several chains were simulated for each
model, and the same results were obtained up to simulation noise.
The benchmark prior distribution of Section 3.2 was used in many variants of the
compound Markov normal mixture model. Three features of the model distinguish
these variants: whether serial correlation is permitted (Case II, ζ = 1) or not (Case I,
ζ = 0); the number of persistent states m1 ; and the number of transitory states m2 .
These models can be compared by examining their marginal likelihoods, as dis-
cussed in Section 4.3. Table 5 provides logarithms of marginal likelihoods for 20
models that permit serial correlation, and Table 6 does the same for 36 models in
which serial correlation is prohibited. Recall that when m1 = 1 there is serial inde-
pendence, and hence this column is missing in Table 5.

24
Table 5
Foreign exchange returns: Model comparison
Serial correlation permitted
Values of log marginal likelihood
m1 − 2 3 4 5
m2 :
1 —1970.9 —1953.0 —1946.5 —1952.5
2 —1932.8 —1915.5 —1920.6 —1923.2
3 —1922.5 —1911.9 —1924.5 —1913.5
4 —1912.0 —1913.5 —1899.0 —1923.3
5 —1914.7 —1919.4 —1923.9 —1926.5

Table 6
Foreign exchange returns: Model comparison
Serial correlation prohibited
Values of log marginal likelihood
m1 − 1 2 3 4 5 6
m2 :
1 —2642.4 —1946.8 —1905.8 —1902.0 —1902.6 —1909.1
2 —2025.8 —1908.2 —1872.9 —1856.5 —1860.8 —1855.5
3 —2019.4 —1898.5 —1871.0 —1855.4 —1859.5 —1853.6
4 —2009.5 —1892.1 —1856.7 —1865.7 —1849.1 —1846.6
5 —2010.8 —1885.4 —1861.5 —1849.8 —1852.2 —1867.6
6 —2012.4 —1886.1 —1858.5 —1858.2 —1854.2 —1855.6

Normal mixture models are the cells with m1 = 1 in Table 6. Markov normal
mixture models correspond to the cells with m2 = 1 in both tables. Compound
Markov normal mixture models are the cells with m1 > 1 and m2 > 1 in both tables.
The odds against the i.i.d. Gaussian model are overwhelming, at least 10268 : 1.
Odds ratios strongly favor models without serial correlation, as a cell-by-cell compar-
ison of the two tables quickly shows. However, the evidence of other forms of serial
dependence is also strong. This can be seen by comparing the log marginal likelihoods
in Table 6 for m1 = 1 and m2 = j (a serially independent mixture of j normals) with
m1 = j and m2 = 1 (a serially uncorrelated, serially dependent mixture of j normals)
for the same values of j = 2, 3, 4, 5, 6. For example, when j = 2 the odds ratio in favor
of serial dependence is 2 × 1034 : 1. For larger values of j the ratio is higher. These
three findings are entirely expected.
The evidence against the conventional Markov mixture model is also strong. The
odds ratio in favor of a compound normal mixture model with two transitory states
(m2 = 2) and the same number of persistent states is on the order of 1017 : 1 in all
cases. Introducing more than two transitory states produces a further increase in
the marginal likelihood in most cases, but it is never this dramatic: odds ratios are
at the most of the order 108 : 1.
The evidence on the number of persistent states is less regular. In models permit-
ting serial correlation (Table 5) the large gains are realized with just two persistent

25
states: odds in favor of additional persistent states never surpass about 109 : 1 for
any values of m2 > 1. Without serial correlation (Table 6) the same can be said
for three persistent states. However, the most favored model with serial correlation
(m1 = m2 = 4) enjoys a posterior odds ratio of 2×1014 : 1 against the model with two
persistent and transitory states. Without serial correlation the most favored model
has 6 persistent and 4 transitory states, and the posterior odds against m1 = 3,
m2 = 2 is 3 × 1011 : 1.
The question of which model should be used, or whether models should be aver-
aged, has a number of practical dimensions that we are still exploring. One consider-
ation is the computation time. For example, m1 = m2 = 2, ζ = 1 requires about 35
minutes on a state of the art Intel processor for 12,000 iterations, whereas m1 = 6,
m2 = 4, ζ = 0 requires about twice as much time. In no case is serial corelation in
the MCMC algorithm substantial.
A second consideration is whether smaller models produce quite similar results
to larger models with favorable posterior odds, in the ultimate applications for these
models discussed in Section 1. This question is currently under investigation. In
any event, we find it striking that compound normal mixture models with such rich
structures are supported by the evidence and are practical — the largest model in
Table 6 has 272 parameters, and the model with the largest marginal likelihood has
177 parameters.
While the function of the compound Markov normal mixture model in application
to financial returns is only to provide a flexible vehicle for changing conditional distri-
butions, it is nonetheless interesting to examine some parameter values. For reasons
discussed in [4], it is not possible to construct posterior distributions for state-specific
parameters. However, examination of parameter values at representative points in
the posterior can be illuminating. For example, the point with the highest posterior
density evaluation in the case m1 = m2 = 3, ζ = 0 had parameter values
   
984 .010 .005 486
P =  003 .914 .084  , π =  381  ,
050 .212 .737 132
−1/2 ¡ ¢
hi = 288 .380 .433 ,
   
537 .428 .035 h i 503 .812 .714
−1/2
R =  957 .028 .015  , hij =  625 .613 .587  ,
895 .054 .051 939 .674 .889
 
004 .022 −.032 −.041 −.008 .040
Ψ1 =  −.064 −.044 .137 .154 .215 .008  ,
772 .207 −1.185 −1.238 −.141 −.701
 
−.003 .032 −.008 −.013 −.008 −.036
Ψ2 =  210 −.946 −.041 .066 −.102 .099  ,
−.208 −.266 .597 .730 .745 2.148
 
−.002 −.005 −.015 .004 .001 .009
Ψ3 =  165 −.061 .852 −.341 −.168 −.140  .
−.146 .156 −.636 .186 .165 .001

26
(Both transitory and persistent states have been set in order of decreasing state prob-
ability.) Observe that each state has significant probability under the stationary dis-
tribution π. Variances are similar in each state, but there is less divergence in means
across returns in the first persistent state than in the other two, and overall there is
greater divergence in means in the third persistent state than in the second.
The posterior distribution on parameters induces a posterior distribution on the
summary moments set forth in Section 2.3. Table 7 provides posterior medians and
the interquartile range for some of these moments for the model with the greatest
marginal likelihood when serial correlation is permitted (m1 = m2 = 4), and Table 8
does the same when serial correlation is absent (m1 = 6, m2 = 4) The columns headed
“population” indicate the posterior distribution of population moments. The columns
headed “sample” answer the following question: in the context of this compound
Markov normal mixture model, if another, independent data set of the same size were
to be drawn, what would be the computed sample moments? A dash indicates that a
population moment is identically zero.

Table 7
Foreign exchange returns: Moments (m1 = m2 = 4)
Serial correlation permitted
Posterior quantiles for
Population Sample
Moment Data .25 .50 .75 .25 .50 .75
Skewness —.110 —.214 —.127 —.059 —.220 —.118 —.022
Kurtosis 5.583 6.020 6.915 8.416 5.310 6.626 7.543
c1 —.017 .004 .008 .012 —.005 .006 .017
c2 .004 .004 .005 .007 —.004 .005 .015
c3 —.028 .002 .003 .005 —.007 .002 .012
c4 .009 .002 .003 .004 —.007 .002 .011
c5 .015 .001 .002 .003 —.008 .001 .011
s1 .005 —.033 —.022 —.011 —.048 —.017 .011
s2 .024 —.022 —.015 —.008 —.040 —.013 .014
s3 .054 —.016 —.011 —.006 —.034 —.009 .014
s4 .032 —.013 —.009 —.004 —.029 —.006 .016
s5 —.036 —.011 —.007 .004 —.020 —.006 .016
s−1 —.102 —.034 —.023 —.013 —.051 —.023 .006
s−2 .002 —.022 —.016 —.009 —.042 —.017 .012
s−3 —.054 —.016 —.011 —.002 —.030 —.009 .014
s−4 .051 —.013 —.009 —.005 —.030 —.001 .013
s−5 —.026 —.011 —.007 —.004 —.028 —.001 .011
k1 .427 .305 .357 .417 .254 .320 .396
k2 .164 .222 .264 .312 .173 .235 .299
k3 .183 .174 .210 .251 .124 .181 .242
k4 .219 .143 .175 .212 .101 .150 .206
k5 .221 .124 .153 .188 .093 .125 .179

27
The results convey the fact that autocorrelation is negligible, consistent with the
formal evidence favoring absence of serial correlation. Models with serial correlation
also permit autoskewness and hence leverage effects, whereas models without serial
correlation do not.
The population distributions for dynamic moments (cu , su , ku ) are broadly con-
sistent with sample moments, with the possible and notable exception of s−1 . This is
the one day leverage effect, linking negative returns in one day with volatile returns
the next. The sample moments is −.102. The centered 90% credible region for the
population moment (not shown in Table 7) is (—.054, .001) in the model that permits
serial correlation, and the 90% credible region for the sample moment is (—.106, .056).
The substantially greater width of the second region reflects the sampling variation
in the sample moment s−1 in the context of the compound Markov normal mixture
model. For the model that prohibits serial correlation the 90% credible region for the
sample moment is (—.084, .075). Thus neither model is glaringly at odds with the size
of this moment, which can be ascribed in substantial part to sampling variation.

Table 8
Foreign exchange returns: Moments (m1 = 6, m2 = 4, ζ = 0)
Serial correlation prohibited
Posterior quantiles for
Population Sample
Moment Data .25 .50 .75 .25 .50 .75
Skewness —.110 —.138 —.061 .128 —.166 —.058 .047
Kurtosis 5.583 7.151 8.917 13.382 6.231 7.525 9.789
c1 —.017 – – – —.012 —.001 .009
c2 .004 – – – —.011 —.012 .009
c3 —.028 – – – —.010 —.001 .009
c4 .009 – – – —.011 —.001 .008
c5 .015 – – – —.013 .001 .012
s1 .005 – – – —.032 .001 .032
s2 .024 – – – —.028 .002 .025
s3 .054 – – – —.026 —.002 .026
s4 .032 – – – —.025 —.001 .024
s5 —.036 – – – —.023 .001 .023
s−1 —.102 – – – —.031 —.002 .031
s−2 .002 – – – —.027 .000 .028
s−3 —.054 – – – —.027 .000 .023
s−4 .051 – – – —.024 —.001 .026
s−5 —.026 – – – —.025 .000 .023
k1 .427 .320 .363 .428 .263 .331 .405
k2 .164 .232 .271 .316 .184 .243 .312
k3 .183 .179 .212 .251 .132 .183 .252
k4 .219 .145 .173 .207 .099 .146 .205
k5 .221 .119 .145 .175 .071 .121 .183

28
The posterior distribution of skewness and kurtosis are dissimilar in the two mod-
els. In Table 7, interquartile ranges are considerably shorter than in Table 8. It
appears that the model with six persistent states (Table 8) includes states with high
variances but low probabilities, leading to the possibility of high kurtosis and skew-
ness. This is supported by the fact that for sample moments the interquartile range
is smaller than for population moments: in over half of the samples, states with high
variance do not occur.
The columns headed “population” indicate the posterior distribution of population
moments. The columns headed “sample” answer the following question: in the context
of this compound Markov normal mixture model, if another, independent data set of
the same size were to be drawn, what would be the computed sample moments? The
entries in these columns provide the median and centered 90% credible region for
these new sample moments.

7. Conclusion
In this paper we generalise the Markov Mixture Model by permitting each component
of the mixture to be itself a mixture of Gaussian distributions. The resulting model
provides a flexible characterization of stationary multivariate time series that incor-
porates serial correlation, persistence in conditional variance and higher conditional
moments, and an arbitrarily good (L1) approximation to unconditional distributions.
Constraints may also be imposed that eliminate serial correlation while retaining serial
dependence. The paper illustrates the model using foreign exchange returns.
Other multivariate applications will be included in future versions of the paper.

8. Appendices
8.1. Appendix: Proof of Theorem 2
The instantaneous variance matrix is immediately attained by considering
h ih i0 ³ ´
(p) (p) (p) (p) (p)0
Γ0 = E zt − µ∗(p) zt − µ∗(p) = E zt zt − µ∗(p) µ∗(p)0
m
X h i
(p) (p)0
= πj zt zt | st = j − µ∗(p) µ∗(p)0
j=1
Xm ³ ´
(p) (p) (p)0
= πj Rj + µj µm − µ∗(p) µ∗(p)0 .
j=1

The dynamic covariance matrices are obtained by conditioning on st and st−u , ex-
ploiting serial independence of observables after conditioning on the states, and then
by marginalizing out the states:
³ ´ ³ ´
(p) (p) (p) (p)0
Γu(p) = cov zt , zt−u = E zt zt−u − µ∗(p) µ(p)0

29
m X
X m ³ ´
(p) (p)0
= E zt zt−u | st = j, st−u = i [Pu ]ij π i − M(p) ππ 0 M(p)0
j=1 i=1
m X
X m ³ ´ ³ ´
(p) (p)0
= E zt | st = j E zt−u | st = i [Pu ]ij π i − M(p) ππ 0 M(p)0
j=1 i=1
m X
X m
(p) (p)0
= µj µi [Pu ]ij π i − µ(p) e0m ΠM(p)0 = M(p) Bu ΠM(p)0 ,
j=1 i=1

u
where Bu = (P − em π 0 ) = Pu − em π 0 .

8.2. Appendix: Proof of Theorem 3


Adopt the notation in the proof of Theorem 1, and set

Q = [q1 , q2 , . . . , qm ]0 = [π, q2 , . . . , qm ]0 ,
£ ¤ £ ¤
Q−1 = q1 , q2 , . . . , qm = em , q2 , . . . , qm .
Pm u j 0
Exploiting the spectral decomposition of P, Bu = j=2 λj q qj . Substituting in
(2.18),
Xm r+1
X
Γ(p)
u = λuj M(p) qj q0j M(p)0 = λuj A0j (u = 1, 2, 3, . . .)
j=2 j=2

where
X
A0j = M(p) qh qh0 M(p)0 ,
h∈Hj
n o
Hj = h : q0h P = λj q0h , M(p) qh 6= 0 .

Observe that r is the number of distinct eigenvalues of P with modulus in the open
unit interval associated with as least one column of Q0 not in the column null space
of M(p) . In other words, r can be less than m − 1 because some eigenvalues are equal
to zero (as in the compound Markov model interpreted as having m = m1 m2 states),
because some eigenvalues are repeated, or because some eigenvalues are associated
with columns of Q0 all in the column null space of M(p) .
(p) (p) Pr+1
Define now a stochastic process vt with autocovariances Γ̃u = j=2 λuj A0j
(p) P (p) (p)
(u > 0) and Γ̃0 = r+1 0
j=2 Aj . Then for u > 0, Γ̃u = Γu , while

r+1
X m
X
(p) (p) 0(p)
Γ̃0 = A0j = µj µj π j − µ∗(p) µ∗(p)0 .
j=2 j=1

(p) (p) Pm (p)


Notice that the matrix Γ0 − Γ̃0 = j=1 Rj π j is positive (semi) definite, since
(p)
each Rj is a variance matrix (2.15).

30
Given that there are r distinct eigenvalues of P, λ2 , . . . , λr+1 , with modulus in
(p) (p)
the open unit interval, contributing to the determination of Γu = Γ̃u , there exists
a unique set of constants α1 , . . . , αr such that
r
X
λrj − αi λr−i
j =0 (j = 2, . . . , r + 1) .
i=1

The coefficients α1 , . . . , αr determine a degree r polynomial whose roots are λ−1 −1


2 , . . . , λr .
Thus for all u > r,
r
X r+1
X r
X r+1
X
(p)
Γ̃(p)
u − αi Γ̃u−i = λuj A0j − αi λu−i
j A0j
i=1 j=2 i=1 j=2
r+1
à r
!
X X
= λuj − αi λu−i
j A0j = 0.
j=2 i=1
n o
(p)
The autocovariance function of vt therefore satisfies the Yule-Walker equa-
tions for a VAR(r) process with coefficient matrices αi Inp (i = 1, . . . , r).

8.3. Appendix: The MCMC algorithm in detail


The kernel of the prior density is the product of the following expressions.
h ¡ ¢0 ¡ ¢ i
p (β) ∝ exp − β − β Hβ β − β / 2 (8.1)
m1
Y
r1 −1
p (pi ) ∝ pij (i = 1, . . . , m1 ) (8.2)
j=1
m2
Y
p (ρi ) ∝ ρrij2 −1 (i = 1, . . . , m1 ) (8.3)
j=1

p (H) ∝ |H|(ν−n−1)/2 exp (−tr (SH) /2) (8.4)


(ν −1)/2 ¡ ¢
p (hi ) ∝ hi 1 exp −s21 hi /2 (i = 1, . . . , m1 ) (8.5)
( ν −1)/2 ¡ ¢
p (hij ) ∝ hij 2 exp −s22 hij /2
(i = 1, . . . , m1 ; j = 1, . . . , m2 ) (8.6)
³ ´ h i
p φ e|H ∝ |H|(m1 −1)/2 exp −h0φ φ e (Im −1 ⊗ H) φ/2 e (8.7)
1
" m
#
X1 −1
0 0
= |H| (m1 −1)/2
exp −h φe 0Hφ e /2 (8.8)
φ i i
i=1i
³ ´
p ψe | hi , H n(m −1)/2
∝ hi 2 |H|(m2 −1)/2
i
h i
· exp −hψ hi ψe 0 (Im −1 ⊗ H) ψ
e /2
i 2 i

(i = 1, . . . , m1 )

31
32
³ ´
p ψ e | h1 , . . . , hm , H
i

m1
" m1
#
Y X
∝ |H|m1 (m2 −1)/2
h
n(m2 −1)/2
exp −h e 0 (Im
hi ψ e /2
⊗ H) ψ (8.9)
i ψ i 2 −1 i
i=1 i=1
 
m1
Y m1
X m
X2 −1

= |H| m1 (m2 −1)/2 n(m −1)/2


hi 2 exp −hψ hi e 0 Hψ
ψ e /2 (8.10)
ij ij
i=1 i=1 j=1

m1
Y n(m2 −1)/2
= |H|m1 (m2 −1)/2 hi
i=1
n o
e 0 [diag (h1 , . . . , hm ) ⊗ Im −1 ⊗ H] ψ/2
· exp −hψ ψ e (8.11)
1 2

Conditional posterior distribution of H. From (8.4), (8.8), (8.10) and (2.5),


¡ ¢
H ∼ W S, ν ;
m
X1 −1 m1
X m
X2 −1 T
X
S = S + ζhφ e φe0 e ψ e0 δ t εt ε0t
φ i i + hψ hi ψ ij ij +
i=1 i=1 j=1 t=1
m1
X T
X
e 0Φ
= S + ζhφ Φ e +h e 0i Ψ
hi Ψ ei + δ t εt ε0t ,
ψ
i=1 t=1
ν = ν + ζ (m1 − 1) + m1 (m2 − 1) + T.

Conditional posterior distribution of the hi . From (8.5), (8.10), and (2.5),

s2i hi ∼ χ2 (ν i ) ;
m
X2 −1 m2
X X
s2i = s21 + hψ e 0 Hψ
ψ e0 + hij ε0t Hεt
ij ij
j=1 j=1 t:st =(i,j)
³ ´ m2
X X
e 0 Hψ
= s21 + hψ tr ψ e0 + hij ε0t Hεt ,
ij ij
j=1 t:st =(i,j)

νi = ν 1 + n (m2 − 1) + nTi .

Conditional posterior distribution of the hij . From (8.6) and (2.5),

s2ij hij ∼ χ2 (ν ij ) ;
X
s2ij = s22 + hi ε0t Hεt ,
t:st =(i,j)
ν ij = ν 2 + nUij .

Conditional posterior distribution of P. From (8.2), (2.5), and (2.2),

33
m1 Y
m1
à T
!
Y r +T −1
X
p (P) ∝ π s11 pij1 ij exp − δ t ε0t Hεt /2 .
i=1 j=1 t=1

¡ ¢
Because εt = yt − B0 xt − ψ st − z10 e
t C0 ⊗ In φ, in Case II C0 is a function of π
and therefore of P. Proceeding one row at a time through P, draw the candidate p∗i
from Beta (r1 + Ti1 , . . . , r1 + Tim1 ), and let C∗0 be the orthonormal complement of
π ∗ corresponding to the resulting P∗ . For Case II, compute ε∗t = yt − B0 xt − ψ st −
¡ 10 ∗ ¢ ¡ ¢
zt C0 ⊗ In φ e and ε− = yt − B0 xt − ψ − z10 C− ⊗ In φ e . (The matrix C− is
t st t 0 0
the orthonormal complement of π corresponding to the accepted candidate matrix
P after the draw for the immediately preceding row of P) In Case I the Metropolis
acceptance ratio is π ∗s11 /π −s11 . Hence the Metropolis acceptance ratio is
³ P ´
T ∗
π ∗s11 exp −ζ t=1 δ t ε∗0 t Hεt /2
³ P ´
T −
π−s11 exp −ζ
−0
t=1 δ t εt Hεt /2

The orthonormal complement of C0 of π is not unique. As discussed in Section


4.1, nothing substantive in the model depends on which C0 is used. However, if C0 is
not a smooth function of π then the candidate will be rejected more often than if it is,
because C0 Φ e will change more. Appendix 8.3.1 details an algorithm for constructing
C0 that is smooth in π For parameter values that are implausible under the posterior,
acceptance probabilities in the Metropolis step are low. We have found this problem
is avoided by accepting all candidates in the early iterations of the algorithm. (Of
course, these iterations must be discarded in approximating posterior moments.) In
its current implementation the rejection step is skipped in the first 1,000 iterations.
Conditional posterior distribution of R. From (8.3), (2.5), and (2.3),
 
m2
¡ ¢ Y r +U −1
X
p ρj ∝ ρ 2 jk exp −
jk δ t ε0t Hεt /2
k=1 t:st1 =j

where in εt = yt − B0 xt − φst − (z0t C ⊗ In ) ψ̃, if st1 = j, then Cj is a function


of ρj . Draw the candidate ρ∗j from Beta (r2 + Uj1 , . . . , r2 + Uj,m2 ). Let C∗j be the
orthonormal complement of ρ∗j . (We continue to use the algorithm of Appendix
8.3.1.) For all t for which st1 = j, compute ε∗t = yt − B0 xt − φst − (z0t C∗ ⊗ In ) ψ̃
and ε− 0 0 − e −
t = yt − B xt − φst − (zt C ⊗ In ) ψ, where C is the value from the previous
iteration. The Metropolis acceptance ratio is
³ P ´

exp − t:st1 =j δ t ε∗0
t Hεt /2
³ P ´.
exp − t:st1 =j δ t εt Hε−
−0
t /2

The Metropolis step is used only after the first 1,000 iterations.

34
³ Conditional ´ posterior distribution of γ. Recall that yt = Wt0 γ + εt , with γ 0 =
0 0 0
β , φ̃ , ψ̃ and
£ ¤
Wt0 = x0t ⊗ In z10
t C0 ⊗ In z0t C ⊗ In . (8.12)

From (8.1), (8.7), (8.11) and (2.10),

³ −1
´ T
X
γ = N γ, Hγ , Hγ = δ t Wt HW0t + Hγ ;
t=1

for Case II,  


Hβ 0 0
Hγ =  0 Hφ 0 ,
0 0 Hψ
Hφ = hφ Im1 −1 ⊗ H = hφ Blockdiag (H, . . . , H) ,

Hψ = hψ Diag (h1 , . . . , hm1 ) ⊗ Im2 −1 ⊗ H


= hψ Blockdiag (h1 Im2 −1 ⊗ H, . . . ,hm1 Im2 −1 ⊗ H) ;

T
X
−1
γ = Hγ cγ , cγ = δ t Wt Hyt + cγ ;
t=1
¡ 0 0 0¢
c0γ = β Hβ , 0 .

e are
Case I proceeds the same way except that the elements of γ corresponding to φ
omitted.
To draw from the conditional posterior for γ we must compute the moment matrix
" PT PT #
Hγ + t=1 δ t Wt HW0t cγ + t=1 Wt Hyt δ t
PT PT .
c0γ + t=1 δ t yt0 HW0t 0
t=1 δ t yt Hyt

P
Let Zt = [Wt | yt ]. The key computation is to accumulate a matrix Tt=1 δ t Zt HZ0t
efficiently, given that most entries in Zt are zero. Appendix 8.3.2 provides such an
algorithm. Here, we show how to use this algorithm.
For each observation t, denote st = (u, v) to reduce clutter in the notation. Then
from (8.12),
 
xt ⊗ In
 c0(0)u· ⊗ In 
 
 0A 

Zt =  0 .
c ⊗ I 
 (u)v· n 
 0B 
yt0

35
In this matrix, c(0)u· is the 1×(m1 − 1) row u of C(0) ; c(u)v· is the 1×(m2 − 1) row v of
C(u) ; 0A is a (u − 1) n (m2 − 1)×n matrix of zeros; and 0B is a (m1 − u − 1) n (m2 − 1)×
n matrix of zeros. The entire Zt matrix is {n [k + ζ (m1 − 1) + m1 (m2 − 1)] + 1}×n.
Set up the k + ζ (m¡1 − 1) + (m2 − 1) + ¢ n distinct nonzero entries of the matrix
Zt in the vector v0 = x0t , c(0)u· , c(u)v· , yt0 . (For Case I, omit c(0)u ) Notice that in
each of the [k + ζ (m1 − 1) + m1 (m2 − 1)] n rows of Wt there is at most one nonzero
entry; this corresponds to the fact that there are no crossequation restrictions on the
coefficients. The indexing needed for the algorithm in Appendix 8.3.2 is as follows:
The first kn entries (ri , ci , pi ):
n (j − 1) + i; i; j (i = 1, . . . , n; j = 1, . . . , k) .
The next ζ (m1 − 1) n entries (ri , ci , pi ):
kn + n (j − 1) + i; i; k + j (i = 1, . . . , n; j = 1, . . . , m1 − 1) .
The next (m2 − 1) n entries (ri , ci , pi ):
[ k + ζ (m1 − 1)] n + n (m2 − 1) (u − 1) + n (m2 − 1) (j − 1) + i;
i; k + ζ (m1 − 1) + j (i = 1, . . . , n; j = 1, . . . , m2 − 1) .
The last n entries (ri , ci , pi ):
[k + ζ (m1 − 1) + m1 (m2 − 1)] n + 1;
i; k + ζ (m1 − 1) + (m2 − 1) + i (i = 1, . . . , n) .
Note that the double indexing in the parentheses works like an implied do-loop: the
outer (right) index moves slowly, and the inner (left) index moves quickly. The total
number of nonzero entries in Zt is [k + ζ (m1 − 1) + (m2 − 1) + 1] n.
Now we can restate the indexing specifically in terms of the algorithm in Appendix
8.3.2. Define
mw = k + ζ (m1 − 1) + (m2 − 1) and nw = n · mw ,
the total number of distinct nonzero entries, and total number of nonzero entries,
respectively, in Wt . Define
m∗w = k + ζ (m1 − 1) and n∗w = n · m∗w ,
the total number of distinct nonzero entries, and the total number of nonzero entries,
respectively, corresponding to the explanatory variables and persistent states. Define
mz = mw + n and nz = n (mw + 1) = nw + n,
the total number of distinct nonzero entries, and the total number of nonzero entries,
respectively, in Zt . We also require some pointers to the rows of the matrix Zt
(equivalently, to the elements of γ and one beyond). Let
`φ̃ = nk + 1,
`ψ̃ = n [k + ζ (m1 − 1)] + 1 = n∗w + 1,
`y = n [k + ζ (m1 − 1) + m1 (m2 − 1)] + 1.

36
Many of the indices do not depend on the state assignments (u, v). In the notation
of the algorithm in Appendix 8.3.2, these indices are

ri = i (i = 1, . . . , n∗w )
rnw +i = `y (i = 1, . . . , n)
ci = imod(n) (i = 1, . . . , nz )
pi = [(i − 1) /n] + 1 (i = 1, . . . , nw )
pnw +i = mw + i (i = 1, . . . , n) .

The only integer indices depending on state assignments are

rn∗w +i = n∗w + n (m2 − 1) (u − 1) + i (i = 1, . . . , n (m2 − 1)) .

The composition of the mz × 1 vector v is different for each t:

vi = xti (i = 1, . . . , k)
vk+i = c(0)ui (i = 1, . . . , m1 − 1)
vm∗w +i = c(u)vi (i = 1, . . . , m2 − 1)
vmw +i = yti (i = 1, . . . , n) .

Drawing the state matrix S. This drawing is described fully in Section 4.1.

8.3.1. Appendix: Computation of orthonormal complements


To construct
Pm a unique orthonormal complement C of a vector of probabilities π
with i=1 π i = 1, note that π j ∈ (0, 1) with probability 1 (j = 1, . . . , m). Con-
struct a matrix C∗ as follows. The first column of C∗ is c∗11 = π 2 , c∗21 = −π1 ,
c∗i1 = 0 (i = 3, . . . , m). The j’th column of C∗ is c∗ij = π i (i = 1, . . . , j), c∗j+1,j =
Pj
− i=1 π 2i /πj+1 , c∗ij = 0 (i = j + 2, . . . , m). Construct C from C∗ by normalizing
the columns to each have Euclidian length 1.

8.3.2. Appendix: Algorithm for a sparse outer product problem


We wish to compute the increment A = δZHZ0 , where Z : g × n, H : n × n, and Z
(p)
is sparse. We have zt
n X
X n
aij = δ zik hk` zj` .
k=1 `=1

Now suppose there are m nonzero entries in Z. Let the vector v contain the distinct
nonzero entries in Z. For each entry i we have three indices:

ri = Row entry in Z;
ci = Column entry in Z;
pi = Pointer to v: zri ,ci = vpi .

37
Order the indexing so that ri is nondecreasing. Then for each combination

(i, j) (j = 1, . . . , i; i = 1, . . . , m)

compute
ari ,rj = δhci ,cj · vpi · vpj .
This provides the lower triangle of the symmetric increment A = δZHZ0 .

8.4. Appendix: Approximating the marginal likelihood


Material to be added

8.5. Appendix: Numerical approximation error


Material to be added

References
[1] Andersen, T., T. Bollerslev, F.X. Diebold and P. Labys, “The Distribution of Re-
alized Exchange Rate Volatility,” Journal of the American Statistical Association
96: 42-55.
[2] Bollerslev, T., 1986, “Generalized Autoregressive Conditional Heteroskedastic-
ity,” Journal of Econometrics 31: 307-327.
[3] Bollerslev, T., 1990, “Modelling the Coherence in Short-Run Nominal Exchange
Rates: A Multivariate Generalized ARCH Model,” Review of Economics and
Statistics 72: 498-505.
[4] Celeux, G., M. Hurn and C.P. Robert, 2000, “Computational and Inferential
Difficulties with Mixture Posterior Distributions,” Journal of the American Sta-
tistical Association 95: 957-970.
[5] Chib, S., 1996, “Calculating Posterior Distributions and Modal Estimates in
Markov Mixture Models,” Journal of Econometrics 75: 79-97.
[6] Chib, S. and E. Greenberg, 1995, “Understanding the Metropolis-Hastings Algo-
rithm,” The American Statistician 49: 327-355.
[7] Engle, R.F. and K.F. Kroner, 1995, “Multivariate Simultaneous Generalized
ARCH,” Econometric Theory 11: 122-150.
[8] Ferguson, T.S., 1973, “A Bayesian Analysis of Some Nonparametric Problems,”
The Annals of Statistics 2: 615-629.
[9] Ferguson, T.S., 1983, “Bayesian Density Estimation by Mixtures of Normal Dis-
tributions,” in H. Rivisi and J. Rustagi (eds.), Recent Advances in Statistics.
New York: Academic Press, pp. 287-302.

38
[10] Gelfand, A.E., and D.K. Dey, 1994, “Bayesian Model Choice: Asymptotics and
Exact Calculations,” Journal of the Royal Statistical Society Series B 56: 501-
514.
[11] Geweke, J., 1999, “Using Simulation Methods for Bayesian Econometric Models:
Inference, Development and Communication” (with discussion and rejoinder),
Econometric Reviews 18: 1-126.
[12] Geweke, J., 2001, “Getting it Right: Checking for Errors in Likelihood Based
Inference” University of Iowa working paper, July.
[13] Geweke, J. and M. Keane, 2000, “An Empirical Analysis of Income Dynamics
among Men in the PSID: 1968-1989,” Journal of Econometrics 96: 293-356.
[14] Geweke, J. and M. Keane, 2001, “Computationally Intensive Methods for Inte-
gration in Econometrics,” in J.J. Heckman and E.E. Leamer (eds.), Handbook of
Econometrics (vol. 5). Amsterdam: North-Holland (forthcoming).
[15] Geweke, J. and W. McCausland, 2001, “Embedding Bayesian Tools in Mathe-
matical Software,” in E.I. George (ed.), Proceedings of the Sixth International
Meeting of the International Society for Bayesian Analysis. Brussels: Eurostat
(forthcoming).
[16] Hora, M., 1999, “Markov Mixtures with Applications in Finance.” University of
Minnesota unpublished Ph.D. dissertation.
[17] Kiefer, J. and J. Wolfowitz, 1956, “Consistency of the Maximum Likelihood
Estimator in the Presence of Infinitely Many Nuisance Parameters,” Annals of
Mathematical Statistics 27: 887-906.
[18] Nijman, T. and E. Sentana, 1996, “Marginalization and Contemporaneous Aggre-
gation in Multivariate GARCH Processes,” Journal of Econometrics 71: 71-88.
[19] Ryden, T., T. Terasvirta and S. Asbrink, 1998, “Stylized Facts of Daily Return
Series and the Hidden Markov Model,” Journal of Applied Econometrics 13:
217-244.
[20] Tierney, L., 1994, “Markov Chains for Exploring Posterior Distributions,” The
Annals of Statistics 22: 1701-1762.
[21] Timmerman, A., 2000, “Moments of Markov Switching Models,” Journal of
Econometrics 96: 75-112.

39

You might also like