You are on page 1of 10

Mixed Hidden Markov Models: An Extension of the

Hidden Markov Model to the Longitudinal Data Setting

Rachel MacKay A LTMAN
Hidden Markov models (HMMs) are a useful tool for capturing the behavior of overdispersed, autocorrelated data. These models have
been applied to many different problems, including speech recognition, precipitation modeling, and gene finding and profiling. Typically,
HMMs are applied to individual stochastic processes; HMMs for simultaneously modeling multiple processesas in the longitudinal data
settinghave not been widely studied. In this article I present a new class of models, mixed HMMs (MHMMs), where I use both covariates
and random effects to capture differences among processes. I define the models using the framework of generalized linear mixed models and
discuss their interpretation. I then provide algorithms for parameter estimation and illustrate the properties of the estimators via a simulation
study. Finally, to demonstrate the practical uses of MHMMs, I provide an application to data on lesion counts in multiple sclerosis patients.
I show that my model, while parsimonious, can describe the heterogeneity among such patients.
KEY WORDS: Hidden Markov model; Latent process; Longitudinal model; Mixed model; Random effect.

Hidden Markov models (HMMs) describe the relationship
between two stochastic processes: an observed process and an
underlying hidden (unobserved) process. The hidden process
is assumed to follow a Markov chain, and the observed data are
modeled as independent conditional on the sequence of hidden
states. The separability of the model for the hidden process and
the conditional model for the observed data leads to great flexibility in the overall model structure.
In particular, the observed data {Yt }nt=1 follow a HMM if
1. The hidden states, {Zt }nt=1 , follow a Markov chain.
2. Given Zt , Yt is independent of Y1 , . . . , Yt1 , Yt+1 , . . . , Yn
and Z1 , . . . , Zt1 , Zt+1 , . . . , Zn .
The HMM is fully specified by the initial and transition probabilities of the hidden Markov chain and by the distribution of
Yt given Zt . Typically, the latter would be chosen from a family
of distributions with mean depending on Zt .
Under the preceding conditions, the observed data may be
autocorrelated, but do not have the Markov property. Furthermore, the marginal distribution of Yt is a finite mixture. For
example, if Yt is a count, one might choose the distribution of
Yt given Zt = z to be Poisson with mean z . In this case, the
marginal distribution of Yt would be a finite mixture of Poisson
distributions. Thus, HMMs are one way of describing overdispersion in count (or binary) data.
HMMs have many areas of application, including speech
recognition (e.g., Levinson, Rabiner, and Sondhi 1983), gene
profiling and recognition (e.g., Krogh 1998), and the modeling
of fetal lamb movements (Leroux and Puterman 1992). Albert,
McFarland, Smith, and Frank (1994) used HMMs to model lesion counts observed on a multiple sclerosis (MS) patientan
Rachel Altman is Assistant Professor, Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, British Columbia, Canada
V5A 1S6 (E-mail: This article includes work from the
authors Ph.D. dissertation and was supported in part by a research grant from
the Natural Sciences and Engineering Research Council of Canada. The author
gratefully acknowledges her advisor, John Petkau, for his assistance, both academic and financial. In addition, the author thanks Farouk Nathoo for helpful
conversations, and the editors and referees whose comments led to substantial
improvement of the original manuscript. The author also expresses her appreciation to Paul Albert, Henry McFarland, and the Joseph Frank Experimental
Neuroimaging Section, Laboratory of Diagnostic Radiology Research, Clinical
Center, NIH, for sharing their MS/MRI data. Finally, she thanks Drs. Don Paty
and David Li of the UBC MS/MRI Research Group and Serono International
S.A. for providing the PRISMS data.

application discussed further in Sections 2 and 6. One common

feature of these models is that they have been developed for
single processes.
Few HMMs for multiple processes have been considered in
the literature. Most have been developed in the context of specific applications and, hence have not been posed in their full
generality. Little is known about the theory surrounding these
Hughes and Guttorp (1994) presented one example of a
HMM for multiple processes: a multivariate HMM for data consisting of daily binary rainfall observations (rain or no rain)
at four different stations. These time series are assumed to be
functions of the same underlying process. Given the hidden
state at time t, the observations at this time are considered to
be independent of all observations at other time points. They
may, however, depend on one another. Turner, Cameron, and
Thomson (1998) and Wang and Puterman (1999), working in
the setting of Poisson count data, developed models for independent processes, each of which depends on a different underlying process. To account for between-subject differences, these
authors incorporated covariates in the conditional model for
the observed process. MacDonald and Zucchini (1997, chap. 3)
also provided a brief discussion of this idea. Between-subject
differences can also occur in the transition probabilities; for example, in their rainfall model, Hughes and Guttorp (1994) included covariates in the model for the hidden process.
The addition of random effects is a natural extension of these
models. In the subject-area literature, HMMs with random effects have appeared in a limited way. For instance, Humphreys
(1997, 1998) suggested such a model for representing labor
force transition data. He worked with binary observations on
employment status (employed or unemployed) that are assumed
to contain reporting errors. The misclassification probabilities,
as well as the transition probabilities, depend on a subjectspecific random effect. Seltman (2002) proposed a complex
biological model for describing cortisol levels over time in a
group of patients. The initial concentration of cortisol in each
patient is modeled as a random effect.
The goal of this article is to develop a new class of models,
mixed hidden Markov models (MHMMs), which unify existing HMMs for multiple processes and provide a general framework for working in this context. These models extend the class


2007 American Statistical Association

Journal of the American Statistical Association
March 2007, Vol. 102, No. 477, Theory and Methods
DOI 10.1198/016214506000001086


Journal of the American Statistical Association, March 2007

of HMMs by allowing the incorporation of covariates and random effects in both the conditional and the hidden parts of the
model. The advantages of MHMMs are numerous. First, modeling multiple processes simultaneously permits the estimation
of population-level effects, as well as more efficient estimation
of parameters that are common to all processes. Second, these
models are relatively easy to interpret. Finally, MHMMs permit greater flexibility in modeling correlation structure because
they relax the assumption that the observations are independent
given the hidden states.
This article is organized as follows. In Section 2 I describe
the MS study that provided the practical motivation for this
work. In Section 3 I define the class of MHMMs and give some
insight into their interpretation. I discuss estimation of the parameters of these models in Section 4, and in Section 5, for
some simple models, conduct a simulation study to investigate
small sample properties. I illustrate the application of MHMMs
to the MS data in Section 6 and conclude with a discussion in
Section 7.
Magnetic resonance imaging (MRI) scans of relapsing
remitting multiple sclerosis (RRMS) patients are one source
of data that may be appropriately modeled using the HMM
framework. Patients afflicted with this disease have symptoms
that worsen and then improve in alternating periods of relapse
and remission. One such symptom is lesions in the brain; it is
now believed that exacerbations are associated with increased
numbers of lesions (e.g., Sormani, Bruzzi, Comi, and Filippi
2002). Interferon -1a (Rebif) is a common treatment and has
been shown to reduce MRI activity, relapse rate, and clinical
progression according to the Expanded Disability Status Scale
(PRISMS Study Group 1998; Li and Paty 1999; PRISMS Study
Group and the University of British Columbia MS/MRI Analysis Group 2001).
Because the number of lesions depends on whether the patient is in relapse or remission, one would expect the distribution of the lesion counts to be a finite mixture. In addition,
one would expect counts observed on the same patient to be
autocorrelated. Thus, a two-state HMM may be reasonable for
describing such data. Indeed, Albert et al. (1994) used these
ideas in the development of a model for lesion counts observed
monthly on three RRMS patients over a period of approximately 2 years. They used a HMM to analyze each patients
data individually, treating the data as three unrelated processes.
In particular, for a given patient, they let {Zt } be an unobserved,
stationary Markov chain with P(Zt = 1|Zt1 = 1) = P(Zt =
1|Zt1 = 1) = and P(Zt = 1) = P(Zt = 1) = 1/2. They
then modeled the observed lesion count at time t, Yt , conditional
on Zt as Poisson (t ), with

Zt = 1
t1 ,
t =
(1/ )t1 ,
Zt = 1.
Here 0 is an unknown parameter. At first glance, it may appear that this model is not a HMM, because t depends on
Z1 , . . . , Zt1 as well as on Zt . However, Altman and Petkau
(2005), by reparameterizing the model, demonstrated that it is,
in fact, a nonhomogeneous HMM. These authors also pointed

Table 1. Analysis of the Data of Albert et al. Under Three

Different Models








out that, under this model, t is restricted to a discrete number of values (evenly spaced on the log scale). Therefore, if the
process {Yt } is stationary, one might expect the model of Albert
et al. (1994) to be similar to a stationary Poisson HMM with
K < hidden states.
There are a number of limitations to the approach of Albert
et al. (1994). First, for each patient, the authors compared the fit
of their HMM to the model that assumes independent Poisson
counts. Based on the Akaike information criterion (AIC) values (reproduced in columns 2 and 3 of Table 1), they claimed
evidence of autocorrelation in the data from patients 2 and 3.
However, the HMM differs from the Poisson model in that it allows not only for autocorrelation but for overdispersion as well.
When one fits a mixture of two Poisson distributions (a model
that allows for overdispersion but not autocorrelation), one sees
that, of the three models, the mixture model actually yields the
lowest AIC values for patients 2 and 3 (column 4 of Table 1). In
other words, although one expects autocorrelation to be present,
it is not detectable using individual HMMs.
Second, as Albert et al. (1994) noted, MS is a very heterogeneous disease, and its degree and behavior are expected to vary
considerably among patients. Although modeling each patients
data separately certainly allows for interpatient differences, individual models require a large number of parameters, which
leads to increased uncertainty in all parameter estimates. In
addition, this approach prevents the assessment of populationlevel effects. For example, in an MS clinical trial, models for
individual patients would typically not be sufficient given the
usual sample size per patient and heterogeneity among patients.
Third, despite its simple form, the model of Albert et al.
(1994) is hard to interpret. Specifically, consider the implied
marginal moment structure:

E[Y1 ] = 0 +

E[Y2 ] = 0 2 + 2(1 ) + 2 ,

E[Y3 ] = 0 3 2 + +

1 2
[2(1 ) + (1 ) ] + 3 ,

The model assumes that the mean lesion counts follow a very
complicated trend. It would be difficult to justify this assumption in practice.
Finally, little is known about the theoretical properties of
nonhomogeneous HMMs. For instance, there are no tools for
model assessment in this setting.

Altman: Mixed Hidden Markov Models


This example illustrates the motivation for the development

of MHMMs: the need for an interpretable, parsimonious model
that allows interprocess differences while borrowing strength
across processes. MHMMs are most advantageous in the setting where there are many processes. Therefore, to demonstrate the performance of my models in the MS setting, rather
than continuing to work with the data of Albert et al. (1994),
I use a larger MS/MRI dataset that was generated as part of
the PRISMS study (PRISMS Study Group 1998). This study
was a multicenter trial designed to investigate the use of interferon -1a in treating RRMS. Patients were randomly assigned to the placebo group, the low-dose group, or high-dose
group. The 39 patients who participated at the University of
British Columbia (UBC) received monthly scans for approximately two years (similar to the data of Albert et al. 1994 data).

Figure 1 illustrates the UBC PRISMS data. The plots clearly

demonstrate both the heterogeneity among patients and the effectiveness of the treatment in reducing mean lesion count. The
counts in each treatment group are most frequently 0, but range
up to 18 in the placebo group, 8 in the low-dose group, and 7 in
the high-dose group. Some scans are missing in each treatment
group (assumed at random). The PRISMS data are analyzed
further in Section 6.
In this section I specify the MHMM class and discuss the
interpretation of these new models. For concreteness, I present
my ideas in the context where each process corresponds to repeated observations on a patient.
Let Yit be the observation and let Zit be the hidden state associated with patient i, i = 1, . . . , N, at time t, t = 1, . . . , ni .

Figure 1. UBC PRISMS Data.


Journal of the American Statistical Association, March 2007

Assume, for convenience, that these time points are

spaced. However, missing data are permissible. Let N
i=1 ni =
nT . Yi denotes the ni -dimensional vector of observations on patient i and Y the nT -dimensional vector of all observations. The
vectors of hidden states, Zi and Z, are defined analogously. The
generic notation f (x) is used to denote the density (or probability mass function) of a random variable (or vector), X.
I make the following additional assumptions:
1. Zit takes on values from a finite set, {1, 2, . . . , K}, where
K is known.
2. Given the random effects, {Zit }nt=1
are Markov chains.
These processes may or may not be stationary.
3. Conditional on the random effects, the ith process,
, is a HMM, and observations on different pro{Yit }nt=1
cesses are independent.
, conditional on the random effects,
If the processes {Zit }nt=1
are stationary with unique stationary distributions, one may use
these as the initial probabilities. Otherwise, the initial probabilities are typically nuisance parameters. In this case, it may be
sensible to assume that they are fixed parameters and are the
same for all patients (or perhaps the same for all patients in a
given treatment group). One generally has very little information with which to estimate these probabilities, so allowing for
further complexity does not seem necessary.

3.1 Extending the Conditional Model

for the Observed Data
I first focus on the addition of random effects to the conditional model for the observed data and assume that random
effects do not appear in the model for the hidden processes.
In particular, I assume that the hidden processes are homogeneous with transition probabilities, {Pk }, and initial probabilities, {k }, common to all subjects. In the RRMS setting, such
a model would allow the mean lesion count to vary among patients.
Borrowing from the theory of generalized linear mixed models (GLMMs) (e.g., McCulloch and Searle 2001) and letting
be the vector of all model parameters, I assume that, conditional
on the random effects, u, and the hidden states Z, {Yit } are independent with distribution in the exponential family:
f (yit |Zit = k, u, )

= exp (yit itk c(itk ))/a() + d(yit , )


itk = k + xit k + witk u.

necessary, but one that facilitates the exposition. Likewise, the

model could be extended further by allowing the parameter to
vary among patientsa case I do not consider here. Henceforth,
this model will be referred to as Model I.
The likelihood for this model is
L(; y)

f (y|z, u, )f (z; )f (u; ) du
u z

u z

f (yit |zit , u, )

i=1 t=1



u i=1


Pzi,t1 ,zit f (u; ) du


zi1 f (yi1 |zi1 , u, )


Pzi,t1 ,zit f (yit |zit , u, ) f (u; ) du.



The summation associated with each i is just the likelihood of a

standard HMM and can be expressed more simply as a product
of matrices. Specifically, for a given value of u, let Ai1 be the
vector with elements Ai1
k = k f (yi1 |Zi1 = k, u) and let A be the
matrix with elements Ak = Pk f (yit |Zit = , u), t > 1. Let 1 be
the K-dimensional vector of 1s. Then



A 1 f (u; ) du.
L(; y) =
(A )
u i=1


Thus, the integrand is simple to compute, and it is only the complexity of the integral that can make evaluating and maximizing
(4) challenging. In most applications, however, one would expect (4) to reduce to a much simpler form.
Example 1 (Single, patient-specific random effect). Let ui be
a random effect associated with the ith patient, i = 1, . . . , N,
and let {ui } be iid. Then, observations on different patients are
independent, and, given ui and the sequences of hidden states,
the observations on patient i are independent. In this case, (3)
and (4) simplify to a one-dimensional integral:
L(; y)

i=1 ui


Here k is the fixed effect when Zit = k, xit are covariates for
patient i at time t, and witk is the row of the model matrix for
the random effects for patient i at time t in state k. I denote
the distribution of the random effects by f (u; ) and assume
that the random effects are independent of the hidden states.
I take E[u] 0 to avoid problems with model identifiability.
The notation u (as opposed to ui ) indicates that a random effect
could be common to more than one patienta generalization
that would be helpful if, for example, data were collected on
patients from multiple centers. The form of (1) assumes that
the link function is canonical, an assumption that is not strictly

i=1 ui

zi1 f (yi1 |zi1 , ui , )


Pzt1 ,zit f (yit |zit , ui , ) f (ui ; ) dui


(A )



1f (ui ; ) dui .



3.2 Extending the Model for the Hidden Process

It may also be desirable to allow the parameters of the hidden
Markov chain to vary randomly among patients. For example,
in the RRMS context, patients may spend differing proportions

Altman: Mixed Hidden Markov Models


of time in relapse and remission. To explore this class of models, I again specify the conditional model for the observed data
by (1) and (2), but I now allow the model for the hidden process
to vary among patients.
In particular, I assume that {Zit |u}nt=1
is a Markov chain and
that Zit |u is independent of Zjs |u for i = j. Any model for these
Markov chains must satisfy the constraints that the transition
probabilities lie between 0 and 1 and that the rows of the transition probability matrices sum to 1. Thus, I propose modeling
the transition probabilities as
P(Zit = |Zi,t1 = k, u, )

+ x + w u}
it k

= K

h=1 exp{kh

+ w u}
+ xit kh


The asterisks distinguish the model matrices and parameters

from those in (2). The vector u now contains the random effects associated with both the hidden process and the conditional model for the observed data. To prevent overparameteri 0 for all k and set w to be a row
zation, I define kK
of 0s for all i, t, k. I call this model Model II.
The likelihood associated with Model II is very similar to (4).
Again, I can write



L(; y) =
A 1 f (u; ) du,
(A )
u i=1


where now I define the quantities Ai1
k and Ak as Ak =
k f (yi1 |Zi1 = k, u, ) and Ak = P(Zit = |Zi,t1 = k, u, )
f (yit |Zit = , u, ), t > 1.

Example 2 (Patient-specific random effects). I assume that

the random effects are patient specific, so that observations on
different patients are independent. In particular, for patient i,
I model the transition probabilities as
+ u }
P(Zit = |Zi,t1 = k, uik , ) = K
+ u }
u 0 for all i, k. The likelihood for this model
where kK
can be simplified as in (5). However, I now need K(K 1) (possibly correlated) random effects for each patient. For K > 2, the
resulting integral could be prohibitively complex.

3.3 Interpretation of the Models

It is easy to create complex models using latent variables.
However, caution must be exercised in order to avoid unwanted
implications of ones modeling choices. One way of understanding the impact of the fixed and random effects is to examine the resulting marginal moments of the observed process.
In the MHMM setting, the marginal moments do not, in general, have closed forms. Nonetheless, in the case of Model I,
when {Zit }nt=1
is stationary, closed forms do exist for certain
common distributions of Yit |Zit , u (e.g., Poisson, normal) and of
the random effects (e.g., multivariate normal). Interpreting the
model under these circumstances is relatively easy (see, e.g.,
Sec. 6).
One can also interpret the impact of the random effects on
the asymptotic covariance. Consider the case where {Zit |u} is
homogeneous and stationary with unique stationary distribution

and where xit xi and witk wi are independent of t and k.

Under mild conditions, it can then be shown that
cov[Yit , Yis ]

as |t s| ,

where is a positive constant. Note that cov[Yit , Yis ] 0 when

there are no random effects in the model (i.e., when one assumes the same model for each patient). Thus, the random effects allow a long-range, positive dependence in each patients
In addition, the different roles played by the random effects
in the two parts of the model are noteworthy. Specifically, the
effect of including random effects in the conditional model for
the observed data is to relax the assumption that the observations are conditionally independent given the hidden states. In
contrast, the role of the random effects in the model for the
hidden process is to relax the assumption that this process is
Traditionally, the expectationmaximization (EM) algorithm
has been used to estimate the parameters of a HMM. However,
the EM algorithm is notoriously slow to converge, and thus,
in this setting, I prefer to maximize the likelihood directly. In
particular, I evaluate the likelihood as a product of matrices
(as described in Sec. 3) and then use a quasi-Newton method
(e.g., Nash 1979) to locate the maximum likelihood estimators
(MLEs). For standard HMMs, I have found this approach to
be much more efficient. Moreover, the quasi-Newton routine
has the added benefit of producing, as a by-product, the estimated variancecovariance matrix of the parameter estimators
(as given by the inverse of the observed Fisher information matrix).
The estimation of MHMMs poses a more challenging problem. In the special cases considered by Humphreys (1997,
1998), (4) can be evaluated analytically. In these cases, the response is binary and the hidden Markov chain has two states.
The random effects are assumed to follow a log-Gamma distribution, and the complementary log-log link is used. Typically,
though, neither (4) nor (7) will have a closed form. Seltman
(2002) took a Bayesian approach to estimating the parameters
of his specialized model (an example of a MHMM with one
random effect), claiming that the frequentist approach is intractable. I disagree with this claim, having successfully implemented two of the frequentist estimation methods discussed
4.1 Numerical Integration
In the case where there are only a few random effects, numerical methods of integration work well. In particular, for
common choices of distribution for u (e.g., multivariate normal), Gaussian quadrature methods offer both accuracy and efficiency. The quasi-Newton method can then be used to maximize the approximated likelihood.
Two issues concerning such methods are the starting values
and the number of quadrature points. With respect to the former, I recommend trying a variety of starting values to improve
the chances of locating the global maximum. With respect to
the latter, I recommend increasing the number of quadrature


Journal of the American Statistical Association, March 2007

points with the number of iterations of the quasi-Newton routine. To choose the various numbers of points, the likelihood
can be evaluated (at the starting values) for different numbers
of quadrature points, q1 < q2 < . I select qk as the number
of quadrature points if the use of qk+1 , qk+2 , . . . points does not
lead to a substantial change in the value of the integral, where
the definition of substantial depends on the number of iterations executed by the quasi-Newton algorithm. (I accept less
accuracy for early iterations, but demand high accuracy for the
final iterations.)
I have used these techniques to estimate a variety of MHMMs
in the RRMS context. Using a dual 1.2-GHz Athlon processor,
the fitting of a model with no random effects takes less than
1 second, that with one random effect takes approximately 12
seconds, that with two or three correlated random effects takes
several hours, and that with four random effects (two correlated
random effects in the conditional model, independent of two
additional correlated random effects in the hidden model) takes
several days.
4.2 Monte Carlo Methods
For larger numbers of random effects, numerical integration
methods are no longer appropriate. For such complex models,
estimation is significantly more difficult. Of existing estimation
methods, the Monte Carlo expectation-maximization (MCEM)
algorithm (McCulloch 1997) seems to be the most feasible. In
this setting, both the states of the hidden Markov chain and the
random effects are treated as missing data. The complete loglikelihood is then given by
log Lc (; y, z, u)
= log f (y|z, u, ) + log f (z|u, ) + log f (u; )


log f (yit |zit , u, ) +

i=1 t=1

log zi1



log Pzi,t1 ,zit + log f (u; ).

i=1 t=2

For the E-step, one needs to take the expectation of log Lc (;

y, z, u) conditional on the observed data and parameter estimates at iteration p (denoted by p ). Note that

f (z, u|y, )

f (y|z, uj , p )f (z|uj , p )

z f (y|z, u , )f (z|u , )

hj (z) = B

The values of hj (z) are easy to compute using the fact that the
standard HMM likelihood can be written as a product of matrices (see Sec. 3).
For the M-step, note that, typically, the parameters involved in f (yit |zit , u, ), the initial probabilities, the transition probabilities, and f (u; ) form disjoint sets. (One notable exception occurs when the hidden Markov chains are
stationary conditional on u, in which case the initial probabilities are functions of the
probabilities.) In other
f (yit |zit , u, )|y, p ],
words, the expressions E[ N
N t=1
E[ i=1 log zi1 |y, ], E[ i=1 t=2 log Pzi,t1 ,zit |y, p ], and
E[log f (u; )|y, p ] can usually be maximized separately, improving the efficiency of the procedure.
Assuming that the  s are treated as unknown parameters, it
is straightforward to show that
N B 
z hj (z)1(zi1 = )
where 1(zi1 = ) = 1 if zi1 =  and 0 otherwise. Numerical
maximization (e.g., via the quasi-Newton routine) would ordinarily be required in order to obtain updates for the other parameter estimates.
Although this method is flexible and theoretically sound, the
number of samples, B, required to approximate the E-step accurately is an important practical consideration. For integrals of
high dimension, B may be very large, resulting in a prohibitive
computational burden. I recommend the same approach as for
the quadrature method discussed in Section 4.1, namely, to increase the values of B with the number of EM iterations. To
choose these different values, I suggest, given B and starting
values, estimating


log zi1 |y, ,




log f (yit |zit , u, )|y,

i=1 t=1

log Pzi,t1 ,zit |y, p ,


i=1 t=2

f (y|z, u, p )f (z|u, p )f (u; p )

f (y; p )

f (y|z, u, p )f (z|u, p )f (u; )

z f (y|z, u, )f (z|u, )f (u; ) du
Therefore, if one generates samples u1 , . . . , uB from f (u; p )
(via a random number generator or more sophisticated methods
such as Gibbs sampling), one obtains the approximation
E[log Lc (; y, z, u)|y, p ]
log Lc ( p ; y, z, uj )hj (z),



E[log f (u; )|y, p ]

several times. The accuracy of the E-step is reflected in the
amount of variation in these estimates.
I have successfully used the MCEM algorithm to fit several different models in the RRMS setting. The most complex
of these included three random effects (one in the conditional
model, independent of two additional correlated random effects
in the hidden model). In this case, I used B = 50 for the first 5
iterations, B = 5,000 for the next 10 iterations, and B = 50,000
for subsequent iterations. The computational time required was
approximately 3 days.
It is of interest to note that, in their estimation of a GLMM using the MCEM method, Chan and Kuk (1997) selected a much

Altman: Mixed Hidden Markov Models

smaller value of B (B = 1,000). This difference is an indication

of the complexity of the MHMM as compared to the GLMM
in particular, of the difficulty in parameter estimation when the
observations are not assumed independent given the random effects.
4.3 Simulated Maximum Likelihood
Another way to estimate the parameters of a MHMM is to approximate the likelihood directly, that is, by generating samples
u1 , . . . , uB from an importance sampling distribution, g(u), and
then computing
1 f (y|uj , )f (uj ; )
log L(; y)
g(uj )


Again, numerical methods would be needed to maximize this

function, and large values of B may be required.
McCulloch (1997) and Kuk and Cheng (1999), working in
the GLMM context, commented on the poor performance of
this method when used in isolation. I had a similar experience
in the MHMM setting. However, this method may be used successfully after executing a number of iterations of the MCEM
algorithm. One advantage of this approach is that standard errors are readily obtained as a function of the maximized loglikelihood.
4.4 Other Estimation Methods
Methods such as penalized quasi-likelihood (Breslow and
Clayton 1993) and h-likelihood (Lee and Nelder 1996), which
were developed for the estimation of GLMMs, rely on simple
forms for the derivatives of log f (y|u, ) with respect to . Unlike the GLMM case, in a MHMM, the Yit s are not independent
conditional on u. Thus, these derivatives are very complicated,
and neither method seems appropriate here.
In this section I present the results of a small simulation study
designed to investigate how the uncertainty in my parameter
estimates depends on the model complexity and structure. Because my application of interest is the MS data described in Section 2, I conduct my study in this context. I consider a selection
of models that may be appropriate for such data, focusing my
study on models with one or no random effects. (Models with
more than one random effect take at least several hours to fit;
for this reason, I exclude them from my study.) For each model,
I execute the following two steps:
1. I generate a sequence of 20 counts for each of 30 independent patients divided equally into treatment and control groups (similar to the UBC data, but with only two
treatment groups).
2. I fit the model to the simulated data and record the parameter estimates.
Let Yit be the count observed at time t on patient i and let Zit
be the associated hidden state (remission = 1, relapse = 2). Let
xi = 1 if patient i is in the treatment group and xi = 0 otherwise.
I consider the following six models:


Cf: Treatment effect in the conditional model only;

Yit |Zit Poisson(eaZit +xi ), logit{P(Zit = 1|Zi,t1 =
z)} = z
Cfr: Treatment and random effect in the conditional model;
Yit |Zit , ui Poisson(eaZit +xi +ui ), logit{P(Zit = 1|
Zi,t1 = z)} = z , ui N(0, e2 )
Cfr.Hf: Treatment effect in the conditional and hidden models,
random effect in the conditional model; Yit |Zit , ui
Poisson(eaZit +xi +ui ), logit{P(Zit = 1|Zi,t1 = z)} =
z + z xi , ui N(0, e2 )
Cf.Hf: Treatment effect in the conditional and hidden models;
Yit |Zit Poisson(eaZit +xi ), logit{P(Zit = 1|Zi,t1 =
z)} = z + z xi
Cf.Hr: Treatment effect in the conditional model, random effect in the hidden model; Yit |Zit Poisson(eaZit +xi ),
logit{P(Zit = 1|Zi,t1 = 1, ui )} = 1 + ui , logit{P(Zit =
1|Zi,t1 = 2)} = 2 , ui N(0, e2 )
Cf.Hfr: Treatment effect in the conditional and hidden models, random effect in the hidden model; Yit |Zit
Poisson(eaZit +xi ), logit{P(Zit = 1|Zi,t1 = 1, ui )} =
1 + 1 xi + ui , logit{P(Zit = 1|Zi,t1 = 2)} = 2 + 2 xi ,
ui N(0, e2 ).
The labeling of the models indicates whether fixed or random
effects are included and whether they appear in the conditional
or hidden models. For example, Cfr corresponds to the model
with a fixed (f) treatment effect and a random (r) effect in
the conditional (C) model. Similarly, Cf.Hf corresponds to the
model with fixed treatment effects in both the conditional and
the hidden (H) models.
I use logit{P(Zit = 1)} 1 in all cases. I do not include
P(Zit = 2) or P(Zit = 2|Zi,t1 = z) as unknown parameters in
the models because P(Zit = 2) = 1 P(Zit = 1) and P(Zit =
2|Zi,t1 = z) = 1 P(Zit = 1|Zi,t1 = z). The parameterizations of the models were chosen so as to avoid the need for
constrained optimization (e.g., the variance of the random effect must be nonnegative, but can take on any value on the
real line).
Models Cf and Cf.Hf assume that the same model applies
to each patient in the same treatment group. Models Cfr and
Cfr.Hf allow the mean lesion count, given the hidden state,
to vary according to a patient-specific random effect. It is assumed that this random effect impacts the two conditional
means equally. Models Cf.Hr and Cf.Hfr allow the probability of remaining in remission from one month to the next to
vary among patients, but treat the probability of moving from
remission to relapse as common to all patients. This assumption would be valid in practice if the distribution of the relapse
periods tended to be relatively homogeneous across patients.
Tables 2 and 3 give the sample mean, sample standard deviation, and average asymptotic standard error of the parameter
estimates obtained based on 200 simulations from each model.
I estimated the parameters by evaluating the likelihood (using
GaussHermite quadrature with 100 points in the case of Cfr,
Cfr.Hf, Cf.Hr, and Cf.Hfr) followed by the quasi-Newton maximization routine (using the true parameter values as the starting
For models Cf, Cfr, Cfr.Hf, and Cf.Hf, the mean values in
these tables are close to the true parameter values. Furthermore,
the average asymptotic standard errors agree quite well with the


Journal of the American Statistical Association, March 2007

Table 2. Sample Mean (sample standard deviation, average asymptotic standard error) of the Parameter
Estimates Under Models Cf, Cfr, and Cfr.Hf


True value





1.000 (.174, .174)

1.500 (.050, .051)
2.011 (.149, .149)
.071 (.536, .543)
.848 (.201, .191)
.402 (.207, .211)

1.048 (.245, .218)

1.485 (.185, .150)
1.975 (.281, .251)
.000 (.593, .551)
.851 (.213, .205)
.423 (.227, .216)
.560 (.208, .175)

.999 (.218, .212)

1.516 (.160, .151)
1.975 (.365, .317)
.007 (.608, .553)
.863 (.202, .216)
.383 (.213, .227)
1.049 (.765, .689)
.956 (.782, .732)
.603 (.240, .188)

NOTE: NA means not applicable.

sample standard deviations. Histograms of the parameter estimates (not shown) do not deviate substantially from the normal
distribution. Therefore, it would seem that the usual asymptotic
properties apply reasonably well when there is no random effect or one random effect in the conditional modeleven for
such a modest sample size.
For models Cf.Hr and Cf.Hfr, with the exception of the estimates of , the sample means are close to the true values,
and the histograms are well behaved. However, the asymptotic
standard errors are, in general, larger than the sample standard
deviations of the parameter estimates. This suggests that more
data may be required in order to obtain accurate standard errors
when there is a random effect in the hidden model. To confirm this theory, I ran 200 additional simulations from Cf.Hr
and Cf.Hfr but with 60 patients instead of 30. The resulting asymptotic standard errors were very close to the sample standard
deviations (agreeing to two decimal places, in most cases).
In the case of both Cf.Hr and Cf.Hfr, the histograms of the
estimates of are distinctly bimodal, with the bulk (78%) of the
estimates clustering around the true value of .5, and the rest
clustering around 9 (implying that the variance of the random
effect, e2 , is approximately 0). This behavior suggests that the
estimated variance of the random effect will be close to 0 unless
the data provide strong evidence otherwise. The simulations using 60 patients support this claim: These histograms are indeed
much closer to unimodal, with less than 5% of the estimates
clustered around 9. Clearly, a considerable amount of data is
necessary for estimating the variance of the random effect in the
hidden model. Not surprisingly, the asymptotic standard errors

associated with the estimates of were often poorly behaved;

for this reason, I have excluded their averages from Table 3.
In terms of the standard deviations associated with the parameter estimates, I consider first the parameters of the conditional model, a1 , a2 , and . In particular, in Cf, one sees that
these parameters are quite precisely estimated, especially the
largest of these, a2 . The inclusion of a random effect in the conditional model (Cfr and Cfr.Hf) leads to increased variability in
these estimates. Moreover, incorporating a fixed treatment effect in the hidden model results in greater variability in the estimate of (Cf.Hf and Cf.Hfr); estimating treatment effects in
both parts of the model is more difficult than estimating a treatment effect in the conditional model alone. In contrast, adding a
random effect to the hidden model (Cf.Hr) appears to have little
effect on the precision of the estimates of a1 , a2 , and .
Turning now to the parameters of the hidden model, one sees
from Cf.Hr and Cf.Hfr that the estimate of 1 tends to be more
variable when P(Zit = 1|Zi,t1 = 1) includes a random effect.
The estimates of 1 and 2 are not precise in any model and
do not seem to be affected by the addition of a random effect
in either the conditional (Cfr.Hf) or hidden models (Cf.Hfr).
The results are similar when one considers 60 patients instead
of 30. However, when one increases the magnitude of (e.g.,
from .5 to 0), the variability in the estimate of 1 in Cf.Hfr is
substantially greater than that in Cf.Hf (simulations not shown).
Therefore, as was the case with the conditional model, one sees
that adding a random effect to the hidden model results in less
precision in the estimates of the fixed effects in this part of the

Table 3. Sample Mean (sample standard deviation, average asymptotic standard error) of the Parameter
Estimates Under Models Cf.Hf, Cf.Hr, and Cf.Hfr


True value





1.027 (.178, .174)

1.503 (.050, .052)
1.985 (.228, .227)
.004 (.581, .526)
.843 (.202, .200)
.356 (.217, .219)
1.019 (.608, .592)
.977 (.824, .758)

1.017 (.178, .232)

1.508 (.050, .103)
2.011 (.160, .207)
.032 (.546, .564)
.854 (.229, .299)
.397 (.207, .266)
2.508 (3.778, )

1.030 (.162, .244)

1.496 (.050, .126)
1.970 (.248, .311)
.043 (.545, .656)
.823 (.261, .385)
.402 (.208, .306)
1.137 (.693, .754)
.988 (.965, .887)
2.633 (3.814, )

NOTE: NA means not applicable, and indicates omitted values.

Altman: Mixed Hidden Markov Models


As expected, the parameter 1 is not precisely estimated in

any case.
In summary, this study serves to illustrate two major points:
1. At least in the context of simple MHMMs, MLEs and
their asymptotic standard errors are typically well behaved.
2. There is far more information about the parameters associated with the conditional model than those associated
with the hidden model. Therefore, in practice, one would
likely confine any complex modeling to the conditional
I now use a MHMM to analyze the PRISMS data described
in Section 2. Let Yhit be the lesion count for patient i in treatment group h at time t and let Zhit {1, . . . , K} be the associated
hidden state. I assume that h {P, L, H}, where P = placebo,
L = low dose, and H = high dose. The results of the analysis in
MacKay (2002) suggested that two hidden states are appropriate for this type of data, and thus I use K = 2 here.
MacKay (2003) provided indication that a stationary Poisson HMM may be an appropriate model for an individual patients data. Incorporating these results in the MHMM framework, I assume that, given Zhit = k and uhi , Yhit is distributed as
Poisson(hik ) with
log hik = k + kh + ui ,
where {ui } are iid, each with an N(0, e2 ) distribution. I take
1P 2P 0 so that k determines the placebo groups log
mean lesion count while in state k. Thus, kL and kH are the
effects of the low- and high-dose treatments, respectively, on
this log mean. Furthermore, I model the transition probabilities
logit{P(Zhit = 1|Zhi,t1 = s)} = s + sh .
Again, I take 1P 2P 0, so that s determines the placebo
groups probability of making a transition from state s to a state
of remission. Consequently, sL and sH are the effects of the
low- and high-dose treatments, respectively, on this probability.
In addition, I assume that {Zhit }t=1
is stationary for all h, i with
hik P(Zhit = k).
Under this model, the marginal moments discussed in Section 3.3 have a closed form. One clear advantage of this model
over that of Albert et al. (1994) is the relative interpretability of
the marginal mean structure:
 2 2 
E[Yhit ] = exp
hik ek +kh .

It can also be shown that

cov[Yis , Yit ] = C1 + C2 |ts| ,
where C1 and C2 are constant with respect to i, s, and t, and
0 < < 1. Therefore, this model assumes that the covariance
between two observations on a patient has two components:
a base level, C1 , common to all pairs of observations on this
patient; and a secondary level, C2 |ts| , that declines to 0 as the
time between the observations increases.

For the purposes of comparison, I also consider the same

model with no random effect. In other words, I let Yhit |Zhit = k
be distributed as Poisson(hk ) with log hk = k + kh . I refer
to this as the fixed model.
I fit both models using the quasi-Newton method, trying
multiple starting values. In the case of the MHMM, I used
the GaussHermite quadrature formula (with 175 quadrature
points) to approximate the likelihood. The missing data in this
study were easily accommodated; for example, if r sequential
observations were missing on a given patient, I simply used the
r-step (rather than 1-step) transition probability in the appropriate piece of the likelihood. Table 4 gives the parameter estimates, approximate standard errors, and maximized likelihoods
for both models.
The estimates of 2L , 1H , and 2H suggest that the treatment
has a beneficial effect on the mean number of lesions (the high
dose more so than the low dose). However, because of the large
standard errors associated with the estimates of sh , I am unable
to detect an effect of treatment on the transition probabilities.
(These results are consistent with those in Sec. 5 regarding the
varying precision of the parameter estimates.) To validate these
conclusions, I fit the MHMM with a treatment effect in the conditional model only and obtained a maximum value of 506.4
for the log-likelihood. In contrast, when I included a treatment
effect in the hidden model only, I obtained a maximum value
of 512.9 for the log-likelihood. When compared to the maximum value of the likelihood of the original MHMM (505.2),
these values further suggest that the treatment effect in the conditional model is significant, whereas the treatment effect in the
hidden model is not.
Table 4 also shows that, by including the random effect, one
observes quite a large increase in the log-likelihood. In addition, one sees substantial changes in some of the estimates of
the fixed effects. More formally, one can use the variance component test described in MacKay (2003) to test the hypothesis that = 0. This test is equivalent to a comparison of the
fixed model and the MHMM and involves the bootstrap. I obtain a p value of <.001, implying strong evidence in favor of
the MHMM.
In addition, in contrast to the analysis of Albert et al. (1994)
discussed in Section 2, the MHMM (and my larger sample size,
Table 4. Parameter Estimates, Standard Errors, and Maximized
Likelihoods for the UBC PRISMS Data


log L


Fixed model




Journal of the American Statistical Association, March 2007

presumably) allow one to detect the autocorrelation among observations on the same patient. In particular, if one assumes
are iid rather than Markovian, then, conditional on
that {Zhit }t=1
uhi , Yhit has a mixture distribution but is uncorrelated with Yhis ,
s = t. The log-likelihood based on this model has a maximum
value of 528.3. The associated AIC value is thus 1,076.6,
compared with 1,036.5 in the case of the MHMM.
Note that, in my preliminary analysis of these data, I attempted to fit models with a patient-specific random effect in
the hidden as well as conditional model. Under these models,
the likelihoods were very flat, and convergence was difficult to
achieve. In both the model where I allowed P(Zhit = 1|Zhi,t1 =
1) to vary according to a random effect and that where I allowed P(Zhit = 1|Zhi,t1 = 2) to vary, the maximum value of
the likelihood was 505.2 (essentially identical to the maximum achieved by the model with a random effect in the conditional model only). In addition, the estimates of the variance of
the random effects in the hidden model were approximately 0,
and the estimates of the other parameters were very similar (relative to their standard errors) across all models. Thus, for these
data, it appears that the inclusion of random effects in the hidden model is unnecessary.
To conclude, by borrowing strength across patients using a
MHMM with a random effect in the conditional model, I am
able to detect a treatment effect as well as important features
of the data such as autocorrelation. Moreover, this model provides a significantly improved fit over the fixed model with the
addition of only one extra unknown parameter ( ).
In summary, I have shown that the addition of one or two
random effects to the conditional model for the observed data
results in a model that can be readily interpreted and estimated.
I have applied such a model to the PRISMS data and have
demonstrated the advantages of such an approach in this context.
It is also possible to include random effects in the model for
the hidden process, but such models are more difficult to interpret and may involve high-dimensional integrals. In general,
one has less information about the parameters of the hidden
process than about the parameters of the conditional model. In
this case, extending the model to allow patient-to-patient differences on this level may explain very little additional variation
in the observed data and, hence, may not be worthwhile from
a statistical standpoint. Moreover, capturing interpatient heterogeneity in the hidden processes is still possible by incorporating
covariates in this part of the model. Therefore, in practice, one
would be more likely to use Model I than Model IIunless the
data provided strong indication that the complexity of Model II
was warranted.
Estimation of models with more than three or four random
effects is an area of ongoing research. The MCEM algorithm
can be used, in theory, to estimate any MHMM, but the computational burden required for models with large numbers of correlated random effects can be prohibitive. However, as computing power increases, so will the feasibility of using the MCEM
method to estimate increasingly complex models.
[Received September 2005. Revised August 2006.]

Albert, P. S., McFarland, H. F., Smith, M. E., and Frank, J. A. (1994), Time Series for Modelling Counts From a RelapsingRemitting Disease: Application
to Modelling Disease Activity in Multiple Sclerosis, Statistics in Medicine,
13, 453466.
Altman, R. M., and Petkau, A. J. (2005), Application of Hidden Markov Models to Multiple Sclerosis Lesion Count Data, Statistics in Medicine, 24,
Breslow, N. E., and Clayton, D. G. (1993), Approximate Inference in Generalized Linear Mixed Models, Journal of the American Statistical Association,
88, 925.
Chan, J. S. K., and Kuk, A. Y. C. (1997), Maximum Likelihood Estimation for
Probit-Linear Mixed Models With Correlated Random Effects, Biometrics,
53, 8697.
Hughes, J. P., and Guttorp, P. (1994), A Class of Stochastic Models for Relating Synoptic Atmospheric Patterns to Regional Hydrologic Phenomena,
Water Resources Research, 30, 15351546.
Humphreys, K. (1997), Classification Error Adjustments for Female Labour
Force Transitions Using a Latent Markov Chain With Random Effects, in
Applications of Latent Trait and Latent Class Models in the Social Sciences,
eds. J. Rost and R. Langeheine, New York: Waxmann Munster, pp. 370380.
(1998), The Latent Markov Chain With Multivariate Random Effects, Sociological Methods & Research, 26, 269299.
Krogh, A. (1998), An Introduction to Hidden Markov Models for Biological Sequences, in Computational Methods in Molecular Biology, eds.
S. L. Salzberg, D. B. Searls, and S. Kasif, Amsterdam: Elsevier, pp. 4563.
Kuk, A. Y. C., and Cheng, Y. W. (1999), Pointwise and Functional Approximations in Monte Carlo Maximum Likelihood Estimation, Statistics and
Computing, 9, 9199.
Lee, Y., and Nelder, J. A. (1996), Hierarchical Generalized Linear Models,
Journal of the Royal Statistical Society, Ser. B, 58, 619678.
Leroux, B. G., and Puterman, M. L. (1992), Maximum-Penalized Likelihood
Estimation for Independent and Markov-Dependent Mixture Models, Biometrics, 48, 545558.
Levinson, S. E., Rabiner, L. R., and Sondhi, M. M. (1983), An Introduction to
the Application of the Theory of Probabilistic Functions of a Markov Process
to Automatic Speech Recognition, The Bell System Technical Journal, 62,
Li, D. K., and Paty, D. W. (1999), Magnetic Resonance Imaging Results of
the PRISMS Trial: A Randomized, Double-Blind, Placebo-Controlled Study
of Interferon-Beta1a in RelapsingRemitting Multiple Sclerosis, Annals of
Neurology, 46, 197206.
MacDonald, I. L., and Zucchini, W. (1997), Hidden Markov Models and Other
Models for Discrete-Valued Time Series, London: Chapman & Hall.
MacKay, R. J. (2002), Estimating the Order of a Hidden Markov Model, The
Canadian Journal of Statistics, 30, 573589.
(2003), Hidden Markov Models: Multiple Processes and Model Selection, unpublished Ph.D. dissertation, University of British Columbia,
Dept. of Statistics.
McCulloch, C. E. (1997), Maximum Likelihood Algorithms for Generalized
Linear Mixed Models, Journal of the American Statistical Association, 92,
McCulloch, C. E., and Searle, S. R. (2001), Generalized, Linear, and Mixed
Models, New York: Wiley.
Nash, J. C. (1979), Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, New York: Wiley.
PRISMS Study Group (1998), Randomised Double-Blind Placebo-Controlled
Study of Interferon Beta-1a in Relapsing/Remitting Multiple Sclerosis, The
Lancet, 352, 14981504.
PRISMS Study Group and the University of British Columbia MS/MRI Analysis Group (2001), PRISMS-4: Long-Term Efficacy of Interferon-Beta-1a in
Relapsing MS, Neurology, 56, 16281636.
Seltman, H. J. (2002), Hidden Markov Models for Analysis of Biological
Rhythm Data, in Case Studies in Bayesian Statistics, Vol. 5, Springer-Verlag,
pp. 397405.
Sormani, M. P., Bruzzi, P., Comi, G., and Filippi, M. (2002), MRI Metrics
as Surrogate Markers for Clinical Relapse Rate in RelapsingRemitting MS
Patients, Neurology, 58, 417421.
Turner, T. R., Cameron, M. A., and Thomson, P. J. (1998), Hidden Markov
Chains in Generalized Linear Models, The Canadian Journal of Statistics,
26, 107125.
Wang, P., and Puterman, M. L. (1999), Markov Poisson Regression Models
for Discrete Time Series, Journal of Applied Statistics, 26, 855882.