Introduction To HMM

Hidden Markov Model
Achmad Arifin
Electrical Engineering Department, ITS
arifin@ee.its.ac.id
(Summarized from: Advanced Digital Signal
Processing and Noise Reduction, Saeed V. Vaseghi,
John Wiley & Sons Ltd
Abstract
⚫ Hiddden Markov models (HMMs) are used for the statistical
modelling of non-stationary signal processes, e.g. physiological
siganals (ECG, EMG, EEG, etc.), image data, and other time-
varying data (gold price, stock price, random noise, etc.).
⚫ Objetcive of an HMM is to model variations of time and/ or

space of the statistics of a random process with a Markovian
chain of state-dependent stationary subprocesses.
⚫ An HMM is essentially a Bayesian finite state process, with

- a Markovian chain of state for modelling the transitions
between the states.
- a set of state probability density functions for modelling the
random variations of the signal process within each state.
Non-stationary Process
⚫ A non-stationary process : a process that it’s statistical parameters vary over
time.
⚫ Biological signals, audio signals, and seismic signals, are examples of non-
stationary, in that the parameters of the systems that generate the signals, and
the environments in which the signals propagate, change with time.
⚫ A non-stationary model:
a double-layered stochastic
process.
A hidden process controls
time variations of the
statistics of the observable
process.
Continuous and Finite State
⚫ Classification: a) continuously variable state process, b) finite state
process.
⚫ A continuously variable state process : a state process that the underlying

statistics vary continuously with time. Example: Speech signal, music, in
which their power and spectral composition vary continuously with time.
⚫ A finite state process: a state process that statistical characteristics can

switch between a finite number of stationary or non-stationary states.
Example: impulse noise, the ECG signal that reflects specific physiological
events of anatomical part of the heart.
Continuous state process
⚫ A non-stationary AR1 process:
⚫ An observable AR signal model:

x ( m ) = a ( m ) x ( m − 1) + e ( m )
⚫ The hidden model controls the
time variations of the parameters of the
non-stationary AR model:
a ( m ) =  a ( m ) x ( m − 1) +  ( m )
• a(m): the time-varying coefficient of the observable AR process.

• β: coefficient of the hidden state-control process.
Finite state process (binary)
⚫ At each time instant a random switch selects one of the two AR models for
connection to the output terminal.
x ( m ) = s ( m ) x0 ( m ) + s ( m ) x1 ( m )
s(m) selects the state of the process at time m
⚫ Continuously variable processes can be approximated by an appropriate

finite state process.
Hidden Markov Model (1)
⚫ A hidden Markov model (HMM): a double-layered finite state process,
with a hidden Markovian process that controls the selection of the states of
an observable process.
⚫ A binary-state Markovian process,
which shows two containers of
different mixtures of black and
white balls.
⚫ PB and PW : probability of the black
and the white balls in each container.
⚫ Assume that at successive time
intervals a hidden selection process
selects one of the two containers to
release a ball.
⚫ The balls released are replaced so that the mixture density of the black and
the white balls in each container remains unaffected. Each container can be
considered as an underlying state of the output process.
⚫ At any time, if the output from the

currently selected container is a black ball
then the same container is selected to
output the next ball, otherwise the other
container is selected.
⚫ This is an example of a Markovian process

because the next state of the process
depends on the current state as shown in the binary state model of Figure
Note that in this example the observable outcome does not unambiguously
indicate the underlying hidden state, because both states are capable of releasing
black and white balls.
⚫ In general, a hidden Markov model has N sates, with each state trained
to model a distinct segment of a signal process. A hidden Markov model can
be used to model a time-varying random process as a probabilistic
Markovian chain of N stationary, or quasi-stationary, elementary subprocesses.
⚫ A general form of a three-state HMM :
⚫ This structure is known as an ergodic

HMM. In the context of an HMM,
the term “ergodic” implies that there are no
structural constraints for connecting any
state to any other state.
⚫ Time series data → variation: spectral composition & time scale
(articulation)
State observation State transition probabilities
Each state → a mechanism to

accommodate the random variations in
different realizations of the segments
that it models.
The state observation pdfs model the

probability distributions of the spectral
composition of the signal segments
associated with each state.
The state transition probabilities →

a mechanism for connection of various
states, and for the modeling the variations
in the duration and time-scales of the
signals in each state.
Heart sound data
HMM as a Bayesian model
⚫ A hidden Markov model M is a Bayesian structure with a Markovian state
transition probability and a state observation likelihood that can be either a
discrete pmf or a continuous pdf. The posterior pmf of a state sequence s of a
model M, given an observation sequence X, can be expressed using Bayes’
rule as the product of a state prior pmf and an observation likelihood function:
where the observation sequence X is modelled by a probability density
1
function PS|X,M(s|X,M). P ( s | X , M ) = PS | X , M ( s | M ) f X |S , M ( X | s, M )
fX ( X )
S| X ,M
⚫ The posterior probability that an observation signal sequence X was

generated by the model M is summed over all likely state sequences, and
may also be weighted by the model prior PM (M) :
1
PM | X ( M | X ) = PM ( M )  PS |M ( s | M ) f X |S , M ( X | s, M )
fX ( X )



s
Model prior State prior Observation

likelihood
⚫ The Markovian state transition prior can be used to model the time
variations and the sequential dependence of most non-stationary processes.
However, for many applications, such as speech recognition, the state
observation likelihood has far more influence on the posterior probability
than the state transition prior.
HMM parameter
⚫ Number of states N. This is usually set to the total number of distinct, or
elementary, stochastic events in a signal process. For example, in
modelling a binary-state process such as impulsive noise, N is set to 2,
and in isolated-word speech modelling N is set between 5 to 10.
⚫ State transition-probability matrix A={aij, i,j=1, ... N}. This provides a
Markovian connection network between the states, and models the

variations in the duration of the signals associated with each state. For
a left–right HMM aij=0 for i>j, and hence the
transition matrix A is upper-triangular.
⚫ State observation vectors {μi1, μi2, ..., μiM, i=1, ..., N}. For each state a set
of M prototype vectors model the centroids of the signal space associated with each state.
⚫ State observation vector probability model. This can be either a discrete
model composed of the M prototype vectors and their associated

probability mass function (pmf) P={Pij(· ); i=1, ..., N, j=1, ... M}, or it
may be a continuous (usually Gaussian) pdf model F={fij(· ); i=1, ...,
N, j=1, ..., M}.
⚫ Initial state probability vector π=[π1, π2, ..., πN].
State observation model (2)
⚫ For the modelling of a continuous-valued process, the signal space

associated with each state is partitioned into a number of clusters as in. If the
signals within each cluster are modelled by a uniform
distribution then each cluster is described by the centroid vector and the
cluster probability, and the state observation model consists of M cluster
centroids and the associated pmf {μik, Pik; i=1, ..., N, k=1, ..., M}. In effect,
this results in a discrete state observation HMM for a continuous-valued
process. The Figure shows a partitioning, and quantisation, of a signal
space into a number of centroids.
State observation model (3)
⚫ If each cluster of the state observation space is modelled by a
continuous pdf, such as a Gaussian pdf, then a continuous density
HMM results. The most widely used state observation pdf for an
HMM is the mixture Gaussian density
where N (x,μ ik ,Σ ik ) is a Gaussian density with mean vector μik

and covariance matrix Σik, and Pik is a mixture weighting factor for
the kth Gaussian pdf of the state i. Note that Pik is the prior
probability of the kth mode of the mixture pdf for the state i.
State Transition Probabilities (1)
⚫ The first-order Markovian property of an HMM entails that the
transition probability to any state s(t) at time t depends only on the
state of the process a t time t–1, s(t–1), and is independent of the
previous states of the HMM. It usually be expressed as
where s(t) denotes the state of HMM at time t.

⚫ The transition probabilities provide a probabilistic mechanism for
connecting the states of an HMM, and for modelling the variations
in the duration of the signals associated with each state.
⚫ The probability of occupancy of a state i for d consecutive time
units, Pi(d), can be expressed in terms of the state self-loop
transition probabilities aii in following equation.
State Transition Probabilities (2)
⚫ Mean occupancy duration for each state of an HMM can be
expressed as the below equation.
⚫ State Trellis Diagram:
⚫ Each state sequence has a prior probability that can beobtained

by multiplication of the state transition Probabilities of the
sequence.
⚫ N-state HMM can reproduce NT different realizations of the

random process.
Three Problems
1. Model evaluation problem
⚫ What is the probability of the observation?
⚫ Given an observed sequence and an HMM, how probable is
that sequence?
⚫ Forward algorithm
2. Path decoding problem
⚫ What is the best state sequence for the observation?
⚫ Given an observed sequence and an HMM, what is the most
likely state sequence that generated it?
⚫ Viterbi algorithm
3. Model training problem
⚫ How to estimate the model parameters?
⚫ Given an observation, can we learn an HMM for it?
⚫ Baum-Welch re-estimation algorithm
18
A hidden Markov model for relating numbers of ice creams
eaten (the observations) to the weather (H or C, the
hidden variables)
Likelihood Computation: The Forward Algorithm
Computing Likelihood: Given an HMM l = (A;B) and an observation
sequence O, determine the likelihood P(Ojl).
computation of the joint probability of our ice-cream

computation of the forward probability for our observation 3 1 3 and one possible hidden state
ice-cream observation 3 1 3 from one possible sequence hot hot cold
hidden state sequence hot hot cold
Total probability of the observations: sum of all possible
hidden state sequence.
An HMM: N hidden states, T observation sequences → NT possible hidden

sequences, where N and T are both large → computational cost (Exponential
algorithm).
Forward algorithm: O(N2T) a dynamic programming algorithm, that is, an

algorithm that uses a table to store intermediate values as it builds up the
probability of the observation sequence.
The forward algorithm computes the observation probability by summing over

the probabilities of all possible hidden state paths that could generate the
observation sequence, but it does so efficiently by implicitly folding each of
these paths into a single forward trellis
Forward Algorithm
Computation of a single element at (i) in the trellis by summing all the previous
values at􀀀 1, weighted by their transition probabilities a, and multiplying by the
observation probability bi(ot ).
Encoding
Given as input an HMM l = (A;B) and a sequence of observations
O = o1;o2; :::;oT , find the most probable sequence of states Q = q1q2q3 : : :qT .
The Viterbi trellis for computing the best path through the hidden state
space for the ice-cream eating events 3 1 3. Hidden states are in circles,
observations in squares. White (unfilled) circles indicate illegal transitions.
Viterbi backtrace
Training of the HMM (1)
⚫ The first step in training the parameters of an HMM is to collect a
training database of a sufficiently large number of different
examples of the random process to be modeled.
⚫ Assume that the examples in a training database consist of L vector-
valued sequences [X]=[Xk; k=0, ..., L–1], with each sequence
Xk=[x(t); t=0, ..., Tk–1] having a variable number of Tk vectors.
⚫ The objective is to train the parameters of an HMM to model the
statistics of the signals in the training data set. In a probabilistic
sense, the fitness of a model is measured by the posterior probability
PM|X(M|X) of the model M given the training data X. The training
process aims to maximise the posterior probability of the model M
and the training data [X], expressed using Bayes’ rule.
Training of the HMM (2)
⚫ The likelihood of an observation vector sequence X given a model
M can be expressed in next equation,
⚫ where fX|S,M(X(t)|s(t),M), the pdf of the signal sequence X

along the state sequence =[s(0),s(1), ,s(T −1)] s of the model
M, and Markovian Probality are given in the following,.
⚫ The likelihood of an observation vector sequence X given a model

M:
•Learning problem. Given some training observation sequences O=o o 1 2
... o K and general structure of HMM (numbers of hidden and visible states),
determine HMM parametersM=(A, B, ) that best fit training data,
that is maximizes P(O | M) .
• There is no algorithm producing optimal parameter values.
• Use iterative expectation-maximization algorithm to find local maximum of

P(O | M) - Baum-Welch algorithm.
• If training data has information about sequence of hidden states (as in word
recognition example), then use maximum likelihood estimation of parameters:
a = P(s | s ) =
ij i j
Number of transitions from state s to
j state s i
Number of transitions out of state s j
Number of times observation v m occurs in state s i

b (v ) = P(v | s )=
i m m i
Number of times in state s i
Baum-Welch algorithm
General idea:
s to state s
Expected number of transitions from state j
a = P(s | s ) = Expected number of transitions out of state s
ij i j
j
Expected number of times observation v m occurs in state s i

b (v ) = P(v | s )=
i m m i
Expected number of times in state s i
 = P(s ) =
i i Expected frequency in state s at time k=1.
i
Baum-Welch algorithm: expectation
step(1)
• Define variable  (i,j) as
k s at time k and in
the probability of being in state i
state s at
j time k+1, given the observation sequence o o ... o . 1 2 K
 (i,j)= P(q = s , q = s | o o ... o )

k k i k+1 j 1 2 K
P(qk= si , qk+1= sj , o1 o2 ... ok)

 (i,j)= =
k
P(o1 o2 ... ok)
P(qk= si , o1 o2 ... ok) aij bj (ok+1 ) P(ok+2 ... oK | qk+1= sj )
=
P(o1 o2 ... ok)
 (i) a b (o )  (j)
k ij j k+1 k+1
   (i) a b (o ) 
i j k ij j k+1 k+1(j)
step(2)
• Define variable  (i) as
k the probability of being in state s at time k, given
i
the observation sequenceo o ... o . 1 2 K
 (i)= P(q = s | o o ... o )

k k i 1 2 K
P(qk= si , o1 o2 ... ok)  (i)  (i)

 (i)= =
k k
k
P(o1 o2 ... ok)   (i)  (i)
i k k
step(3)
•We calculated  (i,j) = P(q = s , q = s | o o ... o )
k k i k+1 j 1 2 K
and  (i)= P(q = s | o o ... o )

k k i 1 2 K
• Expected number of transitions from state s to state s =

i j
=   (i,j)
k k
• Expected number of transitions out of state s =   (i)

i k k
• Expected number of times observation vm occurs in state si =

=   (i) , k is such that o = vm
k k k
• Expected frequency in state si at time k=1 :  (i) . 1

Baum-Welch algorithm:
maximization step
Expected number of transitions from state sj to state si   (i,j)
a=
k k
ij Expected number of transitions out of state sj =   (i) k k
Expected number of times observation vm occurs in state si   (i,j) k k

b (v ) =
i m Expected number of times in state si = k,o = v  (i)
k m k
 = (Expected frequency in state s at time k=1) =  (i).

i i 1
HMM Example: Weather Forces
Start Probability
{Rainy=0.26, Cloudy=0.24, Sunny=0.5}
START
Size Transition Probability:
{Rainy={Rainy=0.4, Cloudy=0.2, Sunny=0.4},
Cloudy={Rainy=0.5, Cloudy=0.4, Sunny=0.1},
Sunny={Rainy=0.1, Cloudy=0.3, Sunny=0.6}}
Hidden State
Sunny Cloudy Rainy
Emission Probability:
{Shop={Rainy=0.0, Cloudy=0.5, Sunny=0.5},
Walk={Rainy=0.1, Cloudy=0.4, Sunny=0.5},
Clean={Rainy=0.55, Cloudy=0.35, Sunny=0.1}}
Clean Walk Shop

Observed output
HMM Learning Baum-Welch Algorithm
HMM hmm = new HMM(2, 4);

hmm.pi[0] = 0.5;
hmm.pi[1] = 0.5;
hmm.a[0][0] = 0.5;
hmm.a[0][1] = 0.5;
hmm.a[1][0] = 0.5;
hmm.a[1][1] = 0.5;
hmm.b[0][0] = 0.25;
hmm.b[0][1] = 0.25;
hmm.b[0][2] = 0.25;
hmm.b[0][3] = 0.25;
hmm.b[1][0] = 0.25;
hmm.b[1][1] = 0.25;
hmm.b[1][2] = 0.25;
hmm.b[1][2] = 0.25;

Introduction To HMM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To HMM

Uploaded by

Copyright:

Available Formats

Hidden Markov Model

⚫ Objetcive of an HMM is to model variations of time and/ or

⚫ An HMM is essentially a Bayesian finite state process, with

⚫ A continuously variable state process : a state process that the underlying

⚫ A finite state process: a state process that statistical characteristics can

⚫ An observable AR signal model:

• a(m): the time-varying coefficient of the observable AR process.

s(m) selects the state of the process at time m

⚫ Continuously variable processes can be approximated by an appropriate

⚫ At any time, if the output from the

⚫ This is an example of a Markovian process

⚫ This structure is known as an ergodic

Each state → a mechanism to

The state observation pdfs model the

The state transition probabilities →

⚫ The posterior probability that an observation signal sequence X was

Model prior State prior Observation

Markovian connection network between the states, and models the

model composed of the M prototype vectors and their associated

⚫ For the modelling of a continuous-valued process, the signal space

where N (x,μ ik ,Σ ik ) is a Gaussian density with mean vector μik

where s(t) denotes the state of HMM at time t.

⚫ State Trellis Diagram:

⚫ Each state sequence has a prior probability that can beobtained

⚫ N-state HMM can reproduce NT different realizations of the

computation of the joint probability of our ice-cream

An HMM: N hidden states, T observation sequences → NT possible hidden

Forward algorithm: O(N2T) a dynamic programming algorithm, that is, an

The forward algorithm computes the observation probability by summing over

⚫ where fX|S,M(X(t)|s(t),M), the pdf of the signal sequence X

⚫ The likelihood of an observation vector sequence X given a model

determine HMM parametersM=(A, B, ) that best fit training data,

that is maximizes P(O | M) .

• There is no algorithm producing optimal parameter values.

• Use iterative expectation-maximization algorithm to find local maximum of

Number of transitions out of state s j

Number of times observation v m occurs in state s i

Expected number of times observation v m occurs in state s i

 (i,j)= P(q = s , q = s | o o ... o )

P(qk= si , qk+1= sj , o1 o2 ... ok)

the observation sequenceo o ... o . 1 2 K

 (i)= P(q = s | o o ... o )

P(qk= si , o1 o2 ... ok)  (i)  (i)

and  (i)= P(q = s | o o ... o )

• Expected number of transitions from state s to state s =

• Expected number of transitions out of state s =   (i)

• Expected number of times observation vm occurs in state si =

• Expected frequency in state si at time k=1 :  (i) . 1

Expected number of times observation vm occurs in state si   (i,j) k k

 = (Expected frequency in state s at time k=1) =  (i).

Clean Walk Shop

HMM hmm = new HMM(2, 4);

You might also like