End of Studies Internship

Clustering applied to hidden Markov models
Ibrahim Kaddouri
Supervised by: Elisabeth Gassiat and Zacharie Naulet
Université Paris Saclay, Laboratoire de mathématiques d’Orsay
September 8, 2022
Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 1 / 35

Clustering and Hidden Markov Models
Outline
1 Clustering and Hidden Markov Models
2 Inference in HMMs
3 Reconstructing the hidden states
4 Efficiency of the plug-in empirical Bayes classifier
5 Asymptotic Bayes risk
6 Bounds on the asymptotic Bayes risk

Clustering
Clustering is an ill-posed problem which aims to find out interesting
structures in the data or to derive a useful grouping of the observations.
DBSCAN clustering K-means clustering

2.5 2.5
2.0 2.0
1.5 1.5
1.0 1.0
0.5 0.5
0.0 0.0
−0.5 −0.5
−1 0 1 2 −1 0 1 2

Mixture models: Probabilistic definition of clusters
Observations Y = (Yk )1≤k≤n coming from J populations.

Given that the observation comes from population j, it has emission
distribution Fj .
Define latent variables X = (Xk )1≤k≤n such that: for each k,
Yk | Xk = j ∼ Fj

Mixture models: Probabilistic definition of clusters
Observations Y = (Yk )1≤k≤n coming from J populations.

Given that the observation comes from population j, it has emission
distribution Fj .
Define latent variables X = (Xk )1≤k≤n such that: for each k,
Yk | Xk = j ∼ Fj
Then Yk has distribution

J
X
πj Fj
j=1
πj : Probability to come from population j
Useful to model data coming from heterogeneous populations.

Mixture models: Identifiability
Mixture models are not identifiable :

J π1 J
!
π1 π1 2 F1 + π 2 F2
X X
πj Fj = F1 + + π2 π1 + π j Fj
j=1
2 2 2 + π2 j=3

Mixture models: Identifiability
Mixture models are not identifiable :

J π1 J
!
π1 π1 2 F1 + π 2 F2
X X
πj Fj = F1 + + π2 π1 + π j Fj
j=1
2 2 2 + π2 j=3
Learning of population components possible only under additional

structural assumptions such as:
Parametric mixtures
Shape restrictions (gaussian, multinomial, ...)
→
− This might lead to poor results in practice.

Hidden Markov Models and why they are useful
Markov process : X0 X1 X2 ··· XT −1
Observations : Y0 Y1 Y2 ··· YT −1
Figure: A Hidden Markov Model.
Observations (Yk )k are independent conditionnally to (Xk )k .

Latent (unobserved) variables (Xk )k form a Markov chain.

Hidden Markov Models and why they are useful
Markov process : X0 X1 X2 ··· XT −1
Observations : Y0 Y1 Y2 ··· YT −1
Figure: A Hidden Markov Model.
Observations (Yk )k are independent conditionnally to (Xk )k .

Latent (unobserved) variables (Xk )k form a Markov chain.
HMMs are identifiable without any additional assumption

Inference in HMMs
Outline
2 Inference in HMMs

Inference in HMMs
Inference in Hidden Markov Models
Purpose: Estimate the model parameters and the hidden states associated
to the observations.
The HMM parameters are:

The initial distribution ν.
The transition matrix Q.
The emission distributions (Fi )1≤i≤J

Inference in HMMs
Inference in Hidden Markov Models (HMM)
Definition (Smoothing, Filtering, Prediction)

For positive indices k, l and n with l ≥ k, denote by ϕk:l|n the conditional
distribution of Xk:l = (Xk , .., Xl ) given Y0:n = (Y0 , .., Yn ).
Specific choices of k and l give rise to several cases of interest:

Inference in HMMs

- Smoothing: ϕk|n for n ≥ k ≥ 0;

Inference in HMMs

- Filtering: ϕn|n for n ≥ 0;

Inference in HMMs

- Prediction filtering: ϕn+1|n for n ≥ 0

Inference in HMMs

- Prediction filtering: ϕn+1|n for n ≥ 0
Because the use of filtering will be preeminent in the following, we shall
most often abbreviate ϕn|n to ϕn .

Inference in HMMs
Forward filtering (discrete emissions)
Algorithm 1: Forward filtering

Assume X = {0, ..., r − 1}, (ν, Q, F ) is given;
Initialization: for i = 0, .., r − 1 ϕ0|−1 (i) = ν(i)
for k ∈ {0, .., n} do
P −1
ck ← ri=0 ϕk|k−1 (i)Fi (Yk );
for j ∈ {0, .., r − 1} do
ϕk (j) ← ϕk|k−1 (j)Fj (Yk )/ck ;
P −1
ϕk+1|k (j) ← ri=0 ϕk (i)q(i, j);
end
end

Inference in HMMs
Backward smoothing
Algorithm 2: Backward smoothing

Assume X = {0, ..., r − 1}, (ν, Q, F ) is given.;
Using Algorithm 1, compute c0 , .., cn and ϕ0 , .., ϕn .;
for j ∈ {0, .., r − 1} do
βn|n (j) ← cn−1 ;
for k ∈ {n − 1, .., 0} do
for i ∈ {0, .., r − 1} do
βk|n (i) ← ck−1 rj=0
P −1
q(i, j)gk+1 (j)βk+1|n (j);
for k ∈ {0, .., n − 1} do

ϕ (i)βk|n (i)
ϕk|n (i) ← Pr −1k ;
j=0
ϕk (j)βk|n (j)

Inference in HMMs
EM algorithm
Algorithm 3: EM algorithm
Randomly initialize matrices Q and F , set nb_iter number of iterations
Compute marginal distributions (ϕk|n )0≤k≤n and joint distributions
(ϕk:k+1|n )0≤k≤n−1 using algorithm 1
for k ∈ {0, .., nb_iter − 1} do
for i, j ∈ {0, .., r P
− 1} do
n
k−1:k|n ϕ (i,j;θ′ )
q(i, j) ← Pn k=1
Pr −1
k=1 l=0
ϕk−1:k|n (i,l;θ′ )
for i ∈ {0, .., r − 1} do

for j ∈ {0, ..,P
m − 1} do
ϕ (i,θ ′)
0≤k≤n,Yk =j k|n
f (i, j) = nP
ϕ (i,θ′ )
k=0 k|n

Inference in HMMs
Example
Consider the following HMM with the parameters:

! !
0.22 0.78 0.1 0.9
Q= , F = , n = 7000
0.66 0.34 0.85 0.15

Inference in HMMs
Example
0.8
0.8
0.7 0.7
0.6
0.6
Estimated value 0.5 Estimated value
Q
F
True value True value
0.5 0.4
0.3
0.4
0.2
0.3 0.1
0 50 100 150 200 0 50 100 150 200
Iterations Iterations
Figure: The two plots represent the estimated parameters at each iteration. On
the left: parameters Q0,1 and Q1,1 . On the right: parameters F0,0 and F1,0 .

Inference in HMMs
Example
−4620
−4640
−4660
Log − Likelihood
−4680
−4700
−4720
−4740
−4760
0 50 100 150 200

Iterations
Figure: Log-likelihood growth during the iterations.

Inference in HMMs
Behavior near the i.i.d frontier

The dependence structure in a HMM is lost whenever the following
quantity approaches 0.
2 !
q−p

1− (1 − p − q) ∥f0 − f1 ∥
q+p
where : !
1−p p
Q= and F = (f0 , f1 ).
q 1−q

Inference in HMMs

2 !
q−p

1− (1 − p − q) ∥f0 − f1 ∥
q+p
where : !
1−p p
Q= and F = (f0 , f1 ).
q 1−q
When p + q = 1, the Markov chain becomes:
p
1−p 0 1 p
1−p

Inference in HMMs

2 !
q−p

1− (1 − p − q) ∥f0 − f1 ∥
q+p
where : !
1−p p
Q= and F = (f0 , f1 ).
q 1−q
When p + q = 1, the Markov chain becomes:
p
1−p 0 1 p
1−p
→
− It becomes impossible to learn the model parameters.
Reconstructing the hidden states
Outline
2 Inference in HMMs

First loss function
Let θ = (ν, Q, F ).
Consider the loss function:
′
L1 (x1:n , x1:n ) = 1x1:n ̸=x1:n
′

First loss function
Let θ = (ν, Q, F ).
′
L1 (x1:n , x1:n ) = 1x1:n ̸=x1:n
′
The risk associated to a classifier h is:
Rn,HMM (h) = Eθ [L1 (X1:n , h(Y1:n ))] = Pθ (h(Y1:n ) ̸= X1:n )

First loss function
Let θ = (ν, Q, F ).
′
L1 (x1:n , x1:n ) = 1x1:n ̸=x1:n
′
Rn,HMM (h) = Eθ [L1 (X1:n , h(Y1:n ))] = Pθ (h(Y1:n ) ̸= X1:n )
The associated Bayes risk is
R⋆n,HMM = Eθ [ min n Pθ (X1:n ̸= x1:n | Y1:n )]

x1:n ∈X

Reconstruction algorithm
Algorithm 4: Viterbi algorithm

for i = 0, .., r − 1 m0 (i) ← log(ν(i)g0 (i))
for k ∈ {0, .., n − 1} do
for j ∈ {0, .., r − 1} do
mk+1 (j) ← maxi∈{0,..,r −1} [mk (i) + log(q( i, j))] + log(gk+1 (j))
end
end
xn ← arg maxj∈{0,..,r −1} mn (j);
for k ∈ {n − 1, .., 0} do
xk be the state i for which the maximum is attained in update above for
mk+1 (j) with j = xk+1
end

Second loss function

n
′ 1X
L2 (x1:n , x1:n )= 1 ′
n k=1 xk ̸=xk


n
′ 1X
L2 (x1:n , x1:n )= 1 ′
n k=1 xk ̸=xk

n
" #
1 X
Rn,HMM (h) = Eθ [L2 (X1:n , h(Y1:n ))] = Eθ 1[h(Y1:n )]i ̸=Xi
n i=1


n
′ 1X
L2 (x1:n , x1:n )= 1 ′
n k=1 xk ̸=xk

n
" #
1 X
Rn,HMM (h) = Eθ [L2 (X1:n , h(Y1:n ))] = Eθ 1[h(Y1:n )]i ̸=Xi
n i=1
The associated Bayes risk is:

n
1X
R⋆n,HMM = Eθ [min Pθ (Xi ̸= x | Y1:n )]
n i=1 x ∈X

Reconstruction algorithm
Algorithm 5: MAP classifier algorithm

Assume X = {0, ..., r − 1}, (ν, Q, F ) is given.;
Using Algorithm 2, compute ϕ0|n , .., ϕn|n .;
for k ∈ {0, .., n} do
xk = arg max0≤i≤r −1 ϕk|n (i)

Example
Consider the following HMM with the parameters:

! !
0.75 0.25 0.1 0.9
Q= , F = , n = 1000
0.3 0.7 0.85 0.15

Example
Below are the observations and the associated states:
Figure: The observations and the associated hidden states. For the observations,
the top vertical lines correspond to observations of 1, the lower ones are
observations of 0
Reconstructions of the hidden states
The following are the reconstructions of the hidden states:

Efficiency of the plug-in empirical Bayes classifier
Outline
2 Inference in HMMs

Notations
(f0 (y ), f1 (y )) = (p0y (1 − p0 )1−y , p1y (1 − p1 )1−y )

θ = (ν, Q, (fx )x =0,1 ) true parameters
θ̂ = (ν̂, Q̂, (fˆx )x =0,1 ) estimators of the true parameters
ϕ⋆i|n (., Y1:n ) smoothing distribution under true parameters θ
ϕ̂i|n (., Y1:n ) smoothing distribution under estimated parameters θ̂
⋆
h (Y1:n ) = (1ϕ⋆ (1,Y1:n )>1/2 )1≤i≤n Bayes classifier
i|n
ĥ(Y1:n ) = (1ϕ̂ )1≤i≤n plug-in empirical Bayes classifier

i|n (1,Y1:n )>1/2
ϕ⋆i+1|i (., Y0:i ) prediction filter under true θ

ϕ̂i+1|i (., Y0:i ) prediction filter under estimated θ̂
Notations
(f0 (y ), f1 (y )) = (p0y (1 − p0 )1−y , p1y (1 − p1 )1−y )

θ = (ν, Q, (fx )x =0,1 ) true parameters
θ̂ = (ν̂, Q̂, (fˆx )x =0,1 ) estimators of the true parameters
ϕ⋆i|n (., Y1:n ) smoothing distribution under true parameters θ
ϕ̂i|n (., Y1:n ) smoothing distribution under estimated parameters θ̂
⋆
h (Y1:n ) = (1ϕ⋆ (1,Y1:n )>1/2 )1≤i≤n Bayes classifier
i|n
ĥ(Y1:n ) = (1ϕ̂ )1≤i≤n plug-in empirical Bayes classifier

i|n (1,Y1:n )>1/2
ϕ⋆i+1|i (., Y0:i ) prediction filter under true θ

ϕ̂i+1|i (., Y0:i ) prediction filter under estimated θ̂

Online vs offline clustering
We study the risk of clustering observations in two frameworks:

Online: Classification can use only past observations. Classifiers are of
the from: h(Y0:n−1 ) = (hi (Y0:i−1 ))1≤i≤n
Offline: All observations are used in the classification procedures.
Classifiers are of the form: h(Y1:n ) = (hi (Y1:n ))1≤i≤n

Control of smoothing distribution
Theorem (De Castro, Gassiat, Le Corff (2018))

Suppose the initial distribution is the stationary distribution and
δ = mini,j Qi,j > 0. Then, for all 1 ≤ i ≤ n and all y1:n ∈ Yn :
4(1 − δ) i−1
∥ϕ⋆i|n (., y1:n ) − ϕ̂i|n (., y1:n )∥TV ≤ ρ ∥ν − ν̂∥ 2 + 1/(1 − ρ)
δ2
n
!
(ρ̂ ∨ ρ)|l−i|
max fx (yl ) − fˆx (yl )
X
+1/(1 − ρ̂) ∥Q − Q̂∥F +

f (y )
l=1 0 l
∨ f1 (yl ) x =0,1
where:
δ̂ = mini,j Q̂i,j
1−2δ 1−2δ̂
ρ= 1−δ and ρ̂ = 1−δ̂
fx (y ) = pxy (1 − px ) 1−y

Efficiency of plug-in empirical Bayes classifier

Theorem
Assume in addition that c ⋆ = min(p0 , 1 − p0 , p1 , 1 − p1 ) > 0. Then:
"
4(1 − δ) 1
ROffline
n,HMM (ĥ) − R⋆,Offline
n,HMM ≤ Eθ ∥ν − ν̂∥ 2 + 1/(1 − ρ)
δ2 n(1 − ρ)
#
2
+1/(1 − ρ̂) ∥Q − Q̂∥F + ⋆ max |px − p̂x |∞
c x =0,1
"
8(1 − δ) 1
ROnline
n,HMM (ĥ) − R⋆,Online
n,HMM ≤ Eθ ∥ν − ν̂∥ 2 + 1/(1 − ρ)
δ2 n(1 − ρ)
#
1 h i
+1/(1 − ρ̂) ∥Q − Q̂∥F + ⋆ max |px − p̂x |∞ +2Eθ ∥Q − Q̂∥F
c x =0,1
1−2δ
where δ = mini,j Qi,j > 0 and ρ = 1−δ .
Efficiency of plug-in empirical Bayes classifier

Theorem
Assume in addition that c ⋆ = min(p0 , 1 − p0 , p1 , 1 − p1 ) > 0. Then:
"
4(1 − δ) 1
ROffline
n,HMM (ĥ) − R⋆,Offline
n,HMM ≤ Eθ ∥ν − ν̂∥ 2 + 1/(1 − ρ)
δ2 n(1 − ρ)
#
2
+1/(1 − ρ̂) ∥Q − Q̂∥F + ⋆ max |px − p̂x |∞
c x =0,1
"
8(1 − δ) 1
ROnline
n,HMM (ĥ) − R⋆,Online
n,HMM ≤ Eθ ∥ν − ν̂∥ 2 + 1/(1 − ρ)
δ2 n(1 − ρ)
#
1 h i
+1/(1 − ρ̂) ∥Q − Q̂∥F + ⋆ max |px − p̂x |∞ +2Eθ ∥Q − Q̂∥F
c x =0,1
1−2δ
where δ = mini,j Qi,j > 0 and ρ = 1−δ .
Asymptotic Bayes risk
Outline
2 Inference in HMMs

Exponential forgetting
Proposition
Assume:
The initial distribution is the stationary distribution
The HMM model is comprised of two hidden states and discrete
emissions
δ = mini,j Qi,j > 0

Proposition
Assume:
emissions
Then, for 0 ≤ j ′ ≤ j, k ≥ 0 and n ≥ 0:
′
∥Pθ (Xk ∈ . | Y−j:n ) − Pθ (Xk ∈ . | Y−j ′ :n )∥TV ≤ 2ρk∧n+j
0 ρk−k∧n
1

Proposition
Assume:
emissions
Then, for 0 ≤ j ′ ≤ j, k ≥ 0 and n ≥ 0:
′
∥Pθ (Xk ∈ . | Y−j:n ) − Pθ (Xk ∈ . | Y−j ′ :n )∥TV ≤ 2ρk∧n+j
0 ρk−k∧n
1
Similarly, for 0 ≤ k ≤ j ′ ≤ j and n ≥ 0 one has:

′
∥Pθ (Xk ∈ . | Y−n:j ) − Pθ (Xk ∈ . | Y−n:j ′ )∥TV ≤ 2ρ−k+j
0
1−2δ
where ρ0 = 1−δ and ρ1 = 1 − 2δ
Consider the following quantities:

n
⋆,Offline 1X
Rn,HMM = Eθ [min(Pθ (Xi = 1 | Y1:n ) , Pθ (Xi = 0 | Y1:n ))]
n i=1
⋆,Offline
R∞,HMM = Eθ [min (Pθ (X0 = 1 | Y−∞:+∞ ) , Pθ (X0 = 0 | Y−∞:+∞ ))]
n
⋆,Online 1X
Rn,HMM = Eθ [min(Pθ (Xi = 1 | Y0:i−1 ) , Pθ (Xi = 0 | Y0:i−1 ))]
n i=1
⋆,Online
R∞,HMM = Eθ [min (Pθ (X1 = 1 | Y−∞:0 ) , Pθ (X1 = 0 | Y−∞:0 ))]

Theorem
Under the same assumptions:

⋆,offline

⋆,offline 2
Rn,HMM − R∞,HMM ≤
n(1 − ρ0 )

⋆,Online
ρ1 1
Rn,HMM − R⋆,Online
∞,HMM ≤

2n 1 − ρ0
1−2δ
where ρ0 = 1−δ , ρ1 = 1 − 2δ and δ = mini,j Qi,j > 0.

Bounds on the asymptotic Bayes risk
Outline
2 Inference in HMMs

Bounds on the asymptotic Bayes risk
Case of online clustering

We introduce the following reparametrization which will be useful in the
statement of the results:
q−p

ϕ(θ) = , 1 − p − q, ∥f0 − f1 ∥
q+p
Theorem
- The asymptotic Bayes risk for online clustering verifies:
1 |ϕ1 |
R⋆,Online
∞,HMM ≤ −
2 2
c c2
- Assuming in addition that min (f0 , f1 ) ≥ c and that |ϕ2 | ≤ 12 ∧ 4,
then one has:
1 |ϕ1 | (1 − ϕ21 )ϕ22 ϕ3
R⋆,Online
∞,HMM ≥ − −
2 2 2c

End of Studies Internship

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

End of Studies Internship

Uploaded by

Copyright:

Available Formats

Clustering applied to hidden Markov models

Université Paris Saclay, Laboratoire de mathématiques d’Orsay

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 1 / 35

1 Clustering and Hidden Markov Models

3 Reconstructing the hidden states

4 Efficiency of the plug-in empirical Bayes classifier

5 Asymptotic Bayes risk

6 Bounds on the asymptotic Bayes risk

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 2 / 35

DBSCAN clustering K-means clustering

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 3 / 35

Mixture models: Probabilistic definition of clusters

Observations Y = (Yk )1≤k≤n coming from J populations.

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 4 / 35

Mixture models: Probabilistic definition of clusters

Observations Y = (Yk )1≤k≤n coming from J populations.

Then Yk has distribution

πj : Probability to come from population j

Useful to model data coming from heterogeneous populations.

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 4 / 35

Mixture models: Identifiability

Mixture models are not identifiable :

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 5 / 35

Mixture models: Identifiability

Mixture models are not identifiable :

Learning of population components possible only under additional

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 5 / 35

Hidden Markov Models and why they are useful

Markov process : X0 X1 X2 ··· XT −1

Figure: A Hidden Markov Model.

Observations (Yk )k are independent conditionnally to (Xk )k .

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 6 / 35

Hidden Markov Models and why they are useful

Markov process : X0 X1 X2 ··· XT −1

Figure: A Hidden Markov Model.

Observations (Yk )k are independent conditionnally to (Xk )k .

HMMs are identifiable without any additional assumption

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 6 / 35

1 Clustering and Hidden Markov Models

3 Reconstructing the hidden states

4 Efficiency of the plug-in empirical Bayes classifier

5 Asymptotic Bayes risk

6 Bounds on the asymptotic Bayes risk

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 7 / 35

Inference in Hidden Markov Models

The HMM parameters are:

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 8 / 35

Inference in Hidden Markov Models (HMM)

Definition (Smoothing, Filtering, Prediction)

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 9 / 35

Inference in Hidden Markov Models (HMM)

Definition (Smoothing, Filtering, Prediction)

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 9 / 35

Inference in Hidden Markov Models (HMM)

Definition (Smoothing, Filtering, Prediction)

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 9 / 35

Inference in Hidden Markov Models (HMM)

Definition (Smoothing, Filtering, Prediction)

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 9 / 35

Inference in Hidden Markov Models (HMM)

Definition (Smoothing, Filtering, Prediction)

Ibrahim KADDOURI Clustering applied to hidden Markov models September 8, 2022 9 / 35

Forward filtering (discrete emissions)

Algorithm 1: Forward filtering