You are on page 1of 19

Hierarchical Dirichlet Process and

Infinite Hidden Markov Model

Paper by Y. W. Teh, M. I. Jordan, M. J. Beal & D. M. Blei,


NIPS 2004

Duke University Machine Learning Group


Presented by Kai Ni
February 17, 2006
Outline
• Motivation

• Dirichlet Processes (DP)

• Hierarchical Dirichlet Processes (HDP)

• Infinite Hidden Markov Model (iHMM)

• Results & Conclusions


Motivation

• Problem – “multi-task learning” in which the “tasks” are


clustering problems.

• Goal – Share clusters among multiple, related clustering


problems. The number of clusters are open-ended and
inferred automatically by the model.

• Application
– Genome pattern analysis
– Information retrieval of corpus
Hierarchical Model

• A single clustering problem can be analyzed as a Dirichlet


process (DP). 
– G ~ DP( 0 , G0 ) G    k  k
k 1
– Draws G from DP are discrete, generally not distinct.

• For J groups, we consider Gj for j=1~J is a group-specific


DP. G ~ DP( , G ) 

j 0j 0j G   
j 
k 1
jk  jk

• To share information, we link the group-specific DPs


– G j ~IfDP(
G(τ)is continuous, the draws Gj have no atoms in common
0 , G0 ( ))
with probability one.
– HDP solution: G0 is itself a draw from a DP(, H)
Dirichlet Process &
Hierarchical Dirichlet Process
• Three different perspectives
– Stick-breaking
– Chinese restaurant
– Infinite mixture models

• Setup DP HDP
G ~ DP ( , G0 ) G0 |  , H ~ DP( , H )
G j |  0 , G0 ~ DP( 0 , G0 )

• Properties of DP



Stick-breaking View
• A mathematical explicit form of DP. Draws from DP are
discrete.


• In DP G 
k 1
k  k
with

 ~ Stick ( ),  k ~ G0

• In HDP
 
Gj  
k 1
jk  k
G0   k  k
k 1

π j ~ DP ( 0 ,  )  ~ Stick ( )
 k ~ G0 k ~ H
DP – Chinese Restaurant Process

• Exhibit clustering property

• Φ1,…,Φi-1, i.i.d., r.v., distributed according to G; Ө1,…, ӨK


to be the distinct values taken on by Φ1,…,Φi-1, nk be # of
Φi’= Өk, 0<i’<i,
HDP – Chinese Restaurant Franchise
• First level: within each group, DP mixture
– G j ~ DP(0 , G0 ),  ji | G j ~ G j , x ji |  ji ~ F ( ji )

– Φj1,…,Φj(i-1), i.i.d., r.v., distributed according to Gj; Ѱj1,…, ѰjTj to be


the values taken on by Φj1,…,Φj(i-1), njk be # of Φji’= Ѱjt, 0<i’<i.

• Second level: across group, sharing clusters


– Base measure of each group is a draw from DP:
 jt | G0 ~ G0 , G0 ~ DP( , H )
– Ө1,…, ӨK to be the values taken on by Ѱj1,…, ѰjTj , mk be # of
Ѱjt=Өk, all j, t.
HDP – CRF graph
• The values of  are shared between groups, as well as
within groups. This is a key property of HDP.

Integrating out G0
DP Mixture Model
• One of the most important application of DP: nonparametric
prior distribution on the components of a mixture model.

• G can be looked as an infinite mixture model.

G ~ DP ( 0 , G 0 )

G  
k 1
k k
i | G ~ G
x i |  i ~ F ( i )
HDP mixture model

• HDP can be used as the


prior distribution over the
factors for nested group
data.

• We consider a two-level
DPs. G0 links the child Gj
DPs and forces them to
share components. Gj is
conditionally independent
given G0
Infinite Hidden Markov Model

• The number of hidden states is allowed to be countably


infinite.

• The transition probabilities given in the ith row of the transition


matrix A can be interpreted as mixing proportions
 = (ai1, ai2, …, aik, …)

• Thus each row of the A in HMM is a DP. Also these DPs must
be linked, because they should have same set of “next states”.
HDP provides the natural framework for the infinite HMM.
iHMM via HDP
• Assign observations to groups, where the groups are indexed by the
value of the previous state variable in the sequence. Then the current
state and emission distribution define a group-specific mixture model.

• Multiple iHMMs can be linked by adding an additional level of


Bayesian hierarchy, letting a master DP couple each of the iHMM,
each of which is a set of DPs.
HDP & iHMM

HDP (CRF aspect) iHMM


Group Restaurant J (fixed) By Si-1 (random)
Data Customer xji yi
Hidden Table ji = k, k=1~ Si = k, k=1~
factor Dish k ~ H B (Si , : )
DP weights Popularity jk, k=1~ A (Si-1, : )
Likelihood F(xji| ji ) B (Si, yi)
Non-trivialities in iHMM

• HDP assumes a fixed partition of the data into groups


while HMM is for time-series data, and the definition of
groups is itself random.

• Consider CRF aspect of HDP, the number of restaurant is


infinite. Also in the sampling scheme, changing st may
affect all subsequent data assignment.

• CRF is natural to describe the iHMM, however it is


awkward for sampling. We need to use sampling algorithm
from other respects for the iHMM.
HDP Results
iHMM Results
Conclusion

• HDP is a hierarchical, nonparametric model for clustering


problems involving multiple groups of data.

• The mixture components are shared across groups and the


appropriate number is determined by HDP automatically.

• HDP can be extended to infinite HMM model, providing


effective inference algorithm.
Reference

• Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei,


“Sharing Clusters among Related Groups: Hierarchical
Dirichlet Processes”, NIPS 2004.

• Beal, M.J., Ghahramani, Z. and Rasmussen, C.E., “The


Infinite Hidden Markov Model”, NIPS 2002

• Y.W. Teh, M.I. Jordan, M.J. Beal and D.M. Blei,


“Hierarchical Dirichlet Processes”, Revised version to
appear in JASA, 2006.

You might also like