20 Latent Dirichlet Allocation

Probabilistic Machine Learning
Lecture 20
Latent Dirichlet Allocation
Philipp Hennig
29 June 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# content Ex # content Ex
1 Introduction 1 15 Exponential Families 8
2 Reasoning under Uncertainty 16 Graphical Models
3 Continuous Variables 2 17 Factor Graphs 9
4 Monte Carlo 18 The Sum-Product Algorithm
5 Markov Chain Monte Carlo 3 19 Example: Modelling Topics 10
6 Gaussian Distributions 20 Latent Dirichlet Allocation
7 Parametric Regression 4 21 Mixture Models 11
8 Learning Representations 22 Bayesian Mixture Models
9 Gaussian Processes 5 23 Expectation Maximization (EM) 12
10 Understanding Kernels 24 Variational Inference
11 Gauss-Markov Models 6 25 Example: Collapsed Inference 13
12 An Example for GP Regression 26 Example: EM for Kernel Topic Models
13 GP Classification 7 27 Making Decisions
14 Generalized Linear Models 28 Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements
A Mixture of Probabilities?
Desiderata
V words K topics
V words
D documents
D documents
K topics
X ∼ Π × Θ
∏V ∏xdv
▶ topics should be probabilities: p(xd | k) = v=1 i=1 θk(i),v
▶ but documents should have several topics! Let πdk be the probability to draw a word from topic k
doc sparsity each document d should only contain a small number of topics
word sparsity each topic k should only contain a small number of the words v in the vocabulary
Each document discusses a few topics
Dirichlet sparsity in the document-topic distribution
QM\XYVIQSHIP XSTMGQSHIP

HSGYQIRXd

XSTMGk XSTMGk
Each topic contains a few words
Dirichlet sparsity in the topic-word distribution
+EYWWMERFEWIQIEWYVI (MVMGLPIXFEWIQIEWYVI

XSTMGk

[SVHv [SVHv
1. get the data
2. build the model
A Discrete Topic Model
almost there
πd cdi wdi θk
i = [1, . . . , Id ]
d = [1, . . . , D] k = [1, . . . , K]
To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

∑
▶ Draw topic assignments cdi = [cdi1 , . . . , cdiK ] ∈ {0; 1}K , k cdik = 1 of word wid from
∏
D ∏
Id
∏
K
⇒ C ∈ {0; 1}D×Id ×K with p(C | Π) = cdik
πdk
d=1 i=1 k=1
▶ Draw word wdi from ∏ cdik

p(wdi = v | cdi , Θ) = θkv
k
But we need priors for Π, Θ. And we’d like them to be sparse!

Reminder: The Dirichlet Distribution
a sparsity prior for probability distribution
∏
n
p(x | π) = πxi x ∈ {1; . . . , K}
i=1
∏
K
= πknk nk := |{xi | xi = k}|
k=1
∑
αk ) ∏ αk −1 1 ∏ αk −1
K K
Γ(
p(π | α) = D(α) = ∏ k πk = πk
k Γ(αk ) B(α)
k=1 k=1
p(π | x) = D(α + n)
Peter Gustav Lejeune Dirichlet

(1805–1859)
Dirichlets can encode sparsity
for α ∼ 0.01, 0.1, 0.5 (100 samples each)
1 1 1
0.5 0.5 0.5

π3
π3
π3
0 0 0
00 00 00
0.5 0.5 0.5 0.5 0.5 0.5
1 1 π2 1 1 π2 1 1 π2
π1 π1 π1
our generative model [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

∏K
▶ Draw K topic distributions θk over V words from p(Θ | β) = k=1 D(θk ; βk )
∏D
▶ Draw D document distributions over K topics from p(Π | α) = d=1 D(πd ; αd )
∏ cdik
▶ Draw topic assignments cdik of word wdi from p(C | Π) = i,d,k πdk
∏ cdik
▶ Draw word wdi from p(wdi = v | cdi , Θ) = k θkv
∑
Useful notation: ndkv = #{i : wdi = v, cdik = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = v ndkv , etc.
our generative model [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

∏K
▶ Draw K topic distributions θk over V words from p(Θ | β) = k=1 D(θk ; βk )
∏D
▶ Draw D document distributions over K topics from p(Π | α) = d=1 D(πd ; αd )
∏ cdik
▶ Draw topic assignments cdik of word wdi from p(C | Π)
David M. = Blei i,d,k πdk
∏ cdik
▶ Draw word wdi from v | cColumbia
p(wdi =image: di , Θ) =
U
k θkv
∑
1. get the data
2. build the model
Building an algorithm
A more careful look at the model
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
p(C, Π, Θ, W)
p(C, Π, Θ | W) = ∫∫∫
p(C, Π, Θ, W) dC dΠ, dΘ
p(W | C, Π, Θ) · p(C, Π, Θ)
= ∫∫∫
p(C, Π, Θ, W) dC dΠ, dΘ
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)

( D ) ( D I ) ( D I ) ( K )
∏ ∏∏ d ∏∏ d ∏
= p(π d | αd ) · p(cdi | πd ) · p(wdi | cdi , Θ) · p(θ k | β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ) ( Id (
) ( ) ( K )
∏
D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
= D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ( ∑ )) ( D I )
∏ d ( )
Γ( k αdk ) ∏ αdk −1 ∏∏
D K
∏K
= ∏ πdk · cdik
k=1 πdk · ...
d=1 k Γ(αdk ) k=1 d=1 i=1
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
( ( ∑ )) ( D I )
∏ d ( )
Γ( k αdk ) ∏ αdk −1 ∏∏
D K
∏K
p(Π | α) · p(C | Π) = ∏ πdk · cdik
k=1 πdk
d=1 k Γ(αdk ) k=1 d=1 i=1
∑
( ∑ )
∏
D
Γ( k αdk ) ∏ αdk −1+ndk·
K
= ∏ πdk
d=1 k Γ(αdk ) k=1
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)

( D ) ( D I ) ( D I ) ( K )
∏ ∏∏ d ∏∏ d ∏
= p(π d | αd ) · p(cdi | πd ) · p(wdi | cdi , Θ) · p(θ k | β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ) ( Id (
) ( ) ( K )
∏
D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
= D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ∑ ) ( K )
∏
D
K ∏ Γ(∑ βkv ) ∏ V
βkv −1+n·kv
= ∏ πdk · ∏ v
θkv
d=1 k Γ(αdk ) k=1 v Γ(βkv ) k=1 v=1
∑
Useful notation: n
Probabilistic ML — P. Hennig, SS 2021 — Lecturedkv = #{i : w = v, c = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· =
di — © Philipp Hennig,
20: Latent Dirichlet Allocation dik 2021 CC BY-NC-SA 3.0 v ndkv , etc. 15
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
( ∑ ) ( K )
∏
D
K ∏ Γ(∑ βkv ) ∏ V
−1+n
p(C, Π, Θ, W) = ∏ πdk · ∏ v β
θkvkv ·kv
d=1 k Γ(αdk ) v Γ(βkv )

k=1 k=1 v=1
▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:

Id ∏K
p(W, C, Θ, Π) ∏
D ∏ cdik
k=1 (πdk θkwdi )
p(C | Θ, Π, W) = ∑ = ∑
C p(W, C, Θ, Π) k′ (πdk′ θk′ wdi ) d=1 i=1
▶ note that this conditional independence can easily be read off from the above graph!
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
( ) ( Id (
) ( ) ( K )
∏
D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
p(C, Π, Θ, W) = D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1
▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:

Id ∏K
p(W, C, Θ, Π) ∏
D ∏ cdik
k=1 (πdk θkwdi )
p(C | Θ, Π, W) = ∑ = ∑
C p(W, C, Θ, Π) k′ (πdk′ θk′ wdi ) d=1 i=1
▶ note that this conditional independence can easily be read off from the above graph!
Let’s look at the structure again
αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
( ∑ ) ( K )
∏
D
K ∏ Γ(∑ βkv ) ∏ V
−1+n
p(C, Π, Θ, W) = ∏ πdk · ∏ v β
θkvkv ·kv
d=1 k Γ(αdk ) v Γ(βkv )

k=1 k=1 v=1
▶ If we had C (which we don’t), then the posterior p(Θ, Π | C, W) would be easy:

(∏ (∏ ndk· )) (∏ (∏ n·kv ))
p(C, W, Π, Θ) D(πd ; αd ) πdk D(θk ; βk ) v θkv
p(Θ, Π | C, W) = ∫ = d k k
p(Θ, Π, C, W) dΘ dΠ p(C, W)
( )( )
∏ ∏
= D(πd ; αd: + nd:· ) D(θk ; βk: + n·k: )
d k
▶ note that this conditional independence can not easily be read off from the above graph!
The Toolbox
Framework:
∫
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ graphical models ▶ Monte Carlo
▶ Gaussian distributions ▶ Linear algebra / Gaussian inference
▶ (deep) learnt representations ▶ maximum likelihood / MAP
▶ Kernels ▶ Laplace approximations
▶ Markov Chains ▶
▶ Exponential Families / Conjugate Priors
▶ Factor Graphs & Message Passing
Gibbs sampling:
▶ xt ^ xt−1 ; xti ∼ p(xti | xt1 , xt2 , . . . , xt(i−1) , xt(i+1) , . . . )
▶ a special case of Metropolis-Hastings:
▶ q(x′ | xt ) = δ(x′\i − xt,\i )p(x′i | xt,\i )
▶ p(x′ ) = p(x′i | x′\i )p(x′\i ) = p(x′i | xt,\i )p(xt,\i )
p(x′ ) q(xt | x′ ) p(x′i | xt,\i ) p(xti | xt,\i )
▶ acceptance rate: a = · = · =1
p(xt ) q(x′ | xt ) p(xti | xt,\i ) p(x′i | xt,\i )
Markov Chain Monte Carlo Methods provide a relatively simple way to construct approximate posterior
distributions. They are asymptotically exact. Compared to other approximate inference methods, like
variational inference, they tend to be easier to implement but harder to monitor, and may also be more
computationally expensive
∫
1∑
S
f(x)p(x) dx ≈ f(xs ) if xs ∼ p
S
s=1
A Gibbs Sampler
Simple Markov Chain Monte Carlo inference
Iterate between (recall ndkv = #{i : wdi = v, cdik = 1})

∏
Θ ∼ p(Θ | C, W) = D(θk ; βk: + n·k: )
k
∏
Π ∼ p(Π | C, W) = D(πd ; αd: + nd:· )
d
∏K
∏
D ∏
Id cdik
k=1 (πdk θkwdi )
∑
C ∼ p(C | Θ, Π, W) =
d=1 i=1 k′ (πdk′ θk′ wdi )
▶ This is comparably easy to implement because there are libraries for sampling from Dirichlet’s, and
discrete sampling is trivial. All we have to keep around are the counts n (which are sparse!) and
Θ, Π (which are comparably small). Thanks to factorization, much can also be done in parallel!
▶ Unfortunately, this sampling scheme is relatively slow to move out of initialization, because z
depends strongly on θ, π and vice versa.
▶ properly vectorizing the code is important for speed
A visual example
building a toy model adapted from T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101, 5228–5235
d=0 d=1 d=2 d=3 d=4
d=5 d=6 d=7 d=8 d=9
▶ K = 10, V = 25, N = 100, D = 1000

∏
▶ p(Π) = d D(πd ; α = 1)
A visual example
building a toy model adapted from T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101, 5228–5235
XSTMG XSTMG XSTMG XSTMG XSTMG

XSTMG XSTMG XSTMG XSTMG XSTMG

▶ K = 10, V = 25, N = 100, D = 1000

∏
▶ p(Π) = d D(πd ; α = 1)
It works!
your homework this week
Designing a Probabilistic Machine Learning Model
1. get the data
1.2 take a close look at the data
2. build the model
4. Test the Setup
5. Revisit the Model and try to improve it, using creativity

20 Latent Dirichlet Allocation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

20 Latent Dirichlet Allocation

Uploaded by

Copyright:

Available Formats

Probabilistic Machine Learning

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

▶ Draw word wdi from ∏ cdik

But we need priors for Π, Θ. And we’d like them to be sparse!

Peter Gustav Lejeune Dirichlet

0.5 0.5 0.5

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:

p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)

p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)

d=1 k Γ(αdk ) v Γ(βkv )

▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:

▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:

d=1 k Γ(αdk ) v Γ(βkv )

▶ If we had C (which we don’t), then the posterior p(Θ, Π | C, W) would be easy:

Iterate between (recall ndkv = #{i : wdi = v, cdik = 1})

d=0 d=1 d=2 d=3 d=4

d=5 d=6 d=7 d=8 d=9

▶ K = 10, V = 25, N = 100, D = 1000

XSTMG XSTMG XSTMG XSTMG XSTMG

XSTMG XSTMG XSTMG XSTMG XSTMG

▶ K = 10, V = 25, N = 100, D = 1000

You might also like

XSTMG XSTMG XSTMG XSTMG XSTMG

XSTMG XSTMG XSTMG XSTMG XSTMG