You are on page 1of 27

Probabilistic Machine Learning

Lecture 20
Latent Dirichlet Allocation

Philipp Hennig
29 June 2021

Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# content Ex # content Ex
1 Introduction 1 15 Exponential Families 8
2 Reasoning under Uncertainty 16 Graphical Models
3 Continuous Variables 2 17 Factor Graphs 9
4 Monte Carlo 18 The Sum-Product Algorithm
5 Markov Chain Monte Carlo 3 19 Example: Modelling Topics 10
6 Gaussian Distributions 20 Latent Dirichlet Allocation
7 Parametric Regression 4 21 Mixture Models 11
8 Learning Representations 22 Bayesian Mixture Models
9 Gaussian Processes 5 23 Expectation Maximization (EM) 12
10 Understanding Kernels 24 Variational Inference
11 Gauss-Markov Models 6 25 Example: Collapsed Inference 13
12 An Example for GP Regression 26 Example: EM for Kernel Topic Models
13 GP Classification 7 27 Making Decisions
14 Generalized Linear Models 28 Revision

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 2
A Mixture of Probabilities?
Desiderata

V words K topics

V words
D documents

D documents

K topics
X ∼ Π × Θ

∏V ∏xdv
▶ topics should be probabilities: p(xd | k) = v=1 i=1 θk(i),v
▶ but documents should have several topics! Let πdk be the probability to draw a word from topic k
doc sparsity each document d should only contain a small number of topics
word sparsity each topic k should only contain a small number of the words v in the vocabulary
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
Each document discusses a few topics
Dirichlet sparsity in the document-topic distribution

QM\XYVIQSHIP XSTMGQSHIP





HSGYQIRXd








                   
XSTMGk XSTMGk
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
Each topic contains a few words
Dirichlet sparsity in the topic-word distribution

+EYWWMERFEWIQIEWYVI (MVMGLPIXFEWIQIEWYVI


XSTMGk





                 
[SVHv [SVHv

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
A Discrete Topic Model
almost there

πd cdi wdi θk
i = [1, . . . , Id ]
d = [1, . . . , D] k = [1, . . . , K]

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:



▶ Draw topic assignments cdi = [cdi1 , . . . , cdiK ] ∈ {0; 1}K , k cdik = 1 of word wid from


D ∏
Id

K
⇒ C ∈ {0; 1}D×Id ×K with p(C | Π) = cdik
πdk
d=1 i=1 k=1

▶ Draw word wdi from ∏ cdik


p(wdi = v | cdi , Θ) = θkv
k

But we need priors for Π, Θ. And we’d like them to be sparse!


Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
Reminder: The Dirichlet Distribution
a sparsity prior for probability distribution


n
p(x | π) = πxi x ∈ {1; . . . , K}
i=1

K
= πknk nk := |{xi | xi = k}|
k=1

αk ) ∏ αk −1 1 ∏ αk −1
K K
Γ(
p(π | α) = D(α) = ∏ k πk = πk
k Γ(αk ) B(α)
k=1 k=1
p(π | x) = D(α + n)

Peter Gustav Lejeune Dirichlet


(1805–1859)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 8
Dirichlets can encode sparsity
for α ∼ 0.01, 0.1, 0.5 (100 samples each)

1 1 1

0.5 0.5 0.5


π3

π3

π3
0 0 0
00 00 00
0.5 0.5 0.5 0.5 0.5 0.5
1 1 π2 1 1 π2 1 1 π2
π1 π1 π1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Latent Dirichlet Allocation
our generative model [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:


∏K
▶ Draw K topic distributions θk over V words from p(Θ | β) = k=1 D(θk ; βk )
∏D
▶ Draw D document distributions over K topics from p(Π | α) = d=1 D(πd ; αd )
∏ cdik
▶ Draw topic assignments cdik of word wdi from p(C | Π) = i,d,k πdk
∏ cdik
▶ Draw word wdi from p(wdi = v | cdi , Θ) = k θkv

Useful notation: ndkv = #{i : wdi = v, cdik = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = v ndkv , etc.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Latent Dirichlet Allocation
our generative model [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) JMLR 3, 993–1022]

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

To draw Id words wdi ∈ [1, . . . , V] of document d ∈ [1, . . . , D]:


∏K
▶ Draw K topic distributions θk over V words from p(Θ | β) = k=1 D(θk ; βk )
∏D
▶ Draw D document distributions over K topics from p(Π | α) = d=1 D(πd ; αd )
∏ cdik
▶ Draw topic assignments cdik of word wdi from p(C | Π)
David M. = Blei i,d,k πdk
∏ cdik
▶ Draw word wdi from v | cColumbia
p(wdi =image: di , Θ) =
U
k θkv

Useful notation: ndkv = #{i : wdi = v, cdik = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = v ndkv , etc.

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Designing a probabilistic machine learning method:
1. get the data
1.1 try to collect as much meta-data as possible
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

p(C, Π, Θ, W)
p(C, Π, Θ | W) = ∫∫∫
p(C, Π, Θ, W) dC dΠ, dΘ
p(W | C, Π, Θ) · p(C, Π, Θ)
= ∫∫∫
p(C, Π, Θ, W) dC dΠ, dΘ

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)


( D ) ( D I ) ( D I ) ( K )
∏ ∏∏ d ∏∏ d ∏
= p(π d | αd ) · p(cdi | πd ) · p(wdi | cdi , Θ) · p(θ k | β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ) ( Id (
) ( ) ( K )

D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
= D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ( ∑ )) ( D I )
∏ d ( )
Γ( k αdk ) ∏ αdk −1 ∏∏
D K
∏K
= ∏ πdk · cdik
k=1 πdk · ...
d=1 k Γ(αdk ) k=1 d=1 i=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

( ( ∑ )) ( D I )
∏ d ( )
Γ( k αdk ) ∏ αdk −1 ∏∏
D K
∏K
p(Π | α) · p(C | Π) = ∏ πdk · cdik
k=1 πdk
d=1 k Γ(αdk ) k=1 d=1 i=1

Useful notation: ndkv = #{i : wdi = v, cdik = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· = v ndkv , etc.
( ∑ )

D
Γ( k αdk ) ∏ αdk −1+ndk·
K
= ∏ πdk
d=1 k Γ(αdk ) k=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

p(C, Π, Θ, W) = p(Π | α) · p(C | Π) · p(Θ | β) · p(W | C, Θ)


( D ) ( D I ) ( D I ) ( K )
∏ ∏∏ d ∏∏ d ∏
= p(π d | αd ) · p(cdi | πd ) · p(wdi | cdi , Θ) · p(θ k | β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ) ( Id (
) ( ) ( K )

D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
= D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1
( ∑ ) ( K )

D
Γ( k αdk ) ∏ αdk −1+ndk·
K ∏ Γ(∑ βkv ) ∏ V
βkv −1+n·kv
= ∏ πdk · ∏ v
θkv
d=1 k Γ(αdk ) k=1 v Γ(βkv ) k=1 v=1

Useful notation: n
Probabilistic ML — P. Hennig, SS 2021 — Lecturedkv = #{i : w = v, c = 1}. Write ndk: := [ndk1 , . . . , ndkV ] and ndk· =
di — © Philipp Hennig,
20: Latent Dirichlet Allocation dik 2021 CC BY-NC-SA 3.0 v ndkv , etc. 15
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

( ∑ ) ( K )

D
Γ( k αdk ) ∏ αdk −1+ndk·
K ∏ Γ(∑ βkv ) ∏ V
−1+n
p(C, Π, Θ, W) = ∏ πdk · ∏ v β
θkvkv ·kv

d=1 k Γ(αdk ) v Γ(βkv )


k=1 k=1 v=1

▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:


Id ∏K
p(W, C, Θ, Π) ∏
D ∏ cdik
k=1 (πdk θkwdi )
p(C | Θ, Π, W) = ∑ = ∑
C p(W, C, Θ, Π) k′ (πdk′ θk′ wdi ) d=1 i=1

▶ note that this conditional independence can easily be read off from the above graph!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Building an algorithm
A more careful look at the model

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]

( ) ( Id (
) ( ) ( K )

D ∏
D ∏
∏K ) ∏ Id (
D ∏
∏K ) ∏
cdik cdik
p(C, Π, Θ, W) = D(π d ; αd ) · k=1 πdk · k=1 θkwdi · D(θ k ; β k )
d=1 d=1 i=1 d=1 i=1 k=1

▶ If we had Π, Θ (which we don’t), then the posterior p(C | Θ, Π, W) would be easy:


Id ∏K
p(W, C, Θ, Π) ∏
D ∏ cdik
k=1 (πdk θkwdi )
p(C | Θ, Π, W) = ∑ = ∑
C p(W, C, Θ, Π) k′ (πdk′ θk′ wdi ) d=1 i=1

▶ note that this conditional independence can easily be read off from the above graph!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Let’s look at the structure again
Latent Dirichlet Allocation

αd βk
πd cdi wdi θk
i = [1, . . . , Id ] k = [1, . . . , K]
d = [1, . . . , D]
( ∑ ) ( K )

D
Γ( k αdk ) ∏ αdk −1+ndk·
K ∏ Γ(∑ βkv ) ∏ V
−1+n
p(C, Π, Θ, W) = ∏ πdk · ∏ v β
θkvkv ·kv

d=1 k Γ(αdk ) v Γ(βkv )


k=1 k=1 v=1

▶ If we had C (which we don’t), then the posterior p(Θ, Π | C, W) would be easy:


(∏ (∏ ndk· )) (∏ (∏ n·kv ))
p(C, W, Π, Θ) D(πd ; αd ) πdk D(θk ; βk ) v θkv
p(Θ, Π | C, W) = ∫ = d k k
p(Θ, Π, C, W) dΘ dΠ p(C, W)
( )( )
∏ ∏
= D(πd ; αd: + nd:· ) D(θk ; βk: + n·k: )
d k

▶ note that this conditional independence can not easily be read off from the above graph!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
The Toolbox

Framework:

p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)

Modelling: Computation:
▶ graphical models ▶ Monte Carlo
▶ Gaussian distributions ▶ Linear algebra / Gaussian inference
▶ (deep) learnt representations ▶ maximum likelihood / MAP
▶ Kernels ▶ Laplace approximations
▶ Markov Chains ▶
▶ Exponential Families / Conjugate Priors
▶ Factor Graphs & Message Passing

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Gibbs sampling:
▶ xt ^ xt−1 ; xti ∼ p(xti | xt1 , xt2 , . . . , xt(i−1) , xt(i+1) , . . . )
▶ a special case of Metropolis-Hastings:
▶ q(x′ | xt ) = δ(x′\i − xt,\i )p(x′i | xt,\i )
▶ p(x′ ) = p(x′i | x′\i )p(x′\i ) = p(x′i | xt,\i )p(xt,\i )
p(x′ ) q(xt | x′ ) p(x′i | xt,\i ) p(xti | xt,\i )
▶ acceptance rate: a = · = · =1
p(xt ) q(x′ | xt ) p(xti | xt,\i ) p(x′i | xt,\i )
Markov Chain Monte Carlo Methods provide a relatively simple way to construct approximate posterior
distributions. They are asymptotically exact. Compared to other approximate inference methods, like
variational inference, they tend to be easier to implement but harder to monitor, and may also be more
computationally expensive


1∑
S
f(x)p(x) dx ≈ f(xs ) if xs ∼ p
S
s=1

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 19
A Gibbs Sampler
Simple Markov Chain Monte Carlo inference

Iterate between (recall ndkv = #{i : wdi = v, cdik = 1})



Θ ∼ p(Θ | C, W) = D(θk ; βk: + n·k: )
k

Π ∼ p(Π | C, W) = D(πd ; αd: + nd:· )
d
∏K

D ∏
Id cdik
k=1 (πdk θkwdi )

C ∼ p(C | Θ, Π, W) =
d=1 i=1 k′ (πdk′ θk′ wdi )

▶ This is comparably easy to implement because there are libraries for sampling from Dirichlet’s, and
discrete sampling is trivial. All we have to keep around are the counts n (which are sparse!) and
Θ, Π (which are comparably small). Thanks to factorization, much can also be done in parallel!
▶ Unfortunately, this sampling scheme is relatively slow to move out of initialization, because z
depends strongly on θ, π and vice versa.
▶ properly vectorizing the code is important for speed
Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
A visual example
building a toy model adapted from T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101, 5228–5235

d=0 d=1 d=2 d=3 d=4

d=5 d=6 d=7 d=8 d=9

▶ K = 10, V = 25, N = 100, D = 1000



▶ p(Π) = d D(πd ; α = 1)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
A visual example
building a toy model adapted from T. L. Griffiths & M. Steyvers, Finding scientific topics, PNAS 101, 5228–5235

XSTMG XSTMG XSTMG XSTMG XSTMG








XSTMG XSTMG XSTMG XSTMG XSTMG







                        

▶ K = 10, V = 25, N = 100, D = 1000



▶ p(Π) = d D(πd ; α = 1)

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22
It works!
your homework this week

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 23
Designing a Probabilistic Machine Learning Model
1. get the data
1.1 try to collect as much meta-data as possible
1.2 take a close look at the data
2. build the model
2.1 identify quantities and datastructures; assign names
2.2 design a generative process (graphical model)
2.3 assign (conditional) distributions to factors/arrows (use exponential families!)
3. design the algorithm
3.1 consider conditional independence
3.2 try standard methods for early experiments
3.3 run unit-tests and sanity-checks
3.4 identify bottlenecks, find customized approximations and refinements
4. Test the Setup
5. Revisit the Model and try to improve it, using creativity

Probabilistic ML — P. Hennig, SS 2021 — Lecture 20: Latent Dirichlet Allocation — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 24

You might also like