Professional Documents
Culture Documents
1
Supervised learning
2
Unsupervised learning
3
Expectation Maximization
• In a model with hidden states (e.g. POS tagging), how can we estimate a model
if we don’t have annotated data?
• We can see the outputs (e.g. word strings), but we do not know which hidden
sequence (POS strings) generated it
• EM does the following:
– takes an initial model parameterization, and calculates the Expected frequen-
cies of being in states and taking transitions
– uses these expected frequencies to maximize the likelihood of the data
4
Expectation Maximization
5
HMM POS-tag model (from previous lecture)
6
Baum-Welch (forward-backward)
7
Forward and backward probabilities
Backward probability:
(probability of seeing remaining sequence wt+1 . . . wn given tag i at time t)
Pm
βi(n) = ai0 βi(t) = j=1 βj (t + 1)aij bj (wt+1)
Pm
P(w1 . . . wn) = β0(0) = i=1 αi(n)ai0
8
Expected frequency definitions
αi(t)βi(t)
γi(t) = Pm
j=1 αj (t)βj (t)
γi(t)aij bj (wt+1)βj (t + 1)
ξij (t) =
βi(t)
9
Maximization step (new model)
Pn
t=1 δwt,vk γi(t)
b̃i(vk ) = Pn
t=1 γi(t)
Pn−1
t=1 ξij (t)
ãij = Pn
t=1 γi(t)
ã0j = γj (1)
γi(n)
ãi0 = Pn
t=1 γi(t)
10
Forward-backward algorithm, E-step
word sequence: W = w1 . . . wn, size of tagset |T | = m α0(0) ← 1
for t = 1 to n
for j = 1 to m
Pm
αj (t) ← α
i=0 i (t − 1)aij bj (wt )
for i = 1 to m
βi(n) ← ai0
for i = 1 to m
γi(n) ← Pmαi (n)βi (n)
j=1 αj (n)βj (n)
for t = n − 1 to 1
for i = 1 to m
βi(t) ← m
P
j=1 βj (t + 1)aij bj (wt+1 )
for i = 1 to m
γi(t) ← Pmαi (t)βi (t)
j=1 αj (t)βj (t)
for j = 1 to m
γi (t)aij bj (wt+1 )βj (t+1)
ξij (t) ← βi (t)
11
Forward-backward algorithm, M-step
12
Important: use log accumulators!
Pm
Example: βi(t) ← j=1 βj (t + 1)aij bj (wt+1)
Want, instead, to calculate log βi(t) to avoid underflow
Recall: log ab
c = log a + log b − log c
13
Good trick with logs
14
Example
15
Forward-backward, initialize
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0
2 NN 0
3 NNS 0
4 VB 0
5 RB 0
αj (t)
16
Forward-backward, forward 1
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0
2 NN 0 0.02
3 NNS 0 0
4 VB 0 0
5 RB 0 0
αj (t)
17
Forward-backward, forward 2
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0 0
2 NN 0 0.02 0
3 NNS 0 0 0.00004
4 VB 0 0 0.0004
5 RB 0 0 0
αj (t)
18
Forward-backward, forward 3
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0 0 0.0000022
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012
5 RB 0 0 0 0.0000372
αj (t)
19
Forward-backward, forward finalize
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904
1 JJ 0 0 0 0.0000022
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012
5 RB 0 0 0 0.0000372
αj (t)
20
Forward-backward, backward time 3
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
21
Forward-backward, backward time 2
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004/0.0076 0
4 VB 0 0 0.0004/0.0190 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
22
Forward-backward, backward times 1, 0
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004/0.0076 0
4 VB 0 0 0.0004/0.0190 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
23
Forward-backward, γ calculations
α (t)βi(t)
γi(t) ← Pm i α (t)β (t)
j=1 j j
24
Joint models
25
Generative model
• Joint models of this sort are also known as generative
• Can be used to generate strings: first generate τ1,
then generate w1 and τ2 conditioned on τ1, etc.
• Not to be confused with Generative Linguistics,
although the meaning of that term is similar in both
• Examples of generative models:
– Smoothed n-gram language models
– Speech recognition HMMs
– HMM POS-tagging models
• Often used for disambiguation, not generation
26
Conditional modeling
27
Joint (a) versus Conditional (b) modeling
s1 s2 s3 sn
o1 o2 o3 .... on
(a)
s1 s2 s3 sn
o1 o2 o3 .... on
(b)
28
Log linear modeling
29
Global feature vectors
30
Parameter optimization
31
Discriminative training
32
One approach: perceptron
33
Perceptron algorithm (Collins, 2002)
Approach assumes:
• Training examples (xi, yi) for i = 1 . . . N where xi is
the input and yi is the true output.
e.g. (w1 . . . wk , τ1 . . . τk ), where τ1 . . . τk is the true
tag sequence
• A function GEN which enumerates a set of
candidates GEN(x) for an input x.
• A representation Φ mapping each (x, y) ∈ X × Y to a
feature vector Φ(x, y) ∈ Rd.
• A parameter vector ᾱ ∈ Rd.
34
Perceptron algorithm (Collins, 2002)
35
Notes about percepton
36
Conditional Random Fields (CRFs)
37
Conditional Random Fields (CRF)
40
Derivative of LLR: refresher
41
Derivative of LLR
N
X ||ᾱ||2
LLR (ᾱ) = [Φ(xi, yi) · ᾱ − log Z(xi, ᾱ)] −
i=1
2σ 2
N d d d
X X X X X ᾱ2m
= Φs(xi, yi)αs − log exp( Φj (xi, y)ᾱj ) −
i=1 s=1 y∈GEN(x ) j=1 m=1
2σ 2
i
N
" P Pd #
∂LLR X y∈GEN(xi ) exp(Φj (xi, y)ᾱj )Φs(xi, y)
j=1 2αs
= Φs(xi, yi) − Pd −
∂αs 2σ 2
P
i=1 y∈GEN(xi ) exp( Φ
j=1 j (x i , y) ᾱj )
N P d
X X exp( j=1 Φj (xi, y)ᾱj ) αs
= Φs(xi, yi) − Φs(xi, y) − 2
i=1 y∈GEN(xi )
Z(xi, ᾱ) σ
N
X X αs
= Φs(xi, yi) − p(y|xi)Φs(xi, y) − 2
i=1 y∈GEN(x )
σ
i
42
Sha & Pereira (2003)
ci is the class of wi
ti is the POS-tag of wi
yi= ci−1ci
e.g. BI or IO,
but never OI
c(yi) = ci
44
Full versus shallow parse
S
HH
HH
H
NP VP
H
HH
HH
They are VP
HH
HH
H
starting S
VP
H
H
H
H
to VP
H
H
HH
H
buy NP
H
H
H
H
[NP They] [VP are starting to buy] [NP growth stocks] growth stocks
45
Final thoughts
46