Roark - Lec 2b - Forward Backward

Agenda for today
Part 1 • Supervised and unsupervised sequence learning

• Expectation Maximization (EM) Algorithm
– Basic idea
– Some known properties
• Baum-welch (forward-backward) algorithm
– relation to viterbi algorithm
– use for POS-tagging
• Discriminative sequence learning techniques (time permitting)
– Perceptron algorithm
– Globally conditional log-linear models
Part 2 • Gene prediction
1
Supervised learning
• Last session we considered a supervised learning approach to POS-

tagging
– for every string in the training corpus, we are given the true
POS tag sequence
– Using frequencies from this corpus, we used the relative fre-
quency estimator (plus some smoothing)
c(τj /vi)
P(vi|τj ) =
c(τj )
c(τiτj ) + 1
P(τj |τi) =
c(τi) + m + 1
2
Unsupervised learning
• Relative frequency estimation is the maximum likelihood estima-

tor for the kinds of problems we deal with
• That also holds in unsupervised cases, where we do not know the
true frequency of hidden states and transitions
• In this case, we will use expected frequency to maximize the like-
lihood
• New maximum likelihood models will give us new expected fre-
quencies
3
Expectation Maximization
• In a model with hidden states (e.g. POS tagging), how can we estimate a model
if we don’t have annotated data?
• We can see the outputs (e.g. word strings), but we do not know which hidden
sequence (POS strings) generated it
• EM does the following:
– takes an initial model parameterization, and calculates the Expected frequen-
cies of being in states and taking transitions
– uses these expected frequencies to maximize the likelihood of the data
4
Expectation Maximization
• The Viterbi algorithm is a special case of general

dynamic programming techniques, applied to HMMs
• In a similar way, Baum-Welch, or forward-backward,
is a special case of general EM
• EM is guaranteed to improve likelihood of data until
convergence
• That is terrific, if likelihood is your objective
• Likelihood of the data may not optimize other inter-
esting objectives (such as accuracy of labeling)
5
HMM POS-tag model (from previous lecture)
word sequence: W = w1 . . . wn, for time 1 ≤ t ≤ n

input (word) vocabulary: vi ∈ V for 1 ≤ i ≤ k
output (tag) vocabulary: τj ∈ T for 1 ≤ j ≤ m
c(τj /vi)
Let bj (vi) = P(vi|τj ) = c(τ )
j
c(τiτj )+1
Let aij = P(τj |τi) = c(τi)+m+1
c(<s>τj )+1
Let a0j = P(τj | <s>) = c(<s>)+m
c(τi</s>)+1
Let ai0 = P(</s> |τi) = c(τi)+m+1
Pm
Let α0(0) = 1 and αj (t) = i=1 αi(t − 1)aij bj (wt)
6
Baum-Welch (forward-backward)
• For a given observation sequence w1 . . . wn

and a given model λ = {bj , aij }
– For any tag τk at any time t, what is
P(τ (t) = τk |w1 . . . wn, λ)
– For any tags τj , τk at any time t, what is
P(τ (t) = τj and τ (t + 1) = τk |w1 . . . wn, λ)
• Our forward probability αj (t) is insufficient to calcu-
late these conditional probabilities
• Also need a backward probability
7
Forward and backward probabilities
word sequence: W = w1 . . . wn, for time 1 ≤ t ≤ n

Forward probability:
(probability of seeing initial sequence w1 . . . wt and having tag j at time t)
Pm
α0(0) = 1 αj (t) = i=1 αi(t − 1)aij bj (wt)
Backward probability:
(probability of seeing remaining sequence wt+1 . . . wn given tag i at time t)
Pm
βi(n) = ai0 βi(t) = j=1 βj (t + 1)aij bj (wt+1)
Pm
P(w1 . . . wn) = β0(0) = i=1 αi(n)ai0
8
Expected frequency definitions
Probability of having tag i at time t given w1 . . . wn
αi(t)βi(t)
γi(t) = Pm
j=1 αj (t)βj (t)
Probability of having tag i at time t and tag j at time

t + 1, given w1 . . . wn
γi(t)aij bj (wt+1)βj (t + 1)
ξij (t) =
βi(t)
9
Maximization step (new model)
Pn
t=1 δwt,vk γi(t)
b̃i(vk ) = Pn
t=1 γi(t)
Pn−1
t=1 ξij (t)
ãij = Pn
t=1 γi(t)
ã0j = γj (1)
γi(n)
ãi0 = Pn
t=1 γi(t)
10
Forward-backward algorithm, E-step
word sequence: W = w1 . . . wn, size of tagset |T | = m α0(0) ← 1
for t = 1 to n
for j = 1 to m
Pm
αj (t) ← α
i=0 i (t − 1)aij bj (wt )
for i = 1 to m
βi(n) ← ai0
for i = 1 to m
γi(n) ← Pmαi (n)βi (n)
j=1 αj (n)βj (n)
for t = n − 1 to 1
for i = 1 to m
βi(t) ← m
P
j=1 βj (t + 1)aij bj (wt+1 )
for i = 1 to m
γi(t) ← Pmαi (t)βi (t)
j=1 αj (t)βj (t)
for j = 1 to m
γi (t)aij bj (wt+1 )βj (t+1)
ξij (t) ← βi (t)
11
Forward-backward algorithm, M-step
corpus of N sentences, Ws = w1s . . . w|W

s
s|
, size of tagset |T | = m
initialize aij , a0j , aj0, and bj (vk ) to 0 for all i, j, k
for i = 1 to m
PN P|Ws| s
c(i) ← s=1 t=1 γi (t)
1 PN
a0i ← N s=1 γis(1)
1 PN
ai0 ← c(i) s=1 γis(|Ws|)
for j = 1 to m
1 PN P|Ws |−1 s
aij ← c(i) s=1 t=1 ξij (t)
for k = 1 to |V |
1 PN P|Ws|
bi(vk ) ← c(i) s=1 t=1 δwts,vk γis(t)
12
Important: use log accumulators!
Pm
Example: βi(t) ← j=1 βj (t + 1)aij bj (wt+1)
Want, instead, to calculate log βi(t) to avoid underflow
Recall: log ab
c = log a + log b − log c
log βi(t) ← log 0

for j = 1 to m
upd ← log βj (t + 1) + log aij + log bj (wt+1)
log βi(t) ← log(elog βi(t) + eupd)
13
Good trick with logs
Recall: ex+y = exey

log(eA + eB ) = log(eB+A−B + eB )
= log(eB eA−B + eB )
= log(eB (eA−B + 1))
= log eB + log(eA−B + 1)
= B + log(eA−B + 1)
= A + log(eB−A + 1)
Don’t want eA−B to be large. Hence, if A > B ,
calculate A + log(eB−A + 1)
14
Example
fruit flies fast

NN NNS VB aij = P(τj |τi)
VB RB
JJ j: 0 1 2 3 4 5
i </s> JJ NN NNS VB RB
bj (w) 0 <s> 0 0.3 0.2 0.2 0.2 0.1
b2(fruit) = P(fruit | NN) = 0.1 1 JJ 0.2 0.1 0.3 0.2 0.1 0.1
b3(flies) = P(flies | NNS) = 0.01 2 NN 0.2 0.1 0.2 0.2 0.2 0.1
b4(flies) = P(flies | VB) = 0.1 3 NNS 0.2 0.1 0.1 0.2 0.3 0.1
b4(fast) = P(fast | VB) = 0.01 4 VB 0.2 0.1 0.2 0.2 0 0.3
b5(fast) = P(fast | RB) = 0.3 5 RB 0.2 0.1 0.2 0.1 0.2 0.2
b1(fast) = P(fast | JJ) = 0.05
15
Forward-backward, initialize
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0
2 NN 0
3 NNS 0
4 VB 0
5 RB 0
αj (t)
16
Forward-backward, forward 1
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0
2 NN 0 0.02
3 NNS 0 0
4 VB 0 0
5 RB 0 0
αj (t)
17
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0 0
2 NN 0 0.02 0
3 NNS 0 0 0.00004
4 VB 0 0 0.0004
5 RB 0 0 0
αj (t)
18
Time
j lbl 0 1 2 3 4
0 1 0 0 0
1 JJ 0 0 0 0.0000022
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012
5 RB 0 0 0 0.0000372
αj (t)
19
Forward-backward, forward finalize
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904
1 JJ 0 0 0 0.0000022
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012
5 RB 0 0 0 0.0000372
αj (t)
P(fruit flies fast) = 0.000007904

ln P(fruit flies fast) = -11.748
20
Forward-backward, backward time 3
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004 0
4 VB 0 0 0.0004 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
21
Forward-backward, backward time 2
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004/0.0076 0
4 VB 0 0 0.0004/0.0190 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
22
Forward-backward, backward times 1, 0
Time
j lbl 0 1 2 3 4
0 1 0 0 0 0.000007904/1
1 JJ 0 0 0 0.0000022/0.2
2 NN 0 0.02 0 0
3 NNS 0 0 0.00004/0.0076 0
4 VB 0 0 0.0004/0.0190 0.00000012/0.2
5 RB 0 0 0 0.0000372/0.2
αj (t)/βj (t)
β2(1) = β3(2)a23b3(flies) + β4(2)a24b4(flies) = 0.0003952

β0(0) = β2(1)a02b2(fruit) = 0.000007904
23
Forward-backward, γ calculations
α (t)βi(t)
γi(t) ← Pm i α (t)β (t)
j=1 j j
t j lbl αj (t) βj (t) αj (t)βj (t) γj (t)

1 2 NN 0.02 0.0003952 0.000007904 1
2 3 NNS 0.00004 0.0076 0.000000304 0.038
4 VB 0.0004 0.0190 0.0000076 0.962
3 1 JJ 0.0000022 0.2 0.00000044 0.056
4 VB 0.00000012 0.2 0.000000024 0.003
5 RB 0.0000372 0.2 0.00000744 0.941
• Alternative to Viterbi algorithm for tagging new strings

– Can pick the highest probability tag at each time
24
Joint models
• Up to now, we have had modeling

n
Y
P(τ1 . . . τn, w1 . . . wn) = P(τi|τi−1)P(wi|τi)
i=1
where τ0 =<s> and τn = wn =</s>
• This is a joint model of the tag and word sequence
• Useful for class based language modeling
• Also used for finding the tag sequence
P(T, W )
T̂ = argmax P(T |W ) = argmax
T T P(W )
= argmax P(T, W )
T
25
Generative model
• Joint models of this sort are also known as generative
• Can be used to generate strings: first generate τ1,
then generate w1 and τ2 conditioned on τ1, etc.
• Not to be confused with Generative Linguistics,
although the meaning of that term is similar in both
• Examples of generative models:
– Smoothed n-gram language models
– Speech recognition HMMs
– HMM POS-tagging models
• Often used for disambiguation, not generation
26
Conditional modeling
• Remember our POS-tagging equation

T̂ = argmax P(T |W )
T
• Rather than using Bayes rule inversion, we can try
to estimate and evaluate this directly
• One possible conditional decomposition:
n
Y
P(τ1 . . . τn|w1 . . . wn) = P(τi|τi−1wi)
i=1
27
Joint (a) versus Conditional (b) modeling
s1 s2 s3 sn
o1 o2 o3 .... on
(a)
s1 s2 s3 sn
o1 o2 o3 .... on
(b)
28
Log linear modeling
• Define a d-dimensional vector of features φ

e.g. φ1000(τi−1τiwi) = 1 if τi is DT and wi is the, 0 otherwise
• Estimate a d-dimensional parameter vector α
• Then
Pd
( s=1 αs φs (τi−1 τi wi ))
e
P(τi|τi−1, wi) =
Z(τi−1wi)
Where
X Pd 0
( s=1 αs φs (τi−1 τ wi ))
Z(τi−1wi) = e
τ0
• We can just consider log P:

d
X
log P(τi|τi−1, wi) = αsφs(τi−1τiwi) − log Z(τi−1wi)
s=1
29
Global feature vectors
• To do global conditional modeling, we need to define

a global feature vector
n
X
Φs(τ1 . . . τn, w1 . . . wn) = φs(τi−1τiwi) (1)
i=1
• Thus, if φ1000(τi−1τiwi) = 1 if τi is DT and wi is
the, 0 otherwise, then Φ1000(τ1 . . . τn, w1 . . . wn) is the
count of the number of times DT/the occurs in the
tag/word sequence. Then
d
X
log P(τ1 . . . τn|w1 . . . wn) = αsΦs(τ1 . . . τn, w1 . . . wn)
s=1
− log Z(w1 . . . wn)
30
Parameter optimization
• Log linear modeling requires searching for optimal

parameterizations
• Typically this is more complicated than with simple
HMM models
– Iterative search techniques, e.g. gradient descent
or iterative scaling
• For disambiguation, the normalization constant is
constant, and can be ignored
– typically required for training
31
Discriminative training
• Statistical model training involves maximizing some

objective function
• For an HMM, we use maximum likeihood training
– maximize the probability of the training set
• Reduction in errors is the true objective of learning
• Another option is to try to directly optimize error
rate or some other closely related objective
• Consider not just truth, but also other candidates
32
One approach: perceptron
• One approach that has been around since late 60s is

the perceptron
• Basic idea:
– Find the best scoring analysis
(e.g. POS tag sequence)
– Make its score lower, by penalizing its features
– Make the score of the truth better, by rewarding
its features
– Go onto the next example
33
Perceptron algorithm (Collins, 2002)
Approach assumes:
• Training examples (xi, yi) for i = 1 . . . N where xi is
the input and yi is the true output.
e.g. (w1 . . . wk , τ1 . . . τk ), where τ1 . . . τk is the true
tag sequence
• A function GEN which enumerates a set of
candidates GEN(x) for an input x.
• A representation Φ mapping each (x, y) ∈ X × Y to a
feature vector Φ(x, y) ∈ Rd.
• A parameter vector ᾱ ∈ Rd.
34
Perceptron algorithm (Collins, 2002)
Inputs: Training examples (xi, yi)

Initialization: Set ᾱ = 0
Algorithm:
For t = 1 . . . T , i = 1 . . . N
Calculate zi = argmaxz∈GEN(xi) Φ(xi, z) · ᾱ
If(zi 6= yi) then ᾱ = ᾱ + Φ(xi, yi) − Φ(xi, zi)
Output: Parameters ᾱ
35
Notes about percepton
• Because this technique is optimizing (sequence)

error rate, it does not involve a normalization
factor Z(w1 . . . wk )
• This will overtrain – i.e. it will do very well on
the training set, not so well on new data, like un-
smoothed maximum likelihood
– techniques for controlling overtraining, such as
regularization, voting and averaging
• Approach outperforms maximum likelihood optimized
models on a range of tasks: POS-tagging, NP-chunking
36
Conditional Random Fields (CRFs)
• The perceptron algorithm only pays attention to

best scoring (argmax) path
• What if there were two top analyses, very close
in score?
– Should penalize features on both
– How do we allocate the penalty?
• CRFs are a way to do this, by optimizing the
conditional log-likelihood of the truth
37
Conditional Random Fields (CRF)
Define a conditional distribution over the members of

GEN(x) for a given input x:
1
pᾱ(y|x) = exp (Φ(x, y) · ᾱ)
Z(x, ᾱ)
where
X
Z(x, ᾱ) = exp (Φ(x, y) · ᾱ)
y∈GEN(x)
NOTE: this is just like what fsmpush does for costs!

(Also, can be calculated with forward-backward
algorithm)
38
CRF Objective Function
Choose ᾱ to maximize the conditional log-likelihood of

the training data:
N
X N
X
LL(ᾱ) = log pᾱ(yi|xi) = [Φ(xi, yi) · ᾱ − log Z(xi, ᾱ)]
i=1 i=1
Use a zero-mean Gaussian prior on the parameters re-

sulting in the regularized objective function:
N
X ||ᾱ||2
LLR(ᾱ) = [Φ(xi, yi) · ᾱ − log Z(xi, ᾱ)] −
2σ 2
i=1
The value σ is typically estimated using held-out data.

39
CRF Optimization
• The objective function is convex and there is a glob-

ally optimum solution.
• Can use general numerical optimization techniques
to find the global optimum
– e.g. for a language modeling project we used a gen-
eral limited memory variable metric method to optimize
LLR from a publically available software library
• The optimizer needs the function value and the
derivative (or gradient)
40
Derivative of LLR: refresher
Remember the chain rule:

df (g(x)) df dg
=
dx dg dx
Also remember derivative of (natural) log:
d log(x) 1
=
dx x
And don’t forget the derivative of exp:
d exp(ax)
= a exp(ax)
dx
41
Derivative of LLR
N
X ||ᾱ||2
LLR (ᾱ) = [Φ(xi, yi) · ᾱ − log Z(xi, ᾱ)] −
i=1
2σ 2
 
N d d d
X X X X X ᾱ2m
=  Φs(xi, yi)αs − log exp( Φj (xi, y)ᾱj ) −
i=1 s=1 y∈GEN(x ) j=1 m=1
2σ 2
i
N
" P Pd #
∂LLR X y∈GEN(xi ) exp(Φj (xi, y)ᾱj )Φs(xi, y)
j=1 2αs
= Φs(xi, yi) − Pd −
∂αs 2σ 2
P
i=1 y∈GEN(xi ) exp( Φ
j=1 j (x i , y) ᾱj )
 
N P d
X X exp( j=1 Φj (xi, y)ᾱj ) αs
= Φs(xi, yi) − Φs(xi, y) − 2
i=1 y∈GEN(xi )
Z(xi, ᾱ) σ
 
N
X X αs
= Φs(xi, yi) − p(y|xi)Φs(xi, y) − 2

i=1 y∈GEN(x )
σ
i
42
Sha & Pereira (2003)
• Shallow parsing is a kind of labeled bracketing

(NP the boy) (VP saw) (NP his brother)
• Actually is equivalent to a tagging task
NP-B/the NP-I/boy VP-B/saw NP-B/his NP-I/brother
• NP chunking only annotates for noun-phrases
B/the I/boy O/saw B/his I/brother
• Sha & Pereira (2003) use CRFs to tag
input: word string
output: sequence of B/I/O
• Perceptron training gives nearly identical performance to CRF
43
Sha & Pereira (2003) features
ci is the class of wi
ti is the POS-tag of wi
yi= ci−1ci
e.g. BI or IO,
but never OI
c(yi) = ci
44
Full versus shallow parse
S
HH
HH
H
NP VP
H
HH
HH

They are VP
HH
HH
H
starting S
VP
H
H
H
H
to VP
H
H
HH

H
buy NP
H
H
H
H
[NP They] [VP are starting to buy] [NP growth stocks] growth stocks
45
Final thoughts
• Good feature sets matter a lot

• These discriminative methods allow for easy use
of many features
– Unlike HMM based methods
• In Sha & Pereira, perceptron performance not
statistically significantly different from CRF with
same feature set
• Training can be very expensive
46

Roark - Lec 2b - Forward Backward

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Roark - Lec 2b - Forward Backward

Uploaded by

Copyright:

Available Formats

Agenda for today

Part 1 • Supervised and unsupervised sequence learning

• Last session we considered a supervised learning approach to POS-

• Relative frequency estimation is the maximum likelihood estima-

• The Viterbi algorithm is a special case of general

word sequence: W = w1 . . . wn, for time 1 ≤ t ≤ n

• For a given observation sequence w1 . . . wn

word sequence: W = w1 . . . wn, for time 1 ≤ t ≤ n

Probability of having tag i at time t given w1 . . . wn

Probability of having tag i at time t and tag j at time

corpus of N sentences, Ws = w1s . . . w|W

log βi(t) ← log 0

Recall: ex+y = exey

fruit flies fast

P(fruit flies fast) = 0.000007904

β2(1) = β3(2)a23b3(flies) + β4(2)a24b4(flies) = 0.0003952

t j lbl αj (t) βj (t) αj (t)βj (t) γj (t)

• Alternative to Viterbi algorithm for tagging new strings

• Up to now, we have had modeling

• Remember our POS-tagging equation

• Define a d-dimensional vector of features φ

• We can just consider log P:

• To do global conditional modeling, we need to define

• Log linear modeling requires searching for optimal

• Statistical model training involves maximizing some

• One approach that has been around since late 60s is

Inputs: Training examples (xi, yi)

• Because this technique is optimizing (sequence)

• The perceptron algorithm only pays attention to

Define a conditional distribution over the members of

NOTE: this is just like what fsmpush does for costs!

Choose ᾱ to maximize the conditional log-likelihood of

Use a zero-mean Gaussian prior on the parameters re-

The value σ is typically estimated using held-out data.

• The objective function is convex and there is a glob-

Remember the chain rule:

• Shallow parsing is a kind of labeled bracketing

• Good feature sets matter a lot

You might also like