You are on page 1of 3

EM

expectation maximization

semi-supervised learning for noisy channel / HMMs

1 hmm
p(s4 |s1 ) p(s7 |s4 )
p(s1 |start) s1 s4 s7 p(end|s7 )

hidden: start s2 s5 s8 end

s3 s6 s9
observed: o1 o2 o3
What are the probabilities of a certain observation given a certain source state (emis-
sion, p(oi |si )) or source given another (transmission, p(sj |si ))?1
If we know those, we can make an educated guess about what sources generated what
we’ve observed.
(find the max p(src|obs) via Bayes by calculating p(src), p(obs), and p(obs|src), or p(obs, src))

2 generative model
With models that estimate the probability of the data, we can prefer those that make the
data more likely.

p(~o, ~s) = p(s1 |start)...p(sn |sn−1 )p(end|sn )p(o1 |s1 )...p(on |sn ) (1)
e.g. p(o1 , o2 , o3 , s1 , s4 , s7 ) = p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 ) (2)

X
p(~o) = p(s1 |start)...p(sn |sn−1 )p(end|sn )p(o1 |s1 )...p(on |sn ) (3)
~
s∈S
yikes, exponential paths to count! better do something dynamic! but first, what about
estimating our parameters?
1
if you want to trigram words, the preceding source state must encode both the 2 preceding words; in
HMMs decisions are severely local.

1
3 fractional counting
Let’s count what we can observe and estimate unobserved counts from our prior beliefs.
In fact, then we can update those beliefs with our new counts, and then count again, and
do that all over and over until there’s just not much more improvement (in fact, that often
occurs in just a few iterations). Though I won’t prove it here, this method is guaranteed
to improve the likelihood of the data. Be sure to normalize wherever necessary!

• count(o1 , o2 , o3 , s1 , s4 , s7 ) = p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 )

• count(s1 ) = count(s1 ) + p(s1 |start)p(s4 |s1 )p(s7 |s4 )p(end|s7 )p(o1 |s1 )p(o2 |s4 )p(o3 |s7 )

As we explore all possible explanations for the data, we can increment the individual
members of a joint observation by as much as we believe in that hypothesized observation.
initially,

• count(o1 , o2 , o3 , s1 , s4 , s7 ) = 31 ∗ 13 ∗ 31 ∗ 1 ∗ 1 ∗ 1 ∗ 1 = 27
1

(but if s4 = s1 , must normalize to p(o1 |s1 ) = .5, p(o2 |s1 ) = .5)


1
• count(s1 , o1 ) = count(s1 , o1 ) + 27
1
• count(s1 ) = count(s1 ) + 27

4 dynamic efficiency
need an efficient way to count all these paths:
(see Baum, Baker, Viterbi, etc. for major developers of these methods)

4.1 forward-backward
Calculate the probability up to any given point as the sum of probabilities of all possible
paths to that point times the probability that point contributes.
X
f i = pi f (4)
f ∈Fi
s 1 : f = p1 f = p1 p(s4 |s1 )
P3
s 2 : f = p2 f = p2 p(s4 |s2 ) s4 : f = p(o4 |s4 ) i pi p(s4 |si )
s 3 : f = p3 f = p3 p(s4 |s3 )
o4

The probability of the observation is fend.


Do the same thing backwards:

2
X
bj = p j b (5)
b∈Bj

bstart better equal fend!

Fi Bj
si : fi p(sj |si ) s j : bj

fi ∗ p(sj |si ) ∗ bj
count(si , sj )+ = (6)
fend
rather than NsNo more like Ns ∗ No !

4.2 inside-outside
messier but possible with trees. we can store a chart of all possible parses of a string of
words:

ca (w1 , w2 , w3 ), ..., cz (w1 , w2 , w3 ), ..., cz (ca (w1 , w2 ), w3 ), ..., cz (ca (w1 , w2 ), w3 ), ...

ca (w1 , w2 ), ..., cz (w1 , w2 ) ca (w2 , w3 ), ..., cz (w2 , w3 )

w1 w2 w3

the chart is essentially a pyramid, with any particular block containing all possible
classifications of the sub-string which it sits on. possible classifications of (w1 , w2 , w3 ) are
presumably affected by possible classifications of (w1 , w2 ) and (w2 , w3 ).
X
p(w1 , ..., wn ) = p of all possible parse generations of (w1 , ..., wn ) (7)

count(daughters, mother)+ = I ∗ p(daughters|mother) ∗ O (8)

X
Icij = p of ways of classifying wi ...wj as c (9)

X
Ocij = p of ways to the root which classify wi ...wj as c (10)
as above, the sums are entirely local to the node they operate on:
the probability of the path up to that point is just extracted from the relevant inside
or outside,
hypothesized probability of whatever hypothesized operation is multiplied in,
and that value is pushed along whatever paths said operation connects in.

You might also like