You are on page 1of 4

Maximum Likelihood

USC Linguistics

August 8, 2007

Ratnaparkhi (1998), p. 8 (slightly modified):


f (a,b)
wj j
Q
j
p(a|b) = P Q f (a,b) (1)
j
a j wj

training:
X
L(p) = p̃(a, b) log p(a|b) (2)
a,b

Generative Iterative Scaling (GIS) (Ratnaparkhi (1998) p. 14, citing Darroch and
Ratcliff (1972)):
X
f (a, b) = C (3)
f

if necessary:
X X
C = max f (a, b) ; fn+1 (a, b) = C − f (a, b) (4)
a,b
f f
 1
Ep̃ fj C
wj:0 = 1 ; wj:i+1 = wj:i (5)
Epi fj
X
Epi fj = p̃(b)fj (a, b)pi (a|b) (6)
a,b

f (a,b)
wj:ij
Q
j
pi (a|b) = P Q f (a,b) (7)
j
a j wj:i
X 1 X
Ep̃ fj = p̃(a, b)fj (a, b) = fj (an , bn ) (8)
N n
a,b

1
c(b,a) cp1 (b) cp2 (b) a f1 f2 f3 f4 fc
20 T T T 1 1 0 0 0
0 T T F 0 0 0 0 2
0 T F T 1 0 0 0 1
20 T F F 0 0 0 1 1
0 F T T 0 1 0 0 1
20 F T F 0 0 1 0 1
0 F F T 0 0 0 0 2
20 F F F 0 0 1 1 0
80 20 20 40 40 40
.25 .25 .5 .5 .5

C=2 (9)

" #1
1 P 2
N n fj (an , bn )
wj:i = wj:i−1 P (10)
a,b p̃(b)fj (a, b)pi (a|b)
Q j f (a,b)
j wj:i
pi (a|b) = P Q f (a,b) (11)
j
a j wj:i

Ep̃ f1,2 = 20/80 = .25 ; Ep̃ f3,4,c = 40/80 = .5 (12)

f1 f2 f3 f4 fc
wx:0 1 1 1 1 1
wx:1 1 1 1.41 1.41 .71
wx:2 .96 .96 1.71 1.71 .57
wx:3 .91 .91 1.94 1.94 .47
T|TT F|TT T|TF F|TF T|FT F|FT T|FF F|FF
p0 .5 .5 .5 .5 .5 .5 .5 .5
p1 .66 .33 .41 .59 .41 .59 .2 .8
p2 .74 .26 .36 .64 .36 .64 .1 .9
p3 .79 .21 .32 .68 .32 .68 .06 .94
f1 f2 f3 f4 fc
Ep0 .25 .25 .25 .25 1
Ep1 .27 .27 .34 .34 .77
Ep2 .28 .28 .39 .39 .73
Ep3 .28 .28 .41 .41 .73

2
11 ∗ 11 ∗ 10 ∗ 10 ∗ 10
p0 (T |(T, T )) = = .5 (13)
(11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )
10 ∗ 10 ∗ 10 ∗ 10 ∗ 12
p0 (F |(T, T )) = = .5 (14)
(11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )

Ep0 f1 = (.25 ∗ 1 ∗ p0 (T |(T, T ))) + (.25 ∗ 1 ∗ p0 (T |(T, F ))) = .25 (15)

Ep0 f3 = (.25 ∗ 1 ∗ p0 (F |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, F ))) = .25 (16)

Ep0 fc = (.25 ∗ 2 ∗ p0 (F |(T, T )))+


(.25 ∗ 1 ∗ p0 (T |(T, F ))) + (.25 ∗ 1 ∗ p0 (F |(T, F )))+
(17)
(.25 ∗ 1 ∗ p0 (T |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, T )))+
(.25 ∗ 2 ∗ p0 (T |(F, F ))) = 1
√ √
w1:1 = w2:1 = 1 ; w3:1 = w4:1 = 2 ; wc:1 = .5 (18)

11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710


p1 (T |(T, T )) = = .66 (19)
(11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712


p1 (F |(T, T )) = = .33 (20)
(11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711


p1 (T |(T, F )) = = .41 (21)
(11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.411 ∗ .711 )

10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710


p1 (F |(T, F )) = = .59 (22)
(10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

.712
p1 (T |(F, F )) = = .2 (23)
(.712 ) + (1.41 ∗ 1.41)
1.41 ∗ 1.41
p1 (F |(F, F )) = = .8 (24)
(.712 )+ (1.41 ∗ 1.41)

Ep1 f1 = (.25 ∗ 1 ∗ .66) + (.25 ∗ 1 ∗ .41) = .27 (25)

Ep1 f3 = (.25 ∗ 1 ∗ .59) + (.25 ∗ 1 ∗ .8) = .34 (26)

w1:2 = w2:2 = .96 ; w3:2 = w4:2 = 1.71 ; wc:2 = .57 (27)

3
References
Berger, Adam L., Pietra, Stephen Della, and Pietra, Vincent J. Della (1996) “A Maximum
Entropy Approach to Natural Language Processing,” Computational Linguistics, 22(1),
39–71.

Darroch, J.N. and Ratcliff, D. (1972) “Generalized iterative scaling for log-linear models,”
Ann. Math. Statist., 43, 1470–1480.

Ratnaparkhi, A. (1998) “Maximum Entropy Models for Natural Language Ambiguity Res-
olution,” Ph.D Thesis, University of Pennsylvania.

You might also like