Parameter Estimation For PCFGS: Julia Hockenmaier

CS 598 JH: Advanced NLP (Spring 09)
Parameter estimation for PCFGs
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Ofce Hours: Fri, 2:00-3:00pm http://www.cs.uiuc.edu/~juliahmr/cs598
Estimating P(X | X)
P(X |X) =
f (X ) f (X)
(n+1)
(X |X) =
C(X ) P(n) C(X) P(n)

2
Estimating P(X | X)
Supervised:
f (X ) f (X)
P(X |X) =
(n+1)
(X |X) =

2
Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)
P(X |X) =
f (X ) f (X)
(n+1)
(X |X) =

2
Estimating P(X | X)
Supervised:
P(X |X) =
f (X ) f (X)
Unsupervised: Inside/outside (EM) algorithm

P (n+1) (X |X) = C(X ) P(n) C(X) P(n)

2
Estimating P(X | X)
Supervised:
P(X |X) =
f (X ) f (X)

When we dont have parse trees, we need to use expected relative frequencies: C(X ) P(n) (n+1) P (X |X) = C(X) P(n)

2
Estimating P(X | X)
Supervised:
P(X |X) =
f (X ) f (X)

When we dont have parse trees, we need to use expected relative frequencies: C(X ) P(n) (n+1) P (X |X) = C(X) P(n) Start with some initial model P(0)(X | X), and recompute expected counts iteratively over the data to get a better guess of P(X | X)
2
Relative frequency estimation yields Maximum Likelihood estimates

PRF (A |A) = fD (A ) fD (A)
MLE
= arg max L(D| ) = arg max

P(xi | )
xi D

3

Relative frequency estimate
PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

fD(...) = nr of times event has been observed in corpus D=x1.xn
PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D

3

PRF (A |A) =
fD (A ) fD (A)
MLE

P(xi | )
xi D
What is for PCFGs? Why does RFE yield an MLE for PCFGs?
3
The inside-outside algorithm
Inside/outside probabilities

5
XP w1.wi-1 wi..wj wj+1.wn

5
XP w1.wi-1 wi..wj wj+1.wn

Outside Probability of XP i..j : ij (XP ) ij (XP ) = P (S w1 ..wi1 XP wj+1 ...wn ) Inside Probability of XP i..j : ij (XP ) ij (XP ) = P (XP wi ...wj )
5
Estimating P(X|X) from raw text
(n+1)
(X |X) =

6

- Our data D={w1,,wN}
consists of raw sentences wt consisting of nt words wt = wt1,,wnt
(n+1)
(X |X) =

6

- We have a (nonprobabilistic) CFG G = N,T,R

with nonterminals N, terminals T (wi T*) and rules R in Chomsky Normal Form
(n+1)
(X |X) =

6


(n+1)
(X |X) =

6


- We start with some initial PCFG P(0) over G, which denes

P(0)(X | X), and recompute expected counts iteratively over the data to get a better guess of P(X | X):
P (n+1) (X |X) = C(X ) P(n) C(X) P(n)

6
Computing C(X)

7
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt

7
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt

7
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt)

7
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt) Hence: C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt)
7
Computing C(XYZ)

8
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is

8
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt)

8
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt)

8
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n)

8
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n) = t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X) P(n)(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt)

8
Computing C(Xw)

9
Computing C(Xw)
The number of times we expect to see Xw in wi is:

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt)

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt)

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n)

9
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

9
The inside/outside algorithm

10

Initialization: Assign initial parameters P(0)

10

Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)

10

for all nonterminals X, rules XYZ and Xw over entire corpus

10

C(X)P(n) =

10

C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt)

10

C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)=

10

C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) =

10

C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

10

M-step: Recompute new parameters P(n+1)

10


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n)

10


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n) P(n+1)(Xw) := C(Xw)P(n) C(X)P(n)

10


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n) P(n+1)(Xw) := C(Xw)P(n) C(X)P(n)
Repeat E-M until convergence

10
Caveats

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation:

11
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation: O(n3|N|3) for each sentence with n words and a grammar with |N| nonterminals

11
Tightness of PCFGs

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g:

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973):

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight

12
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight - PCFGs estimated from labeled or unlabeled data are tight (Chi, 1999, Chi & Geman 1998)
12
Learning PCFGs from partially bracketed data (Pereira/Schabes92)

13

13

How do you estimate a PCFG if you have a corpus that contains some (unlabeled) brackets?

13


13


13
Head-lexicalized PCFGs (Carroll/Rooth 1998)

14

How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG?

14

How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG?

14

How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences

14

How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences

14

How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences (Challenges: unobserved events, model size)

14


14


14

Parameter Estimation For PCFGS: Julia Hockenmaier

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parameter Estimation For PCFGS: Julia Hockenmaier

Uploaded by

Copyright:

Available Formats

CS 598 JH: Advanced NLP (Spring 09)