You are on page 1of 90

CS 598 JH: Advanced NLP (Spring 09)

Parameter estimation for PCFGs

Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Ofce Hours: Fri, 2:00-3:00pm http://www.cs.uiuc.edu/~juliahmr/cs598

Estimating P(X | X)

P(X |X) =

f (X ) f (X)

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


2

Estimating P(X | X)
Supervised:
f (X ) f (X)

P(X |X) =

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


2

Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)

P(X |X) =

f (X ) f (X)

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


2

Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)

P(X |X) =

f (X ) f (X)

Unsupervised: Inside/outside (EM) algorithm


P (n+1) (X |X) = C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


2

Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)

P(X |X) =

f (X ) f (X)

Unsupervised: Inside/outside (EM) algorithm


When we dont have parse trees, we need to use expected relative frequencies: C(X ) P(n) (n+1) P (X |X) = C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


2

Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)

P(X |X) =

f (X ) f (X)

Unsupervised: Inside/outside (EM) algorithm


When we dont have parse trees, we need to use expected relative frequencies: C(X ) P(n) (n+1) P (X |X) = C(X) P(n) Start with some initial model P(0)(X | X), and recompute expected counts iteratively over the data to get a better guess of P(X | X)
CS 598 JH: Advanced NLP (Spring 09)
2

Relative frequency estimation yields Maximum Likelihood estimates


PRF (A |A) = fD (A ) fD (A)

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate

PRF (A |A) =

fD (A ) fD (A)

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

CS 598 JH: Advanced NLP (Spring 09)


3

Relative frequency estimation yields Maximum Likelihood estimates


Relative frequency estimate
fD(...) = nr of times event has been observed in corpus D=x1.xn

PRF (A |A) =

fD (A ) fD (A)

Maximum likelihood estimate (MLE) for parameter from corpus D=x1.xn

MLE

= arg max L(D| ) = arg max


P(xi | )
xi D

What is for PCFGs? Why does RFE yield an MLE for PCFGs?
CS 598 JH: Advanced NLP (Spring 09)
3

The inside-outside algorithm

Inside/outside probabilities

CS 598 JH: Advanced NLP (Spring 09)


5

Inside/outside probabilities

XP w1.wi-1 wi..wj wj+1.wn

CS 598 JH: Advanced NLP (Spring 09)


5

Inside/outside probabilities

XP w1.wi-1 wi..wj wj+1.wn


Outside Probability of XP i..j : ij (XP ) ij (XP ) = P (S w1 ..wi1 XP wj+1 ...wn ) Inside Probability of XP i..j : ij (XP ) ij (XP ) = P (XP wi ...wj )
CS 598 JH: Advanced NLP (Spring 09)
5

Estimating P(X|X) from raw text

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


6

Estimating P(X|X) from raw text


- Our data D={w1,,wN}
consists of raw sentences wt consisting of nt words wt = wt1,,wnt

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


6

Estimating P(X|X) from raw text


- Our data D={w1,,wN}
consists of raw sentences wt consisting of nt words wt = wt1,,wnt

- We have a (nonprobabilistic) CFG G = N,T,R


with nonterminals N, terminals T (wi T*) and rules R in Chomsky Normal Form

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


6

Estimating P(X|X) from raw text


- Our data D={w1,,wN}
consists of raw sentences wt consisting of nt words wt = wt1,,wnt

- We have a (nonprobabilistic) CFG G = N,T,R


with nonterminals N, terminals T (wi T*) and rules R in Chomsky Normal Form

(n+1)

(X |X) =

C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


6

Estimating P(X|X) from raw text


- Our data D={w1,,wN}
consists of raw sentences wt consisting of nt words wt = wt1,,wnt

- We have a (nonprobabilistic) CFG G = N,T,R


with nonterminals N, terminals T (wi T*) and rules R in Chomsky Normal Form

- We start with some initial PCFG P(0) over G, which denes


P(0)(X | X), and recompute expected counts iteratively over the data to get a better guess of P(X | X):
P (n+1) (X |X) = C(X ) P(n) C(X) P(n)

CS 598 JH: Advanced NLP (Spring 09)


6

Computing C(X)

CS 598 JH: Advanced NLP (Spring 09)


7

Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt

CS 598 JH: Advanced NLP (Spring 09)


7

Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt

CS 598 JH: Advanced NLP (Spring 09)


7

Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


7

Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt) Hence: C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt)
CS 598 JH: Advanced NLP (Spring 09)
7

Computing C(XYZ)

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt)

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n)

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n) = t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X) P(n)(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


8

Computing C(Xw)

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is:

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt)

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n)

CS 598 JH: Advanced NLP (Spring 09)


9

Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


9

The inside/outside algorithm

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) =

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)=

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) =

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

M-step: Recompute new parameters P(n+1)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

M-step: Recompute new parameters P(n+1)


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

M-step: Recompute new parameters P(n+1)


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n) P(n+1)(Xw) := C(Xw)P(n) C(X)P(n)

CS 598 JH: Advanced NLP (Spring 09)


10

The inside/outside algorithm


Initialization: Assign initial parameters P(0) E-step: Compute C(X)P(n), C(XYZ)P(n) C(Xw)P(n)
for all nonterminals X, rules XYZ and Xw over entire corpus
C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt) C(XYZ)P(n)= t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X)P(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt) C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)

M-step: Recompute new parameters P(n+1)


P(n+1)(XYZ) := C(XYZ)P(n) C(X)P(n) P(n+1)(Xw) := C(Xw)P(n) C(X)P(n)

Repeat E-M until convergence


CS 598 JH: Advanced NLP (Spring 09)
10

Caveats

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation:

CS 598 JH: Advanced NLP (Spring 09)


11

Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation: O(n3|N|3) for each sentence with n words and a grammar with |N| nonterminals

CS 598 JH: Advanced NLP (Spring 09)


11

Tightness of PCFGs

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g:

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973):

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight

CS 598 JH: Advanced NLP (Spring 09)


12

Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight - PCFGs estimated from labeled or unlabeled data are tight (Chi, 1999, Chi & Geman 1998)
CS 598 JH: Advanced NLP (Spring 09)
12

Learning PCFGs from partially bracketed data (Pereira/Schabes92)

CS 598 JH: Advanced NLP (Spring 09)


13

Learning PCFGs from partially bracketed data (Pereira/Schabes92)

CS 598 JH: Advanced NLP (Spring 09)


13

Learning PCFGs from partially bracketed data (Pereira/Schabes92)


How do you estimate a PCFG if you have a corpus that contains some (unlabeled) brackets?

CS 598 JH: Advanced NLP (Spring 09)


13

Learning PCFGs from partially bracketed data (Pereira/Schabes92)


How do you estimate a PCFG if you have a corpus that contains some (unlabeled) brackets?

CS 598 JH: Advanced NLP (Spring 09)


13

Learning PCFGs from partially bracketed data (Pereira/Schabes92)


How do you estimate a PCFG if you have a corpus that contains some (unlabeled) brackets?

CS 598 JH: Advanced NLP (Spring 09)


13

Head-lexicalized PCFGs (Carroll/Rooth 1998)

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG?

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG?

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences (Challenges: unobserved events, model size)

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences (Challenges: unobserved events, model size)

CS 598 JH: Advanced NLP (Spring 09)


14

Head-lexicalized PCFGs (Carroll/Rooth 1998)


How do you extend EM when you want to estimate the parameters of a head-lexicalized PCFG? This can also be used to learn selectional preferences (Challenges: unobserved events, model size)

CS 598 JH: Advanced NLP (Spring 09)


14

You might also like