Professional Documents
Culture Documents
Julia Hockenmaier
juliahmr@illinois.edu 3324 Siebel Center Ofce Hours: Fri, 2:00-3:00pm http://www.cs.uiuc.edu/~juliahmr/cs598
Estimating P(X | X)
P(X |X) =
f (X ) f (X)
(n+1)
(X |X) =
Estimating P(X | X)
Supervised:
f (X ) f (X)
P(X |X) =
(n+1)
(X |X) =
Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)
P(X |X) =
f (X ) f (X)
(n+1)
(X |X) =
Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)
P(X |X) =
f (X ) f (X)
Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)
P(X |X) =
f (X ) f (X)
Estimating P(X | X)
Supervised:
When we have labels (parse trees), we can use relative frequencies to estimate P(X | X)
P(X |X) =
f (X ) f (X)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
PRF (A |A) =
fD (A ) fD (A)
MLE
P(xi | )
xi D
What is for PCFGs? Why does RFE yield an MLE for PCFGs?
CS 598 JH: Advanced NLP (Spring 09)
3
Inside/outside probabilities
Inside/outside probabilities
Inside/outside probabilities
(n+1)
(X |X) =
(n+1)
(X |X) =
(n+1)
(X |X) =
(n+1)
(X |X) =
Computing C(X)
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt)
Computing C(X)
The number of times we expect to see nonterminal X in D is the sum of the number of times we expect to see X in each of the sentences wt The number of times we expect to see X in wt is the sum of the number of times we expect to see X in any substring wi...j of wt The number of times we expect to see X in wi...j is P(n)(X@wi...j | wt) = P(n)(X@wi...j ,wt)P(n)(wt) = [(n)ij(X) (n)ij(X)] P(n)(wt) Hence: C(X)P(n) = t=1..N i=0..nt j=i..nt [(n)ij(X)(n)ij(X)]/P(n)(wt)
CS 598 JH: Advanced NLP (Spring 09)
7
Computing C(XYZ)
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt)
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt)
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n)
Computing C(XYZ)
The number of times we expect to see XYZ in wi...j is P(n)(XYZ@wi...j | wt) =P(n)(XYZ@wi...j ,wt)P(n)(wt) = k=i..j-1[(n)ij(X) P(n)(XYZ )(n)ik(Y) (n)(k+1)j(Z) ] P(n)(wt) Hence: C(XYZ)P(n) = t=1..Ni=0..ntj=i..ntk=i..j-1[(n)ij(X) P(n)(XYZ)(n)ik(Y)(n)(k+1)j(Z)]/P(n)(wt)
Computing C(Xw)
Computing C(Xw)
The number of times we expect to see Xw in wi is:
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise:
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt)
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt)
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n)
Computing C(Xw)
The number of times we expect to see Xw in wi is: - C(Xw) = 0 iff w wi - Otherwise: P(n)(Xw@wi | wt) =P(n)(Xw@wi...j ,wt)P(n)(wt) = [(n)ij(X) P(n)(Xw) ] P(n)(wt) Hence: C(Xw)P(n) = t=1..Ni=0..nt: wti = w [(n)ii(X) P(n)(Xw)]/P(n)(wt)
Caveats
Caveats
Inside/Outside is not guaranteed to nd a global maximum
Caveats
Inside/Outside is not guaranteed to nd a global maximum
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation:
Caveats
Inside/Outside is not guaranteed to nd a global maximum The structures it nds dont always correspond to linguistically intuitive ones Complexity of restimation: O(n3|N|3) for each sentence with n words and a grammar with |N| nonterminals
Tightness of PCFGs
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L.
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g:
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973):
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight
Tightness of PCFGs
A PCFG for a language L is tight if it assigns probability mass 1 to L. A PCFG may assign p.m < 1 to L if it assigns too much probability to recursive rules, e.g: S -> A A 0.9 S -> a 0.1 Test for tightness (Booth/Thompson 1973): - Construct a square matrix E, where eij= the expected number of nonterminals Aj in rules with RHS Ai - If the largest eigenvalue of E has magniture < 1, the PCFG is tight - PCFGs estimated from labeled or unlabeled data are tight (Chi, 1999, Chi & Geman 1998)
CS 598 JH: Advanced NLP (Spring 09)
12