You are on page 1of 79

Tutorialon

ProbabilisticContextFree
Grammars
RaphaelHoffmann
590AI,Winter2009
Outline
PCFGs:InferenceandLearning
ParsingEnglish
DiscriminativeCFGs
GrammarInduction
ImageSearchforpcfg

Live Search
Outline
PCFGs:InferenceandLearning
ParsingEnglish
DiscriminativeCFGs
GrammarInduction
Thevelocityoftheseismicwavesrisesto

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


ACFGconsistsof
Terminals w1, w2, . . . , wV

Nonterminals N 1, N 2, . . . , N n

Startsymbol N1

Rules N i j
where j is a sequence of
terminals and nonterminals

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


A(generative)PCFG consistsof
Terminals w1, w2, . . . , wV

Nonterminals N 1, N 2, . . . , N n

Startsymbol N1

Rules N i j
where j is a sequence of
terminals and nonterminals

Ruleprobabilities such that


P i j
i j P (N )=1

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Notation
sentence: sequenceofwords w1w2 . . . wm
wab : thesubsequence wa . . . wb
i
Nab : nonterminal N i dominates wa . . . wb
Ni

N i = : repeatedderivationfromgives
Ni

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


ProbabilityofaSentence

X
P (w1n) = P (w1n, t)
t

where t aparsetreeof w1n

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Example

Terminals with,saw,astronomers,ears,stars,
telescopes
Nonterminals S,PP,P,NP,VP,V
Startsymbol S

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


astronomerssawstarswithears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


astronomerssawstarswithears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Probabilities

P (t1 ) = 1.0 0.1 0.7 1.0 0.4


0.18 1.0 1.0 0.18
= 0.0009072
P (t2 ) = 1.0 0.1 0.3 0.7 1.0
0.18 1.0 1.0 0.18
= 0.0006804
P (w15) = P (t1 ) + P (t2 ) = 0.0015876

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


AssumptionsofPCFGs
1. Placeinvariance(liketimeinvarianceinHMMs)
j
k P (Nk(k+c) ) isthesame

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


AssumptionsofPCFGs
1. Placeinvariance(liketimeinvarianceinHMMs)
j
k P (Nk(k+c) ) isthesame

2. Contextfree
j j
P (Nkl | words outside wk . . . wl ) = P (Nkl )

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


AssumptionsofPCFGs
1. Placeinvariance(liketimeinvarianceinHMMs)
j
k P (Nk(k+c) ) isthesame

2. Contextfree
j j
P (Nkl | words outside wk . . . wl ) = P (Nkl )

3. Ancestorfree
j j j
P (Nkl | ancestor nodes of Nkl ) = P (Nkl )

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


SomeFeaturesofPCFGs
Partialsolutionforgrammarambiguity
Canbelearnedfrompositivedataalone
(butgrammarinductiondifficult)
Robustness
(admiteverythingwithlowprobability)
Givesaprobabilisticlanguagemodel
PredictivepowerbetterthanthatforaHMM

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


SomeFeaturesofPCFGs
Notlexicalized(probabilitiesdonotfactorin
lexicalcooccurrence)
PCFGisaworselanguagemodelforEnglish
thanngrammodels
Certainbiases:smallertreesmoreprobable
(averageWSJsentence23words)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


InconsistentDistributions

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Questions
Letbeasentence,agrammar,aparsetree.
w1m G t

Whatistheprobabilityofasentence?
P (w1m|G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Questions
Letbeasentence,agrammar,aparsetree.
w1m G t

Whatistheprobabilityofasentence?
P (w1m|G)
Whatisthemostlikelyparseofsentence?
arg maxt P (t|w1m, G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Questions
Letbeasentence,agrammar,aparsetree.
w1m G t

Whatistheprobabilityofasentence?
P (w1m|G)
Whatisthemostlikelyparseofsentence?
arg maxt P (t|w1m, G)

Whatruleprobs.maximizeprobs.ofsentences?
G
Findthatmaximizes P (w1m|G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


ChomskyNormalForm
AnyCFGgrammarcanberepresentedinCNF
whereallrulestaketheform
N i N j N k

N i wj

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


HMMs andPCFGs
HMMs:distributionoverstringsofcertainlength
P
n w1n P (w1n) = 1
PCFGs:distributionoverstringsoflanguageL
P
wL P (w) = 1

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


HMMs andPCFGs
HMMs:distributionoverstringsofcertainlength
P
n w1n P (w1n) = 1
PCFGs:distributionoverstringsoflanguageL
P
wL P (w) = 1
Consider

highprobabilityinHMM,lowprobabilityinPCFG

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


HMMs andPCFGs
HMMs:distributionoverstringsofcertainlength
P
n w1n P (w1n) = 1
PCFGs:distributionoverstringsoflanguageL
P
wL P (w) = 1
Consider

highprobabilityinHMM,lowprobabilityinPCFG
ProbabilisticRegularGrammar
N i wj N k
N i wj
SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze
HMMs andPCFGs

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


InsideandOutsideProbabilities
ForHMMs wehave

Forwards i (t) = P (w1(t1), Xt = i)


Backwards i (t) = P (wtT |Xt = i)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


InsideandOutsideProbabilities
ForHMMs wehave

Forwards i (t) = P (w1(t1), Xt = i)


Backwards i (t) = P (wtT |Xt = i)

ForPCFGs wehave
j
Outside j (p, q) = P (w1(p1), Npq , w(q+1)m |G)
j
Inside j (p, q) = P (wpq |Npq , G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


InsideandOutsideProbabilities

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Probabilityofasentence
j
Outside j (p, q) = P (w1(p1), Npq , w(q+1)m |G)
j
Inside j (p, q) = P (wpq |Npq , G)

P (w1m |G) = 1 (1, m)

X
P (w1m|G) = j (k, k)P (N j wk )
j
InsideProbabilities j
j (p, q) = P (wpq |Npq , G)

Basecase j (k, k) = j
P (wkk |Nkk , G)
= P (N j wk |G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


InsideProbabilities j
j (p, q) = P (wpq |Npq , G)

Basecase j (k, k) = j
P (wkk |Nkk , G)
= P (N j wk |G)

Induction
Wanttofind j (p, q) for p < q

SinceweassumeChomskyNormalForm,
thefirstrulemustbeoftheform N j N r N s

Sowecandividethesentenceintwoin
variousplacesandsumtheresult

q1
XX
j (p, q) = P (N j N r N s )r (p, d)s (d + 1, q)
r,s d=p

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


CYKAlgorithm

astronomers saw stars with ears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


CYKAlgorithm

NP = 0.1 V = 1.0 NP = 0.18 P = 1.0 NP = 0.18


? ?
NP = 0.04 ?
astronomers saw stars with ears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


CYKAlgorithm

V P = 0.126 P P = 0.18
? ? ? ?
NP = 0.1 V = 1.0 NP = 0.18 P = 1.0 NP = 0.18
NP = 0.04
astronomers saw stars with ears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


CYKAlgorithm

S = 0.015876

V P = 0.015876

S = 0.0126 NP = 0.01296
?
V P = 0.126 P P = 0.18

NP = 0.1 V = 1.0 NP = 0.18 P = 1.0 NP = 0.18


NP = 0.04
astronomers saw stars with ears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


CYKAlgorithm
Worstcase:O(m3r)
m=lengthofsentence
S = 0.015876 r=numberofrulesingrammar
n=numberofnonterminals
IfweconsiderallpossibleCNFrules:O(m3n3)
V P = 0.015876

S = 0.0126 NP = 0.01296

V P = 0.126 P P = 0.18

NP = 0.1 V = 1.0 NP = 0.18 P = 1.0 NP = 0.18


NP = 0.04
astronomers saw stars with ears

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


OutsideProbabilities
Computetopdown(afterinsideprobabilities)
Basecase
1 (1, m) = 1

j (1, m) = 0, for j 6= 1

Induction
m
X X
j (p, q) = f (p, e)P (N f N j N g )g (q + 1, e)
f,g e=q+1

X p1
X
+ f (e, q)P (N f N g N j )g (e, p 1)
f,g e=1

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Probabilityofanodeexisting
AswithaHMM,wecanformaproductofthe
insideandoutsideprobabilities.
j
j (p, q)j (p, q) = P (w1m , Npq |G)

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Probabilityofanodeexisting
AswithaHMM,wecanformaproductofthe
insideandoutsideprobabilities.
j
j (p, q)j (p, q) = P (w1m , Npq |G)

Therefore, X
P (w1m , Npq |G) = j (p, q)j (p, q)
j

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Probabilityofanodeexisting
AswithaHMM,wecanformaproductofthe
insideandoutsideprobabilities.
j
j (p, q)j (p, q) = P (w1m , Npq |G)

Therefore, X
P (w1m , Npq |G) = j (p, q)j (p, q)
j

Justinthecasesoftherootnodeandthe
preterminals,weknowtherewillbesome
suchconstituent.
SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze
Training
Ifhavedata count
j C(N j )
P (N ) = P j
C(N )

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Training
Ifhavedata count
j C(N j )
P (N ) = P j
C(N )

elseuseEM(InsideOutsideAlgorithm)
repeat
computesands
j j
computes
P
P (N j N r N s ) = . . .
P (N j wk ) = . . .
end
j j
tworeallylongformulaswithsands

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


EMProblems
Slow:O(m3n3)foreachsentenceandeach
iteration
Localmaxima(Charniak:300trialsledto300
differentmax.)
Inpractice,need>3timesmorenonterminals
thanaretheoreticallyneeded
Noguaranteethatlearnednonterminals
correspondtoNP,VP,

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Bracketinghelps
Pereira/Schabes 92:
Trainonsentences:
37%ofpredictedbracketscorrect
Trainonsentences+brackets:
90%ofpredictedbracketscorrect

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


GrammarInduction
Rulestypicallyselectedbylinguist
Automaticinductionisverydifficultfor
contextfreelanguages
Itiseasytofindsomeformofstructure,but
littleresemblancetothatoflinguistics/NLP

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Outline
PCFGs:InferenceandLearning
ParsingEnglish
DiscriminativeCFGs
GrammarInduction
ParsingforDisambiguation
Thepostofficewillholdoutdiscountsand
serviceconcessionsasincentives.

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


ParsingforDisambiguation
Therearetypicallymanysyntactically
possibleparses
Wanttofindthemostlikelyparses

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Treebanks
Ifgrammarinductiondoesnotwork,whynot
countexpansionsinmanyparsetrees?
PennTreebank

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


PCFGweaknesses
NoContext
(immediatepriorcontext,speaker,)
NoLexicalization
VPNPNP morelikelyifverbishand ortell
failtocapturelexicaldependencies(ngramsdo)
NoStructuralContext
HowNPexpandsdependsonposition

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


PCFGweaknesses
Expansion % as Subj % as Obj
NP PRP 13.7% 2.1%
NP NNP 3.5% 0.9%
NP DT NN 5.6% 4.6%
NP NN 1.4% 2.8%
NP NP SBAR 0.5% 2.6%
NP NP PP 5.6% 14.1%
Expansion % as 1st Obj % as 2nd Obj
NP NNS 7.5% 0.2%
NP PRP 13.4% 0.9%
NP NP PP 12.2% 14.4%
NP DT NN 10.4% 13.3%
NP NNP 4.5% 5.9%
NP NN 3.9% 9.2%
NP JJ NN 1.1% 10.4%
NP NP SBAR 0.3% 5.1%
SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze
OtherApproaches
Challenge:uselexicalandstructuralcontext,
withouttoomanyparameters,sparsedata
OtherGrammars
ProbabilisticLeftCornerGrammars
PhraseStructureGrammars
DependencyGrammars
ProbabilisticTreeSubstitutionGrammars
HistorybasedGrammars

SlidebasedonFoundationsofStatisticalNaturalLanguageProcessing byChristopherManningandHinrich Schtze


Outline
PCFGs:InferenceandLearning
ParsingEnglish
DiscriminativeCFGs
GrammarInduction
Generativevs Discriminative
AnHMM(orPCFG)isagenerative model
P (y, w)

Oftensufficientisadiscriminative model
P (y|w)

Easier,becausedoesnotcontain P (w)
CannotmodeldependentfeaturesinHMM,
sooneonlypicksonefeature:wordsidentity
GenerativeandDiscriminativeModels
Sequence GeneralGraphs

HMMs Generative
NaveBayes
directedmodels

Conditional Conditional Conditional

Sequence GeneralGraphs

LogisticRegression LinearchainCRFs GeneralCRFs

SlidebasedonAnintroductiontoConditionalRandomFieldsforRelationalLearning byCharlesSuttonandAndrewMcCallum
GenerativeandDiscriminativeModels
General
Sequence Tree Graphs

HMMs PCFGs Generative


NaveBayes
directedmodels

Conditional Conditional Conditional

General
Sequence Graphs

LogisticRegression LinearchainCRFs GeneralCRFs


GenerativeandDiscriminativeModels
General
Sequence Tree Graphs

HMMs PCFGs Generative


NaveBayes
directedmodels

Conditional Conditional Conditional Conditional

General
Sequence Tree Graphs

LogisticRegression LinearchainCRFs
?
GeneralCRFs
Discriminative
ContextFreeGrammars
Terminals w1, w2, . . . , wV

Nonterminals N 1, N 2, . . . , N n

Startsymbol N1

Rules N i j
where j is a sequence of
terminals and nonterminals

Rulescores
F
X
S(N i j , p, q) = k (N i j )fk (w1 w2 . . . wm , p, q, N i j )
k=1

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Features
F
X
S(N i j , p, q) = k (N i j )fk (w1 w2 . . . wm , p, q, N i j )
k=1

Featurescandependonalltokens+span
ConsiderfeatureAllOnTheSameLine
Mavis Wood Mavis Wood Products
Products

[comparetolinearCRF
fk (st , tt1, w1 w2 . . . wm , t) ]

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Features
F
X
S(N i j , p, q) = k (N i j )fk (w1 w2 . . . wm , p, q, N i j )
k=1

Noindependencebetweenfeaturesnecessary
Cancreatefeaturesbasedonwords,
dictionaries,digits,capitalization,
CanstilldoefficientViterbi inferencein
O(m3r)

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Example

BizContact BizName Address BizPhone


PersonalContact BizName Address HomePhone

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Example

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Training
Trainfeatureweightvectorforeachrule
Havelabels,butnotparsetrees;
efficientlycreatetreesbyignoringleaves

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Collins AveragedPerceptron

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Results

LinearCRF DiscriminativeCFG Improvement

WordErrorRate 11.57% 6.29% 45.63%

RecordErrorRate 54.71% 27.13% 50.41%

SlidebasedonLearningtoExtractInformationfromSemiStructuredTextusingaDiscriminativeContextFreeGrammar byPaulViolaandMukund Narasimhan


Outline
PCFGs:InferenceandLearning
ParsingEnglish
DiscriminativeCFGs
GrammarInduction
GoldsTheorem67
Anyformallanguagewhichhashierarchical
structurecapableofinfiniterecursionis
unlearnable frompositiveevidencealone.

SlidebasedonWikipedia
EmpiricalProblems
Evenfinitesearchspacescanbetoobig
Noise
Insufficientdata
Manylocaloptima

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


CommonApproach
Minimizetotaldescriptionlength
SimulatedAnnealing

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


random_neighbor(G)
Insert:

Delete

NewRule

Split

Substitute

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


Energy

DefinebinaryrepresentationforG,code(D|G)

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


Experiment1
Wordsegmentationby8montholdinfants
Vocabulary:pabiku, golatu, daropi, tibudo
Saffran 96:usespeechsynthesizer,noword
breaks,2minutes=180words
Infantscandistinguishwordsfromnonwords
Nowtrygrammarinduction(60words)

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


Experiment1

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


Experiment2

SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir


Experiment2

Accuratesegmentation
Inaccuratestructurallearning
SlidebasedonUnsupervisedgrammarinductionwithMinimumDescriptionLength byRoni Katzir
PrototypeDrivenGrammarInduction
Semisupervisedapproach
Giveonlyafewdozenprototypicalexamples
(forNPe.g.determinernoun,pronouns,)
OnEnglishPennTreebank:F1=65.1
(52%reductionovernavePCFGinduction)
Aria Haghighi and Dan Klein.
Prototype-Driven Grammar Induction.
ACL 2006
Dan Klein and Chris Manning.
A Generative Constituent-Context Model for Improved Grammar Induction.
ACL 2001
Thatsit!

You might also like