You are on page 1of 7

Natural Language Grammar Induction using a

Constituent-Context Model
Dan Klein and Christopher D. Manning
Computer Science Department
Stanford University
Stanford, CA 94305-9040
fklein, manningg@cs.stanford.edu

Abstract
This paper present a novel approach to the unsupervised learning of syntactic analyses of natural language text. Most previous work has focused
on maximizing likelihood according to generative PCFG models. In contrast, our probabilistic model over trees is based directly on constituent
identity and linear context, and uses a hard EM procedure to induce structure. Despite employing a simpler model of structure, this approach
produces higher quality analyses, giving the best published constituent
precision, recall, and F-score results on the ATIS dataset.

1 Overview
To enable a wide range of subsequent tasks, human language sentences are standardly given
tree-structure analyses, whereby the nodes in a tree dominate contiguous spans of words
called constituents, as in figure 1. These constituents represent the linguistically coherent
units in the sentence, and are usually labeled with a constituent category, such as noun
phrase or verb phrase. An aim of grammar induction systems is to figure out, given just
the sentences in a corpus S , what tree structures correspond to them. In this sense, the
grammar induction problem is an incomplete data problem, where the complete data is the
corpus of trees T , but we only observe their yields S . This paper presents a new approach
to this problem, which gains leverage by directly making use of constituent contexts.
It is an open problem whether entirely unsupervised methods can produce linguistically
accurate parses of sentences. Due to the difficulty of this task, the vast majority of statistical parsing work has focused on supervised learning approaches to parsing [8, 4]. But
there are compelling motivations for unsupervised grammar induction. Building supervised training data requires considerable resources, including time and linguistic expertise.
Investigating unsupervised methods can shed light on linguistic phenomena which are implicit within a supervised parsers supervisory information (e.g., unsupervised systems often have difficulty correctly attaching subjects to verbs above objects, whereas for a supervised CFG parser, this ordering is implicit in the given sentence structures). Finally, while
the presented system makes no claims to modeling human language acquisition, results on
whether there is enough information in sentences to recover their structure is important data
for linguistic theory, where it has standardly been assumed that the information in the data
is deficient, and strong innate knowledge is required for language acquisition [5].

2 Problems with inducing PCFGs using ML, MDL, and EM


Most grammar induction work assumes that sentence trees are in fact generated by a symbolic or probabilistic context-free grammar (CFG or PCFG). These systems generally boil
down to one of two types. Some fix the structure of the grammar in advance [14], often
with an aim to incorporate linguistic constraints [3] or prior knowledge [15]. These systems
typically then attempt to find the grammar production parameters  which maximize the
likelihood P (S j) using the inside-outside algorithm [1], which is an efficient (dynamic
programming) instance of the EM algorithm [9] applied to PCFGs. Other systems (which
have generally been more successful) incorporate a structural search as well, typically using
a heuristic to propose candidate grammar modifications which minimize the joint encoding
of data and grammar using an MDL criterion [7, 20, 19]. These approaches can also be
seen as likelihood maximization where the objective function is the a posteriori likelihood
of the grammar given the data, and the description length provides a structural prior.
The PCFG model family works reasonably well for supervised parsing, where we are given
a corpus of fully parsed sentences and asked to induce a model which parses unseen sentences. However, there are linguistic reasons to distrust an ML objective function. First,
the optimal model is very strongly data-dependent. The grammar G which maximizes
P (S jG) depends on the corpus S , which, in some sense, the core of a given languages
phrase structure should not. Second, there is pressure for the symbols and rules to align in
ways which maximize the truth of the conditional independence assumptions embodied by
the PCFG. The symbols and rules of a natural language grammar, on the other hand, represent syntactically and semantically coherent units, for which a host of linguistic arguments
have been made [16]. None of these have anything to do with conditional independence;
traditional linguistic constituency reflects only grammatical regularities and possibilities
for expansion. There are expected to be strong connections across phrases (such as dependencies between verbs and their selected arguments). It could be that ML over PCFGs
and linguistic criteria align, but in practice they do not always seem to, and so maximizing
one criterion cannot be expected to maximize the other. For example, in [13], we found
that noun phrase (NP) constituents which contained numbers were rarely incorporated into
the NP grammar. This was partly because numbers have a distinct distribution: they are
common objects of verbs, but very rare subjects. However, a linguist would take this as a
selectional characteristic of the data set, not an indication that numbers are not NPs.
Another common objective function is MDL, which asserts that a good analysis is a short
one, in that the joint encoding of the grammar and the data is compact [18]. The compact
grammar aspect of MDL is perhaps closer to some traditional linguistic argumentation
which at times has argued for minimal grammars on grounds of analytical [11] or cognitive
[6] economy. However, some CFGs which might possibly be taken as the acquisition goal
are anything but compact. A more serious issue with MDL is that the target grammar is
presumably bounded in size, while adding more and more data will on average cause MDL
methods to choose ever larger grammars. More generally, if the data is large, the MDL prior
is relatively weak, and MDL reduces to ML. However, the most serious practical issue with
MDL systems is the following. These systems are primarily distinguished from each other
in the mechanism used for postulating new non-terminals and rules to the grammar. To do
so requires a heuristic for detecting which non-terminals represent constituent categories
(see [7, 13] for various heuristics). These heuristics can be effective, but are imperfect, and
require various assumptions about the nature of the rules in the grammars, which are true
for some categories but not others. As shown in these papers, improving these heuristics
results in some of the best-performing systems to date, but the fact remains that heuristic
design is necessary and the resulting induction systems are multi-phase and fairly complex.
While early work showed that small, artificial context-free grammars could be induced
with the EM algorithm [14], studies with large natural language grammars have generally
suggested that using this method of completely unsupervised acquisition is ineffective. For

instance, Carroll and Charniak [3] describe experiments running the EM algorithm from
random starting points, which produced widely varying learned grammars, almost all of
extremely poor quality. It is well-known that EM is only locally optimal, and one might
think that this merely provides evidence that the locality of the search procedure, not the
objective function, is to blame. However, [14] describe an experiment in which, starting
from fixed, correct structure, EM produced a grammar which had higher log-likelihood
than the linguistically determined grammar, but lower parsing accuracy.
To investigate EM over fixed-structure PCFGs, we duplicated one of the experiments in
[3] using [12]. In it, grammars were restricted to simple adjunctions: rules are of the form
x ! x y j y x, where there is one category x for each part-of-speech. Such a restricted
CFG is isomorphic to a dependency grammar, and is a reasonable linguistic bias. We
began reestimation from a grammar with uniform rewrite probabilities. Figure 3 shows the
resulting grammar is not quite as bad as conventional wisdom suggests. Its performance
is about midway between the baseline of random structure and the top scores of the system
presented in this paper, and comparable to some other recent acquisition systems.
Charniak is right to observe that the search spaces is riddled with pronounced local maxima, and EM does not do nearly so well when randomly initialized. The system has been
supplied with a linguistic bias, and inspection shows that, while it makes some of the same
systematic mistakes as other systems, it also makes some serious additional errors such as
grouping articles and prepositions. The need for random seeding in using EM over PCFGs
is two-fold. First, for some grammars, such as a grammar over a set X of non-terminals
in which any x1 ! x2 x3 , xi 2 X is possible, it is needed to break symmetry. This is
not the case for dependency grammars, where symmetry is broken by the yields (e.g., a
short sentence noun verb can only be covered by a noun- or verb- projection. The second
reason is to start the search from a random region of the space. But unless one plans on
many random restarts, the uniform starting condition is better than beginning at an extreme
point in the space, and produces superior results.
We conjecture that PCFG models often fail to propagate contextual cues efficiently. The
reason we expect an algorithm to converge on a good PCFG is that there seem to be coherent categories, like noun phrases, which occur in distinctive environments, like between
the beginning of the sentence and the verb phrase. In the inside-outside algorithm, the
product of inside and outside probabilities j (p; q ) j (p; q ) is the probability of generating
the sentence with a j constituent spanning words p through q : the outside probability captures the environment, and the inside probability the coherent category. If we had a good
idea of what verb phrases and noun phrases looked like, then if a novel NP appeared in an
NP context, the outside probabilities should pressure the sequence to be parsed as an NP.
However, what happens early in the EM procedure, when we have no real idea about the
grammar parameters? With randomly-weighted, complete grammars over a symbol set X ,
we have observed that a frequent, short, noun phrase sequence often does get assigned to a
category x. However, since there is not a clear overall structure learned, there is only very
weak pressure for other NPs, even if they occur in the same positions, to also be assigned to
x, and the reestimation process goes astray. To enable this kind of inside-outside pressure
to be effective early in the process, we propose the model in the following section.

3 The Constituent-Context model


We propose a simpler, alternate parametric family of models over trees. The primary task
in parsing sentences is deciding which spans of a sentence are constituents, not what their
labels would be if they were. For example, the sequence DT NN IN DT NN ([the man in
the moon]) is virtually always a noun phrase when it is a constituent, but it is only a constituent 66% of the time, because the IN DT NN, is often attached elsewhere ([[we sent a
man] [to the moon]]). Thus, it is important that an induction system be able to detect con-

stituents, either implicitly or explicitly. A variety of methods of constituent detection have


been proposed [13, 7], usually based on information-theoretic properties of a sequences
distributional context. However, we here rely entirely on the following two simple assumptions: (i) Constituents of a parse do not cross each other, and (ii) Constituents occur in
constituent contexts. The first property is self-evident from the nature of the parse trees.
The second is an extremely weakened version of classic linguistic constituency tests [16].

be a part-of-speech tag sequence. Every occurrence of will be in some context


c( ) = x y, where x and y are the adjacent tags or sentence boundaries. Then any tree
t over a sentence s can be seen as a collection of span sequences and span contexts. Good
Let

trees will include spans whose sequences are frequently constituents and whose contexts
frequently surround constituents. Formally, we use a conditional Gibbs model of the form

p(tjs; ) =

Pt

exp(P ;c 2t  f + c fc )
P
t s exp( ;c 2t  f + c fc )
(

:yield( )=

We have one feature f (t) for each sequence whose value on a tree t is the number of
nodes in t with yield , and one feature fc (t) for each context c representing the number of
times c is the context of the yield of some node in the tree. No joint features over c and
are used, and, unlike many other systems, there is no distinction between constituent types.

p(S ) where
We model only the conditional likelihood of the trees, p(S; T j) = p(T jS; )~
p~ is the empirical distribution over yields. We then use hard-assignment EM to find a
local maximum P (T j) of the completed data (trees) T . For the E-step, we find the most
probable tree structure according to this model using a simple dynamic program. For the
M-step, it is necessary to fit the model to the completed data. We found that in practice
running a complete fitting procedure such as IIS [2] was not necessary for convergence.
Rather, we found that simple relative frequency estimates a = log(count(f )=count( ))
and c = log(count(fc )=count(c)) resulted in extremely similar data completions to a
full fitting phase. This appears to be because the crossing constraints are doing most of the
work, with the exact feature weights mattering very little.
Since there is one feature per tag sequence, most features will have very little support in
the data; very many will correspond to sequences which only occurred once. We dealt with
this issue in two ways. When correctly fitting the  values, we can simply discard features
with support below a given threshold. However, we achieved the best results by keeping
them in and using smoothed relative frequency estimators. Additionally, we noticed that
problems with over-dependence on first-round parses were removed by heavily smoothing
these estimators during the first few rounds of re-estimation.

4 Results
In all experiments, we used Penn Treebank training data, mainly sentences from the WSJ
of length  10 after removal of punctuation, though the system behaves qualitatively the
same on longer sentences. For ATIS results we trained on the WSJ sentences and tested
on ATIS. Figure 1 shows sample data and results. Many systems, including the present
one, start not with the actual words in the sentences, but with the part-of-speech tags of
the words (all computation is done with the Penn treebank tagset). This assumption, in
general, makes inducing a reasonable grammar easier but finding the correct parse for a
given sentence harder. For example, the system has been told that screen and sea are
to have the same behavior in the grammar. This reduces the space of grammars drastically,
but there is no longer a distinction between this sentence and The man bought a ticket
on Monday, which has the same part-of-speech tag sequence but a different correct parse.
The task of inducing parts of speech for words has been separately tackled, with reasonable
success [10, 17].

S
NP
DT

VBD

NP

The screen was

NP

PP

DT NN IN NP
a

VP
NN

VBD

DT

VBD
NN

VBD

DT

was DT NN IN NN

The screen

sea

of

red

DT

NN

The screen

sea of NN

DT
VBD

NN

DT
DT

was

IN red

DT NN of
a

sea

red

Figure 1: Alternate trees for a sentence: from left, the Penn Treebank tree (deemed correct),
the one found by our system CP - FREQ, and the one found by DEP - PCFG.
Sequence
DT NN
NNP NNP
CD CD
JJ NNS
DT JJ NN
DT NNS
JJ NN
CD NN
IN NN
IN DT NN
NN NNS
NN NN
TO VB
DT JJ
IN DT
PRP VBZ
PRP VBP
NNS VBP
NN VBZ
NN IN
NNS VBD

Example
the man
United States
4 1/2
daily yields
the top rank
the people
plastic furniture
12 percent
on Monday
for the moment
fire trucks
fire truck
to go
?the big
*of the
?he says
?they say
?people are
?value is
*man from
?people were

CORRECT

FREQUENCY

ENTROPY

DEP - PCFG

CP - FREQ

CP - RAND

1
2
3
4
5
6
7
8
9
10
11
22
26
78
90
95
180
=350
=532
=648
=648

2
1
9
7
3
8
6
4
10
5
-

2
3
7
9
6
10
1
4
5
8

1
2
5
4
7
3
6
10
8
9
-

1
2
5
4
8
3
7
6
10
9
-

1
2
5
4
6
10
3
9
8
7
-

Figure 2: Top non-trivial sequences by actual constituent counts, raw frequency of sequence, scaled
entropy, and according to DEP - PCFG, CP - RANDOM, and CP - FREQUENCY.

The grammar acquired by our system is implicit in the learned feature weights. However,
these are not by themselves particularly interpretable, and not directly comparable to the
grammars produced by other systems. Therefore, we present two kinds of results. First,
any grammar which parses the trees in our corpus will have a distribution over constituent
yields. We examine these distributions, as they give a good sense of what structures are and
are not being learned. Second, we compare the trees produced by our grammar with those
produced by others, and show our grammars performance on standardized parsing tasks.
Figure 2 shows the top scoring constituents by several rankings. These lists do not say
very much about how long, complex, recursive constructions are being parsed by a given
system, but grammar induction systems are still at the level where major mistakes manifest
themselves in short, frequent sequences. CORRECT indicates the frequency rank ordering
of constituents in the correct parses. FREQUENCY lists POS sequences by their tag subsequence frequencies in the sentences. Note that the sequence IN DT (e.g., of the) is high
on this list, and is a typical error of many early systems. DEP - PCFG is the frequency rank of
constituents when parsing according to the dependency grammar PCFG described in section 2. ENTROPY is the ranking according to the heuristic proposed in [13] which ranks by
context entropy. It is better in practice than FREQUENCY, but that isnt self-evident from
this list. CP - RAND is a list from our system when initialized randomly, while CP - FREQ
is the list from our system when initialized by frequency. Clearly, the lists produced by

Method
RANDOM
DEP - PCFG
CP - FREQ
CP - RAND

UR
29.0
39.5
49.4
47.3

UP
31.0
42.3
52.9
50.7

F1
30.0
40.9
51.1
48.9

NP UR
42.8
69.7
83.4
78.1

PP UR
23.6
44.1
79.2
69.3

VP UR
26.3
22.8
28.9
33.2

System
EMILE
ABL
CDC-40
CP - FREQ
CP - RAND

UR
16.8
35.6
34.6
44.3
46.4

UP
51.6
43.6
53.4
51.5
54.0

F1
25.4
39.2
42.0
47.6
49.9

CB
0.84
2.12
1.46
1.96
1.75

Figure 3: Comparative accuracy on WSJ sentences (left) and on the ATIS corpus (right).
UR = unlabeled recall; UP = unlabeled precision; F1 = the harmonic mean of UR and UP;
CB = crossing brackets. Separate recall values are shown for three major categories.
our system are closer to correct than the others. They look much like a censored version
of the frequency list, where frequent sequences which do not co-exist with higher-ranked
ones have been removed (e.g., IN DT often crosses DT NN). This observation may explain
a good part of the success of this method.
In figure 3, we report summary results for each of our systems on WSJ parsing (using
traditional PARSEVAL scoring, with matching of unlabeled constituents, ignoring unary
productions), and for parsing the ATIS treebank, this time using not the original PARSE VAL measures, but rather the definitions used by [8] and EVALB, a standardized scoring
program, to facilitate comparison to other systems. For WSJ, the baseline method is RAN DOM, where random binary parsing decisions are made. The sentences are short enough
that random parsing does surprisingly (or embarrassingly) well, since one gets the root of
the tree right and, on average, 1 or 2 other constituents. DEP - PCFG is the result of using the
grammar described in section 2. CP - RAND is our system using random first-round parses,
while CP - FREQ is our system with initial  values proportional to the log-frequency of the
sequences (and zero for the contexts). For ATIS, EMILE and ABL are lexical systems,
whose performance is described in [19]. The results for CDC-40, taken from [7], reflect
training on much more data (12M words).
There are a number of issues in how to interpret the results of an unsupervised system
when comparing with hand-given supervised parses. Errors come in several kinds. First
are innocent sins of commission. Treebank trees are very flat. For example, there is no
analysis of the inside of many short noun phrases ([two hard drives] rather than [two
[hard drives]]). Our system gives a (usually correct) analysis of the insides of NPs for
which it is penalized in terms of unlabeled precision (though not crossing brackets) when
compared to the treebank. Some are genuine errors in parsing. Our system tends to form
verb groups and attaches the subject below the object for transitive verbs. As a result, most
VP s are systematically incorrect, boosting crossing bracket scores and dramatically impacting VP recall, substantially pulling down the overall figures. Finally, the treebanks grammar is sometimes an arbitrary, and even inconsistent standard for an unsupervised learner:
an alternate analyses may be just as good. For example, transitive sentences are bracketed [subject [verb object]] (The president [executed the law]) while nominalizations are
bracketed [[possessive noun] complement] ([The presidents execution] of the law), an
arbitrary inconsistency which is unlikely to be learned automatically. Notwithstanding this,
we use standard parser evaluation measures to facilitate comparison with other systems.

5 Conclusions and future work


We have presented an alternate parametric probability model over trees which is based on
simple assumptions about the nature of natural language structure. The model is designed
to allow direct propagation of constituency between sequences and their environments with
the hope that this will reduce the problem of local maxima in the search space. Using EM,
we show that this model, despite its simplicity, produces higher quality structural analyses
than previous ML and MDL methods which employ the PCFG model family.

Despite, or perhaps due to, its simplicity, our model predicts bracketings very well. It
makes no distinction between different types of constituents, which is necessary for any
practical application which uses syntactic analyses. There are at least two ways this system
could be extended to incorporate constituent labels. The direct extension would treat the
labels as further hidden data and add features which are joint over sequences or contexts
and labels. Another method would be to separately cluster sequences, as in [7]. Beyond
adding labels, it remains to be seen whether this model has any merit in supervised parsing (perhaps by adding features which are joint over the contexts and sequences) or, as is
entirely possible, this model is good for basic chunk acquisition but too disconnected from
the high-level, recursive structure of language to be good at parsing long sentences with
frequent, complex constructions.
References
[1] James K. Baker. Trainable grammars for speech recognition. In D. H. Klatt and J. J. Wolf, editors, Speech Communication Papers for the 97th Meeting of the Acoustical Society of America,
pages 547550, 1979.
[2] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, 22(1):3971, 1996.
[3] Glenn Carroll and Eugene Charniak. Two experiments on learning probabilistic dependency
grammars from corpora. In Carl Weir, Stephen Abney, Ralph Grishman, and Ralph Weischedel,
editors, Working Notes of the Workshop Statistically-Based NLP Techniques, pages 113. AAAI
Press, Menlo Park, CA, 1992.
[4] Eugene Charniak. A maximum-entropy-inspired parser. In NAACL 1, pages 132139, 2000.
[5] Noam Chomsky. Knowledge of Language: Its Nature, Origin, and Use. Prager, New York,
1986.
[6] Noam Chomsky and Morris Halle. The Sound Pattern of English. Harper & Row, New York,
1968.
[7] Alexander Clark. Unsupervised induction of stochastic context-free grammars using distributional clustering. In Proceedings of CoNLL 2001, 2001.
[8] Michael John Collins. Three generative, lexicalised models for statistical parsing. In ACL
35/EACL 8, pages 1623, 1997.
[9] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. J. Royal Statistical Society Series B, 39:138, 1977.
[10] Steven Finch and Nick Chater. Distributional bootstrapping: From word class to proto-sentence.
In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 301
306, Hillsdale, NJ, 1994. Lawrence Erlbaum.
[11] Zellig Harris. Methods in Structural Linguistics. University of Chicago Press, Chicago, 1951.
[12] Mark Johnson. Inside-outside algorithm code. http://www.cog.brown.edu/~mj/, 2000.
[13] Dan Klein and Christopher D. Manning. Distributional phrase structure induction. In The Fifth
Conference on Natural Language Learning, 2001.
[14] K. Lari and S. J. Young. The estimation of stochastic context-free grammars using the insideoutside algorithm. Computer Speech and Language, 4:3556, 1990.
[15] Fernando Pereira and Yves Schabes. Inside-outside reestimation from partially bracketed corpora. In ACL 30, pages 128135, 1992.
[16] Andrew Radford. Transformational Grammar. Cambridge University Press, Cambridge, 1988.
[17] Hinrich Schutze. Distributional part-of-speech tagging. In EACL 7, pages 141148, 1995.
[18] Andreas Stolcke and Stephen M. Omohundro. Inducing probabilistic grammars by Bayesian
model merging. In Grammatical Inference and Applications: Proceedings of the Second International Colloquium on Grammatical Inference. Springer Verlag, 1994.
[19] M. van Zaanen and P. Adriaans. Comparing two unsupervised grammar induction systems:
Alignment-based learning vs. emile. Technical Report 2001.05, University of Leeds, 2001.
[20] J. G. Wolff. Learning syntax and meanings through optimization and distributional analysis. In
Y. Levy, I. M. Schlesinger, and M. D. S. Braine, editors, Categories and processes in language
acquisition, pages 179215. Lawrence Erlbaum, Hillsdale, NJ, 1988.

You might also like