Professional Documents
Culture Documents
81
root John saw a dog yesterday which was a Yorkshire Terrier
82
there is no s(2, 1, 4) for the adjacent edges from 2-order-non-proj-approx(x, s)
Sentence x = x0 . . . xn , x0 = root
hit to John and ball). This independence between Weight function s : (i, k, j) → R
left and right descendants allow us to use a O(n 3 ) 1. Let y = 2-order-proj(x, s)
2. while true
second-order projective parsing algorithm, as we 3. m = −∞, c = −1, p = −1
will see later. We write s(xi , −, xj ) when xj is 4. for j : 1 · · · n
the first left or first right dependent of word x i . 5. for i : 0 · · · n
6. y 0 = y[i → j]
For example, s(2, −, 4) is the score of creating a 7. if ¬tree(y 0 ) or ∃k : (i, k, j) ∈ y continue
dependency from hit to ball, since ball is the first 8. δ = s(x, y 0 ) − s(x, y)
child to the right of hit. More formally, if the word 9. if δ > m
10. m = δ, c = j, p = i
xi0 has the children shown in this picture, 11. end for
12. end for
xi0 13. if m > 0
14. y = y[p → c]
xi1 ... xij xij+1 ... xim 15. else return y
16. end while
the score factors as follows:
Figure 4: Approximate second-order non-
Pj−1 projective parsing algorithm.
, ik ) + s(i0 , −, ij )
k=1 s(i0 , ik+1P
+ s(i0 , −, ij+1 ) + m−1k=j+1 s(i0 , ik , ik+1 ) of dependents have been gathered. This allows for
the collection of pairs of adjacent dependents in
This second-order factorization subsumes the a single stage, which allows for the incorporation
first-order factorization, since the score function of second-order scores, while maintaining cubic-
could just ignore the middle argument to simulate time parsing.
first-order scoring. The score of a tree for second- The Eisner algorithm can be extended to an
order parsing is now arbitrary mth -order model with a complexity of
X O(nm+1 ), for m > 1. An mth -order parsing al-
s(x, y) = s(i, k, j) gorithm will work similarly to the second-order al-
(i,k,j)∈y gorithm, except that we collect m pairs of adjacent
dependents in succession before attaching them to
where k and j are adjacent, same-side children of
their parent.
i in the tree y.
The second-order model allows us to condition
2.3 Approximate Non-projective Parsing
on the most recent parsing decision, that is, the last
dependent picked up by a particular word, which Unfortunately, second-order non-projective MST
is analogous to the the Markov conditioning of in parsing is NP-hard, as shown in appendix A. To
the Charniak parser (Charniak, 2000). circumvent this, we designed an approximate al-
gorithm based on the exact O(n3 ) second-order
2.2 Exact Projective Parsing projective Eisner algorithm. The approximation
For projective MST parsing, the first-order algo- works by first finding the highest scoring projec-
rithm can be extended to the second-order case, as tive parse. It then rearranges edges in the tree,
was noted by Eisner (1996). The intuition behind one at a time, as long as such rearrangements in-
the algorithm is shown graphically in Figure 3, crease the overall score and do not violate the tree
which displays both the first-order and second- constraint. We can easily motivate this approxi-
order algorithms. In the first-order algorithm, a mation by observing that even in non-projective
word will gather its left and right dependents in- languages like Czech and Danish, most trees are
dependently by gathering each half of the subtree primarily projective with just a few non-projective
rooted by its dependent in separate stages. By edges (Nivre and Nilsson, 2005). Thus, by start-
splitting up chart items into left and right com- ing with the highest scoring projective tree, we are
ponents, the Eisner algorithm only requires 3 in- typically only a small number of transformations
dices to be maintained at each step, as discussed in away from the highest scoring non-projective tree.
detail elsewhere (Eisner, 1996; McDonald et al., The algorithm is shown in Figure 4. The ex-
2005b). For the second-order algorithm, the key pression y[i → j] denotes the dependency graph
insight is to delay the scoring of edges until pairs identical to y except that xi ’s parent is xi instead
83
FIRST-ORDER
h1 h1
h3 h3
⇒
h1 r r+1 h3 h1 h3
(A) (B)
SECOND-ORDER
h1 h1 h1
h2 h2 h3 h2 h2 h3 h3
⇒ ⇒
h1 h2 h2 r r+1 h3 h1 h2 h2 h3 h1 h3
Figure 3: A O(n3 ) extension of the Eisner algorithm to second-order dependency parsing. This figure
shows how h1 creates a dependency to h3 with the second-order knowledge that the last dependent of
h1 was h2 . This is done through the creation of a sibling item in part (B). In the first-order model, the
dependency to h3 is created after the algorithm has forgotten that h 2 was the last dependent.
of what it was in y. The test tree(y) is true iff the terminate — there are only a finite number of de-
dependency graph y satisfies the tree constraint. pendency trees for any given sentence and each it-
In more detail, line 1 of the algorithm sets y to eration of the loop requires an increase in score
the highest scoring second-order projective tree. to continue. However, the loop could potentially
The loop of lines 2–16 exits only when no fur- take exponential time, so we will bound the num-
ther score improvement is possible. Each iteration ber of edge transformations to a fixed value M .
seeks the single highest-scoring parent change to It is easy to argue that this will not hurt perfor-
y that does not break the tree constraint. To that mance. Even in freer-word order languages such
effect, the nested loops starting in lines 4 and 5 as Czech, almost all non-projective dependency
enumerate all (i, j) pairs. Line 6 sets y 0 to the de- trees are primarily projective, modulo a few non-
pendency graph obtained from y by changing x j ’s projective edges. Thus, if our inference algorithm
parent to xi . Line 7 checks that the move from y starts with the highest scoring projective parse, the
to y 0 is valid by testing that xj ’s parent was not al- best non-projective parse only differs by a small
ready xi and that y 0 is a tree. Line 8 computes the number of edge transformations. Furthermore, it
score change from y to y 0 . If this change is larger is easy to show that each iteration of the loop takes
than the previous best change, we record how this O(n2 ) time, resulting in a O(n3 + M n2 ) runtime
new tree was created (lines 9-10). After consid- algorithm. In practice, the approximation termi-
ering all possible valid edge changes to the tree, nates after a small number of transformations and
the algorithm checks to see that the best new tree we do not need to bound the number of iterations
does have a higher score. If that is the case, we in our experiments.
change the tree permanently and re-enter the loop. We should note that this is one of many possible
Otherwise we exit since there are no single edge approximations we could have made. Another rea-
switches that can improve the score. sonable approach would be to first find the highest
scoring first-order non-projective parse, and then
This algorithm allows for the introduction of
re-arrange edges based on second order scores in
non-projective edges because we do not restrict
a similar manner to the algorithm we described.
any of the edge changes except to maintain the
We implemented this method and found that the
tree property. In fact, if any edge change is ever
results were slightly worse.
made, the resulting tree is guaranteed to be non-
projective, otherwise there would have been a
3 Danish: Parsing Secondary Parents
higher scoring projective tree that would have al-
ready been found by the exact projective parsing Kromann (2001) argued for a dependency formal-
algorithm. It is not difficult to find examples for ism called Discontinuous Grammar and annotated
which this approximation will terminate without a large set of Danish sentences using this formal-
returning the highest-scoring non-projective parse. ism to create the Danish Dependency Treebank
It is clear that this approximation will always (Kromann, 2003). The formalism allows for a
84
parsing. As usual for supervised learning, we as-
sume a training set T = {(xt , yt )}Tt=1 , consist-
root Han spejder efter og ser elefanterne ing of pairs of a sentence xt and its correct depen-
He looks for and sees elephants dency representation yt .
The algorithm is an extension of the Margin In-
Figure 5: An example dependency tree from fused Relaxed Algorithm (MIRA) (Crammer and
the Danish Dependency Treebank (from Kromann Singer, 2003) to learning with structured outputs,
(2003)). in the present case dependency structures. Fig-
ure 6 gives pseudo-code for the algorithm. An on-
word to have multiple parents. Examples include line learning algorithm considers a single training
verb coordination in which the subject or object is instance for each update to the weight vector w.
an argument of several verbs, and relative clauses We use the common method of setting the final
in which words must satisfy dependencies both in- weight vector as the average of the weight vec-
side and outside the clause. An example is shown tors after each iteration (Collins, 2002), which has
in Figure 5 for the sentence He looks for and sees been shown to alleviate overfitting.
elephants. Here, the pronoun He is the subject for On each iteration, the algorithm considers a
both verbs in the sentence, and the noun elephants single training instance. We parse this instance
the corresponding object. In the Danish Depen- to obtain a predicted dependency graph, and find
dency Treebank, roughly 5% of words have more the smallest-norm update to the weight vector w
than one parent, which breaks the single parent that ensures that the training graph outscores the
(or tree) constraint we have previously required predicted graph by a margin proportional to the
on dependency structures. Kromann also allows loss of the predicted graph relative to the training
for cyclic dependencies, though we deal only with graph, which is the number of words with incor-
acyclic dependency graphs here. Though less rect parents in the predicted tree (McDonald et al.,
common than trees, dependency graphs involving 2005b). Note that we only impose margin con-
multiple parents are well established in the litera- straints between the single highest-scoring graph
ture (Hudson, 1984). Unfortunately, the problem and the correct graph relative to the current weight
of finding the dependency structure with highest setting. Past work on tree-structured outputs has
score in this setting is intractable (Chickering et used constraints for the k-best scoring tree (Mc-
al., 1994). Donald et al., 2005b) or even all possible trees by
To create an approximate parsing algorithm using factored representations (Taskar et al., 2004;
for dependency structures with multiple parents, McDonald et al., 2005c). However, we have found
we start with our approximate second-order non- that a single margin constraint per example leads
projective algorithm outlined in Figure 4. We use to much faster training with a negligible degrada-
the non-projective algorithm since the Danish De- tion in performance. Furthermore, this formula-
pendency Treebank contains a small number of tion relates learning directly to inference, which is
non-projective arcs. We then modify lines 7-10 important, since we want the model to set weights
of this algorithm so that it looks for the change in relative to the errors made by an approximate in-
parent or the addition of a new parent that causes ference algorithm. This algorithm can thus be
the highest change in overall score and does not viewed as a large-margin version of the perceptron
create a cycle2 . Like before, we make one change algorithm for structured outputs Collins (2002).
per iteration and that change will depend on the
resulting score of the new tree. Using this sim- Online learning algorithms have been shown
ple new approximate parsing algorithm, we train a to be robust even with approximate rather than
new parser that can produce multiple parents. exact inference in problems such as word align-
ment (Moore, 2005), sequence analysis (Daumé
4 Online Learning and Approximate and Marcu, 2005; McDonald et al., 2005a)
Inference and phrase-structure parsing (Collins and Roark,
2004). This robustness to approximations comes
In this section, we review the work of McDonald from the fact that the online framework sets
et al. (2005b) for online large-margin dependency weights with respect to inference. In other words,
2
We are not concerned with violating the tree constraint. the learning method sees common errors due to
85
Training data: T = {(xt , yt )}Tt=1 English
1. w (0)
= 0; v = 0; i = 0 Accuracy Complete
2. for n : 1..N 1st-order-projective 90.7 36.7
2nd-order-projective 91.5 42.1
3. for t : 1..T
‚ ‚
4. min ‚w(i+1) − w(i) ‚ Table 1: Dependency parsing results for English.
‚ ‚
(i+1)
s.t. s(xt , yt ; w )
Czech
−s(xt , y 0 ; w(i+1) ) ≥ L(yt , y 0 )
0 0 (i) Accuracy Complete
where y = arg maxy 0 s(xt , y ; w )
1st-order-projective 83.0 30.6
5. v = v + w(i+1) 2nd-order-projective 84.2 33.1
6. i= i+1 1st-order-non-projective 84.1 32.2
2nd-order-non-projective 85.2 35.9
7. w = v/(N ∗ T )
Table 2: Dependency parsing results for Czech.
Figure 6: MIRA learning algorithm. We write
s(x, y; w(i) ) to mean the score of tree y using tures over triples of words since this would ex-
weight vector w(i) . plode the size of the feature space.
We evaluate dependencies on per word accu-
approximate inference and adjusts weights to cor-
racy, which is the percentage of words in the sen-
rect for them. The work of Daumé and Marcu
tence with the correct parent in the tree, and on
(2005) formalizes this intuition by presenting an
complete dependency analysis. In our evaluation
online learning framework in which parameter up-
we exclude punctuation for English and include it
dates are made directly with respect to errors in the
for Czech and Danish, which is the standard.
inference algorithm. We show in the next section
that this robustness extends to approximate depen- 5.1 English Results
dency parsing.
To create data sets for English, we used the Ya-
5 Experiments mada and Matsumoto (2003) head rules to ex-
tract dependency trees from the WSJ, setting sec-
The score of adjacent edges relies on the defini- tions 2-21 as training, section 22 for development
tion of a feature representation f(i, k, j). As noted and section 23 for evaluation. The models rely
earlier, this representation subsumes the first-order on part-of-speech tags as input and we used the
representation of McDonald et al. (2005b), so we Ratnaparkhi (1996) tagger to provide these for
can incorporate all of their features as well as the the development and evaluation set. These data
new second-order features we now describe. The sets are exclusively projective so we only com-
old first-order features are built from the parent pare the projective parsers using the exact projec-
and child words, their POS tags, and the POS tags tive parsing algorithms. The purpose of these ex-
of surrounding words and those of words between periments is to gauge the overall benefit from in-
the child and the parent, as well as the direction cluding second-order features with exact parsing
and distance from the parent to the child. The algorithms, which can be attained in the projective
second-order features are built from the following setting. Results are shown in Table 1. We can see
conjunctions of word and POS identity predicates that there is clearly an advantage in introducing
second-order features. In particular, the complete
xi -pos, xk -pos, xj -pos
xk -pos, xj -pos
tree metric is improved considerably.
xk -word, xj -word
xk -word, xj -pos 5.2 Czech Results
xk -pos, xj -word
For the Czech data, we used the predefined train-
where xi -pos is the part-of-speech of the word ith ing, development and testing split of the Prague
in the sentence. We also include conjunctions be- Dependency Treebank (Hajič et al., 2001), and the
tween these features and the direction and distance automatically generated POS tags supplied with
from sibling j to sibling k. We determined the use- the data, which we reduce to the POS tag set
fulness of these features on the development set, from Collins et al. (1999). On average, 23% of
which also helped us find out that features such as the sentences in the training, development and
the POS tags of words between the two siblings test sets have at least one non-projective depen-
would not improve accuracy. We also ignored fea- dency, though, less than 2% of total edges are ac-
86
Danish
Precision Recall F-measure
2nd-order-projective 86.4 81.7 83.9
2nd-order-non-projective 86.9 82.2 84.4
2nd-order-non-projective w/ multiple parents 86.2 84.9 85.6
tually non-projective. Results are shown in Ta- projective parser does slightly better than the pro-
ble 2. McDonald et al. (2005c) showed a substan- jective parser because around 1% of the edges are
tial improvement in accuracy by modeling non- non-projective. Since each word may have an ar-
projective edges in Czech, shown by the difference bitrary number of parents, we must use precision
between two first-order models. Table 2 shows and recall rather than accuracy to measure perfor-
that a second-order model provides a compara- mance. This also means that the correct training
ble accuracy boost, even using an approximate loss is no longer the Hamming loss. Instead, we
non-projective algorithm. The second-order non- use false positives plus false negatives over edge
projective model accuracy of 85.2% is the highest decisions, which balances precision and recall as
reported accuracy for a single parser for these data. our ultimate performance metric.
Similar results were obtained by Hall and Nóvák As expected, for the basic projective and non-
(2005) (85.1% accuracy) who take the best out- projective parsers, recall is roughly 5% lower than
put of the Charniak parser extended to Czech and precision since these models can only pick up at
rerank slight variations on this output that intro- most one parent per word. For the parser that can
duce non-projective edges. However, this system introduce multiple parents, we see an increase in
relies on a much slower phrase-structure parser recall of nearly 3% absolute with a slight drop in
as its base model as well as an auxiliary rerank- precision. These results are very promising and
ing module. Indeed, our second-order projective further show the robustness of discriminative on-
parser analyzes the test set in 16m32s, and the line learning with approximate parsing algorithms.
non-projective approximate parser needs 17m03s
to parse the entire evaluation set, showing that run- 6 Discussion
time for the approximation is completely domi-
We described approximate dependency parsing al-
nated by the initial call to the second-order pro-
gorithms that support higher-order features and
jective algorithm and that the post-process edge
multiple parents. We showed that these approxi-
transformation loop typically only iterates a few
mations can be combined with online learning to
times per sentence.
achieve fast parsing with competitive parsing ac-
curacy. These results show that the gain from al-
5.3 Danish Results
lowing richer representations outweighs the loss
For our experiments we used the Danish Depen- from approximate parsing and further shows the
dency Treebank v1.0. The treebank contains a robustness of online learning algorithms with ap-
small number of inter-sentence and cyclic depen- proximate inference.
dencies and we removed all sentences that con- The approximations we have presented are very
tained such structures. The resulting data set con- simple. They start with a reasonably good baseline
tained 5384 sentences. We partitioned the data and make small transformations until the score
into contiguous 80/20 training/testing splits. We of the structure converges. These approximations
held out a subset of the training data for develop- work because freer-word order languages we stud-
ment purposes. ied are still primarily projective, making the ap-
We compared three systems, the standard proximate starting point close to the goal parse.
second-order projective and non-projective pars- However, we would like to investigate the benefits
ing models, as well as our modified second-order for parsing of more principled approaches to ap-
non-projective model that allows for the introduc- proximate learning and inference techniques such
tion of multiple parents (Section 3). All systems as the learning as search optimization framework
use gold-standard part-of-speech since no trained of (Daumé and Marcu, 2005). This framework
tagger is readily available for Danish. Results are will possibly allow us to include effectively more
shown in Figure 3. As might be expected, the non- global features over the dependency structure than
87
those in our current second-order model. R. McDonald, K. Crammer, and F. Pereira. 2005b. On-
line large-margin training of dependency parsers. In
Acknowledgments Proc. ACL.
This work was supported by NSF ITR grants R. McDonald, F. Pereira, K. Ribarov, and J. Hajič.
2005c. Non-projective dependency parsing using
0205448.
spanning tree algorithms. In Proc. HLT-EMNLP.
88