You are on page 1of 11

A Fast and Accurate Dependency Parser using Neural Networks

Danqi Chen Christopher D. Manning


Computer Science Department Computer Science Department
Stanford University Stanford University
danqi@cs.stanford.edu manning@stanford.edu

Abstract of expertise and are usually incomplete. Third, the


use of many feature templates cause a less stud-
Almost all current dependency parsers
ied problem: in modern dependency parsers, most
classify based on millions of sparse indi-
of the runtime is consumed not by the core pars-
cator features. Not only do these features
ing algorithm but in the feature extraction step (He
generalize poorly, but the cost of feature
et al., 2013). For instance, Bohnet (2010) reports
computation restricts parsing speed signif-
that his baseline parser spends 99% of its time do-
icantly. In this work, we propose a novel
ing feature extraction, despite that being done in
way of learning a neural network classifier
standard efficient ways.
for use in a greedy, transition-based depen-
dency parser. Because this classifier learns In this work, we address all of these problems
and uses just a small number of dense fea- by using dense features in place of the sparse indi-
tures, it can work very fast, while achiev- cator features. This is inspired by the recent suc-
ing an about 2% improvement in unla- cess of distributed word representations in many
beled and labeled attachment scores on NLP tasks, e.g., POS tagging (Collobert et al.,
both English and Chinese datasets. Con- 2011), machine translation (Devlin et al., 2014),
cretely, our parser is able to parse more and constituency parsing (Socher et al., 2013).
than 1000 sentences per second at 92.2% Low-dimensional, dense word embeddings can ef-
unlabeled attachment score on the English fectively alleviate sparsity by sharing statistical
Penn Treebank. strength between similar words, and can provide
us a good starting point to construct features of
1 Introduction words and their interactions.
In recent years, enormous parsing success has Nevertheless, there remain challenging prob-
been achieved by the use of feature-based discrim- lems of how to encode all the available infor-
inative dependency parsers (Kübler et al., 2009). mation from the configuration and how to model
In particular, for practical applications, the speed higher-order features based on the dense repre-
of the subclass of transition-based dependency sentations. In this paper, we train a neural net-
parsers has been very appealing. work classifier to make parsing decisions within
However, these parsers are not perfect. First, a transition-based dependency parser. The neu-
from a statistical perspective, these parsers suffer ral network learns compact dense vector represen-
from the use of millions of mainly poorly esti- tations of words, part-of-speech (POS) tags, and
mated feature weights. While in aggregate both dependency labels. This results in a fast, com-
lexicalized features and higher-order interaction pact classifier, which uses only 200 learned dense
term features are very important in improving the features while yielding good gains in parsing ac-
performance of these systems, nevertheless, there curacy and speed on two languages (English and
is insufficient data to correctly weight most such Chinese) and two different dependency represen-
features. For this reason, techniques for introduc- tations (CoNLL and Stanford dependencies). The
ing higher-support features such as word class fea- main contributions of this work are: (i) showing
tures have also been very successful in improving the usefulness of dense representations that are
parsing performance (Koo et al., 2008). Second, learned within the parsing task, (ii) developing a
almost all existing parsers rely on a manually de- neural network architecture that gives good accu-
signed set of feature templates, which require a lot racy and speed, and (iii) introducing a novel acti-

740
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 740–750,
c
October 25-29, 2014, Doha, Qatar. 2014 Association for Computational Linguistics
vation function for the neural network that better Single-word features (9)
captures higher-order interaction features. s1 .w; s1 .t; s1 .wt; s2 .w; s2 .t;
s2 .wt; b1 .w; b1 .t; b1 .wt
2 Transition-based Dependency Parsing Word-pair features (8)
s1 .wt ◦ s2 .wt; s1 .wt ◦ s2 .w; s1 .wts2 .t;
Transition-based dependency parsing aims to pre-
s1 .w ◦ s2 .wt; s1 .t ◦ s2 .wt; s1 .w ◦ s2 .w
dict a transition sequence from an initial configu-
s1 .t ◦ s2 .t; s1 .t ◦ b1 .t
ration to some terminal configuration, which de-
Three-word feaures (8)
rives a target dependency parse tree, as shown in
s2 .t ◦ s1 .t ◦ b1 .t; s2 .t ◦ s1 .t ◦ lc1 (s1 ).t;
Figure 1. In this paper, we examine only greedy
s2 .t ◦ s1 .t ◦ rc1 (s1 ).t; s2 .t ◦ s1 .t ◦ lc1 (s2 ).t;
parsing, which uses a classifier to predict the cor-
s2 .t ◦ s1 .t ◦ rc1 (s2 ).t; s2 .t ◦ s1 .w ◦ rc1 (s2 ).t;
rect transition based on features extracted from the
s2 .t ◦ s1 .w ◦ lc1 (s1 ).t; s2 .t ◦ s1 .w ◦ b1 .t
configuration. This class of parsers is of great in-
terest because of their efficiency, although they Table 1: The feature templates used for analysis.
tend to perform slightly worse than the search- lc1 (si ) and rc1 (si ) denote the leftmost and right-
based parsers because of subsequent error prop- most children of si , w denotes word, t denotes
agation. However, our greedy parser can achieve POS tag.
comparable accuracy with a very good speed.1
As the basis of our parser, we employ the
arc-standard system (Nivre, 2004), one of the given configuration. Information that can be ob-
most popular transition systems. In the arc- tained from one configuration includes: (1) all
standard system, a configuration c = (s, b, A) the words and their corresponding POS tags (e.g.,
consists of a stack s, a buffer b, and a set of has / VBZ); (2) the head of a word and its label
dependency arcs A. The initial configuration (e.g., nsubj, dobj) if applicable; (3) the posi-
for a sentence w1 , . . . , wn is s = [ROOT], b = tion of a word on the stack/buffer or whether it has
[w1 , . . . , wn ], A = ∅. A configuration c is termi- already been removed from the stack.
nal if the buffer is empty and the stack contains Conventional approaches extract indicator fea-
the single node ROOT, and the parse tree is given tures such as the conjunction of 1 ∼ 3 elements
by Ac . Denoting si (i = 1, 2, . . .) as the ith top from the stack/buffer using their words, POS tags
element on the stack, and bi (i = 1, 2, . . .) as the or arc labels. Table 1 lists a typical set of feature
ith element on the buffer, the arc-standard system templates chosen from the ones of (Huang et al.,
defines three types of transitions: 2009; Zhang and Nivre, 2011).2 These features
suffer from the following problems:
• LEFT-ARC(l): adds an arc s1 → s2 with
• Sparsity. The features, especially lexicalized
label l and removes s2 from the stack. Pre-
features are highly sparse, and this is a com-
condition: |s| ≥ 2.
mon problem in many NLP tasks. The sit-
• RIGHT-ARC(l): adds an arc s2 → s1 with uation is severe in dependency parsing, be-
label l and removes s1 from the stack. Pre- cause it depends critically on word-to-word
condition: |s| ≥ 2. interactions and thus the high-order features.
To give a better understanding, we perform a
• SHIFT: moves b1 from the buffer to the feature analysis using the features in Table 1
stack. Precondition: |b| ≥ 1. on the English Penn Treebank (CoNLL rep-
resentations). The results given in Table 2
In the labeled version of parsing, there are in total demonstrate that: (1) lexicalized features are
|T | = 2Nl + 1 transitions, where Nl is number indispensable; (2) Not only are the word-pair
of different arc labels. Figure 1 illustrates an ex- features (especially s1 and s2 ) vital for pre-
ample of one transition sequence from the initial dictions, the three-word conjunctions (e.g.,
configuration to a terminal one. {s2 , s1 , b1 }, {s2 , lc1 (s1 ), s1 }) are also very
The essential goal of a greedy parser is to pre- important.
dict a correct transition from T , based on one
2
We exclude sophisticated features using labels, distance,
1
Additionally, our parser can be naturally incorporated valency and third-order features in this analysis, but we will
with beam search, but we leave this to future work. include all of them in the final evaluation.

741
punct
root dobj Correct transition: SHIFT
nsubj amod
Stack Bu↵er

ROOT He has good control . ROOT has VBZ good JJ control NN ..

PRP VBZ JJ NN . nsubj


He PRP

Transition Stack Buffer A


[ROOT] [He has good control .] ∅
SHIFT [ROOT He] [has good control .]
SHIFT [ROOT He has] [good control .]
LEFT-ARC(nsubj) [ROOT has] [good control .] A∪ nsubj(has,He)
SHIFT [ROOT has good] [control .]
SHIFT [ROOT has good control] [.]
LEFT-ARC(amod) [ROOT has control] [.] A∪amod(control,good)
RIGHT-ARC(dobj) [ROOT has] [.] A∪ dobj(has,control)
... ... ... ...
RIGHT-ARC(root) [ROOT] [] A∪ root(ROOT,has)

Figure 1: An example of transition-based dependency parsing. Above left: a desired dependency tree,
above right: an intermediate configuration, bottom: a transition sequence of the arc-standard system.

Features UAS transition-based dependency parsing and existing


All features in Table 1 88.0 problems of sparse indicator features. In the fol-
single-word & word-pair features 82.7 lowing sections, we will elaborate our neural net-
only single-word features 76.9 work model for learning dense features along with
excluding all lexicalized features 81.5 experimental evaluations that prove its efficiency.

Table 2: Performance of different feature sets. 3 Neural Network Based Parser


UAS: unlabeled attachment score.
In this section, we first 1present our neural network
• Incompleteness. Incompleteness is an un- model and its main components. Later, we give
avoidable issue in all existing feature tem- details of training and speedup of parsing process.
plates. Because even with expertise and man- 3.1 Model
ual handling involved, they still do not in-
clude the conjunction of every useful word Figure 2 describes our neural network architec-
combination. For example, the conjunc- ture. First, as usual word embeddings, we repre-
tion of s1 and b2 is omitted in almost all sent each word as a d-dimensional vector ewi ∈R
d

and the full embedding matrix is E ∈ R w d×N w


commonly used feature templates, however
it could indicate that we cannot perform a where Nw is the dictionary size. Meanwhile,
RIGHT-ARC action if there is an arc from s1 we also map POS tags and arc labels to a d-
to b2 . dimensional vector space, where eti , elj ∈ Rd are
the representations of ith POS tag and j th arc la-
• Expensive feature computation. The fea-
bel. Correspondingly, the POS and label embed-
ture generation of indicator features is gen-
ding matrices are E t ∈ Rd×Nt and E l ∈ Rd×Nl
erally expensive — we have to concatenate
where Nt and Nl are the number of distinct POS
some words, POS tags, or arc labels for gen-
tags and arc labels.
erating feature strings, and look them up in a
We choose a set of elements based on the
huge table containing several millions of fea-
stack / buffer positions for each type of in-
tures. In our experiments, more than 195% of
formation (word, POS or label), which might
the time is consumed by feature computation
be useful for our predictions. We denote the
during the parsing process.
sets as S w , S t , S l respectively. For example,
So far, we have discussed preliminaries of given the configuration in Figure 2 and S t =

742
Softmax layer:
···
p = softmax(W2 h)
Hidden layer:
···
h = (W1w xw + W1t xt + W1l xl + b1 )3
Input layer: [xw , xt , xl ] ··· ···

words POS tags arc labels


Stack Buffer
Configuration ROOT has VBZ good JJ control NN ..
nsubj
He PRP

Figure 2: Our neural network architecture.

1
{lc1 (s2 ).t, s2 .t, rc1 (s2 ).t, s1 .t}, we will extract
PRP, VBZ, NULL, JJ in order. Here we use a spe- 0.5
cial token NULL to represent a non-existent ele-
ment. −1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1
We build a standard neural network with one cube
−0.5 sigmoid
hidden layer, where the corresponding embed- tanh
dings of our chosen elements from S w , S t , S l will −1
identity

be added to the input layer. Denoting nw , nt , nl as


the number of chosen elements of each type, we
Figure 3: Different activation functions used in
add xw = [ew w w
w1 ; ew2 ; . . . ewnw ] to the input layer,
w neural networks.
where S = {w1 , . . . , wnw }. Similarly, we add
the POS tag features xt and arc label features xl to
the input layer. noun) than DT (determiner), and amod (adjective
We map the input layer to a hidden layer with modifier) should be closer to num (numeric mod-
dh nodes through a cube activation function: ifier) than nsubj (nominal subject). We expect
these semantic meanings to be effectively captured
h = (W1w xw + W1t xt + W1l xl + b1 )3 by the dense representations.

where W1w ∈ Rdh ×(d·nw ) , W1t ∈ Rdh ×(d·nt ) , Cube activation function
W1l ∈ Rdh ×(d·nl ) , and b1 ∈ Rdh is the bias. As stated above, we introduce a novel activation
A softmax layer is finally added on the top of function: cube g(x) = x3 in our model instead
the hidden layer for modeling multi-class prob- of the commonly used tanh or sigmoid functions
abilities p = softmax(W2 h), where W2 ∈ (Figure 3).
R|T |×dh . Intuitively, every hidden unit is computed by a
(non-linear) mapping on a weighted sum of input
POS and label embeddings units plus a bias. Using g(x) = x3 can model
To our best knowledge, this is the first attempt to the product terms of xi xj xk for any three different
introduce POS tag and arc label embeddings in- elements at the input layer directly:
stead of discrete representations.
Although the POS tags P = {NN, NNP, g(w1 x1 + . . . + wm xm + b) =
X X
NNS, DT, JJ, . . .} (for English) and arc labels (wi wj wk )xi xj xk + b(wi wj )xi xj . . .
L = {amod, tmod, nsubj, csubj, dobj, . . .} i,j,k i,j
(for Stanford Dependencies on English) are rela-
tively small discrete sets, they still exhibit many In our case, xi , xj , xk could come from different
semantical similarities like words. For example, dimensions of three embeddings. We believe that
NN (singular noun) should be closer to NNS (plural this better captures the interaction of three ele-

743
ments, which is a very desired property of depen- 285,791, coverage = 79.0%). We will also com-
dency parsing. pare with random initialization of E w in Section
Experimental results also verify the success of 4. The training error derivatives will be back-
the cube activation function empirically (see more propagated to these embeddings during the train-
comparisons in Section 4). However, the expres- ing process.
sive power of this activation function is still open We use mini-batched AdaGrad (Duchi et al.,
to investigate theoretically. 2011) for optimization and also apply a dropout
(Hinton et al., 2012) with 0.5 rate. The parame-
The choice of S w , S t , S l
ters which achieve the best unlabeled attachment
Following (Zhang and Nivre, 2011), we pick a score on the development set will be chosen for
rich set of elements for our final parser. In de- final evaluation.
tail, S w contains nw = 18 elements: (1) The top 3
words on the stack and buffer: s1 , s2 , s3 , b1 , b2 , b3 ; 3.3 Parsing
(2) The first and second leftmost / rightmost We perform greedy decoding in parsing. At each
children of the top two words on the stack: step, we extract all the corresponding word, POS
lc1 (si ), rc1 (si ), lc2 (si ), rc2 (si ), i = 1, 2. (3) and label embeddings from the current configu-
The leftmost of leftmost / rightmost of right- ration c, compute the hidden layer h(c) ∈ Rdh ,
most children of the top two words on the stack: and pick the transition with the highest score:
lc1 (lc1 (si )), rc1 (rc1 (si )), i = 1, 2. t = arg maxt is feasible W2 (t, ·)h(c), and then ex-
We use the corresponding POS tags for S t ecute c → t(c).
(nt = 18), and the corresponding arc labels of Comparing with indicator features, our parser
words excluding those 6 words on the stack/buffer does not need to compute conjunction features and
for Sl (nl = 12). A good advantage of our parser look them up in a huge feature table, and thus
is that we can add a rich set of elements cheaply, greatly reduces feature generation time. Instead,
instead of hand-crafting many more indicator fea- it involves many matrix addition and multiplica-
tures. tion operations. To further speed up the parsing
3.2 Training time, we apply a pre-computation trick, similar
to (Devlin et al., 2014). For each position cho-
We first generate training examples {(ci , ti )}m i=1 sen from S w , we pre-compute matrix multiplica-
from the training sentences and their gold parse
tions for most top frequent 10, 000 words. Thus,
trees using a “shortest stack” oracle which always
computing the hidden layer only requires looking
prefers LEFT-ARCl over SHIFT, where ci is a
up the table for these frequent words, and adding
configuration, ti ∈ T is the oracle transition.
the dh -dimensional vector. Similarly, we also pre-
The final training objective is to minimize the
compute matrix computations for all positions and
cross-entropy loss, plus a l2 -regularization term:
all POS tags and arc labels. We only use this opti-
X λ mization in the neural network parser, but it is only
L(θ) = − log pti + kθk2 feasible for a parser like the neural network parser
2
i
which uses a small number of features. In prac-
where θ is the set of all parameters tice, this pre-computation step increases the speed
{W1w , W1t , W1l , b1 , W2 , E w , E t , E l }. A slight of our parser 8 ∼ 10 times.
variation is that we compute the softmax prob-
abilities only among the feasible transitions in 4 Experiments
practice.
For initialization of parameters, we use pre- 4.1 Datasets
trained word embeddings to initialize E w and use We conduct our experiments on the English Penn
random initialization within (−0.01, 0.01) for E t Treebank (PTB) and the Chinese Penn Treebank
and E l . Concretely, we use the pre-trained word (CTB) datasets.
embeddings from (Collobert et al., 2011) for En- For English, we follow the standard splits of
glish (#dictionary = 130,000, coverage = 72.7%), PTB3, using sections 2-21 for training, section
and our trained 50-dimensional word2vec em- 22 as development set and 23 as test set. We
beddings (Mikolov et al., 2013) on Wikipedia adopt two different dependency representations:
and Gigaword corpus for Chinese (#dictionary = CoNLL Syntactic Dependencies (CD) (Johansson

744
Dataset #Train #Dev #Test #words (Nw ) #POS (Nt ) #labels (Nl ) projective (%)
PTB: CD 39,832 1,700 2,416 44,352 45 17 99.4
PTB: SD 39,832 1,700 2,416 44,389 45 45 99.9
CTB 16,091 803 1,910 34,577 35 12 100.0

Table 3: Data Statistics. “Projective” is the percentage of projective trees on the training set.

and Nugues, 2007) using the LTH Constituent-to- a first-order graph-based parser (McDonald and
Dependency Conversion Tool3 and Stanford Basic Pereira, 2006).9 In this comparison, for Malt-
Dependencies (SD) (de Marneffe et al., 2006) us- Parser, we select stackproj (arc-standard) and
ing the Stanford parser v3.3.0.4 The POS tags are nivreeager (arc-eager) as parsing algorithms,
assigned using Stanford POS tagger (Toutanova et and liblinear (Fan et al., 2008) for optimization.10
al., 2003) with ten-way jackknifing of the training For MSTParser, we use default options.
data (accuracy ≈ 97.3%). On all datasets, we report unlabeled attach-
For Chinese, we adopt the same split of CTB5 ment scores (UAS) and labeled attachment scores
as described in (Zhang and Clark, 2008). Depen- (LAS) and punctuation is excluded in all evalua-
dencies are converted using the Penn2Malt tool5 tion metrics.11 Our parser and the baseline arc-
with the head-finding rules of (Zhang and Clark, standard and arc-eager parsers are all implemented
2008). And following (Zhang and Clark, 2008; in Java. The parsing speeds are measured on an
Zhang and Nivre, 2011), we use gold segmenta- Intel Core i7 2.7GHz CPU with 16GB RAM and
tion and POS tags for the input. the runtime does not include pre-computation or
Table 3 gives statistics of the three datasets.6 In parameter loading time.
particular, over 99% of the trees are projective in Table 4, Table 5 and Table 6 show the com-
all datasets. parison of accuracy and parsing speed on PTB
(CoNLL dependencies), PTB (Stanford dependen-
4.2 Results cies) and CTB respectively.
The following hyper-parameters are used in all ex-
periments: embedding size d = 50, hidden layer Dev Test Speed
Parser
size h = 200, regularization parameter λ = 10−8 , UAS LAS UAS LAS (sent/s)
initial learning rate of Adagrad α = 0.01. standard 89.9 88.7 89.7 88.3 51
To situate the performance of our parser, we first eager 90.3 89.2 89.9 88.6 63
make a comparison with our own implementa- Malt:sp 90.0 88.8 89.9 88.5 560
tion of greedy arc-eager and arc-standard parsers. Malt:eager 90.1 88.9 90.1 88.7 535
These parsers are trained with structured averaged MSTParser 92.1 90.8 92.0 90.5 12
perceptron using the “early-update” strategy. The Our parser 92.2 91.0 92.0 90.7 1013
feature templates of (Zhang and Nivre, 2011) are
Table 4: Accuracy and parsing speed on PTB +
used for the arc-eager system, and they are also
CoNLL dependencies.
adapted to the arc-standard system.7
Furthermore, we also compare our parser Clearly, our parser is superior in terms of both
with two popular, off-the-shelf parsers: Malt- accuracy and speed. Comparing with the base-
Parser — a greedy transition-based dependency lines of arc-eager and arc-standard parsers, our
parser (Nivre et al., 2006),8 and MSTParser — parser achieves around 2% improvement in UAS
3
http://nlp.cs.lth.se/software/treebank converter/ and LAS on all datasets, while running about 20
4
http://nlp.stanford.edu/software/lex-parser.shtml times faster.
5
6
http://stp.lingfil.uu.se/ nivre/research/Penn2Malt.html It is worth noting that the efficiency of our
Pennconverter and Stanford dependencies generate
9
slightly different tokenization, e.g., Pennconverter splits the http://www.seas.upenn.edu/ strctlrn/MSTParser/
token WCRS\/Boston NNP into three tokens WCRS NNP / MSTParser.html
CC Boston NNP. 10
We do not compare with libsvm optimization, which is
7
Since arc-standard is bottom-up, we remove all features known to be sightly more accurate, but orders of magnitude
using the head of stack elements, and also add the right child slower (Kong and Smith, 2014).
features of the first stack element. 11
A token is a punctuation if its gold POS tag is {“ ” : , .}
8
http://www.maltparser.org/ for English and PU for Chinese.

745
Dev Test Speed Initialization of pre-trained word embeddings
Parser
UAS LAS UAS LAS (sent/s) We further analyze the influence of using pre-
standard 90.2 87.8 89.4 87.3 26 trained word embeddings for initialization. Fig-
eager 89.8 87.4 89.6 87.4 34 ure 4 (middle) shows that using pre-trained word
Malt:sp 89.8 87.2 89.3 86.9 469 embeddings can obtain around 0.7% improve-
Malt:eager 89.6 86.9 89.4 86.8 448 ment on PTB and 1.7% improvement on CTB,
MSTParser 91.4 88.1 90.7 87.6 10 compared with using random initialization within
Our parser 92.0 89.7 91.8 89.6 654 (−0.01, 0.01). On the one hand, the pre-trained
word embeddings of Chinese appear more use-
Table 5: Accuracy and parsing speed on PTB +
ful than those of English; on the other hand, our
Stanford dependencies.
model is still able to achieve comparable accuracy
without the help of pre-trained word embeddings.
Dev Test Speed
Parser
UAS LAS UAS LAS (sent/s) POS tag and arc label embeddings
standard 82.4 80.9 82.7 81.2 72
As shown in Figure 4 (right), POS embeddings
eager 81.1 79.7 80.3 78.7 80
yield around 1.7% improvement on PTB and
Malt:sp 82.4 80.5 82.4 80.6 420
nearly 10% improvement on CTB and the label
Malt:eager 81.2 79.3 80.2 78.4 393
embeddings yield a much smaller 0.3% and 1.4%
MSTParser 84.0 82.1 83.0 81.2 6
improvement respectively.
Our parser 84.0 82.4 83.9 82.4 936
However, we can obtain little gain from la-
Table 6: Accuracy and parsing speed on CTB. bel embeddings when the POS embeddings are
present. This may be because the POS tags of two
tokens already capture most of the label informa-
parser even surpasses MaltParser using liblinear, tion between them.
which is known to be highly optimized, while our
parser achieves much better accuracy. 4.4 Model Analysis
Also, despite the fact that the graph-based MST- Last but not least, we will examine the parame-
Parser achieves a similar result to ours on PTB ters we have learned, and hope to investigate what
(CoNLL dependencies), our parser is nearly 100 these dense features capture. We use the weights
times faster. In particular, our transition-based learned from the English Penn Treebank using
parser has a great advantage in LAS, especially Stanford dependencies for analysis.
for the fine-grained label set of Stanford depen-
dencies. What do E t , E l capture?
We first introduced E t and E l as the dense rep-
4.3 Effects of Parser Components resentations of all POS tags and arc labels, and
we wonder whether these embeddings could carry
Herein, we examine components that account for
some semantic information.
the performance of our parser.
Figure 5 presents t-SNE visualizations (van der
Maaten and Hinton, 2008) of these embeddings.
Cube activation function
It clearly shows that these embeddings effectively
We compare our cube activation function (x3 ) exhibit the similarities between POS tags or arc
with two widely used non-linear functions: tanh labels. For instance, the three adjective POS tags
x −e−x
( eex +e−x ), sigmoid ( 1+e1−x ), and also the JJ, JJR, JJS have very close embeddings, and
identity function (x), as shown in Figure 4 also the three labels representing clausal comple-
(left). ments acomp, ccomp, xcomp are grouped to-
In short, cube outperforms all other activation gether.
functions significantly and identity works the Since these embeddings can effectively encode
worst. Concretely, cube can achieve 0.8% ∼ the semantic regularities, we believe that they can
1.2% improvement in UAS over tanh and other be also used as alternative features of POS tags (or
functions, thus verifying the effectiveness of the arc labels) in other NLP tasks, and help boost the
cube activation function empirically. performance.

746
What do W1w , W1t , W1l capture? 5 Related Work
Knowing that Et and El (as well as the word em- There have been several lines of earlier work in us-
w
beddings E ) can capture semantic information ing neural networks for parsing which have points
very well, next we hope to investigate what each of overlap but also major differences from our
feature in the hidden layer has really learned. work here. One big difference is that much early
Since we currently only have h = 200 learned work uses localist one-hot word representations
dense features, we wonder if it is sufficient to rather than the distributed representations of mod-
learn the word conjunctions as sparse indicator ern work. (Mayberry III and Miikkulainen, 1999)
features, or even more. We examine the weights explored a shift reduce constituency parser with
W1w (k, ·) ∈ Rd·nw , W1t (k, ·) ∈ Rd·nt , W1l (k, ·) ∈ one-hot word representations and did subsequent
Rd·nl for each hidden unit k, and reshape them to parsing work in (Mayberry III and Miikkulainen,
d × nt , d × nw , d × nl matrices, such that the 2005).
weights of each column corresponds to the embed- (Henderson, 2004) was the first to attempt to use
dings of one specific element (e.g., s1 .t). neural networks in a broad-coverage Penn Tree-
We pick the weights with absolute value > 0.2, bank parser, using a simple synchrony network to
and visualize them for each feature. Figure 6 gives predict parse decisions in a constituency parser.
the visualization of three sampled features, and it More recently, (Titov and Henderson, 2007) ap-
exhibits many interesting phenomena: plied Incremental Sigmoid Belief Networks to
• Different features have varied distributions of constituency parsing and then (Garg and Hender-
the weights. However, most of the discrim- son, 2011) extended this work to transition-based
inative weights come from W1t (the middle dependency parsers using a Temporal Restricted
zone in Figure 6), and this further justifies the Boltzman Machine. These are very different neu-
importance of POS tags in dependency pars- ral network architectures, and are much less scal-
ing. able and in practice a restricted vocabulary was
used to make the architecture practical.
• We carefully examine many of the h = 200 There have been a number of recent uses of
features, and find that they actually encode deep learning for constituency parsing (Collobert,
very different views of information. For the 2011; Socher et al., 2013). (Socher et al., 2014)
three sampled features in Figure 6, the largest has also built models over dependency representa-
weights are dominated by: tions but this work has not attempted to learn neu-
ral networks for dependency parsing.
– Feature 1: s1 .t, s2 .t, lc(s1 ).t.
Most recently, (Stenetorp, 2013) attempted to
– Feautre 2: rc(s1 ).t, s1 .t, b1 .t. build recursive neural networks for transition-
– Feature 3: s1 .t, s1 .w, lc(s1 ).t, lc(s1 ).l. based dependency parsing, however the empirical
performance of his model is still unsatisfactory.
These features all seem very plausible, as ob-
served in the experiments on indicator feature 6 Conclusion
systems. Thus our model is able to automati-
cally identify the most useful information for We have presented a novel dependency parser us-
predictions, instead of hand-crafting them as ing neural networks. Experimental evaluations
indicator features. show that our parser outperforms other greedy
parsers using sparse indicator features in both ac-
• More importantly, we can extract features re- curacy and speed. This is achieved by represent-
garding the conjunctions of more than 3 ele- ing all words, POS tags and arc labels as dense
ments easily, and also those not presented in vectors, and modeling their interactions through a
the indicator feature systems. For example, novel cube activation function. Our model only
the 3rd feature above captures the conjunc- relies on dense features, and is able to automat-
tion of words and POS tags of s1 , the tag of ically learn the most useful feature conjunctions
its leftmost child, and also the label between for making predictions.
them, while this information is not encoded An interesting line of future work is to combine
in the original feature templates of (Zhang our neural network based classifier with search-
and Nivre, 2011). based models to further improve accuracy. Also,

747
95

90 90
90
UAS score

UAS score

UAS score
85
85
85 80

75
80
80 70

PTB:CD PTB:SD CTB PTB:CD PTB:SD CTB PTB:CD PTB:SD CTB


cube tanh sigmoid identity pre-trained random word+POS+label word+POS word+label word

Figure 4: Effects of different parser components. Left: comparison of different activation functions.
Middle: comparison of pre-trained word vectors and random initialization. Right: effects of POS and
label embeddings.

800
600
: advmod
. ) 600
nsubj
400
VBG ’’ , csubj
nsubjpass
400
expl amod
WP$ predet
num
nn det
200 WP WRB PRP $ RB RBSRBR appos
conj auxpass poss
200 rcmod
PRP$ WDT LS NNPNNPS parataxis possessive
IN POS NNSNN discourse quantmod
TO mark
cc
punct
csubjpass
prt
0 SYM −ROOT−
VBN FW 0 iobj dep
VB infmod root
partmod neg
RP prep
CC ‘‘ number
JJR (
−200
dobj
−200 JJ DT
UH aux acomp
JJS PDT preconj cop tmod advcl
−400 xcomp
EX npadvmod ccomp
−400 mwe
# misc misc
VBD
VBZ CD noun −600
clausal complement
punctuation noun pre−modifier
VBP verbal auxiliaries
−600 verb
MD −800 subject
adverb pcomp preposition complement
pobj noun post−modifier
adjective
−800 −1000
−600 −400 −200 0 200 400 600 −600 −400 −200 0 200 400 600 800

Figure 5: t-SNE visualization of POS and label embeddings.

Figure 6: Three sampled features. In each feature, each row denotes a dimension of embeddings and
each column denotes a chosen element, e.g., s1 .t or lc(s1 ).w, and the parameters are divided into 3
zones, corresponding to W1w (k, :) (left), W1t (k, :) (middle) and W1l (k, :) (right). White and black dots
denote the most positive weights and most negative weights respectively.

748
there is still room for improvement in our architec- James Henderson. 2004. Discriminative training of a
ture, such as better capturing word conjunctions, neural network statistical parser. In ACL.
or adding richer features (e.g., distance, valency). Geoffrey E. Hinton, Nitish Srivastava, Alex
Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
Acknowledgments dinov. 2012. Improving neural networks by
preventing co-adaptation of feature detectors.
Stanford University gratefully acknowledges the CoRR, abs/1207.0580.
support of the Defense Advanced Research
Liang Huang, Wenbin Jiang, and Qun Liu. 2009.
Projects Agency (DARPA) Deep Exploration and Bilingually-constrained (monolingual) shift-reduce
Filtering of Text (DEFT) Program under Air parsing. In EMNLP.
Force Research Laboratory (AFRL) contract no.
Richard Johansson and Pierre Nugues. 2007. Ex-
FA8750-13-2-0040 and the Defense Threat Re-
tended constituent-to-dependency conversion for en-
duction Agency (DTRA) under Air Force Re- glish. In Proceedings of NODALIDA, Tartu, Estonia.
search Laboratory (AFRL) contract no. FA8650-
10-C-7020. Any opinions, findings, and conclu- Lingpeng Kong and Noah A. Smith. 2014. An em-
pirical comparison of parsing methods for Stanford
sion or recommendations expressed in this mate- dependencies. CoRR, abs/1404.4314.
rial are those of the authors and do not necessarily
reflect the view of the DARPA, AFRL, or the US Terry Koo, Xavier Carreras, and Michael Collins.
2008. Simple semi-supervised dependency parsing.
government. In ACL.
Sandra Kübler, Ryan McDonald, and Joakim Nivre.
References 2009. Dependency Parsing. Synthesis Lectures on
Human Language Technologies. Morgan & Clay-
Bernd Bohnet. 2010. Very high accuracy and fast de- pool.
pendency parsing is not a contradiction. In Coling.
Marshall R. Mayberry III and Risto Miikkulainen.
Ronan Collobert, Jason Weston, Léon Bottou, Michael 1999. Sardsrn: A neural network shift-reduce
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. parser. In IJCAI.
2011. Natural language processing (almost) from
scratch. Journal of Machine Learning Research. Marshall R. Mayberry III and Risto Miikkulainen.
2005. Broad-coverage parsing with neural net-
Ronan Collobert. 2011. Deep learning for efficient works. Neural Processing Letters.
discriminative parsing. In AISTATS.
Ryan McDonald and Fernando Pereira. 2006. Online
Marie-Catherine de Marneffe, Bill MacCartney, and learning of approximate dependency parsing algo-
Christopher D. Manning. 2006. Generating typed rithms. In EACL.
dependency parses from phrase structure parses. In
LREC. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas tions of words and phrases and their compositional-
Lamar, Richard Schwartz, and John Makhoul. 2014. ity. In NIPS.
Fast and robust neural network joint models for sta-
Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
tistical machine translation. In ACL.
Maltparser: A data-driven parser-generator for de-
John Duchi, Elad Hazan, and Yoram Singer. 2011. pendency parsing. In LREC.
Adaptive subgradient methods for online learning Richard Socher, John Bauer, Christopher D Manning,
and stochastic optimization. The Journal of Ma- and Andrew Y Ng. 2013. Parsing with composi-
chine Learning Research. tional vector grammars. In ACL.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- Richard Socher, Andrej Karpathy, Quoc V. Le, Christo-
Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A pher D. Manning, and Andrew Y. Ng. 2014.
library for large linear classification. The Journal of Grounded compositional semantics for finding and
Machine Learning Research. describing images with sentences. TACL.
Nikhil Garg and James Henderson. 2011. Temporal Pontus Stenetorp. 2013. Transition-based dependency
restricted boltzmann machines for dependency pars- parsing using recursive neural networks. In NIPS
ing. In ACL-HLT. Workshop on Deep Learning.
He He, Hal Daumé III, and Jason Eisner. 2013. Dy- Ivan Titov and James Henderson. 2007. Fast and ro-
namic feature selection for dependency parsing. In bust multilingual dependency parsing with a gener-
EMNLP. ative latent variable model. In EMNLP-CoNLL.

749
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In NAACL.
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-SNE. The Journal of Ma-
chine Learning Research.
Yue Zhang and Stephen Clark. 2008. A tale of
two parsers: Investigating and combining graph-
based and transition-based dependency parsing us-
ing beam-search. In EMNLP.
Yue Zhang and Joakim Nivre. 2011. Transition-based
dependency parsing with rich non-local features. In
ACL.

750

You might also like