You are on page 1of 11

Improved Semantic Representations From

Tree-Structured Long Short-Term Memory Networks

Kai Sheng Tai, Richard Socher*, Christopher D. Manning
Computer Science Department, Stanford University, *MetaMind Inc.,,

Abstract y1 y2 y3 y4

Because of their superior ability to pre-
serve sequence information over time,
Long Short-Term Memory (LSTM) net- x1 x2 x3 x4
arXiv:1503.00075v3 [cs.CL] 30 May 2015

works, a type of recurrent neural net- y1
work with a more complex computational
unit, have obtained strong results on a va-
y2 y3
riety of sequence modeling tasks. The
only underlying LSTM structure that has
been explored so far is a linear chain. y4 y6
However, natural language exhibits syn- x2
tactic properties that would naturally com-
bine words to phrases. We introduce the
x4 x5 x6
Tree-LSTM, a generalization of LSTMs to
tree-structured network topologies. Tree-
LSTMs outperform all existing systems Figure 1: Top: A chain-structured LSTM net-
and strong LSTM baselines on two tasks: work. Bottom: A tree-structured LSTM network
predicting the semantic relatedness of two with arbitrary branching factor.
sentences (SemEval 2014, Task 1) and
sentiment classification (Stanford Senti-
ment Treebank). Order-insensitive models are insufficient to
fully capture the semantics of natural language
1 Introduction due to their inability to account for differences in
Most models for distributed representations of meaning as a result of differences in word order
phrases and sentences—that is, models where real- or syntactic structure (e.g., “cats climb trees” vs.
valued vectors are used to represent meaning—fall “trees climb cats”). We therefore turn to order-
into one of three classes: bag-of-words models, sensitive sequential or tree-structured models. In
sequence models, and tree-structured models. In particular, tree-structured models are a linguisti-
bag-of-words models, phrase and sentence repre- cally attractive option due to their relation to syn-
sentations are independent of word order; for ex- tactic interpretations of sentence structure. A nat-
ample, they can be generated by averaging con- ural question, then, is the following: to what ex-
stituent word representations (Landauer and Du- tent (if at all) can we do better with tree-structured
mais, 1997; Foltz et al., 1998). In contrast, se- models as opposed to sequential models for sen-
quence models construct sentence representations tence representation? In this paper, we work to-
as an order-sensitive function of the sequence of wards addressing this question by directly com-
tokens (Elman, 1990; Mikolov, 2012). Lastly, paring a type of sequential model that has recently
tree-structured models compose each phrase and been used to achieve state-of-the-art results in sev-
sentence representation from its constituent sub- eral NLP tasks against its tree-structured general-
phrases according to a given syntactic structure ization.
over the sentence (Goller and Kuchler, 1996; Due to their capability for processing arbitrary-
Socher et al., 2011). length sequences, recurrent neural networks

We evaluate the Tree-LSTM ing: architecture on two tasks: semantic relatedness   prediction on sentence pairs and sentiment clas. Sutskever et al. cay exponentially over long sequences (Hochre- image caption generation (Vinyals et al. 2.. have ht = tanh (W xt + U ht−1 + b) . and a hidden state ht . Recently.. 1997) have re-emerged as a popular archi. 1997) addresses this problem of representing sentence meaning over a sequential learning long-term dependencies by introducing a LSTM. a units. the memory dimension of the LSTM. This problem with and program execution (Zaremba and Sutskever. Tree-LSTM.. Memory (LSTM) units (Hochreiter and Schmid. The entries of the gating nal node has exactly one child. The LSTM transition equations are the follow- senting sentences. for the RNN model to learn long-distance correla- In this paper. linearity such as the hyperbolic tangent function: fectiveness at capturing long-term dependencies. 2014. iter. the tree-structured LSTM. notably machine tion functions of this form is that during training. translation (Bahdanau et al. the standard LSTM architecture to tree-structured The LSTM architecture (Hochreiter and network topologies and show its superiority for Schmidhuber. we demonstrate the empiri. components of the gradient vector can grow or de- 2014). dimensional distributed representation of the se- ing tasks.   Our experiments show that Tree-LSTMs outper. We define the LSTM unit at each time step t to tor and the hidden states of arbitrarily many child be a collection of vectors in Rd : an input gate it . here we describe the version previous time step. we introduce a generalization of tions in a sequence. While the standard LSTM composes its memory cell that is able to preserve state over long hidden state from the input at the current time periods of time. els and experiments are available at https://   ut = tanh W (u) xt + U (u) ht−1 + b(u) . RNNs with Long Short-Term quence of tokens observed up to time t. ft and ot are in [0. We refer to d as In our evaluations. vectors it . it = σ W (i) xt + U (i) ht−1 + b(i) . Commonly. a memory cell ct a special case of the Tree-LSTM where each inter.. speech recognition (Graves et al. affine transformation followed by a pointwise non- tecture due to their representational power and ef. exploding or vanishing gradients makes it difficult 2014).(RNNs) are a natural choice for sequence model.   lines on both tasks. 2013). composes its state from an input vec. The standard LSTM can then be considered forget gate ft . form existing systems and sequential LSTM base. or used by Zaremba and Sutskever (2014). a problem with RNNs with transi- modeling and prediction tasks. been successfully applied to a variety of sequence Unfortunately. an output gate ot . cal strength of Tree-LSTMs as models for repre. ft = σ W (f ) xt + U (f ) ht−1 + b(f ) . While numerous LSTM variants step and the hidden state of the LSTM unit in the have been described. (1) sification of sentences drawn from movie reviews. 1998. the RNN transition function is an huber. Bengio et al. Implementations of our mod. github. ot = σ W (o) xt + U (o) ht−1 + b(o) .. 1]. which we review in Sec. LSTM networks. 1994). 2014).com/stanfordnlp/treelstm. ct = it .

ut + ft .

ct−1 . 2 Long Short-Term Memory Networks ht = ot .

tanh(ct ). notes the logistic sigmoid function and . 2.1 Overview where xt is the input at the current time step. σ de- Recurrent neural networks (RNNs) are able to pro.

partial view of the state of the unit’s body of text (Elman. the memory cell is forgotten. the input gate controls hidden state ht is a function of the input vector xt how much each unit is updated. 1990. For example. hidden state ht−1 . The internal memory cell. the model . Mikolov. elementwise multiplication. At each time step t. 2012). Intuitively. variables vary for each vector element. and the output gate that the network receives at time t and its previous controls the exposure of the internal memory state. the for- cursive application of a transition function on a get gate controls the extent to which the previous hidden state vector ht . the input vector xt The hidden state vector in an LSTM unit is there- could be a vector representation of the t-th word in fore a gated. Since the value of the gating hidden state ht ∈ Rd can be interpreted as a d. denotes cess input sequences of arbitrary length via the re.

whereas in a Tree- layer Bidirectional LSTM (Graves et al. on the tree structure used for the network. or it can learn to preserve the representation mation. Zaremba and Sutskever.   As in standard LSTM units. The input word at each node depends et al. f2 2. each node in the tree takes the vector correspond- These two variants can be combined as a Multi. Multilayer LSTM. term dependencies of the input sequence. Labeled edges cor- each time step. A Bidirectional LSTM (Graves et al. Here.. of sentiment-rich children for sentiment classifica- tion. Sutskever sentence. let C(j) denote the set of children strictly sequential information propagation. Here. For in- the idea is to let the higher layers capture longer- stance. in a Tree-LSTM over a dependency tree. the hidden state of an LSTM unit in layer unit takes an input vector xj . Bidirectional LSTM. each Tree-LSTM fjk = σ W (f ) xj + U (f ) hk + b(f ) ... At dren (subscripts 2 and 3). the leaf nodes take the corresponding word vectors as input. 2013. The difference between the standard uj = tanh W (u) xj + U (u) h ˜ j + b(u) . 3 Tree-Structured LSTMs 3. ing to the head word as input. 2013) consists of two LSTMs that Figure 2: Composing the memory cell c1 and hid- are run in parallel: one on the input sequence and den state h1 of a Tree-LSTM unit with two chil- the other on the reverse of the input sequence. task. This setup allows the hidden state to capture both past and future infor.1 Child-Sum Tree-LSTMs A limitation of the LSTM architectures described in the previous section is that they only allow for Given a tree.can learn to represent information over multiple c2 h2 time scales. the hidden state of the Bidirec. (4) unit (indexed by j) contains input and output   ˜ j + b(o) . (2) and the N-ary Tree-LSTM. In our applications. each Tree-LSTM tectures.2 Variants i1 o1 x1 u1 c1 h1 Two commonly-used variants of the basic LSTM architecture are the Bidirectional LSTM and the f3 Multilayer LSTM (also known as the stacked or h3 c3 deep LSTM). oj = σ W (o) xj + U (o) h (5) gates ij and oj . In Multilayer LSTM archi- As with the standard LSTM. The Child-Sum Tree-LSTM transition we propose two natural extensions to the basic equations are the following: LSTM architecture: the Child-Sum Tree-LSTM X ˜j = h hk . 2014). ij = σ W (i) xj + U (i) h (3) child units. and backward hidden states.. tional LSTM is the concatenation of the forward with dependencies omitted for compactness. (6) LSTM unit and Tree-LSTM units is that gating X vectors and memory cell updates are dependent cj = ij . 2013). respond to gating by the indicated gating vector. 2014. ` is used as input to the LSTM unit in layer `+1 in each xj is a vector representation of a word in a the same time step (Graves et al. Both variants allow for k∈C(j) richer network topologies where each LSTM unit   is able to incorporate information from multiple ˜ j + b(i) . LSTM over a constituency tree. of node j. a memory cell cj and hidden   state hj .

uj + fjk .

the Tree. k∈C(j) ditionally. instead of a single forget gate. Ad. hj = oj . (7) on the states of possibly many child units. ck .

we can interpret each parameter ma- For example. k ∈ C(j). (8) LSTM unit contains one forget gate fjk for each child k. This allows the Tree-LSTM unit to se. 4. where in Eq. lectively incorporate information from each child. Intuitively. tanh(cj ). a Tree-LSTM model can learn to em. trix in these equations as encoding correlations be- phasize semantic heads in a semantic relatedness tween the component vectors of the Tree-LSTM .

(11) `=1 of the two models are very similar. it is a get gate fjk that contains “off-diagonal” param- good choice for dependency trees. and left child vs. 10. We refer to this application of Bi- N ! nary Tree-LSTMs as a Constituency Tree-LSTM. i. We can naturally den state and memory cell of its kth child as hjk apply Binary Tree-LSTM units to binarized con- and cjk respectively. it is well. and the hidden states hk of the model to learn more fine-grained conditioning on unit’s children. where the num. we focus on X fjk = σ W (f ) xj + Uk` hj` + b(f ) . `=1 the special cases of Dependency Tree-LSTMs and (10) Constituency Tree-LSTMs. where children are ordered.. The N -ary Tree-LSTM tran. (9) Note that in Constituency Tree-LSTMs. For example. write the hid. this allows the left hidden state in a binary tree 3. in a dependency tree the states of a unit’s children than the Child- application. Since the Child..e. This parameteriza- ber of dependents of a head can be highly variable. since we consider only (o) binarized constituency trees.e. Then the Uk` parameters (such as a determiner). “closed”) it is advantageous to emphasize the verb phrase (f ) when the input is a relatively unimportant word in the representation. they can be in- dexed from 1 to N . stituency trees since left and right child nodes are sition equations are the following: distinguished. and values close to 0 (i.. “forget”). (f ) eter matrices Uk` . Forget gate parameterization.e. The key dif- N ! ference is in the application of the compositional (u) X uj = tanh W (u) xj + U` hj` + b(u) . on the sum of child hidden states hk . However. For any node j. “preserve”).e. while the components of Sum Tree-LSTM unit conditions its components fj2 are close to 1 (i. tion allows for more flexible control of informa- We refer to a Child-Sum Tree-LSTM applied to a tion propagation from child to parent. and the right cally important content word (such as a verb) is child to a verb phrase. right child for Con- (12) stituency Tree-LSTMs. the model can learn parameters W (i) Sum Tree-LSTM. the parameterizations X oj = σ W (o) xj + U` hj` + b(o) .. Suppose that in this case given as input. for The N -ary Tree-LSTM can be used on tree struc- large values of N . a node j `=1 receives an input vector xj only if it is a leaf node. These architectures N ! are in fact closely related.. ple. can be trained such that the components of fj1 are Dependency Tree-LSTMs. Consider.e. node corresponds to a noun phrase. N X cj = ij . (i) X ij = σ W (i) xj + U` hj` + b(i) . head for Dependency `=1 Tree-LSTMs. In Eq. “open”) when a semanti.2 N -ary Tree-LSTMs to have either an excitatory or inhibitory effect on the forget gate of the right child. a con- such that the components of the input gate ij have stituency tree application where the left child of a values close to 1 (i. N ! (f ) In the remainder of this paper. For exam- dependency tree as a Dependency Tree-LSTM. Constituency Tree-LSTMs. for example. these additional parameters are tures where the branching factor is at most N and impractical and may be tied or fixed to zero. the input xj . k 6= `. close to 0 (i.unit. parameters: dependent vs. For example. we suited for trees with high branching factor or define a parameterization of the kth child’s for- whose children are unordered.

uj + fj` .

(13) 4 Models `=1 We now describe two specific models that apply hj = oj . cj` .

In this setting. (14) the Tree-LSTM architectures described in the pre- where in Eq. Note that vious section. discrete set of classes Y for some subset of nodes ces for each child k allows the N -ary Tree-LSTM in a tree. we wish to predict labels yˆ from a The introduction of separate parameter matri. . the label for a node in a . Eqs. 4. k = 1. when the tree is simply a chain. 10. For example. 1. 9–14 reduce to the standard LSTM tran. . tanh(cj ). 2–8 and Eqs. 2. both Eqs. N . . .1 Tree-LSTM Classification sitions.

At each node j. The classifier close to the gold rating y ∈ [1. tions.  yˆj = arg max pˆθ (y | {x}j ) . K]: yˆ = rT pˆθ ≈ y. i = byc  0 otherwise  The cost function is the negative log-likelihood of the true class labels y (k) at each labeled node: for 1 ≤ i ≤ K. we use a softmax classifier to We want the expected rating under the predicted predict the label yˆj given the inputs {x}j observed distribution pˆθ given model parameters θ to be at nodes in the subtree rooted at j.parse tree could correspond to some property of comparison of the signs of the input representa- the phrase spanned by that node. y − byc. takes the hidden state hj at the node as input: We therefore define a sparse target distribution1 p   that satisfies y = rT p: pˆθ (y | {x}j ) = softmax W (s) hj + b(s) . The cost function is the regular- m ized KL-divergence between p and pˆθ : 1 X  .  i = byc + 1 y pi = byc − y + 1.

 λ J(θ) = − log pˆθ y (k) .

{x}(k) + kθk22 . .

4. the semantic relatedness of sentence pairs. K} is some ordinal scale of similarity. The sequence sampled from movie reviews and (2) predicting {1. We use considers both the distance and angle between the the Stanford Sentiment Treebank (Socher et al.. tial LSTMs. . we wish to predict a We evaluate our Tree-LSTM architectures on two real-valued similarity score in some range [1. hidden states2 . where higher scores indicate greater degrees of In comparing our Tree-LSTMs against sequen- similarity. pair (hL .2 Semantic Relatedness of Sentence Pairs 5 Experiments Given a sentence pair. There are two subtasks: binary classifica- tion of sentences. . superscript k indicates the kth sentence pair. Details for each model variant are We first produce sentence representations hL summarized in Table 1.1 Sentiment Classification Given these sentence representations. . and λ is an L2 regularization hyperpa. we predict In this task. and we allow real-valued scores to ac. we predict the sentiment of sen- the similarity score yˆ using a neural network that tences sampled from movie reviews. and hR for each sentence in the pair using a Tree-LSTM model over each sentence’s parse tree. 5. where m is the number of labeled nodes in the m 2 k=1 training set. K]. tasks: (1) sentiment classification of sentences where K > 1 is an integer. hR ): 2013). and fine-grained classification h× = hL . 2. . where m is the number of training pairs and the rameter. m 2 m k=1 1 X  (k)  λ J(θ) = KL p(k) pˆθ + kθk22 . we control for the number of LSTM count for ground-truth ratings that are an average parameters by varying the dimensionality of the over the evaluations of several human annotators. the superscript k indicates the kth la- beled node.

for the fine-grained classification subtask (there are fewer examples for the binary subtask since yˆ = rT pˆθ . hR . . the parameters of the for- distance measures h× and h+ is empirically mo. we found that optimizing where rT = [1 2 . The use of both 2 For our Bidirectional LSTMs. this achieved superior performance to Bidirec- tional LSTMs with untied weights and the same number of the use of either measure alone. positive. function is applied elementwise. tral. dard train/dev/test splits of 6920/872/1821 for the   binary classification subtask and 8544/1101/2210 pˆθ = softmax W (p) hs + b(p) . and very positive. In our tivated: we find that the combination outperforms experiments. ward and backward transition functions are shared. . (15) over five classes: very negative. We use the stan-   hs = σ W (×) h× + W (+) h+ + b(h) . K] and the absolute value this objective yielded better performance than a mean squared error objective. negative. 1 In the subsequent experiments. The multiplicative parameters (and therefore smaller hidden vector dimension- measure h× can be interpreted as an elementwise ality). neu- h+ = |hL − hR |. .

word represen- We use the Sentences Involving Composi- tations were held fixed as we did not observe any tional Knowledge (SICK) dataset (Marelli et al.3) 86. The sentences Our models were trained using AdaGrad (Duchi are derived from existing image and video descrip- et al.7 (0.400 168 315. we produce 5. Each label is the average of 10 ratings as- 2012) with a dropout rate of 0.0) 87. related.2. The model parameters were a relatedness score y ∈ [1. classification. serve performance gains using dropout on the se- Here.1) and Constituency Tree-LSTMs Constituency Tree-LSTM and Dependency Tree- (Sec. 2013) 45.5) – Glove vectors. Constituency Tree-LSTM – randomly initialized vectors 43.8) neutral sentences are excluded). of the similarity of the two sentences in meaning.3 (0. 2014) 47..5) 2-layer LSTM 46. Standard bina. Network Dependency Parser (Chen and Manning.9 Standard 150 203.. significant improvement when the representations 2014). For the similarity prediction network 4 (Eqs.6 Dependency Tree 150 203. Binary: positive/negative senti- tial LSTM models are trained on the spans corre.0 (0.4) ant that we evaluate.4 82.stanford.1 Constituency Tree 142 205. For our experiments. Each sentence pair is annotated with minibatch size of 25. Dependency Tree-LSTM 48. timent Treebank. 2003). the semantic relat- classification task. 2011) with a learning rate of 0.1 (1.0 (0.7 85. 4.1 with both Dependency Tree-LSTMs dency parses of the sentences in the dataset for our (Sec. we report For the sequential LSTM baselines.800 DRNN (Irsoy and Cardie.0 (0. consisting of 9927 sentence pairs in a were tuned..400 168 315. We initialized our word representations using 5. 3 5 Dependency parses produced by the Stanford Neural Trained on 840 billion tokens of Common Crawl data.7 (0..0 87. – Glove vectors.8 2-layer 108 203. http://nlp. 4. word representations were up- edness task is to predict a human-generated rating dated during training with a learning rate of 0.5 86. we use the similarity model described in mantic relatedness task.6) 82. Relatedness Sentiment Method Fine-grained Binary RAE (Socher et al.4 88.05 and a tion datasets.2 Semantic Relatedness publicly available 300-dimensional Glove vec- tors5 (Pennington et al. The sequen. 2014) 49.400 168 315.4) 87.472 120 318. fixed 49. 2014) 48.8 Paragraph-Vec (Le and Mikolov.8 86. The Constituency Tree-LSTMs are LSTM models.6) Bidirectional LSTM 49. with 1 indicating regularized with a per-minibatch L2 regularization that the two sentences are completely unrelated. 2014) 48.720 CNN-multichannel (Kim.4 (0. 2014). 2013) 44. For the sentiment For a given pair of sentences.3) rized constituency parse trees are provided for each sentence in the dataset.7 87.. tuned 51.5 (1.0) 88.0) function parameter counts |θ| for each LSTM vari.5 (0.2 (1. The sentiment classifier was ad- and 5 indicating that the two sentences are very ditionally regularized using dropout (Hinton et al.3 Hyperparameters and Training Details dependency parses3 of each sentence.2).720 CNN-non-static (Kim. .. We did not ob- signed by different human annotators. We Constituency parses produced by the Stanford PCFG Parser (Klein and Manning. 4500/500/4927 train/dev/test split. ment classification.4 LSTM Variant d |θ| d |θ| MV-RNN (Socher et al.6) Table 1: Memory dimensions d and composition 2-layer Bidirectional LSTM 48. sponding to labeled nodes in the training set. strength of 10−4 . we predict mean accuracies over 5 runs (standard deviations the sentiment of a phrase using the representation in parentheses). matches a labeled span in the training set. 2014) 48.472 120 318. Sec.2 Bidirectional 2-layer 108 203.4) 85.190 150 316..4 (1.0 (1.1. For the semantic relatedness task.4 Bidirectional 150 203. 3. each node The hyperparameters for our models were tuned in a tree is given a sentiment label if its span on the development set for each task.9 (0. 15) we use a hidden layer of size 50.1) 84. For the Dependency Tree-LSTMs. 3. 2013) 43. 2014). and each node in Table 2: Test set accuracies on the Stanford Sen- these trees is annotated with a sentiment label.840 LSTM 46.5 (0.9 (0.2 82.840 RNTN (Socher et al. structured according to the provided parse trees.5.840 DCNN (Blunsom et al. Fine-grained: 5-class sentiment given by the final LSTM hidden state. 5]. We use the classification model described in produce binarized constituency parses4 and depen- Sec.

Following task. The art on the binary subtask. This difference RNN that uses a separate transformation for each is due to (1) the dependency representations con.2 Semantic Relatedness 6 We list the strongest results we were able to find for this Our results are summarized in Table 3.2831 (0.0042) 0. 2014) 0. 2014) 0. trained to capture sentiment. rived from WordNet or the Paraphrase Database ize our word representations were not originally (Ganitkevitch et al. (Zhao et al. We also compare against four of the top- responding span in the training data.0108) Dependency Tree-LSTM 0. Our results are summarized in Table 2. (3) Sequential LSTMs.. 319K earity.0027) 0. We compare our models against a number of stituency Tree-LSTM outperforms existing sys.2736 (0. the representations of the constituent words. The first two metrics are measures of correlation against human evaluations of semantic 6. (2014).2532 (0. 2013). (this finding is consistent with previous work on generally using a combination of surface form this task by Kim (2014)). Our LSTM models outperform all these sys- 6.3550 Meaning Factory (Bjerva et al. For our experiments.2.0074) LSTM 0.2762 (0. these results are stronger than the official performance by the team on the shared task.. 2014) 0.7923 (0.3848 (0..0063) 2-layer LSTM 0.0031) 0. 2014)..7538 0.0090) DT-RNN (Socher et al. including the LSTM models.7966 (0.2838 (0.0052) Table 3: Test set results on the SICK semantic relatedness subtask.8676 (0. . 2014).0088) 0. This 2014) both compose vector representations for the performance gap is at least partially attributable to nodes in a dependency tree as a sum over affine- the fact that the Dependency Tree-LSTM is trained transformed child vectors. For each of our baselines.7489 0. match about 9% of the dependency nodes to a cor. 6 Results tion metrics.3822 (0.8070 0.0066) 0.7993 0.7896 (0.. we use the similarity stituency representations. non-LSTM baselines. The Con.0059) 0.7319 (0.8083 (0.0053) 0.0028) 0. 2014). 2014) 0. Method Pearson’s r Spearman’s ρ MSE Illinois-LH (Lai and Hockenmaier. we find that DT-RNN and SDT-RNN models (Socher et al.3224 ECNU (Zhao et al.8280.2734 (0.8515 (0. the listed result by Zhao et al.8414 – – Mean vectors 0.0071) 0. These gains are to be overlap features and lexical distance features de- expected since the Glove vectors used to initial. Results are grouped as follows: (1) SemEval 2014 submissions. 4. we report mean scores over 5 runs (standard deviations in parentheses). 2014) 0. The Meaning Factory (Bjerva bedding) yields a significant boost in performance et al. we use Pearson’s r.7577 (0. 2014) 0. Marelli et al.0030) 0.6738 (0.0076) 0.0070) 0.0150) 2-layer Bidirectional LSTM 0.8558 (0. 2014 semantic relatedness shared task: ECNU tions during training (“fine-tuning” the word em. followed by a nonlin- on less data: about 150K labeled nodes vs.7900 (0. For example.4557 (0.3692 UNAL-NLP (Jimenez et al.0042) 0. a minor gain on the binary classification subtask These systems are heavily feature engineered.0014) 0.1 Sentiment Classification relatedness.. taining fewer nodes than the corresponding con. The mean vector baseline tems on the fine-grained classification subtask and computes sentence representations as a mean of achieves accuracy comparable to the state-of-the. In particular.7911 (0. 2014). dependency relation.8582 (0.0013) 0. in some cases.0038) 0. (2) Our own baselines.. (2014) is stronger than their man’s ρ and mean squared error (MSE) as evalua. performing systems6 submitted to the SemEval We found that updating the word representa.7965 (0.7304 (0.. submitted system’s Pearson correlation score of 0.0020) Constituency Tree-LSTM 0.8268 0. (4) Tree-structured LSTMs. UNAL-NLP (Jimenez et al.7966 (0. on the fine-grained classification subtask and gives and Illinois-LH (Lai and Hockenmaier.7721 0. The SDT-RNN is an extension of the DT- for the Constituency Tree-LSTM..0137) SDT-RNN (Socher et al..0053) 0.0092) Bidirectional LSTM 0.8567 (0. Spear.0018) 0.8528 (0. it outperforms the Dependency Tree-LSTM. and (2) the inability to model described in Sec.

.65 0. the sentence representations that they compose. Given with the best results achieved by the Dependency the query “two men are playing guitar”. front of a crowd” (note as well that there is zero cation task where supervision was also provided token overlap between the two phrases). We conjecture that in 7.82 DT-LSTM 0. 2010. et al. Figure 4: Pearson correlations r between pre- curacy vs. in contrast to the sentiment classifi. tionally. Examples of the length distribution are batched in the final in the tail of the length distribution are batched in window (` = 45).40 CT-LSTM CT-LSTM LSTM 0. tial counterparts on the relatedness task for tors for each sentence. `+2]. the Dependency Tree-LSTM benefits from its more compact structure relative to the One hypothesis to explain the empirical strength Constituency Tree-LSTM.50 0. Addi. Each data point is a mean score In Table 4. Collobert et al.84 r 0. In Figs. shorter sentences. 2013. related phrase “dancing and singing in of the tree.1 Modeling Semantic Relatedness specific metric. ` + 2]. longer sentences of length 13 to 15 (Fig. 2011. 0... we would ex- LSTM. 1988.80 LSTM 0. the Tree- Tree-LSTM.55 accuracy 0. We compare the neighbors ranked by the We observe that while the Dependency Tree- Dependency Tree-LSTM model against a baseline LSTM does significantly outperform its sequen- ranking by cosine similarity of the mean word vec. which indicates that the Tree- LSTM is able to both preserve and emphasize in. and error bars have been omitted for trieved from a 1000-sentence sample of the SICK clarity.35 Bi-LSTM Bi-LSTM 0. over 5 runs. 2012. For each `. 7 Discussion and Qualitative Analysis we show the relationship between sentence length and performance as measured by the relevant task- 7.86 0. we plot r for the pairs with the window [` − 2. both Tree. Distributed representations of words (Rumelhart formation from relatively distant nodes. 3 and 4. in the sense that paths of Tree-LSTMs is that tree structures help miti- from input word vectors to the root of the tree gate the problem of preserving state over long se- are shorter on aggregate for the Dependency Tree. Tree-LSTMs are able to encode word “ocean” is the second-furthest word from the semantically-useful structural information in the root (“waving”). Turian et al. quences of words. the final window (` = 18. test set. Note that in the de. the Tree-LSTM model shows greater ro. we plot dicted similarities and gold ratings vs.70 0. Recall that in this task. bustness to differences in sentence length. it The Dependency Tree-LSTM model exhibits also achieves consistently strong performance on several desirable properties.60 0..78 0 5 10 15 20 25 30 35 40 45 4 6 8 10 12 14 16 18 20 sentence length mean sentence length Figure 3: Fine-grained sentiment classification ac.5).88 0. retrieved sentences are all semantically related to 8 Related Work the word “ocean”. Mikolov et al. If this were true. LSTM associates the phrase “playing guitar” with LSTM models only receive supervision at the root the longer.45 DT-LSTM 0. . the quential LSTMs. pect to see the greatest improvement over sequen- tial LSTMs on longer sentences. Regardless. This suggests that unlike se- pendency parse of the second query sentence.. sentence length. at the intermediate nodes.30 0. sentence accuracy for the test set sentences with length in length. we list nearest-neighbor sentences re.90 0. Huang et al. with a depth of 4. tems without any additional feature engineering. For each `. 4).2 Effect of Sentence Length this setting. Examples in the tail mean length in the window [`−2.

knowledges the support of a Natural Language 2011).. 2014. In this paper. Johannes. Dzmitry. and Yoshua tences (Socher et al. 2011..96 a woman is cutting potatoes 4. compose phrase representations from word vectors (Socher et al.. 2014). 2010. . counterparts. FA8750-13-2-0040.. the authors and do not necessarily reflect the view RNNs have been used to parse images of natu.0473 ..92 potatoes are being sliced by a woman 4.19 there is no one kissing the mother near the ocean two men are playing guitar two men are playing guitar some men are playing rugby 0. Mikolov et al. work in characterizing the role of structure in pro- 2013. corresponding to the children of the node. Following ing systems on both. ral scenes (Socher et al.88 the man is singing and playing the guitar 4.00 Table 4: Most similar sentences from a 1000-sentence sample drawn from the SICK test set. The Bengio.70 a woman is slicing tofu 0.90 the man is tossing a kid into the swimming pool that is 3. The Any opinions. The Tree- LSTM model is able to pick up on more subtle relationships..01 with the case two dogs are playing with each other 0. Bengio. outperforming exist- cability in a variety of NLP tasks.92 a group of men is playing with a ball on the beach 3. We demonstrated with gradient descent is difficult. Tree. and classify the sentiment polarity of sen- Bahdanau. Our results suggest further lines of Yessenalina and Cardie. Kyunghyun Cho. References 2012). we demonstrated that Tree-LSTM the area of learning distributed phrase and sen. Johan Bos.90 a young boy wearing a red swimsuit is jumping out of a 3. Stanford University gratefully ac- works (Goller and Kuchler.08 two men are talking 0. which we abbreviate as Tree-RNNs in or. Ranking by mean word vector cosine similarity Score Ranking by Dependency Tree-LSTM model Score a woman is slicing potatoes a woman is slicing potatoes a woman is cutting potatoes 0. 2011).82 a woman is slicing herbs 0. 2014) have found wide appli. models are able to outperform their sequential tence representations (Mitchell and Lapata.. such as that between “beach” and “ocean” in the second example.39 a boy is waving at some young runners from the ocean a boy is waving at some young runners from the ocean a man and a boy are standing at the bottom of some stairs . there has been substantial interest in mensionality. findings. AFRL. Rob van der Goot. and der to avoid confusion with recurrent neural net.92 tofu is being sliced by a woman 4. Under the Tree-RNN framework. the Defense Advanced Research Projects Agency works. 0. Acknowledgements Le and Mikolov. arXiv preprint arXiv:1409.87 the man is opening the guitar for donations and plays 4. Patrice Simard. 2013). and sentiment classification. and conclusion or recom- choice of composition function gives rise to nu. the vec. Yoshua. Pennington et al. IEEE Trans- the effectiveness of the Tree-LSTM by applying actions on Neural Networks 5(2):157–166. Controlling for model di- this success. 2013).79 which are outdoors a group of children in uniforms is standing at a gate and 0.87 two men are dancing and singing in front of a crowd 4. We thank our anonymous reviewers for their valu- Our approach builds on recursive neural net.. Understanding-focused gift from Google Inc. the architecture in two tasks: semantic relatedness Bjerva. Learning long-term dependencies with arbitrary branching factor. Grefenstette et al. and Paolo Fras- Tree-LSTM architecture can be applied to trees coni. mendations expressed in this material are those of merous variants of this basic framework. Socher et al. of the DARPA. we introduced a generalization of LSTMs to tree-structured network topologies. as well as distributed ducing distributed representations of sentences. (DARPA) Deep Exploration and Filtering of Text tor representation associated with each node of (DEFT) Program under Air Force Research Lab- a tree is composed as a function of the vectors oratory (AFRL) contract no. representations of longer bodies of text such as paragraphs and documents (Srivastava et al. 1996.37 one is kissing the mother blue kiddies pool a group of children in uniforms is standing at a gate and 0. Neural machine translation by 9 Conclusion jointly learning to align and translate. 1994. 2013. able feedback. or the US government.

Convolutional neural net- Ganitkevitch. and Cognitive science 14(2):179–211. 2014. Quoc V and Tomas Mikolov. Dis- learning for compositional distributional se. Improving neural net- SemEval 2014 . Dis. tomatic Speech Recognition and Understanding A solution to plato’s problem: The latent se- (ASRU). et al. International Journal of for Computational Linguistics. SemEval 2014 . UNAL-NLP: Combin- Foltz. Sepp. Koray Kavukcuoglu. Julia Baquero. Walter Kintsch. Danqi and Christopher D Manning. ral networks. and Yoram Singer. and works for sentence classification. of the 41st Annual Meeting on Association for Learning task-dependent distributed representa. Sepp and J¨urgen Schmidhuber. arXiv:1408. 430. Elad Hazan. and Malvina Nissim. Marco Baroni. 2013. Nal Kalch. SemEval coherence with latent semantic analysis. In Proceedings problem during learning recurrent neural nets of the 52nd Annual Meeting of the Association and problem solutions. detectors. Mehrnoosh Sadrzadeh. Edward Grefenstette. Benjamin Van Durme. guage Processing (EMNLP). Hybrid speech recognition with deep to semantics. brenner. pages 2096–2104. In Proceedings Goller. mantic analysis theory of acquisition. Peter W. Geoffrey E. Richard Socher. 1997. and representation of knowledge. Uncertainty. 1998. The Journal of Machine of the Association for Computational Linguis- Learning Research 12:2493–2537. 1996. Ronan. Av Mendiz´abal. arXiv preprint Chris Callison-Burch. 2014 . Long Short-Term Memory. Thomas K and Susan T Dumais. The measurement of textual similarity. Dan and Christopher D Manning. and Andrew Y. Natural language processing (al. pages 740–750. 2011. Hinton. Landauer. cursive neural networks for compositionality in ing and stochastic optimization. L´eon Bottou. Associa- tions by backpropagation through structure. 764. Computational Linguistics-Volume 1. John. volume 1. course processes 25(2-3):285–307. Improv- Michael Karlen.0580 . Eric H. arXiv preprint arXiv:1301. In Annual Meeting most) from scratch. 2014. 2014. multiple word prototypes. Georgiana Dinu. Accurate unlexicalized parsing. A convolutional neural net. pages 758– Klein.4053 . George Duenas. Juri. Alex. uments. 1998. tics (ACL). Christoph and Andreas Kuchler. Alexander Gelbukh. Yao. Multi-step regression Le.6939 . In HLT-NAACL. Av Juan Dios B´atiz. and A-R Mohamed. 2014. Fuzziness and Knowledge-Based Systems 6(02):107–116.. pages 423– IEEE International Conference on Neural Net. Collobert.5882 . Ozan and Claire Cardie. Finding structure in time. relatedness and entailment. works. The Journal of language. 2003. lh: A denotational and distributional approach 2013. 2012. Deep re- Adaptive subgradient methods for online learn. Grefenstette. 2012. and review 104(2):211. Kim. Hochreiter. and Ruslan R entailment and determining semantic similarity. bidirectional LSTM. Navdeep Jaitly. Christopher D. . and Thomas K ing soft cardinality features for semantic textual Landauer. Hochreiter. In Advances in Neural Information Machine Learning Research 12:2121–2159. 2014. Illinois- Graves. 2014. Chen. Salakhutdinov. 1997. Edward. Processing Systems. Huang. tion 9(8):1735–1780. pages 347–352. Ilya Sutskever. works by preventing co-adaptation of feature Blunsom. PPDB: The Para. Sergio. The Meaning Fac. 2013. Irsoy. Yoon. and Pavel ing word representations via global context and Kuksa. arXiv preprint arXiv:1207. In tion for Computational Linguistics. Nitish Srivastava. A fast and accurate dependency parser using neu. In Proceedings of the 2014 Con. tributed representations of sentences and doc- mantics. phrase Database. Jeffrey L. Jason Weston. Neural Computa- ference on Empirical Methods in Natural Lan. Lai. Manning. 2014. 2011. induction. Phil. pages 273–278. arXiv preprint arXiv:1405. Duchi. 2014. Elman. The vanishing gradient work for modelling sentences. Psychological Zhong Zhang. 1990. Jimenez. In IEEE Workshop on Au. Ng. Alex tory: Formal semantics for recognizing textual Krizhevsky. Alice and Julia Hockenmaier.

Richard. David E. and Jeff Dean. Distributed ral networks. Wojciech and Ilya Sutskever. SemEval-2014 Task ality over a sentiment treebank. Brno University of Technology. and Yoshua Bengio. Vinyals. Mikolov. In Proceedings of the 2012 Joint Zaremba. Christopher D Manning. 1988. In Srivastava. Nitish. Com- tations by back-propagating errors. Richard. Conference on Empirical Methods in Natural 2014.Marelli. mantic relatedness and textual entailment. pages 3111–3119. Raffaella Bernardi. Modeling documents Mikolov. Jeff and Mirella Lapata. Socher. 2014. Joseph. Cognitive positional matrix-space models for sentiment modeling 5. arXiv preprint els Based on Neural Networks. 2011. Lev Ratinov. Oriol. and Dumitru Erhan. thesis. and Ronald J Williams. In Advances in Neural Infor. erogenous measures for semantic relatedness Christopher D Manning. Ph. drew Y Ng. arXiv:1410. 2013. Alex Perelygin. Re- roni. Alexander Toshev. In Advances in Neural Informa- representations of words and phrases and their tion Processing Systems.4615 . ural Language Learning. and Christo. Learning to execute. Geoffrey E Hinton. In Proceedings of the Conference on Empirical Methods in Natural Language Socher. pages 3104–3112. Kai Chen. pages 1201–1211. Brody Huval. Richard. arXiv:1309. Glove: Global vectors 384–394. Transactions of the Association for Computa- tional Linguistics 2:207–218. Jeffrey. and Man Lan. Richard. Grounded compositional semantics for finding and describing images with sentences. eral method for semi-supervised learning. ural Language Processing (EMNLP). 2014. Socher. Christopher D Processing. and Andrew Y Ng. Learning represen. Seman- guistics. Geoffrey E Hinton. Marco. Cliff C Lin. and Christopher Potts. Proceedings of the Em. Quoc V Le. Sutskever. sociation for Computational Linguistics. Stefano Menini. Luisa Bentivogli. Samy Bengio. 2013. Ainur and Claire Cardie. Andrej Karpathy. with deep boltzmann machines. Composi. Ruslan R Salakhutdinov. 2010. association for computational linguistics. Jean Y Wu. and cursive deep models for semantic composition- Roberto Zamparelli. Yessenalina. tic compositionality through recursive matrix- vector spaces. for word representation. and textual entailment. 2010. 2011. SemEval 2014 . analysis. Cog. Association for Computational Lin- Manning. 2012. pages pher D Manning. Chris Manning. Tom´asˇ. In Proceedings 1: Evaluation of compositional distributional of the Conference on Empirical Methods in Nat- semantic models on full sentences through se. Ilya. Association for Com. Jason Chuang. and Quoc VV Le. Richard Socher. ECNU: One stone two birds: Ensemble of het- Socher.4555 . Show and tell: A ing (EMNLP 2014) 12. compositionality. In tion in distributional models of semantics. arXiv preprint Language Processing and Computational Nat. and Andrew Y Ng. Word representations: a simple and gen- Mitchell.6865 . and Andrew Y Ng. pages 129–136. Parsing natural scenes and natural language with recursive neural net- works. and SemEval 2014. Oriol Vinyals. Zhao. An- . 2014. neural image caption generator. Statistical Language Mod. 2014. Marco Ba. Rumelhart. Jiang. piricial Methods in Natural Language Process. Greg S 2014. putational Linguistics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2014. Tomas. arXiv preprint arXiv:1411. 2012. As- Pennington. 2013. Tian Tian Zhu. Ilya Sutskever. Sequence to sequence learning with neu- Corrado. pages 172–182.D. mation Processing Systems. Proceedings of the 48th annual meeting of the nitive science 34(8):1388–1429. Turian.