You are on page 1of 8

1 Measuring Similarity with TreeLSTM and Dependency Relations

1.1 Background & Literature Survey


A limitation of the LSTM architecture described in Section ?? is that although it takes into account the
ordering of the text, it is strictly sequential. It can not consider the underlying grammatical structure of
a sentence. To address this issue, Tai et al. [1], proposed a tree-structured LSTM, or Tree-LSTM, that
generalizes the LSTM architecture to tree-structured network topologies. The main difference between
standard LSTM and tree-LSTM is that a node in tree-LSTM computes its state from the hidden states
of its children, whereas in standard LSTM, the current step depends on the entire previous sequence.
In [1], the tree-structure can be constructed from dependency tree as well as constituency tree [2]. The
former is known as child sum tree-LSTM and the later one is known as N-ary tree-LSTM. In child sum
tree-LSTM each node in the tree takes the vector corresponding to the head word as input, whereas in
an N-ary tree-LSTM, the leaf nodes take the corresponding word vectors as input. We only look into
the child sum tree-LSTM and for the rest of the report we will refer to this as tree-LSTM.
Many variant of tree-structured LSTMs have been proposed and applied in numerous language pro-
cessing tasks like natural language inference [3], relation extraction [4], image captioning [5], machine
translation [6] etc. Multiplicative tree-structured LSTMs [7] investigate the use of Abstract Meaning
Representation in tree-structured models. Zhou et al. [8] applied attention over the tree topology for
semantic similarity, paraphrase identification and true-false question selection tasks. To the best of
our knowledge, there is no model that also uses the dependency relation types in tree-LSTM. So, we
propose a method to incorporate that information in our model.

Figure 1: An example of sentence with dependency parsing

1.2 Model Architecture


(1) (1) (1) (2) (2)
Consider a sentence pair X = (X1 , X2 ), where X1 = (x1 , x2 , . . . , xn ) and X2 = (x1 , x2 , . . . ,
(2)
xm ), n and m denote the number of words in these sentences, respectively. Our aim is to predict their
relatedness score in range of [1,5]. For this, we have created models that are based on tree-LSTM
and also incorporate the dependency analysis of the sentence, co-attention between the sentences to
enhance their representation.

Tree-LSTM. In tree-LSTM, the tree-structure of a given sentence is constructed from the dependency
tree [9]. But, it does not consider the edge types, i.e., the dependency relation between the two nodes.
For example, Figure 1 depicts the dependency parse tree of the sentence, “the pilot is making an
announcement”. We use the edges between the words to create the tree-structure and then apply LSTM
over the structure to learn the sentence representation. This is shown in Figure 2a, where h1 , h2 , . . . , hn
(1) (1) (1)
are the hidden states of the sentence of length n corresponding to the words x1 , x2 , . . . , xn . Here,
root is “making” and it has three children “pilot”, “is”, and “announcement”. So, its representation
depends upon the representation of other three nodes. The final hidden representation of a sentence

1
h4
h4

h2 h3 making h6 nsubj h2 h3 making h6 dobj


aux

pilot is h5 announcement  pilot


h1 h1 is h5 announcement 
det det

The  an The  an

(a) Each word’s hidden state is dependent (b) Dependency relation of a word with its
upon the word and its children parent is shown beside its hidden state

Figure 2: Visualization of the tree-LSTM and dep-tree-LSTM

is the representation of the root node of that sentence and in the given example it is h4 , i.e., the
representation of “making”. Suppose C(j) denotes the set of children of node j. ehj is the sum of
the hidden states of the children of node j.
X
hj =
e hk (1)
k∈C(j)

h is fed to the the input (i), output (o) and intermediate cell state (u).
This modified hidden state e
ij = σ(W (i) xj + U (i) hej + b(i) )
oj = σ(W (o) xj + U (o) hej + b(o) ) (2)
uj = tanh(W (u) xj + U (u) hej + b(u) )

Tree-LSTMs have k forget gates (f ), where k is the number of children of the target node and each
forget gate is computed as follows:

fjk = σ(W (f ) xj + U (f ) hk + b(f ) ), k ∈ C(j) (3)

The final cell state (c) of a node is dependent upon the forget gates, intermediate cell state and input.
X
cj = ij uj + fjk ck (4)
k∈C(j)

Finally, the hidden state of a node calculate using with Equation (5).

hj = oj tanh(cj ) (5)

Tree-LSTM with dependency relation. Figure 2b shows the dependency relation of a word with its
parent. Edge types of the Figure 1 is also used to learn the sentence representation. To incorporate the
dependency relation type of a node with its parent, we change Equation (1) as follows, where Dr(k,j)
is the matrix that represents the relation type r between k th child of node j.
X
ehj = Dr(k,j) hk (6)
k∈C(j)

Here, we have introduced a matrix corresponding to every dependency relation in the dependency
tree corresponding to the tree-LSTM. These matrices represent a linear transform on the hidden state

2
acomp
pcomp
Xcomp
dep
rcmod
punct
poss
cop
Dependency Relations

nsubjpass
expl
prt
auxpass
vmod
advmod
neg
num
nn
cc
conj
amod
dobj
aux
nsubj
pobj
prep
0 2500 5000 7500 10000 12500 15000 17500 20000
Frequency of the relation

Figure 3: Frequency distribution of dependency relations in SICK dataset

representation of a child node, which is then composed at a parent node. This allows the learnt repre-
sentations to be dependent on relations. There are 52 relation types in Stanford Universal Dependency
Parser [9], out of which only 37 occur in the SICK dataset. In the first model we have considered all
the 37 matrices separately. We will refer to this model as dep-tree-LSTM.

Tree-LSTM with merged dependency. In dep-tree-LSTM, there are 37 learnable matrices, corre-
sponding to different relations present in the dataset. To reduce this number of matrices, we look into
the frequency of the dependency relation types as well as hierarchy of dependency types [9]. Figure
3 shows the frequency distribution of the top 25 relations except det in SICK dataset. Some relation
types, e.g., det, nsubj, dobj, amod etc. are occurring very frequently and some others like iobj, tmod,
advcl etc. are rarely seen in the dataset. The hierarchy of dependency types defines the grammatical
relations among different relations. As we want to capture the interaction of grammatical relations in
a sentence and at the same time we want to minimize the number of grammatical relations, i.e., the
dependency types, it is reasonable to use the dependency hierarchy instead of the frequency of relation
types. With the help of dependency hierarchy, we empirically merge some similar dependency rela-
tions into a single relation type. We do not combine prep and det with any relations and merge the
other relations in the following manner.

• aux , auxpass, cop as aux


• acomp, ccomp, xcomp, pcomp as comp
• dobj, iobj, pobj as obj
• nsubj, nsubjpass, csubj , csubjpass as subj
• advcl, npadvmod, rcmod, vmod, advmod, amod as mod
• and rest of the relation types is considered as other.

3
Method Pearsosn’ r Spearman’s ρ MSE
tree-LSTM 0.8649 (0.0045) 0.8058 (0.0040) 0.2588 (0.0091)
rand-tree-LSTM 0.8589 (0.0037) 0.7989 (0.0051) 0.2691 (0.0069)
dep-tree-LSTM 0.8614 (0.0056) 0.8016 (0.0085) 0.2662 (0.0111)
m-dep-tree-LSTM 0.8667 (0.0015) 0.8077 (0.0021) 0.2545 (0.0036)

Table 1: Test set results on the SICK dataset. We report mean scores over 5 runs with standard devia-
tions in parentheses

Now, we have 8 relation matrix to represent the dependency types. We will refer to this model as
m-dep-tree-LSTM.

rand-Tree-LSTM. In dep-tree-LSTM, all the relation matrices are learnable. In order to see whether
the the model can be simulated with some random matrices, we use a slight different strategy to train
our model. Instead of training the dependency relation matrices, we randomly initialize them at the
beginning of training and use them in subsequent linear transformations. We will refer to this model
as rand-tree-LSTM. Semantic relatedness of sentence pairs. Given a sentence pair, our goal is to
predict a real-valued similarity score in range [1, K], K > 1 and K ∈ Z. Suppose hL and hR are the
final representation of the sentence pair using the model as described above. Next, we use a neural
network to predict the similarity score ŷ as proposed in [1].

h× = hL hR
h+ = |hL − hR |
hs = σ(W (x) hx + W (+) h+ + b(h) ) (7)
p̂θ = sof tmax(W (p) hs + b(p) )
ŷ = rT p̂θ , rT = [1, 2, ..., K]

Since we want to model the similarity score as distribution p̂θ , we redefine the target score y as a
distribution p such that y = rT p.

 y − byc
 i = byc + 1
pi = byc − y + 1 i = byc (8)

0 otherwise

1 ≤ i ≤ K. Hence, the loss is computed using KL-divergence as follows, where θ is model parameters,
m is the number of training sentence pairs, k denotes the k th sentence pair and λ is an L2 regularization
hyperparameter.
m
1 X  (k)  λ
J(θ) = p ||pˆθ (k) + ||θ||22 (9)
m 2
k=1

1.3 Experiment
1.3.1 Dataset

The SICK [10] dataset consists of 9927 sentence pairs with a 4500/500/4927 train/dev/test split. Each
pair is annotated with a relatedness label y ∈ [1, 5] corresponding to the average relatedness assigned

4
0.92

0.90

0.88

0.86
Pearson r

0.84

0.82

0.80 tree-LSTM
mdep-tree-LSTM
rand-tree-LSTM
0.78
4 6 8 10 12 14 16 18
Mean sentence length

Figure 4: Pearson correlations r between predicted similarities and gold ratings vs. sentence length

by 10 different human annotators, where 1 indicates that the two sentences are totally different and 5
indicates that the two sentences are very similar.

1.3.2 Training

We use 300-dimensional Glove [11] word embedding to initialized the word representations. In tree-
LSTM inputs are projected as 150-dimensional vector hL and hR . The size of the hidden layer in
Equation (7) is taken as 50. As we have applied a linear layer over each node of the tree-LSTM to
incorporate dependency relation, taking its dimension as 150 × 150 in dep-tree-LSTM and m-dep-
tree-LSTM makes a large number of sparse matrices and learning them will be difficult. So, we have
reduced their dimension to 50 × 50 and the dimension of hidden layer in Equation (7) to 20. Our
models were trained using AdaGrad [12] optimizer with a learning rate of 0.05 and mini batch size of
25. The model parameters were regularized with a per-minibatch L2 regularization strength of 104 .
Each model is trained for 20 epochs and the model with best results over the validation set is used for
testing the performance of the model.

1.3.3 Results

Table 1 summarizes the result using different models. In our experiments, we report the mean score of
five runs. Our baseline model is tree-LSTM [1]. Our proposed models are dep-tree-LSTM and m-dep-
tree-LSTM where the former uses all the dependency relations in the dataset, the later uses merged
relations that are semantically related. The evaluation matrices are Pearson correlation coefficient r,
Spearman’s rank correlation coefficient ρ and Mean Squared Error (MSE). The result of tree-LSTM
and mdep-tree-LSTM is close. But, the model perform rather poorly when we randomly initialize the
dependency relation matrices in rand-tree-LSTM.

5
Number of Sentence Prediction closer to gold
Type of sentences
sentence pairs M T Same
Low similarity (< 1.5) 311 73 234 4
High similarity (4.8 − 5.0) 331 79 239 13
Negation in same context 476 242 212 22
Negation in any context 1016 479 493 44
Non-restrictive relative clause 134 69 63 2

Table 2: Quantitative analysis of SICK test dataset. M and T denote mdep-tree-LSTM and tree-
LSTM respectively

1.4 Discussion
Table 2 shows the quantitative analysis of differenct kind of sentence pairs. For sentences with high
similarity score (4.8 − 5.0), tree-LSTM perform significantly better than mdep-tree-LSTM. In this
range, the sentences are either actually the same sentences repeated twice, or one is in active voice
and the other one is its passive voice, or some nouns, verbs replaced with another similar one, or one
sentence is rephrased, and number of such pairs are 331. Out of these sentence pairs, only in 79 pairs
the performance of mdep was better than Tree-LSTM. Similarly, for low similarity score (< 1.5), tree-
LSTM performed better in comparison to mdep-tree-LSTM, but both of the models’ similarity score
has large difference with actual similarity score. Only 311 sentence pairs have gold rating < 1.5 and
among them in 73 pairs mdep-tree-LSTM has better performance. For mdep-tree-LSTM was better
in getting the semantic relatedness score of sentences with relative clause. For sentence pairs which
are negation of each other, i.e., one sentence contains words like “no” or “not”, the mdep-tree-LSTM
shows better performance. Table 1 displays similarity scores of some of the sentences pairs where the
sentences belong to the various cases as described previously.
Figure 4 shows the relationship between pearson score r and mean sentence length. For each length
l, we plot r for the pairs with mean length in the window [l2, l + 2]. Examples in the tail of the length
distribution are batched in the final window. Smaller sentence length has very high correlation score
and as the mean sentence length increases we see a decrease in r. Both tree-LSTM and mdep-tree-
LSTM have nearly similar characteristics whereas rand-tree-LSTM has lower correlation score.

1.5 Conclusion
In this work, we have proposed a method m-dep-tree-LSTM which incorporates the dependency re-
lation along with the tree-LSTM. We hypothesized that adding the information of relation types will
improve the model’s capability to measure the semantic relatedness of sentences, but our experiments
show only a slight improvement on the sentence similarity task.

References
[1] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations
from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015.

6
Sentence 1 Sentence 2 Gold Tree M D R
A man is cutting a paper plate A paper plate is being cut by 5.0 4.83 4.81 4.8 4.78
the man
The puppy is playing with a The puppy is playing with a 5.0 4.75 4.73 4.7 4.68
plastic container container made of plastic
A girl is playing a guitar A woman is dancing near a fire 1.3 1.63 2.09 2.22 2.13
A man is riding a scooter A man is not riding a scooter 3.0 3.54 3.47 3.52 3.47
A snowboarder is grinding There is no snowboarder 4.4 3.69 4.05 3.77 3.93
down a long concrete rail grinding down a long concrete
rail
A person, who is riding a bike, One man is wearing a black 3.3 4.07 3.99 4.14 4.35
is wearing gear which is black helmet and pushing a bicycle

Table 3: Similarity score of two sentences in the SICK test data. Tree / M / D / R denote relatedness
predicted by the tree-LSTM / m-dep-tree-LSTM/ dep-tree-LSTM / rand-tree-LSTM

[2] Dan Klein and Christopher D. Manning. Corpus-based induction of syntactic structure: Models
of dependency and constituency. In Proceedings of the 42Nd Annual Meeting on Association
for Computational Linguistics, ACL ’04, Barcelona, Spain, 2004. Association for Computational
Linguistics.

[3] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced lstm
for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1657–1668, 2017.

[4] Makoto Miwa and Mohit Bansal. End-to-end relation extraction using lstms on sequences and
tree structures. arXiv preprint arXiv:1601.00770, 2016.

[5] Zhenxing Niu, Mo Zhou, Le Wang, Xinbo Gao, and Gang Hua. Hierarchical multimodal lstm
for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1881–1889, 2017.

[6] Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. Tree-to-sequence attentional
neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 823–833, 2016.

[7] Nam Khanh Tran and Weiwei Cheng. Multiplicative tree-structured long short-term memory
networks for semantic representations. In Proceedings of the Seventh Joint Conference on Lexical
and Computational Semantics, pages 276–286, 2018.

[8] Yao Zhou, Cong Liu, and Yan Pan. Modelling sentence pairs with tree-structured attentive en-
coder. In Proceedings of COLING 2016, the 26th International Conference on Computational
Linguistics: Technical Papers, pages 2912–2922, 2016.

[9] Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual.
Technical report, Stanford University, 2008.

7
[10] Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto
Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models
on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th
international workshop on semantic evaluation (SemEval 2014), pages 1–8, 2014.

[11] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word
representation. In Proceedings of the 2014 conference on empirical methods in natural language
processing (EMNLP), pages 1532–1543, 2014.

[12] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning
and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

You might also like