Professional Documents
Culture Documents
Concessive Sentences In concessive sentences, would remember the word “though” and make the
two clauses have opposite polarities, usually re- second clause share the same sentiment orienta-
lated by a contrary-to-expectation implicature. We tion as first—or competitively, with the stronger
plot evolving representations over time for two con- one dominating. The region within dotted line in
cessives in Figure 3. The plots suggest: Figure 3(a) favors the second assumption: the dif-
1. For tasks like sentiment analysis whose goal ference between the two sentences is diluted when
is to predict a specific semantic dimension (as op- the final words (“interesting” and “boring”) appear.
posed to general tasks like language model word
prediction), too large a dimensionality leads to Clause Composition In Figure 4 we explore this
many dimensions non-functional (with values close clause composition in more detail. Representations
to 0), causing two sentences of opposite sentiment move closer to the negative sentiment region by
to differ only in a few dimensions. This may ex- adding negative clauses like “although it had bad
plain why more dimensions don’t necessarily lead acting” or “but it is too long” to the end of a simply
to better performance on such tasks (For example, positive “I like the movie”. By contrast, adding
as reported in (Socher et al., 2013), optimal perfor- a concessive clause to a negative clause does not
mance is achieved when word dimensionality is set move toward the positive; “I hate X but ...” is still
to between 25 and 35). very negative, not that different than “I hate X”.
2. Both sentences contain two clauses connected This difference again suggests the model is able
by the conjunction “though”. Such two-clause sen- to capture negative asymmetry (Clark and Clark,
tences might either work collaboratively— models 1977; Horn, 1989; Fraenkel and Schul, 2008).
Figure 5: Saliency heatmap for for “I hate the movie .” Each row corresponds to saliency scores for the correspondent word
representation with each grid representing each dimension.
Figure 6: Saliency heatmap for “I hate the movie I saw last night .” .
Figure 7: Saliency heatmap for “I hate the movie though the plot is interesting .” .
Negation
“I hate the movie that I saw last night” All only a rough approximate to individual contribu-
three models assign the correct sentiment. The tions and might not suffice to deal with highly non-
simple recurrent models again do poorly at filter- linear cases. By contrast, the LSTM emphasizes the
ing out irrelevant information, assigning too much first clause, sharply dampening the influence from
salience to words unrelated to sentiment. However the second clause, while the Bi-LSTM focuses on
none of the models suffer from the gradient van- both “hate the movie” and “plot is interesting”.
ishing problems despite this sentence being longer;
the salience of “hate” still stands out after 7-8 fol- 5.2 Results on Sequence-to-Sequence
lowing convolutional operations. Autoencoder
Figure 9 represents saliency heatmap for auto-
“I hate the movie though the plot is interesting” encoder in terms of predicting correspondent to-
The simple recurrent model emphasizes only the ken at each time step. We compute first-derivatives
second clause “the plot is interesting”, assigning for each preceding word through back-propagation
no credit to the first clause “I hate the movie”. This as decoding goes on. Each grid corresponds to
might seem to be caused by a vanishing gradient, magnitude of average saliency value for each 1000-
yet the model correctly classifies the sentence as dimensional word vector. The heatmaps give clear
very negative, suggesting that it is successfully overview about the behavior of neural models dur-
incorporating information from the first negative ing decoding. Observations can be summarized as
clause. We separately tested the individual clause follows:
“though the plot is interesting”. The standard recur- 1. For each time step of word prediction,
rent model confidently labels it as positive. Thus SEQ 2 SEQ models manage to link word to predict
despite the lower saliency scores for words in the back to correspondent region at the inputs (automat-
first clause, the simple recurrent system manages ically learn alignments), e.g., input region centering
to rely on that clause and downplay the information around token “hate” exerts more impact when to-
from the latter positive clause—despite the higher ken “hate” is to be predicted, similar cases with
saliency scores of the later words. This illustrates tokens “movie”, “plot” and “boring”.
a limitation of saliency visualization. first-order 2. Neural decoding combines the previously
derivatives don’t capture all the information we built representation with the word predicted at the
would like to visualize, perhaps because they are current step. As decoding proceeds, the influence
ond, surprisingly easy and direct way to visualize
important indicators. We first compute the average
of the word embeddings for all the words within
the sentences. The measure of salience or influence
for a word is its deviation from this average. The
idea is that during training, models would learn
to render indicators different from non-indicator
words, enabling them to stand out even after many
layers of computation.
Figure 8 shows a map of variance;P each grid cor-
1
responds to the value of ||ei,j − NS i0 ∈NS ei0 j ||2
where ei,j denotes the value for j th dimension of
word i and N denotes the number of token within
the sentences.
As the figure shows, the variance-based salience
measure also does a good job of emphasizing the
relevant sentiment words. The model does have
shortcomings: (1) it can only be used in to scenar-
ios where word embeddings are parameters to learn
(2) it’s clear how well the model is able to visualize
local compositionality.
7 Conclusion
In this paper, we offer several methods to help
visualize and interpret neural models, to understand
how neural models are able to compose meanings,
demonstrating asymmetries of negation and explain
some aspects of the strong performance of LSTMs
at these tasks.
Though our attempts only touch superficial
points in neural models, and each method has its
pros and cons, together they may offer some in-
sights into the behaviors of neural models in lan-
guage based tasks, marking one initial step toward
Figure 9: Saliency heatmap for S EQ 2S EQ auto-encoder in understanding how they achieve meaning compo-
terms of predicting correspondent token at each time step. sition in natural language processing. Our future
work includes using results of the visualization be
used to perform error analysis, and understanding
of the initial input on decoding (i.e., tokens in
strengths limitations of different neural models.
source sentences) gradually diminishes as more
previously-predicted words are encoded in the vec-
tor representations. Meanwhile, the influence of References
language model gradually dominates: when word
Herbert H. Clark and Eve V. Clark. 1977. Psychology
“boring” is to be predicted, models attach more
and language: An introduction to psycholinguistics.
weight to earlier predicted tokens “plot” and “is” Harcourt Brace Jovanovich.
but less to correspondent regions in the inputs, i.e.,
the word “boring” in inputs. Navneet Dalal and Bill Triggs. 2005. Histograms of
oriented gradients for human detection. In Com-
puter Vision and Pattern Recognition, 2005. CVPR
6 Average and Variance 2005. IEEE Computer Society Conference on, vol-
For settings where word embeddings are treated as ume 1, pages 886–893. IEEE.
parameters to optimize from scratch (as opposed to Jeffrey L. Elman. 1989. Representation and structure
using pre-trained embeddings), we propose a sec- in connectionist models. Technical Report 8903,
Center for Research in Language, University of Cal- Aravindh Mahendran and Andrea Vedaldi. 2014. Un-
ifornia, San Diego. derstanding deep image representations by inverting
them. arXiv preprint arXiv:1412.0035.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and
Pascal Vincent. 2009. Visualizing higher-layer fea- Brian Murphy, Partha Pratim Talukdar, and Tom M
tures of a deep network. Dept. IRO, Université de Mitchell. 2012. Learning effective and interpretable
Montréal, Tech. Rep. semantic models using non-negative sparse embed-
ding. In COLING, pages 1933–1950.
Manaal Faruqui and Chris Dyer. 2014. Improving
vector space word representations using multilingual Anh Nguyen, Jason Yosinski, and Jeff Clune. 2014.
correlation. In Proceedings of EACL, volume 2014. Deep neural networks are easily fooled: High confi-
dence predictions for unrecognizable images. arXiv
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris preprint arXiv:1412.1897.
Dyer, and Noah Smith. 2015. Sparse overcom-
plete word vector representations. arXiv preprint Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
arXiv:1506.02004. tional recurrent neural networks. Signal Processing,
IEEE Transactions on, 45(11):2673–2681.
Tamar Fraenkel and Yaacov Schul. 2008. The mean-
ing of negated adjectives. Intercultural Pragmatics, Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
5(4):517–540. man. 2013. Deep inside convolutional networks:
Visualising image classification models and saliency
Alona Fyshe, Leila Wehbe, Partha P Talukdar, Brian maps. arXiv preprint arXiv:1312.6034.
Murphy, and Tom M Mitchell. 2015. A compo-
sitional and interpretable semantic space. Proceed- Richard Socher, Alex Perelygin, Jean Y Wu, Jason
ings of the NAACL-HLT, Denver, USA. Chuang, Christopher D Manning, Andrew Y Ng,
and Christopher Potts. 2013. Recursive deep mod-
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jiten- els for semantic compositionality over a sentiment
dra Malik. 2014. Rich feature hierarchies for accu- treebank. In Proceedings of the conference on
rate object detection and semantic segmentation. In empirical methods in natural language processing
Computer Vision and Pattern Recognition (CVPR), (EMNLP), volume 1631, page 1642. Citeseer.
2014 IEEE Conference on, pages 580–587. IEEE.
Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Sequence to sequence learning with neural networks.
Long short-term memory. Neural computation, In Advances in Neural Information Processing Sys-
9(8):1735–1780. tems, pages 3104–3112.
Laurence R. Horn. 1989. A natural history of negation, Christian Szegedy, Wojciech Zaremba, Ilya Sutskever,
volume 960. University of Chicago Press Chicago. Joan Bruna, Dumitru Erhan, Ian Goodfellow, and
Yangfeng Ji and Jacob Eisenstein. 2014. Represen- Rob Fergus. 2013. Intriguing properties of neural
tation learning for text-level discourse parsing. In networks. arXiv preprint arXiv:1312.6199.
Proceedings of the 52nd Annual Meeting of the As- Laurens Van der Maaten and Geoffrey Hinton. 2008.
sociation for Computational Linguistics, volume 1, Visualizing data using t-sne. Journal of Machine
pages 13–24. Learning Research, 9(2579-2605):85.
Andrej Karpathy, Justin Johnson, and Fei-Fei Li. 2015. Oriol Vinyals and Quoc Le. 2015. A neural conversa-
Visualizing and understanding recurrent networks. tional model. arXiv preprint arXiv:1506.05869.
arXiv preprint arXiv:1506.02078.
Carl Vondrick, Aditya Khosla, Tomasz Malisiewicz,
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- and Antonio Torralba. 2013. Hoggles: Visual-
ton. 2012. Imagenet classification with deep con- izing object detection features. In Computer Vi-
volutional neural networks. In Advances in neural sion (ICCV), 2013 IEEE International Conference
information processing systems, pages 1097–1105. on, pages 1–8. IEEE.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, Philippe Weinzaepfel, Hervé Jégou, and Patrick Pérez.
and Bill Dolan. 2015. A diversity-promoting objec- 2011. Reconstructing an image from its local de-
tive function for neural conversation models. arXiv scriptors. In Computer Vision and Pattern Recogni-
preprint arXiv:1510.03055. tion (CVPR), 2011 IEEE Conference on, pages 337–
David G Lowe. 2004. Distinctive image features from 344. IEEE.
scale-invariant keypoints. International journal of
Matthew D Zeiler and Rob Fergus. 2014. Visualizing
computer vision, 60(2):91–110.
and understanding convolutional networks. In Com-
Thang Luong, Ilya Sutskever, Quoc V Le, Oriol puter Vision–ECCV 2014, pages 818–833. Springer.
Vinyals, and Wojciech Zaremba. 2014. Addressing
the rare word problem in neural machine translation. Appendix
arXiv preprint arXiv:1410.8206.
Recurrent Models A recurrent network succes- Bidirectional Models (Schuster and Paliwal,
sively takes word wt at step t, combines its vector 1997) add bidirectionality to the recurrent frame-
representation et with previously built hidden vec- work where embeddings for each time are calcu-
tor ht−1 from time t − 1, calculates the resulting lated both forwardly and backwardly:
current embedding ht , and passes it to the next step.
The embedding ht for the current time t is thus: h→
t = f (W
→
· h→
t−1 + V
→
· et )
(7)
h←
t = f (W
←
· h←
t+1 + V
←
· et )
ht = f (W · ht−1 + V · et ) (4)
Normally, bidirectional models feed the concate-
where W and V denote compositional matrices. If nation vector calculated from both directions
Ns denote the length of the sequence, hNs repre- [e← →
1 , eNS ] to the classifier. Bidirectional models
sents the whole sequence S. hNs is used as input a can be similarly extended to both multi-layer neu-
softmax function for classification tasks. ral model and LSTM version.
Multi-layer Recurrent Models Multi-layer re-
current models extend one-layer recurrent structure
by operation on a deep neural architecture that en-
ables more expressivity and flexibly. The model
associates each time step for each layer with a hid-
den representation hl,t , where l ∈ [1, L] denotes
the index of layer and t denote the index of time
step. hl,t is given by: