Professional Documents
Culture Documents
On Extractive and Abstractive Neural Document Summarization With Transformer Language Models
On Extractive and Abstractive Neural Document Summarization With Transformer Language Models
Sandeep Subramanian1,2,3,⇤, Raymond Li1,⇤ , Jonathan Pilault1,2,4,⇤ , Sandeep Subramanian1,2,3,⇤, Raymond Li1,⇤ , Jonathan Pilault1,2,4,⇤ ,
Abstract Abstract
1 Introduction 1 Introduction
Language models (LMs) are trained to estimate the joint probability of an arbitrary sequence of Language models (LMs) are trained to estimate the joint probability of an arbitrary sequence of
words or characters using a large corpus of text. They typicallyQ factorize the joint distribution of words or characters using a large corpus of text. They typicallyQ factorize the joint distribution of
n
tokens p(x1 , x2 . . . xn ) into a product of conditional probabilities i p(xi |x<i ). It is possible to use n
tokens p(x1 , x2 . . . xn ) into a product of conditional probabilities i p(xi |x<i ). It is possible to use
n-gram based models to estimate these conditional probabilities via counts, relying on Markovian n-gram based models to estimate these conditional probabilities via counts, relying on Markovian
Preprint. Sent to peer review, May 2019. Preprint. Sent to peer review, May 2019.
Introduction
Language models (LMs) are trained to estimate the joint
probability of an arbitrary sequence of words or characters Figure 1: Proposed model for abstractive summarization of a sci-
using a large corpus of text. They typically factorize the joint entific article. An older version of this paper is shown as the ref-
distribution of tokens Q
p(x1 , x2 . . . xn ) into a product of con- erence document. First, a sentence pointer network extracts impor-
n tant sentences from the paper. Next, these sentences are provided
ditional probabilities i p(xi |x<i ). It is possible to use n-
along with the whole scientific article to be arranged in the follow-
gram based models to estimate these conditional probabil- ing order: Introduction, extracted Sentences, abstract & the rest of
ities via counts, relying on Markovian assumptions. How- the paper. A transformer language model is trained on articles or-
ever, Markovian assumptions and the curse of dimension- ganized in this format. During inference, the introduction and the
ality make it harder for n-gram LMs to model long range extracted sentences are given to the language model as context to
dependencies and learn smooth functions that can learn sim- generate a summary. In domains like news and patent documents,
ilarities between words in the vocabulary. This has led to the introduction is replaced by the entire document.
a preference for recurrent or feed-forward neural language
models (Bengio et al. 2003; Mikolov et al. 2010) in recent
years due to to their ability to learn expressive conditional
and abstractive summarization (Rush, Chopra, and Weston
probability distributions (Radford et al. 2019).
2015). The encoder and conditional decoder language
The sequence-to-sequence (seq2seq) paradigm
models are often parameterized as recurrent neural net-
(Sutskever, Vinyals, and Le 2014) uses language mod-
works (RNNs). Attention mechanisms (Bahdanau, Cho,
els that learn the conditional probability of one sequence
and Bengio 2014) are used in the decoder to provide more
given another. Here, a language model serves as a “decoder”
informative conditioning on the representations produced
that is typically conditioned on a representation of an
by the encoder and to ease gradient flow into the encoder.
input sequence produced by an encoder neural network.
RNNs however, are limited by their sequential nature,
These types of encoder-decoder architectures have been
making them 1) difficult to optimize and learn for long
particularly successful when applied to problems such as
sequences with long range dependencies (Hochreiter 1998;
machine translation (Bahdanau, Cho, and Bengio 2014)
Pascanu, Mikolov, and Bengio 2013), and 2) hard to
∗
Equal contribution, order determined by coin flip parallelize on modern hardware like GPUs, limiting their
Preprint. Work in Progress. scalability.
There has therefore been a recent shift towards feedfor- • We demonstrate that transformer language models are
ward architectures for sequential data, such as convolutional surprisingly effective at summarizing long scientific ar-
models (Kalchbrenner et al. 2016; Van Den Oord et al. 2016; ticles and outperform typical seq2seq approaches, even
Gehring et al. 2017) or fully attentive models popularized by without a copy mechanism.
architectures known as transformers (Vaswani et al. 2017).
• We show that our approach produces more “abstractive”
These techniques have a logarithmic or constant path length
summaries compared to prior work that employs a copy
(as opposed to linear in RNNs) between a network’s out-
mechanism (See, Liu, and Manning 2017) while still
put and any of its inputs, making gradient flow much easier,
achieving higher ROUGE scores.
thereby opening up the possibility of learning very long term
dependencies.
Related Work
The abstractive summarization of news or scientific pa-
pers typically requires encoding and generating hundreds or Automatic summarization systems seek to condense the
thousands of words. Recent work by (Radford et al. 2019) size of a piece of text while preserving most of its im-
(GPT-2) has demonstrated that transformers with a large re- portant information content and meaning. The earliest at-
ceptive field and trained on a lot of data yield language mod- tempts at automatic summarization focused on extractive
els that are capable of capturing long range dependencies. techniques, which find words or sentences in a document
If one is interested in creating a coherent, high-quality that capture its most salient content. In the past, various
summary of long documents, such GPT-like architectures similarity scores based on specific sentence features (key-
possess many desirable properties. Their results also show words, position, length, frequency, linguistic) and metrics
that unconditional language models can implicitly learn to (structure-based, vector-based and graph-based) were em-
perform summarization or machine translation as a conse- ployed to estimate salience (Steinberger and Jezek 2004;
quence of the data on which it is trained. If the data is format- Erkan and Radev 2004) between a sentence in a document
ted sequentially into different aspects of a document (intro- and its reference summary. More recently, with advances
duction, body, summary), each divided by “tl;dr”, the model in distributed representations of words, phrases and sen-
can be coaxed to generate one of these aspects. For exam- tences, researchers have proposed to use these to compute
ple, the model can be made to solve a summarization task similarity scores. Such techniques were further refined by
by presenting it similarly formatted data at test time; i.e. a (Nallapati, Zhou, and Ma 2016; Cheng and Lapata 2016;
document’s introduction and body followed by “tl;dr“ that Chen and Bansal 2018) with encoder-decoder architectures
will generate an abstract from a language model conditioned - the representations learned by the encoder are used to
on this context. choose the most salient sentences. (Cheng and Lapata 2016)
and (Nallapati, Zhou, and Ma 2016) trained encoder-decoder
In this work, we take this idea a step further by doing neural networks as a binary classifier to determine if each
away with the sequence-to-sequence paradigm and format- sentence in a document should belong to the extractive sum-
ting data for abstractive summarization in a manner that mary or not. (Nallapati et al. 2016) also present an alterna-
transformer language models can make use of all of the tive that can pick an unordered set of sentences from the
available data present in the documents and their summaries source document to assemble an extractive summary. (Chen
(akin to language model pre-training on mono-lingual data and Bansal 2018) use a pointer network (Vinyals, Fortunato,
in machine translation (Gulcehre et al. 2015)). Specifically, and Jaitly 2015) to sequentially pick sentences from the doc-
we use a single GPT-like Transformer LM (TLM) trained ument that comprise its extractive summary.
on documents followed by their summaries. During infer- Human summarizers have four common characteristics.
ence, we generate from the LM, conditioned on the doc- They are able to (1) interpret a source document, (2) pri-
ument (see figure 1). Unlike most previous approaches to oritize the most important parts of the input text, (3) para-
neural abstractive summarization, we do not use a seq2seq phrase key concepts into coherent paragraphs and (4) gen-
formulation with an explicit encoder and decoder for word erate diverse output summaries. While extractive methods
generation. We split the task in two: an extractive step and an are arguably well suited for identifying the most relevant
abstractive step (Chen and Bansal 2018; Gehrmann, Deng, information, such techniques may lack the fluency and co-
and Rush 2018). To deal with extremely long documents herency of human generated summaries. Abstractive sum-
that exceed several thousand words, we first perform sen- marization has shown the most promise towards address-
tence extraction using two different hierarchical document ing points (3) and (4) above. Abstractive generation may
models - one based on pointer networks (Vinyals, Fortunato, produce sentences not seen in the original input document.
and Jaitly 2015), similar to the variant proposed in (Chen Motivated by neural network success in machine translation
and Bansal 2018) and the other based on a sentence clas- experiments, the attention-based encoder decoder paradigm
sifier (Nallapati, Zhai, and Zhou 2017). This extracts im- has recently been widely studied in abstractive summariza-
portant sentences from the document (described in section tion (Rush, Chopra, and Weston 2015; Nallapati et al. 2016;
) that can be used to better condition the transformer LM Chopra, Auli, and Rush 2016). By dynamically accessing
on relevant information before being tasked with generating the relevant pieces of information based on the hidden states
a summary. We show that this extractive step significantly of the decoder during generation of the output sequence, the
improves summarization results. model revisits the input and attends to important informa-
The contributions of this work are two fold: tion. The advantages of extractive, abstractive and attention-
based models were first combined in (Gu et al. 2016) with a which is then used to compute an attention aware hidden
copy mechanism for out-of-vocabulary words present in the state h̃t . Following the input-feeding approach from (Luong,
source document. Similarly, (See, Liu, and Manning 2017) Pham, and Manning 2015), the attention aware hidden state
used the attention scores to calculate the probability of gen- h̃t is concatenated to the input in the next
erating vs copying a word. A coverage mechanism was also time step, giving
added to penalize the attention score of previously attended the following recurrence ht = LSTM [sTit h̃Tt−1 ]T , ht−1 ,
words, diminishing the model’s tendency to repeat itself. with
XN
Framework c
h̃t = Wh̃ t , ct = at (i)di , αt (i) = dTi ht , (1)
ht
Our model comprises two distinct and independently train- i=1
able components 1) a hierarchical document representation exp (αt (i))
model that either points to or classifies sentences in a doc- at (i) = P 0
, for i = 1..N. (2)
i0 exp (αt (i ))
ument to build an extractive summary 2) a transformer lan-
guage model that conditions on the extracted sentences as The attention weights at are used as output probability
well as a part of or the entire document. distribution over the document sentences, of the choice for
the next extracted sentence. We choose the convention to
Extractive Models signal the end of the extraction by putting the same index
We describe the two neural extractive models used in this twice in a row. Thus, the input to the decoder is the follow-
work in this section. ing sequence: 0, si1 , ..., siM , and the target: i1 , ..., iM , iM
, where M is the length of the ground-truth extracted sum-
Hierarchical Seq2seq Sentence Pointer Our extrac- mary and both sequences have M + 1 elements. The model
tive model is similar to the sentence pointer architecture is trained to minimize the cross-entropy of picking the cor-
developed by (Chen and Bansal 2018) with the main dif- rect sentence at each decoder time step. At inference, we use
ference being the choice of encoder. We use a hierarchical beam-search to generate the extracted summary.
bidirectional LSTM encoder with word and sentence level
LSTMs while (Chen and Bansal 2018) use a convolutional Sentence Classifier As with the pointer network, we use
word level encoder for faster training and inference. The a hierarchical LSTM to encode the document and produce
decoder is in both cases is an LSTM. a sequence of sentence representations d1 , ..., dN where N
The extractive model considers the document as a list of is the number of sentences in the document. We compute a
N sentences D = (S1 , . . . , SN ), and each sentence as a list final document representation as follows:
of tokens. We are given a ground-truth extracted summary of
!
M sentences (Si1 , . . . , SiM ), where the i1 < . . . < iM are 1 X
N
the indices of the extracted sentences. The procedure to de- d = tanh bd + Wd di (3)
termine ground-truth extraction targets are identical to previ- N i=1
ous work - finding two sentences in the document that have
the highest ROUGE score with each sentence in the sum- where bd and Wd are learnable parameters. Finally, the
mary. probability of each sentence belonging to the extractive sum-
We use an encoder-decoder architecture for this extractor. mary is given by:
The encoder has a hierarchical structure that combines a to-
di
ken and sentence-level RNN. First, the “sentence-encoder” oi =σ Wo + bo (4)
d
or token-level RNN is a bi-directional LSTM (Hochreiter
and Schmidhuber 1997) encoding each sentence. The last where σ is the sigmoid activation function. The model is
hidden state of the last layer from the two directions pro- trained to minimize the binary cross-entropy loss with re-
duces sentence embeddings: (s1 , . . . , sN ), where N is the spect to the sentences in the gold-extracted summary.
number of sentences in the document. The sentence-level
LSTM or the “document encoder”, another bi-directional
LSTM, encodes this sequence of sentence embeddings to Model Details The model uses word embeddings of size
produce document representations: (d1 , . . . , dN ). 300. The token-level LSTM (sentence encoder), sentence-
The decoder is an autoregressive LSTM taking the level LSTM (document encoder) and decoder each have 2
sentence-level LSTM hidden state of the previously ex- layers of 512 units and a dropout of 0.5 is applied at the
tracted sentence as input and predicting the next extracted output of each intermediate layer. We trained it with Adam, a
sentence. Let it the index of the previous extracted sentence learning rate 0.001, a weight decay of 10−5 , and using batch
at time step t. The input to the decoder is sit , or a zero vector sizes of 32. We evaluate the model every 200 updates, using
at time-step t = 0. The decoder’s output is computed by an a patience of 50. At inference, we decode using beam search
attention mechanism from the decoder’s hidden state ht over with a beam size of 4 for the pointer model and pick the k
the document representations (d1 , . . . , dN ). We used the dot most likely sentences from the sentence classifier, where k
product attention method from (Luong, Pham, and Manning is the average number of sentences in the summary across
2015). The attention weights at produce a context vector ct , the training dataset.
Transformer Language Models (TLM)
are presented in figure 2. First, the original abstract and our Table 6: Summarization results on the Newsroom dataset.
TLM conditioned on the intro have small and very similar Previous work results from (Grusky, Naaman, and Artzi
overlap fractions with the original article. A model using a 2018) and (Mendes et al. 2019).
pointing mechanism (we used our own implementation of
the model developed by (Cohan et al. 2018))1 copies more Model Type Extractive Mixed Abstractive
than our transformer model, especially for higher n-grams. ROUGE
In particular, more than 10% of the 20-grams from the ab- 1 2 L 1 2 L 1 2 L
Previous Work
stracts generated by the pointing model are also found in
Seq2Seq Abs 6.1 0.2 5.4 5.7 0.2 5.1 6.2 1.1 5.7
the article, showing that it tends to copy long sequences of
TextRank Ext 32.4 19.7 28.7 22.3 7.9 17.7 13.5 1.9 10.5
words. On the other hand, our proposed model produces Pointer-gen Mix 39.1 27.9 36.2 25.5 11.0 21.1 14.7 2.3 11.4
more “abstractive” summaries, demonstrating its ability to Lead-3 Ext 53.0 49.0 52.4 25.1 12.9 22.1 13.7 2.4 11.2
paraphrase. Our model tends to copy longer sequences when Exconsumm Mix 68.4 62.9 67.3 31.7 16.1 27.0 17.1 3.1 14.1
conditioned on the introduction and the sentences from the Our Models
extractor. We hypothesize that providing extracted sentences Sent-CLF Ext 53.0 47.0 52.1 26.8 12.6 23.6 15.4 2.7 12.8
from the article that already contain a lot of words present in Sent-PTR Ext 60.7 55.2 59.7 28.9 14.1 25.1 15.9 2.8 13.0
the reference abstract, makes the transformer’s task easier, TLM Abs 49.8 39.7 47.4 27.1 11.6 22.8 20.4 6.9 17.1
by allowing it to copy words and phrases from the extracted TLM+E (G,M) Mix 63.3 57.3 61.8 31.9 16.6 27.4 20.1 6.5 16.6
sentences. We find empirical evidence of this in figure 2, Oracle
showing that the majority of n-gram copies come from the Gold Ext Oracle 68.1 64.5 67.3 40.8 24.6 34.2 21.9 5.2 16.3
TLM+E (G,G) Oracle 78.8 74.0 77.8 38.6 22.0 33.6 24.5 9.6 20.8
extracted sentences. For 5-grams, close to 2/3rd of the words
copied are from the extracted sentences. As the number of
grams increases to 25-grams, 4/5th of the words copied are
from the extracted sentences. tractive step, by comparing it to a abstractive model vari-
ant that only sees the input text itself. Our approach out-
Conclusion performs previous extractive and abstractive summarization
We have demonstrated that Transformer language models methods on the arXiv, PubMed and bigPatent datasets and
can generate high-quality summaries of long sequences of is less prone to copying entire phrases or sentences from the
text via an extractive step followed by an abstractive step. input text. The fluency and coherency of the sample sum-
We quantitatively measure the positive impact of the ex- maries suggests that these models are ready for comprehen-
sive human evaluation studies. As with other problem do-
1
This model achieved the following ROUGE-1, 2, 3 and L on mains, we have observed that abstractive summaries gener-
the arXiv dataset: 41.33, 14.73, 6.80, 36.34 ated by transformers can generate imaginary content. We ad-
Table 7: Qualitative Results — Generated abstracts of select papers using our Intro Only TLM.
vise that such evaluations should probe multiple aspects of the creative ability of humans to coherently and concisely
the summarization results including both factual correctness synthesize summaries.
and coherency. We also note that for evaluating the correct-
ness of the summaries of scientific articles and patents one
must have highly trained evaluators who are willing to invest
References
significant amounts of time to read the underlying papers Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma-
and patents. Such studies could therefore require significant chine translation by jointly learning to align and translate.
investments of resources. We have also presented an upper arXiv preprint arXiv:1409.0473.
bound on extractive + abstractive models, by conditioning
Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003.
the abstractive step on gold-extracted sentences. In future
A neural probabilistic language model. Journal of machine
work, we are also interested in exploring the possibility of
learning research 3(Feb):1137–1155.
training the extractive and abstractive steps in an end-to-end
manner. While we believe that this work is a step forward Chen, Y.-C., and Bansal, M. 2018. Fast abstractive summa-
towards generating more abstractive summaries, it remains rization with reinforce-selected sentence rewriting. arXiv
an open challenge to develop models that respect the under- preprint arXiv:1805.11080.
lying facts of the content being summarized while matching Cheng, J., and Lapata, M. 2016. Neural summariza-
tion by extracting sentences and words. arXiv preprint Micikevicius, P.; Narang, S.; Alben, J.; Diamos, G.; Elsen,
arXiv:1603.07252. E.; Garcia, D.; Ginsburg, B.; Houston, M.; Kuchaiev, O.;
Chopra, S.; Auli, M.; and Rush, A. M. 2016. Abstrac- Venkatesh, G.; et al. 2017. Mixed precision training. arXiv
tive sentence summarization with attentive recurrent neural preprint arXiv:1710.03740.
networks. In Proceedings of the 2016 Conference of the Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; and Khu-
North American Chapter of the Association for Computa- danpur, S. 2010. Recurrent neural network based language
tional Linguistics: Human Language Technologies, 93–98. model.
Cohan, A.; Dernoncourt, F.; Kim, D. S.; Bui, T.; Kim, S.; Nallapati, R.; Zhou, B.; Gulcehre, C.; Xiang, B.; et al. 2016.
Chang, W.; and Goharian, N. 2018. A discourse-aware at- Abstractive text summarization using sequence-to-sequence
tention model for abstractive summarization of long docu- rnns and beyond. arXiv preprint arXiv:1602.06023.
ments. CoRR abs/1804.05685. Nallapati, R.; Zhai, F.; and Zhou, B. 2017. Summarunner: A
Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. recurrent neural network based sequence model for extrac-
Bert: Pre-training of deep bidirectional transformers for lan- tive summarization of documents.
guage understanding. arXiv preprint arXiv:1810.04805. Nallapati, R.; Zhou, B.; and Ma, M. 2016. Classify or se-
Erkan, G., and Radev, D. R. 2004. Lexrank: Graph-based lect: Neural architectures for extractive document summa-
lexical centrality as salience in text summarization. Journal rization. arXiv preprint arXiv:1611.04244.
of artificial intelligence research 22:457–479. Pascanu, R.; Mikolov, T.; and Bengio, Y. 2013. On the
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, difficulty of training recurrent neural networks. 1310–1318.
Y. N. 2017. Convolutional sequence to sequence learning. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
1243–1252. Sutskever, I. 2019. Language models are unsupervised mul-
Gehrmann, S.; Deng, Y.; and Rush, A. M. 2018. titask learners.
Bottom-up abstractive summarization. arXiv preprint Rush, A. M.; Chopra, S.; and Weston, J. 2015. A neu-
arXiv:1808.10792. ral attention model for abstractive sentence summarization.
Graham, Y. 2015. Re-evaluating automatic summarization CoRR abs/1509.00685.
with bleu and 192 shades of rouge. 128–137. See, A.; Liu, P. J.; and Manning, C. D. 2017. Get to
Grusky, M.; Naaman, M.; and Artzi, Y. 2018. Newsroom: the point: Summarization with pointer-generator networks.
A dataset of 1.3 million summaries with diverse extractive CoRR abs/1704.04368.
strategies. arXiv preprint arXiv:1804.11283. Sennrich, R.; Haddow, B.; and Birch, A. 2015. Neural ma-
Gu, J.; Lu, Z.; Li, H.; and Li, V. O. K. 2016. Incorporat- chine translation of rare words with subword units. arXiv
ing copying mechanism in sequence-to-sequence learning. preprint arXiv:1508.07909.
CoRR abs/1603.06393. Sharma, E.; Li, C.; and Wang, L. 2019. Bigpatent: A large-
Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, scale dataset for abstractive and coherent summarization.
H.-C.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2015. arXiv preprint arXiv:1906.03741.
On using monolingual corpora in neural machine transla- Steinberger, J., and Jezek, K. 2004. Using latent seman-
tion. arXiv preprint arXiv:1503.03535. tic analysis in text summarization and summary evaluation.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term Proc. ISIM 4:93–100.
memory. Neural Comput. 9(8):1735–1780. Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to
Hochreiter, S. 1998. The vanishing gradient problem during sequence learning with neural networks. 3104–3112.
learning recurrent neural nets and problem solutions. Inter- Van Den Oord, A.; Dieleman, S.; Zen, H.; Simonyan, K.;
national Journal of Uncertainty, Fuzziness and Knowledge- Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A. W.;
Based Systems 6(02):107–116. and Kavukcuoglu, K. 2016. Wavenet: A generative model
Kalchbrenner, N.; Espeholt, L.; Simonyan, K.; Oord, A. for raw audio. SSW 125.
v. d.; Graves, A.; and Kavukcuoglu, K. 2016. Neu- Vanderwende, L.; Suzuki, H.; Brockett, C.; and Nenkova, A.
ral machine translation in linear time. arXiv preprint 2007. Beyond sumbasic: Task-focused summarization with
arXiv:1610.10099. sentence simplification and lexical expansion. Information
Processing & Management 43(6):1606–1618.
Lin, C.-Y. 2004. Looking for a few good metrics: Automatic
summarization evaluation-how many samples are enough? Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
In NTCIR. L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
tention is all you need. CoRR abs/1706.03762.
Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effec-
tive approaches to attention-based neural machine transla- Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
tion. arXiv preprint arXiv:1508.04025. networks. 2692–2700.
Mendes, A.; Narayan, S.; Miranda, S.; Marinho, Z.; Mar- Weber, N.; Shekhar, L.; Balasubramanian, N.; and Cho,
tins, A. F.; and Cohen, S. B. 2019. Jointly extracting K. 2018. Controlling decoding for more abstractive
and compressing documents with summary state represen- summaries with copy-based networks. arXiv preprint
tations. arXiv preprint arXiv:1904.02020. arXiv:1803.07038.
Appendix T-SNE of learned word embeddings
Samples from the arXiv test set We visualize the word embeddings learned by our TLM
model using t-sne. We find that words that are often asso-
Samples from the TLM-I and TLM-I+E (G,M) models on a ciated with computer science are clustered in a different part
random example from the arXiv test set. of space when compared to words associated with physics.
We use the arXiv REST API to find the submission category
Table 8: Qualitative Results - Generated abstracts from our models of each paper in the training set and then find the ∼300 most
on a random example from the test set of (Cohan et al. 2018) representative words for each category, using TF-IDF scores
and plot them.
TLM conditioned on intro — deep convolutional neural net-
works ( cnns ) have created new benchmarks in the object
recognition challenge . most of these benchmark datasets con-
tain thousands of images of interest . in addition to these
datasets , several standard data - sets , such as pascal voc
2012 and the ms coco , have been used in conjunction with
the same network in order to improve the object detection task
. in this paper , we present a method for object localization
using the deep convolutional neural networks trained for the
simpler task of image - level classification . we propose local-
ization candidates ( or bounding boxes ) which maximize clas-
sification scores for a class of interest . we explore the beam
search strategy where the number of candidate bounding boxes
are restricted as we traverse from the root node towards the
leaf nodes . we then gradually explore smaller and smaller re-
gions of interest till a point is reached where the classifier is
not able to discriminate among the classes of interest . the de-
tection candidates are organized in a search tree , the root node
being represented by the entire test image . we then gradually
explore smaller and smaller regions of interest till a point is
reached where the classifier is not able to distinguish among
the classes of interest .
TLM conditioned on intro + pointer network extracted sen- Figure 3: t-sne visualization of the TLM-learned word em-
tences — deep convolutional neural networks ( cnns ) are beddings. The model appears to partition the space based on
widely used in image recognition applications . however , they the broad paper categoty in which it frequently occurs.
face a number of challenges . first , cnn models are computa-
tionally expensive to train and to perform . second , they are not
easy to implement and can not exploit global features . to alle-
viate these issues , we present a method that leverages the cor-
relation between semantic information present in feature maps
and localization of an object of interest . the method leverages
feature map activations for object localization . we demonstrate
that this method is significantly faster than state - of - the - art
in localization accuracy by a significant margin of up to 8 map
on two standard data - sets with complex scenes , pascal voc
2012 and the much larger ms coco .
Ground truth abstract — object localization is an important
computer vision problem with a variety of applications . the
lack of large scale object - level annotations and the relative
abundance of image - level labels makes a compelling case for
weak supervision in the object localization task . deep convo-
lutional neural networks are a class of state-of-the-art methods
for the related problem of object recognition . in this paper , we
describe a novel object localization algorithm which uses clas-
sification networks trained on only image labels . this weakly
supervised method leverages local spatial and semantic pat-
terns captured in the convolutional layers of classification net-
works . we propose an efficient beam search based approach
to detect and localize multiple objects in images . the proposed
method significantly outperforms the state-of-the-art in stan-
dard object localization data - sets with a 8 point increase in
map scores .