Professional Documents
Culture Documents
A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model For Abstractive Text Summarization
A Reinforced Topic-Aware Convolutional Sequence-to-Sequence Model For Abstractive Text Summarization
2 Related Work
Automatic text summarization has been widely investigated.
Many approaches have been proposed to address this chal-
lenging task. Various methods [Neto et al., 2002] focus on the
extractive summarization, which select important contents of
text and combine them verbatim to produce a summary. On
the other hand, abstractive summarization models are able
to produce a grammatical summary with a novel expres-
sion, most of which [Rush et al., 2015; Chopra et al., 2016;
Nallapati et al., 2016a] are built upon the neural attention-
based sequence-to-sequence framework [Sutskever et al.,
2014].
The predominant models are based on the RNNs [Nallapati Figure 1: A graphical illustration of the topic-aware convolutional
et al., 2016b; Shen et al., 2016; Paulus et al., 2017], where the architecture. Word and topic embeddings of the source sequence are
encoder and decoder are constructed using either Long Short- encoded by the associated convolutional blocks (bottom left and bot-
Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997] tom right). Then we jointly attend to words and topics by comput-
or Gated Recurrent Unit (GRU) [Cho et al., 2014]. However, ing dot products of decoder representations (top left) and word/topic
very few methods have explored the performance of convolu- encoder representations. Finally, we produce the target sequence
tional structure in summarization tasks. Compared to RNNs, through a biased probability generation mechanism.
convolutional neural networks (CNNs) enjoy several advan-
tages, including the efficient training by leveraging parallel
convolutional architecture with both input words and topics,
computing, and mitigating the gradient vanishing problem
a joint multi-step attention mechanism, a biased generation
due to fewer non-linearities [Dauphin et al., 2016]. Notably,
structure, and a reinforcement learning procedure. The graph-
the recently proposed gated convolutional network [Dauphin
ical illustration of the topic-aware convolutional architecture
et al., 2016; Gehring et al., 2017] outperforms state-of-the-art
can be found in Figure 1.
RNN-based models in the language modeling and machine
translation tasks. 3.1 ConvS2S Architecture
While the ConvS2S model is also evaluated on the ab-
stractive summarization [Gehring et al., 2017], there are sev- We exploit the ConvS2S architecture [Gehring et al., 2017]
eral limitations. First, the model is trained by minimizing as the basic infrastructure of our model. In this paper, two
a maximum-likelihood loss which is sometimes inconsistent convolutional blocks are employed, associated with the word-
with the quality of a summary and the metric that is evaluated level and topic-level embeddings, respectively. We introduce
from the whole sentences, such as ROUGE [Lin, 2004] In ad- the former in this section and the latter in next, along with the
dition, the exposure bias [Ranzato et al., 2015] occurs due to new joint attention and the biased generation mechanism.
only exposing the model to the training data distribution in- Position Embeddings
stead of its own predictions. More importantly, the ConvS2S Let x = (x1 , . . . , xm ) denote the input sentence. We first
model utilizes only word-level alignment which may be in- embed the input elements (words) in a distributional space
sufficient for summarization and prone to incoherent general- as w = (w1 , . . . , wm ), where wi ∈ Rd are rows of a
ized summaries. Therefore, the higher level alignment could randomly initialized matrix Dword ∈ RV ×d with V being
be a potential assist. For example, the topic information the size of vocabulary. We also add a positional embed-
has been introduced to a RNN-based sequence-to-sequence ding, p = (p1 , . . . , pm ) with pi ∈ Rd , to retain the or-
model [Xing et al., 2017] for chatbots to generate more infor- der information. Thus, the final embedding for the input is
mative responses. e = (w1 +p1 , . . . , wm +pm ). Similarly, let q = (q1 , . . . , qn )
denote the embedding for output elements that were already
3 Reinforced Topic-Aware Convolutional generated by the decoder and being fed back to the next step.
Sequence-to-Sequence Model Convolutional Layer
In this section, we propose the Reinforced Topic-Aware Con- Both encoder and decoder networks are built by stacking sev-
volutional Sequence-to-Sequence model, which consists of a eral convolutional layers. Suppose that the kernel has width
of k and the input embedding dimension is d. The convolu- Topic Embeddings
tion takes a concatenation of k input elements X ∈ Rkd and The topic embeddings are obtained by classical topic models
maps it to an output element Y ∈ R2d , namely, such as Latent Dirichlet Allocation (LDA) [Blei et al., 2003].
. During pre-training, we use LDA to assign topics to the input
Y = fconv (X) = WY X + bY , (1)
texts. The top N non-universal words with the highest proba-
where the kernel matrix WY ∈ R2d×kd and the bias term bilities of each topic are chosen into the topic vocabulary K.
bY ∈ R2d are the parameters to be learned. More details will be given in Section 4. While the vocabulary
Rewrite the output as Y = [A; B], where A, B ∈ Rd . Then of texts is denoted as V , we assume that K ⊂ V . Given
the gated linear unit (GLU) [Dauphin et al., 2016] is given by an input sentence x = (x1 , . . . , xm ), if a word xi ∈ / K, we
embed it as before to attain wi . However, if a word xi ∈ K,
g([A; B]) = A ⊗ σ(B) , (2) we can embed this topic word as ti ∈ Rd , which is a row in
where σ is the sigmoid function, ⊗ is the point-wise multipli- the topic embedding matrix Dtopic ∈ RK×d , where K is the
cation, and the output of GLU is in Rd . size of topic vocabulary. The embedding matrix Dtopic is nor-
We denote the outputs of the l-th layer as hl = malized from the corresponding pre-trained topic distribution
(hl1 , . . . , hln ) for the decoder, and z l = (z1l , . . . , zm
l
) for the matrix, whose row is proportional to the number of times that
encoder. Take the decoder for illustration. The convolution each word is assigned to each topic. In this case, the posi-
unit i on the l-th layer is computed by residual connections as tional embedding vectors are also added to the encoder and
h i decoder elements, respectively, to obtain the final topic em-
hli = g ◦ fconv hl−1 l−1
i−k/2 ; · · · ; hi+k/2 + hil−1 , (3) beddings r = (r1 , . . . , rm ) and s = (s1 , . . . , sn ).
where hli ∈ Rd and ◦ is the function composition operator. Joint Attention
Again we take the decoder for illustration. Following the con-
Multi-step Attention
volutional layer introduced before, we can obtain the convo-
The attention mechanism is introduced to make the model
lution unit i on the l-th layer in the decoder of topic level as
access historical information. To compute the attention, we
first embed the current decoder state hli as h̃li ∈ Rd . Similar to (4), we have
point information about a specific input element. Once cli has Biased Probability Generation
been computed, it is added to the output of the corresponding Finally, we compute a distribution over all possible next target
decoder layer hli and serves as a part of the input to hl+1
i .
elements yi+1 ∈ RT , namely
Table 7: Examples of generated summaries on the LCSTS dataset. D: source document, R: reference summary, OR: output of the Reinforced-
ConvS2S model, OT: output of the Reinforced-Topic-ConvS2S model. The words marked in blue are topic words not in the reference
summaries. The words marked in red are topic words neither in the reference summaries nor in the source documents. All the texts are
carefully translated from Chinese.
References [Nallapati et al., 2016a] Ramesh Nallapati, Bing Xiang, and
[Bahdanau et al., 2014] Dzmitry Bahdanau, Kyunghyun Bowen Zhou. Sequence-to-sequence rnns for text summa-
Cho, and Yoshua Bengio. Neural machine translation rization. 2016.
by jointly learning to align and translate. arXiv preprint [Nallapati et al., 2016b] Ramesh Nallapati, Bowen Zhou,
arXiv:1409.0473, 2014. Caglar Gulcehre, Bing Xiang, et al. Abstractive text sum-
[Barzilay and McKeown, 2005] Regina Barzilay and Kath- marization using sequence-to-sequence rnns and beyond.
arXiv preprint arXiv:1602.06023, 2016.
leen R McKeown. Sentence fusion for multidocu-
ment news summarization. Computational Linguistics, [Neto et al., 2002] Joel Neto, Alex Freitas, and Celso Kaest-
31(3):297–328, 2005. ner. Automatic text summarization using a machine learn-
ing approach. Advances in Artificial Intelligence, pages
[Blei et al., 2003] David M Blei, Andrew Y Ng, and
205–215, 2002.
Michael I Jordan. Latent dirichlet allocation. Journal of
machine Learning research, 3(Jan):993–1022, 2003. [Over et al., 2007] Paul Over, Hoa Dang, and Donna Har-
man. Duc in context. Information Processing & Man-
[Cho et al., 2014] Kyunghyun Cho, Bart Van Merriënboer, agement, 43(6):1506–1520, 2007.
Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,
[Paszke et al., 2017] Adam Paszke, Sam Gross, and Soumith
Holger Schwenk, and Yoshua Bengio. Learning phrase
representations using rnn encoder-decoder for statistical Chintala. Pytorch, 2017.
machine translation. arXiv preprint arXiv:1406.1078, [Paulus et al., 2017] Romain Paulus, Caiming Xiong, and
2014. Richard Socher. A deep reinforced model for abstractive
[Chopra et al., 2016] Sumit Chopra, Michael Auli, and summarization. CoRR, abs/1705.04304, 2017.
Alexander M Rush. Abstractive sentence summarization [Ranzato et al., 2015] Marc’Aurelio Ranzato, Sumit Chopra,
with attentive recurrent neural networks. In Proceedings Michael Auli, and Wojciech Zaremba. Sequence level
of the 2016 Conference of the North American Chapter training with recurrent neural networks. arXiv preprint
of the Association for Computational Linguistics: Human arXiv:1511.06732, 2015.
Language Technologies, pages 93–98, 2016. [Rennie et al., 2016] Steven J Rennie, Etienne Marcheret,
[Dauphin et al., 2016] Yann N Dauphin, Angela Fan, Youssef Mroueh, Jarret Ross, and Vaibhava Goel. Self-
Michael Auli, and David Grangier. Language modeling critical sequence training for image captioning. arXiv
with gated convolutional networks. arXiv preprint preprint arXiv:1612.00563, 2016.
arXiv:1612.08083, 2016. [Rush et al., 2015] Alexander M Rush, Sumit Chopra, and
[Gehring et al., 2017] Jonas Gehring, Michael Auli, David Jason Weston. A neural attention model for ab-
Grangier, Denis Yarats, and Yann N Dauphin. Convo- stractive sentence summarization. arXiv preprint
lutional sequence to sequence learning. arXiv preprint arXiv:1509.00685, 2015.
arXiv:1705.03122, 2017. [Shen et al., 2016] Shiqi Shen, Yu Zhao, Zhiyuan Liu,
[Graff and Cieri, 2003] David Graff and C Cieri. English gi- Maosong Sun, et al. Neural headline genera-
tion with sentence-wise optimization. arXiv preprint
gaword corpus. Linguistic Data Consortium, 2003.
arXiv:1604.01904, 2016.
[Gu et al., 2016] Jiatao Gu, Zhengdong Lu, Hang Li, [Sutskever et al., 2013] Ilya Sutskever, James Martens,
and Victor OK Li. Incorporating copying mecha- George Dahl, and Geoffrey Hinton. On the importance
nism in sequence-to-sequence learning. arXiv preprint of initialization and momentum in deep learning. In
arXiv:1603.06393, 2016. International conference on machine learning, pages
[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and 1139–1147, 2013.
Jürgen Schmidhuber. Long short-term memory. Neural [Sutskever et al., 2014] Ilya Sutskever, Oriol Vinyals, and
computation, 9(8):1735–1780, 1997. Quoc V Le. Sequence to sequence learning with neural
[Hu et al., 2015] Baotian Hu, Qingcai Chen, and Fangze networks. In Advances in neural information processing
Zhu. Lcsts: A large scale chinese short text summariza- systems, pages 3104–3112, 2014.
tion dataset. arXiv preprint arXiv:1506.05865, 2015. [Williams and Zipser, 1989] R. J. Williams and D. Zipser. A
[Kraaij et al., 2002] Wessel Kraaij, Martijn Spitters, and learning algorithm for continually running fully recurrent
Anette Hulth. Headline extraction based on a combina- neural networks. Neural Computation, 1(2):270–280, June
tion of uni-and multidocument summarization techniques. 1989.
In Proceedings of the ACL workshop on Automatic Sum- [Xing et al., 2017] Chen Xing, Wei Wu, Yu Wu, Jie Liu,
marization/Document Understanding Conference (DUC Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic aware
2002). ACL, 2002. neural response generation. In AAAI, pages 3351–3357,
[Lin, 2004] Chin-Yew Lin. Rouge: A package for auto- 2017.
matic evaluation of summaries. In Text summarization [Zhou et al., 2017] Qingyu Zhou, Nan Yang, Furu Wei, and
branches out: Proceedings of the ACL-04 workshop, vol- Ming Zhou. Selective encoding for abstractive sentence
ume 8. Barcelona, Spain, 2004. summarization. arXiv preprint arXiv:1704.07073, 2017.