Professional Documents
Culture Documents
Table 1: Comparison of summarization datasets with respect to overall corpus size, size of training, validation, and
test set, average document (source) and summary (target) length (in terms of words and sentences), and vocabulary
size on both on source and target. For CNN and DailyMail, we used the original splits of Hermann et al. (2015)
and followed Narayan et al. (2018b) to preprocess them. For NY Times (Sandhaus, 2008), we used the splits and
pre-processing steps of Paulus et al. (2018). For the vocabulary, we lowercase tokens.
Table 2: Corpus bias towards extractive methods in the CNN, DailyMail, NY Times, and XSum datasets. We show
the proportion of novel n-grams in gold summaries. We also report ROUGE scores for the LEAD baseline and the
extractive oracle system EXT- ORACLE. Results are computed on the test set.
across datasets, however, it is much steeper in maries as reference. The LEAD baseline performs
XSum whose summaries exhibit approximately extremely well on CNN, DailyMail and NY Times
83% novel bigrams, 96% novel trigrams, and confirming that they are biased towards extrac-
98% novel 4-grams (comparison datasets display tive methods. EXT- ORACLE further shows that
around 47–55% new bigrams, 58–72% new tri- improved sentence selection would bring further
grams, and 63–80% novel 4-grams). performance gains to extractive approaches. Ab-
We further evaluated two extractive methods on stractive systems trained on these datasets often
these datasets. LEAD is often used as a strong have a hard time beating the LEAD, let alone EXT-
lower bound for news summarization (Nenkova, ORACLE, or display a low degree of novelty in
2005) and creates a summary by selecting the their summaries (See et al., 2017; Tan and Wan,
first few sentences or words in the document. 2017; Paulus et al., 2018; Pasunuru and Bansal,
We extracted the first 3 sentences for CNN doc- 2018; Celikyilmaz et al., 2018). Interestingly,
uments and the first 4 sentences for DailyMail LEAD and EXT- ORACLE perform poorly on XSum
(Narayan et al., 2018b). Following previous work underlying the fact that it is less biased towards
(Durrett et al., 2016; Paulus et al., 2018), we ob- extractive methods.
tained LEAD summaries based on the first 100 In line with our findings, Grusky et al. (2018)
words for NY Times documents. For XSum, we have recently reported similar extractive biases in
selected the first sentence in the document (ex- existing datasets. They constructed a new dataset
cluding the one-line summary) to generate the called “Newsroom” which demonstrates a high di-
LEAD . Our second method, EXT- ORACLE, can be versity of summarization styles. XSum is not di-
viewed as an upper bound for extractive models verse, it focuses on a single news outlet (i.e., BBC)
(Nallapati et al., 2017; Narayan et al., 2018b). It and a unifrom summarization style (i.e., a single
creates an oracle summary by selecting the best sentence). However, it is sufficiently large for neu-
possible set of sentences in the document that ral network training and we hope it will spur fur-
gives the highest ROUGE (Lin and Hovy, 2003) ther research towards the development of abstrac-
with respect to the gold summary. For XSum, tive summarization models.
we simply selected the single-best sentence in the
document as summary.
3 Convolutional Sequence-to-Sequence
Learning for Summarization
Table 2 reports the performance of the two ex-
tractive methods using ROUGE-1 (R1), ROUGE- Unlike tasks like machine translation and para-
2 (R2), and ROUGE-L (RL) with the gold sum- phrase generation where there is often a one-to-
ing it to recognize pertinent content (i.e., by fore-
nd
]
]
D
D
on
la
.
A
A
g
w
[P
[P
grounding salient words in the document). In par-
En
t′i ⊗ tD
e x i + pi
ticular, we improve the convolutional encoder by
associating each word with a vector representing
Convolutions
topic salience, and the convolutional decoder by
GLU f f f
conditioning each word prediction on the docu-
⊗ ⊗ ⊗ ment topic vector.
zu ⊕
w
w Model Overview At the core of our model is
Attention
P
a simple convolutional block structure that com-
P
P
putes intermediate states based on a fixed num-
ber of input elements. Our convolutional encoder
cℓ (shown at the top of Figure 2) applies this unit
hL across the document. We repeat these operations
hℓ ⊕
in a stacked fashion to get a multi-layer hierarchi-
⊗ ⊗ ⊗ cal representation over the input document where
GLU f f f
words at closer distances interact at lower lay-
Convolutions
ers while distant words interact at higher layers.
The interaction between words through hierarchi-
g x′i + p′i
tD
cal layers effectively captures long-range depen-
dencies.
]
rt
rt
ch
ch
D
po
po
.
at
at
A
re
re
[P
[P
[P
Table 6: Example output summaries on the XSum test set with [ROUGE-1, ROUGE-2 and ROUGE-L] scores,
goldstandard reference, and corresponding questions. Words highlighted in blue are either the right answer or
constitute appropriate context for inferring it; words in red lead to the wrong answer.
C ONV S2S have 20.07 and 20.22 words, respec- to compare summaries produced from the EXT-
tively, while GOLD summaries are the longest ORACLE baseline, P T G EN , the best perform-
with 23.26 words. Interestingly, P T G EN trained ing system of See et al. (2017), C ONV S2S, our
on XSum only copies 4% of 4-grams in the source topic-aware model T-C ONV S2S, and the human-
document, 10% of trigrams, 27% of bigrams, and authored gold summary (GOLD). We did not in-
73% of unigrams. This is in sharp contrast to P T- clude extracts from the LEAD as they were signifi-
G EN trained on CNN/DailyMail exhibiting mostly cantly inferior to other models.
extractive patterns; it copies more than 85% of 4- The study was conducted on the Ama-
grams in the source document, 90% of trigrams, zon Mechanical Turk platform using Best-Worst
95% of bigrams, and 99% of unigrams (See et al., Scaling (BWS; Louviere and Woodworth 1991;
2017). This result further strengthens our hypoth- Louviere et al. 2015), a less labor-intensive alter-
esis that XSum is a good testbed for abstractive native to paired comparisons that has been shown
methods. to produce more reliable results than rating scales
(Kiritchenko and Mohammad, 2017). Participants
Human Evaluation In addition to automatic were presented with a document and summaries
evaluation using ROUGE which can be mislead- generated from two out of five systems and were
ing when used as the only means to assess the in- asked to decide which summary was better and
formativeness of summaries (Schluter, 2017), we which one was worse in order of informativeness
also evaluated system output by eliciting human (does the summary capture important information
judgments in two ways. in the document?) and fluency (is the summary
In our first experiment, participants were asked written in well-formed English?). Examples of
Models Score QA tions for each summary.
EXT- ORACLE -0.121 15.70
P T G EN -0.218 21.40 We followed the scoring mechanism introduced
C ONV S2S -0.130 30.90 in Clarke and Lapata (2010). A correct answer
T-C ONV S2S 0.037 46.05 was marked with a score of one, partially correct
GOLD 0.431 97.23
answers with a score of 0.5, and zero otherwise.
Table 7: System ranking according to human judg- The final score for a system is the average of all
ments and QA-based evaluation. its question scores. Answers again were elicited
using Amazon’s Mechanical Turk crowdsourcing
platform. We uploaded the data in batches (one
system summaries are shown in Table 6. We ran-
system at a time) to ensure that the same partic-
domly selected 50 documents from the XSum test
ipant does not evaluate summaries from different
set and compared all possible combinations of two
systems on the same set of questions.
out of five systems for each document. We col-
Table 7 shows the results of the QA evaluation.
lected judgments from three different participants
Based on summaries generated by T-C ONV S2S,
for each comparison. The order of summaries was
participants can answer 46.05% of the questions
randomized per document and the order of docu-
correctly. Summaries generated by C ONV S2S,
ments per participant.
P T G EN and EXT- ORACLE provide answers to
The score of a system was computed as the 30.90%, 21.40%, and 15.70% of the questions, re-
percentage of times it was chosen as best mi- spectively. Pairwise differences between systems
nus the percentage of times it was selected as are all statistically significant (p < 0.01) with
worst. The scores range from -1 (worst) to 1 the exception of P T G EN and EXT- ORACLE. EXT-
(best) and are shown in Table 7. Perhaps unsur- ORACLE performs poorly on both QA and rating
prisingly human-authored summaries were con- evaluations. The examples in Table 6 indicate that
sidered best, whereas, T-C ONV S2S was ranked EXT- ORACLE is often misled by selecting a sen-
2nd followed by EXT- ORACLE and C ONV S2S. tence with the highest ROUGE (against the gold
P T G EN was ranked worst with the lowest score summary), but ROUGE itself does not ensure that
of −0.218. We carried out pairwise compar-
the summary retains the most important informa-
isons between all models to assess whether sys- tion from the document. The QA evaluation fur-
tem differences are statistically significant. GOLD ther emphasizes that in order for the summary to
is significantly different from all other systems be felicitous, information needs to be embedded in
and T-C ONV S2S is significantly different from the appropriate context. For example, C ONV S2S
C ONV S2S and P T G EN (using a one-way ANOVA and P T G EN will fail to answer the question “Who
with posthoc Tukey HSD tests; p < 0.01). All has resigned?” (see Table 6 second block) de-
other differences are not statistically significant. spite containing the correct answer “Dick Advo-
For our second experiment we used a question- caat” due to the wrong context. T-C ONV S2S is
answering (QA) paradigm (Clarke and Lapata, able to extract important entities from the docu-
2010; Narayan et al., 2018b) to assess the degree ment with the right theme.
to which the models retain key information from
the document. We used the same 50 documents 6 Conclusions
as in our first elicitation study. We wrote two
fact-based questions per document, just by reading In this paper we introduced the task of “extreme
the summary, under the assumption that it high- summarization” together with a large-scale dataset
lights the most important content of the news ar- which pushes the boundaries of abstractive meth-
ticle. Questions were formulated so as not to re- ods. Experimental evaluation revealed that mod-
veal answers to subsequent questions. We cre- els which have abstractive capabilities do better on
ated 100 questions in total (see Table 6 for exam- this task and that high-level document knowledge
ples). Participants read the output summaries and in terms of topics and long-range dependencies
answered the questions as best they could with- is critical for recognizing pertinent content and
out access to the document or the gold summary. generating informative summaries. In the future,
The more questions can be answered, the better the we would like to create more linguistically-aware
corresponding system is at summarizing the docu- encoders and decoders incorporating co-reference
ment as a whole. Five participants answered ques- and entity linking.
Acknowledgments We gratefully acknowledge In Proceedings of the 54th Annual Meeting of the
the support of the European Research Council Association for Computational Linguistics, pages
1998–2008, Berlin, Germany.
(Lapata; award number 681760), the European
Union under the Horizon 2020 SUMMA project Angela Fan, David Grangier, and Michael Auli. 2017.
(Narayan, Cohen; grant agreement 688139), and Controllable abstractive summarization. CoRR,
abs/1711.05217.
Huawei Technologies (Cohen).
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
erarchical neural story generation. In Proceedings
References of the 56th Annual Meeting of the Association for
Computational Linguistics, Melbourne, Australia.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2015. Neural machine translation by jointly Jonas Gehring, Michael Auli, David Grangier, and
learning to align and translate. In Proceedings of Yann Dauphin. 2017a. A convolutional encoder
the 3rd International Conference on Learning Rep- model for neural machine translation. In Proceed-
resentations, San Diego, California, USA. ings of the 55th Annual Meeting of the Association
for Computational Linguistics, pages 123–135, Van-
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. couver, Canada.
2003. Latent dirichlet allocation. The Journal of
Machine Learning Research, 3:993–1022. Jonas Gehring, Michael Auli, David Grangier, Denis
Yarats, and Yann N. Dauphin. 2017b. Convolu-
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and tional sequence to sequence learning. In Proceed-
Yejin Choi. 2018. Deep communicating agents ings of the 34th International Conference on Ma-
for abstractive summarization. In Proceedings of chine Learning, volume 70, pages 1243–1252, Syd-
the 16th Annual Conference of the North American ney, Australia.
Chapter of the Association for Computational Lin-
guistics: Human Language Technologies, New Or- Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy,
leans, USA. Tom Dean, and Larry Heck. 2016. Contextual
LSTM (CLSTM) models for large scale NLP tasks.
Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, and CoRR, abs/1602.06291.
Hui Jiang. 2016. Distraction-based neural networks Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
for modeling documents. In Proceedings of the 25th NEWSROOM: A dataset of 1.3 million summaries
International Joint Conference on Artificial Intelli- with diverse extractive strategies. In Proceedings of
gence, pages 2754–2760, New York, USA. the 16th Annual Conference of the North American
Chapter of the Association for Computational Lin-
Jianpeng Cheng and Mirella Lapata. 2016. Neural
guistics: Human Language Technologies, New Or-
summarization by extracting sentences and words.
leans, USA.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics, pages 484– Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
494, Berlin, Germany. Sun. 2016. Deep residual learning for image recog-
nition. In Proceedings of the IEEE Conference on
James Clarke and Mirella Lapata. 2010. Discourse Computer Vision and Pattern Recognition, pages
constraints for document compression. Computa- 770–778, Las Vegas, USA.
tional Linguistics, 36(3):411–441.
Karl Moritz Hermann, Tomáš Kočiský, Edward
Yann N. Dauphin, Angela Fan, Michael Auli, and Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
David Grangier. 2017. Language modeling with leyman, and Phil Blunsom. 2015. Teaching ma-
gated convolutional networks. In Proceedings of the chines to read and comprehend. In Advances in Neu-
34th International Conference on Machine Learn- ral Information Processing Systems 28, pages 1693–
ing, pages 933–941, Sydney, Australia. 1701. Morgan, Kaufmann.
Adji B. Dieng, Chong Wang, Jianfeng Gao, and John Svetlana Kiritchenko and Saif Mohammad. 2017.
Paisley. 2017. Topicrnn: A recurrent neural network Best-worst scaling more reliable than rating scales:
with long-range semantic dependency. In Proceed- A case study on sentiment intensity annotation. In
ings of the 5th International Conference on Learning Proceedings of the 55th Annual Meeting of the As-
Representations, Toulon, France. sociation for Computational Linguistics, pages 465–
470, Vancouver, Canada.
John Duchi, Elad Hazan, and Yoram Singer. 2011.
Adaptive subgradient methods for online learning Chin-Yew Lin and Eduard Hovy. 2003. Auto-
and stochastic optimization. Journal of Machine matic evaluation of summaries using n-gram co-
Learning Research, 12:2121–2159. occurrence statistics. In Proceedings of the 2003
Human Language Technology Conference of the
Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. North American Chapter of the Association for
2016. Learning-based single-document summariza- Computational Linguistics, pages 71–78, Edmon-
tion with compression and anaphoricity constraints. ton, Canada.
Jordan J Louviere, Terry N Flynn, and Anthony Al- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
fred John Marley. 2015. Best-worst scaling: The- 2013. On the difficulty of training recurrent neu-
ory, methods and applications. Cambridge Univer- ral networks. In Proceedings of the 30th Interna-
sity Press. tional Conference on International Conference on
Machine Learning, pages 1310–1318, Atlanta, GA,
Jordan J Louviere and George G Woodworth. 1991. USA.
Best-worst scaling: A model for the largest differ-
ence judgments. University of Alberta: Working Pa- Ramakanth Pasunuru and Mohit Bansal. 2018. Multi-
per. reward reinforced summarization with saliency and
entailment. In Proceedings of the 16th Annual Con-
Inderjeet Mani. 2001. Automatic Summarization. Nat- ference of the North American Chapter of the Asso-
ural language processing. John Benjamins Publish- ciation for Computational Linguistics: Human Lan-
ing Company. guage Technologies, New Orleans, USA.
Tomas Mikolov and Geoffrey Zweig. 2012. Context Romain Paulus, Caiming Xiong, and Richard Socher.
dependent recurrent neural network language model. 2018. A deep reinforced model for abstractive sum-
In Proceedings of the Spoken Language Technology marization. In Proceedings of the 6th International
Workshop, pages 234–239. IEEE. Conference on Learning Representations, Vancou-
ver, BC, Canada.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. Alexander M. Rush, Sumit Chopra, and Jason Weston.
SummaRuNNer: A recurrent neural network based 2015. A neural attention model for abstractive sen-
sequence model for extractive summarization of tence summarization. In Proceedings of the 2015
documents. In Proceedings of the 31st AAAI Con- Conference on Empirical Methods in Natural Lan-
ference on Artificial Intelligence, pages 3075–3081, guage Processing, pages 379–389, Lisbon, Portugal.
San Francisco, California USA.
Evan Sandhaus. 2008. The New York Times Annotated
Ramesh Nallapati, Bowen Zhou, Cı́cero Nogueira dos Corpus. Linguistic Data Consortium, Philadelphia,
Santos, Çaglar Gülçehre, and Bing Xiang. 2016. 6(12).
Abstractive text summarization using sequence-to-
sequence RNNs and beyond. In Proceedings of the Natalie Schluter. 2017. The limits of automatic sum-
20th SIGNLL Conference on Computational Natural marisation according to rouge. In Proceedings of the
Language Learning, pages 280–290, Berlin, Ger- 15th Conference of the European Chapter of the As-
many. sociation for Computational Linguistics: Short Pa-
pers, pages 41–45, Valencia, Spain.
Shashi Narayan, Ronald Cardenas, Nikos Papasaran-
topoulos, Shay B. Cohen, Mirella Lapata, Jiang- Abigail See, Peter J. Liu, and Christopher D. Manning.
sheng Yu, and Yi Chang. 2018a. Document mod- 2017. Get to the point: Summarization with pointer-
eling with external attention for sentence extrac- generator networks. In Proceedings of the 55th An-
tion. In Proceedings of the 56th Annual Meeting of nual Meeting of the Association for Computational
the Association for Computational Linguistics, Mel- Linguistics, pages 1073–1083, Vancouver, Canada.
bourne, Australia.
Xing Shi, Kevin Knight, and Deniz Yuret. 2016. Why
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. neural translations are the right length. In Proceed-
2018b. Ranking sentences for extractive summa- ings of the 2016 Conference on Empirical Methods
rization with reinforcement learning. In Proceed- in Natural Language Processing, pages 2278–2282,
ings of the 16th Annual Conference of the North Austin, Texas.
American Chapter of the Association for Computa- Sainbayar Sukhbaatar, arthur szlam, Jason Weston, and
tional Linguistics: Human Language Technologies, Rob Fergus. 2015. End-to-end memory networks.
New Orleans, USA. In Advances in Neural Information Processing Sys-
tems 28, pages 2440–2448. Morgan, Kaufmann.
Shashi Narayan, Nikos Papasarantopoulos, Shay B.
Cohen, and Mirella Lapata. 2017. Neural extrac- Ilya Sutskever, James Martens, George Dahl, and Ge-
tive summarization with side information. CoRR, offrey Hinton. 2013. On the importance of initial-
abs/1704.04530. ization and momentum in deep learning. In Pro-
ceedings of the 30th International Conference on In-
Ani Nenkova. 2005. Automatic text summarization ternational Conference on Machine Learning, pages
of newswire: Lessons learned from the Document 1139–1147, Atlanta, GA, USA.
Understanding Conference. In Proceedings of the
29th National Conference on Artificial Intelligence, Jiwei Tan and Xiaojun Wan. 2017. Abstractive docu-
pages 1436–1441, Pittsburgh, Pennsylvania, USA. ment summarization with a graph-based attentional
neural model. In Proceedings of the 55th Annual
Ani Nenkova and Kathleen McKeown. 2011. Auto- Meeting of the Association for Computational Lin-
matic summarization. Foundations and Trends in guistics, pages 1171–1181, Vancouver, Canada.
Information Retrieval, 5(2–3):103–233.