Professional Documents
Culture Documents
Ming Zhong∗, Pengfei Liu∗, Danqing Wang, Xipeng Qiu†, Xuanjing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{mzhong18,pfliu14,dqwang18,xpqiu,xjhuang}@fudan.edu.cn
Table 3: Results of different architectures over different domains, where Enc. and Dec. represent document en-
coder and decoder respectively. Lead means to extract the first k sentences as the summary, usually as a competitive
lower bound. Oracle represents the ground truth extracted by the greedy algorithm (Nallapati et al., 2017), usually
as the upper bound. The number k in parentheses denotes k sentences are extracted during testing and choose
lead-k as a lower bound for this domain. All the experiments use word2vec to obtain word representations.
and “TheGuardian” domains, Pointer sur- impairs its performance on some datasets.
passes SeqLab by at least 1.0 improvment (R-1).
We attempt to explain this difference from the fol- Sentence length We find Pointer shows the abil-
lowing three perspectives. ity to capture sentence length information based
on previous predictions, while SeqLab doesn’t.
Repetition For domains that need to extract We can see from the Fig. 1(c) that models with
multiple sentences as the summary (first two do- Pointer tend to choose longer sentences as the first
mains in Tab. 3), Pointer is aware of the previ- sentence and greatly reduce the length of the sen-
ous prediction which makes it to reduce the du- tence in the subsequent extractions. In compar-
plication of n-grams compared to SeqLab. As ison, it seems that models with SeqLab tend to
shown in Fig. 1(a), models with Pointer always extract sentences with similar length. The ability
get higher repetition scores than models with Se- allows Pointer to adaptively change the length of
qLab when extracting six sentences, which indi- the extracted sentences, thereby achieving better
cates that Pointer does capture word-level infor- performance regardless of whether one sentence
mation from previous selected sentences and has or multiple sentences are required.
positive effects on subsequent decisions.
4.3.2 Analysis of Encoders
Positional Bias For domains that only need to In this section, we make the analysis of two en-
extract one sentence as the summary (last six do- coders LSTM and Transformer in different testing
mains in Tab. 3), Pointer still performs better environments.
than SeqLab. As shown in Fig. 1(b), the perfor-
mance gap between these two decoders grows as Domains From Tab. 3, we get the following ob-
the positional bias of different datasets increases. servations:
For example, from the Tab. 3, we can see in the 1) Transformer can outperform LSTM on some
domains with low-value positional bias, such as datasets “NYDailyNews” by a relatively large
“FoxNews(1.8)”, “NYDailyNews(1.9)”, margin while LSTM beats Transformer on some
SeqLab achieves closed performance against domains with closed improvements. Besides, dur-
Pointer. By contrast, the performance gap ing different training phases of these eight do-
grows when processing these domains with high- mains, the hyper-parameters of Transformer keep
value positional bias (“TheGuardian(2.9)”, unchanged5 while for LSTM, many sets of hyper-
“WashingtonPost(3.0)”). Consequently, 5
4 layers 512 dimensions for Pointer and 12 layers 512
SeqLab is more sensitive to positional bias, which dimensions for SeqLab
2.5 SLSTM
R-1 35 STransformer
R-2 PLSTM
2
R-L PTransformer
30
Avg. Length
0.95 1.5
Score
∆R
1 25
0.9 0.5
20
0
REP2 REP3 REP4 REP5
1.8 1.9 2.3 2.6 2.9 3.0 1 2 3 4
SLSTM STransformer PLSTM PTransformer Positional Bias #Sent
Figure 1: Different behaviours of two decoders (SeqLab and Pointer) under different testing environment. (a)
shows repetition scores of different architectures when extracting six sentences on CNN/DailyMail. (b) shows the
relationship between ∆R and positional bias. The abscissa denotes the positional bias of six different datasets and
∆R denotes the average ROUGE difference between the two decoders under different encoders. (c) shows average
length of k-th sentence extracted from different architectures.
LSTM 41.22 18.72 37.52 41.33 18.78 37.64 42.18 19.64 38.53 41.48 18.95 37.78
SeqLab
Transformer 41.31 18.85 37.63 40.19 18.67 37.51 42.28 19.73 38.59 41.32 18.83 37.63
LSTM 41.56 18.77 37.83 41.15 18.38 37.43 42.39 19.51 38.69 41.35 18.59 37.61
Pointer
Transformer 41.36 18.59 37.67 41.10 18.38 37.41 42.09 19.31 38.41 41.54 18.73 37.83
Table 5: Results of different architectures with different pre-trained knowledge on CNN/DailyMail, where Enc.
and Dec. represent document encoder and decoder respectively.
our model can achieve 40.08 on R-1, which Models R-1 R-2 R-L
is comparable to many existing models. By Chen and Bansal (2018) 41.47 18.72 37.76
contrast, once the positional information is re- Dong et al. (2018) 41.50 18.70 37.60
moved, the performance dropped by a large mar- Zhou et al. (2018) 41.59 19.01 37.98
gin. This experiment shows that the success of Jadhav and Rajan (2018)8 41.60 18.30 37.70
such extractive summarization heavily relies on LSTM + PN 41.56 18.77 37.83
the ability of learning the positional information LSTM + PN + RL 41.85 18.93 38.13
on CNN/DailyMail, which has been a benchmark LSTM + PN + BERT 42.39 19.51 38.69
dataset for most of current work. LSTM + PN + BERT + RL 42.69 19.60 38.85
4.3.3 Analysis of Transferable Knowledge Table 6: Evaluation on CNN/DailyMail. The top half
of the table is currently state-of-the-art models, and the
Next, we show how different types of transferable
lower half is our models.
knowledge influences our summarization models.
Unsupervised Pre-training Here, as a base- Why does BERT work? We investigate two dif-
line, word2vec is used to obtain word repre- ferent ways of using BERT to figure out from
sentations solely based on the training set of where BERT has brought improvement for extrac-
CNN/DailyMail. tive summarization system.
As shown in Tab. 5, we can find that context- In the first usage, we feed each individual sen-
independent word representations can not con- tence to BERT to obtain sentence representation,
tribute much to current models. However, when which does not contain contextualized informa-
the models are equipped with BERT, we are ex- tion, and the model gets a high R-1 score of 41.7.
cited to observe that the performances of all types However, when we feed the entire article to BERT
of architectures are improved by a large margin. to obtain token representations and get the sen-
Specifically, the model CNN-LSTM-Pointer tence representation through mean pooling, model
has achieved a new state-of-the-art with 42.11 on performance soared to 42.3 R-1 score.
R-1, surpassing existing models dramatically. The experiment indicates that though BERT
can provide a powerful sentence embedding, the
Supervised Pre-training In most cases, our key factor for extractive summarization is con-
models can benefit from the pre-trained parame- textualized information and this type of informa-
ters learned from the NEWSROOM dataset. How- tion bears the positional relationship between sen-
ever, the model CNN-LSTM-Pointer fails and tences, which has been proven to be critical to ex-
the performance are decreased. We understand tractive summarization task as above.
this phenomenon by the following explanations:
The transferring process from CNN/DailyMail to 4.4 Learning Schema and Complementarity
N EWSROOM suffers from the domain shift prob-
Besides supervised learning, in text summariza-
lem, in which the distribution of golden labels’ po-
tion, reinforcement learning has been recently
sitions are changed. And the observation from Fig.
used to introduce more constraints. In this paper,
2 shows that CNN-LSTM-Pointer is more sen-
we also explore if several advanced techniques be
sitive to the ordering change, therefore obtaining a
8
lower performance. trained and evaluated on the anonymized version.
complementary with each other. Jianpeng Cheng and Mirella Lapata. 2016. Neural
We first choose the based model summarization by extracting sentences and words.
In Proceedings of the 54th Annual Meeting of the
LSTM-Pointer and LSTM-Pointer + Association for Computational Linguistics (Volume
BERT, then the reinforcement learning are intro- 1: Long Papers). volume 1, pages 484–494.
duced aiming to further optimize our models. As
Zihang Dai, Zhilin Yang, Yiming Yang, William W
shown in Tab. 6, we observe that even though Cohen, Jaime Carbonell, Quoc V Le, and Ruslan
the performance of LSTM+PN has been largely Salakhutdinov. 2018. Transformer-xl: Language
improved by BERT, when applying reinforce- modeling with longer-term dependency .
ment learning, the performance can be improved Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
further, which indicates that there is indeed a com- Kristina Toutanova. 2018. Bert: Pre-training of deep
plementarity between architecture, transferable bidirectional transformers for language understand-
knowledge and reinforcement learning. ing. arXiv preprint arXiv:1810.04805 .
Yue Dong, Yikang Shen, Eric Crawford, Herke van
5 Conclusion Hoof, and Jackie Chi Kit Cheung. 2018. Bandit-
sum: Extractive summarization as a contextual ban-
In this paper, we seek to better understand how dit. In Proceedings of the 2018 Conference on Em-
neural extractive summarization systems could pirical Methods in Natural Language Processing.
benefit from different types of model archi- pages 3739–3748.
tectures, transferable knowledge, and learning Sebastian Gehrmann, Yuntian Deng, and Alexander
schemas. Our detailed observations can provide Rush. 2018. Bottom-up abstractive summarization.
more hints for the follow-up researchers to design In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing. pages
more powerful learning frameworks. 4098–4109.
Kristjan Arumae and Fei Liu. 2018. Reinforced ex- Masaru Isonuma, Toru Fujino, Junichiro Mori, Yutaka
tractive summarization with question-focused re- Matsuo, and Ichiro Sakata. 2017. Extractive sum-
wards. In Proceedings of ACL 2018, Student Re- marization using multi-task learning with document
search Workshop. pages 105–111. classification. In Proceedings of the 2017 Confer-
ence on Empirical Methods in Natural Language
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Processing. pages 2101–2110.
Yejin Choi. 2018. Deep communicating agents for Aishwarya Jadhav and Vaibhav Rajan. 2018. Extrac-
abstractive summarization. In Proceedings of the tive summarization with swap-net: Sentences and
2018 Conference of the North American Chapter of words from alternating pointer networks. In Pro-
the Association for Computational Linguistics: Hu- ceedings of the 56th Annual Meeting of the Associa-
man Language Technologies, Volume 1 (Long Pa- tion for Computational Linguistics (Volume 1: Long
pers). volume 1, pages 1662–1675. Papers). volume 1, pages 142–151.
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- Chris Kedzie, Kathleen McKeown, and Hal Daume III.
tive summarization with reinforce-selected sentence 2018. Content selection in deep learning models of
rewriting. In Proceedings of the 56th Annual Meet- summarization. In Proceedings of the 2018 Con-
ing of the Association for Computational Linguistics ference on Empirical Methods in Natural Language
(Volume 1: Long Papers). volume 1, pages 675–686. Processing. pages 1818–1828.
Yoon Kim. 2014. Convolutional neural net- Technologies, Volume 1 (Long Papers). volume 1,
works for sentence classification. arXiv preprint pages 2227–2237.
arXiv:1408.5882 .
Matthew Peters, Mark Neumann, Luke Zettlemoyer,
Chin-Yew Lin. 2004. Rouge: A package for auto- and Wen-tau Yih. 2018b. Dissecting contextual
matic evaluation of summaries. Text Summarization word embeddings: Architecture and representation.
Branches Out . In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing. pages
Pengfei Liu, Xipeng Qiu, Jifan Chen, and Xuanjing 1499–1509.
Huang. 2016a. Deep fusion lstms for text seman-
tic matching. In Proceedings of the 54th Annual Tim Rocktäschel, Edward Grefenstette, Karl Moritz
Meeting of the Association for Computational Lin- Hermann, Tomáš Kočiskỳ, and Phil Blunsom. 2015.
guistics (Volume 1: Long Papers). volume 1, pages Reasoning about entailment with neural attention.
1034–1043. arXiv preprint arXiv:1509.06664 .
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016b. Alexander M Rush, Sumit Chopra, and Jason Weston.
Recurrent neural network for text classification with 2015. A neural attention model for abstractive sen-
multi-task learning. In Proceedings of IJCAI. pages tence summarization. In Proceedings of the 2015
2873–2879. Conference on Empirical Methods in Natural Lan-
guage Processing. pages 379–389.
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017.
Adversarial multi-task learning for text classifica- Abigail See, Peter J Liu, and Christopher D Manning.
tion. In Proceedings of the 55th Annual Meeting of 2017. Get to the point: Summarization with pointer-
the Association for Computational Linguistics (Vol- generator networks. In Proceedings of the 55th An-
ume 1: Long Papers). volume 1, pages 1–10. nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). volume 1,
Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- pages 1073–1083.
frey Dean. 2013. Efficient estimation of word
representations in vector space. arXiv preprint Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.
arXiv:1301.3781 . Sequence to sequence learning with neural net-
works. In Advances in Neural Information Process-
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. ing Systems. pages 3104–3112.
Summarunner: A recurrent neural network based se-
quence model for extractive summarization of docu- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
ments. In Thirty-First AAAI Conference on Artificial Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Intelligence. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, cessing Systems. pages 5998–6008.
Ça glar Gulçehre, and Bing Xiang. 2016. Abstrac-
tive text summarization using sequence-to-sequence Oriol Vinyals, Samy Bengio, and Manjunath Kudlur.
rnns and beyond. CoNLL 2016 page 280. 2015. Order matters: Sequence to sequence for sets.
arXiv preprint arXiv:1511.06391 .
Shashi Narayan, Shay B Cohen, and Mirella Lapata.
2018. Ranking sentences for extractive summariza- Xiaolong Wang, Ross Girshick, Abhinav Gupta, and
tion with reinforcement learning. In Proceedings of Kaiming He. 2018. Non-local neural networks. In
the 2018 Conference of the North American Chap- Proceedings of the IEEE Conference on Computer
ter of the Association for Computational Linguistics: Vision and Pattern Recognition. pages 7794–7803.
Human Language Technologies, Volume 1 (Long Pa-
pers). volume 1, pages 1747–1759. Yuxiang Wu and Baotian Hu. 2018. Learning to extract
coherent summary via deep reinforcement learning.
Romain Paulus, Caiming Xiong, and Richard Socher. In Thirty-Second AAAI Conference on Artificial In-
2017. A deep reinforced model for abstractive sum- telligence.
marization. arXiv preprint arXiv:1705.04304 .
Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang,
Jeffrey Pennington, Richard Socher, and Christopher Ming Zhou, and Tiejun Zhao. 2018. Neural docu-
Manning. 2014. Glove: Global vectors for word ment summarization by jointly learning to score and
representation. In Proceedings of the 2014 confer- select sentences. In Proceedings of the 56th Annual
ence on empirical methods in natural language pro- Meeting of the Association for Computational Lin-
cessing (EMNLP). pages 1532–1543. guistics (Volume 1: Long Papers). volume 1, pages
654–663.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018a. Deep contextualized word rep-
resentations. In Proceedings of the 2018 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language