You are on page 1of 11

Text Summarization with Pretrained Encoders

Yang Liu and Mirella Lapata


Institute for Language, Cognition and Computation
School of Informatics, University of Edinburgh
yang.liu2@ed.ac.uk, mlap@inf.ed.ac.uk

Abstract pretrained on vast amounts of text, with an unsu-


pervised objective of masked language modeling
Bidirectional Encoder Representations from and next-sentence prediction and can be fine-tuned
Transformers (B ERT; Devlin et al. 2019) rep-
with various task-specific objectives.
arXiv:1908.08345v2 [cs.CL] 5 Sep 2019

resents the latest incarnation of pretrained lan-


guage models which have recently advanced In most cases, pretrained language models have
a wide range of natural language processing been employed as encoders for sentence- and
tasks. In this paper, we showcase how B ERT paragraph-level natural language understanding
can be usefully applied in text summariza- problems (Devlin et al., 2019) involving various
tion and propose a general framework for both classification tasks (e.g., predicting whether any
extractive and abstractive models. We intro-
two sentences are in an entailment relationship; or
duce a novel document-level encoder based on
B ERT which is able to express the semantics determining the completion of a sentence among
of a document and obtain representations for four alternative sentences). In this paper, we ex-
its sentences. Our extractive model is built on amine the influence of language model pretrain-
top of this encoder by stacking several inter- ing on text summarization. Different from previ-
sentence Transformer layers. For abstractive ous tasks, summarization requires wide-coverage
summarization, we propose a new fine-tuning natural language understanding going beyond the
schedule which adopts different optimizers for
meaning of individual words and sentences. The
the encoder and the decoder as a means of al-
leviating the mismatch between the two (the aim is to condense a document into a shorter ver-
former is pretrained while the latter is not). We sion while preserving most of its meaning. Fur-
also demonstrate that a two-staged fine-tuning thermore, under abstractive modeling formula-
approach can further boost the quality of the tions, the task requires language generation ca-
generated summaries. Experiments on three pabilities in order to create summaries containing
datasets show that our model achieves state- novel words and phrases not featured in the source
of-the-art results across the board in both ex- text, while extractive summarization is often de-
tractive and abstractive settings.1
fined as a binary classification task with labels in-
1 Introduction dicating whether a text span (typically a sentence)
should be included in the summary.
Language model pretraining has advanced the We explore the potential of B ERT for text sum-
state of the art in many NLP tasks ranging from marization under a general framework encom-
sentiment analysis, to question answering, natu- passing both extractive and abstractive model-
ral language inference, named entity recognition, ing paradigms. We propose a novel document-
and textual similarity. State-of-the-art pretrained level encoder based on B ERT which is able to
models include ELMo (Peters et al., 2018), GPT encode a document and obtain representations
(Radford et al., 2018), and more recently Bidirec- for its sentences. Our extractive model is built
tional Encoder Representations from Transform- on top of this encoder by stacking several inter-
ers (B ERT; Devlin et al. 2019). B ERT combines sentence Transformer layers to capture document-
both word and sentence representations in a single level features for extracting sentences. Our ab-
very large Transformer (Vaswani et al., 2017); it is stractive model adopts an encoder-decoder archi-
1
Our code is available at https://github.com/ tecture, combining the same pretrained B ERT en-
nlpyang/PreSumm. coder with a randomly-initialized Transformer de-
coder (Vaswani et al., 2017). We design a new word embeddings by learning contextual repre-
training schedule which separates the optimizers sentations from large-scale corpora using a lan-
of the encoder and the decoder in order to accom- guage modeling objective. Bidirectional Encoder
modate the fact that the former is pretrained while Representations from Transformers (B ERT; De-
the latter must be trained from scratch. Finally, vlin et al. 2019) is a new language representation
motivated by previous work showing that the com- model which is trained with a masked language
bination of extractive and abstractive objectives modeling and a “next sentence prediction” task on
can help generate better summaries (Gehrmann a corpus of 3,300M words.
et al., 2018), we present a two-stage approach The general architecture of B ERT is shown in
where the encoder is fine-tuned twice, first with the left part of Figure 1. Input text is first prepro-
an extractive objective and subsequently on the ab- cessed by inserting two special tokens. [CLS] is
stractive summarization task. appended to the beginning of the text; the output
We evaluate the proposed approach on three representation of this token is used to aggregate in-
single-document news summarization datasets formation from the whole sequence (e.g., for clas-
representative of different writing conventions sification tasks). And token [SEP] is inserted after
(e.g., important information is concentrated at the each sentence as an indicator of sentence bound-
beginning of the document or distributed more aries. The modified text is then represented as a
evenly throughout) and summary styles (e.g., ver- sequence of tokens X = [w1 , w2 , · · · , wn ]. Each
bose vs. more telegraphic; extractive vs. abstrac- token wi is assigned three kinds of embeddings:
tive). Across datasets, we experimentally show token embeddings indicate the meaning of each
that the proposed models achieve state-of-the-art token, segmentation embeddings are used to dis-
results under both extractive and abstractive set- criminate between two sentences (e.g., during a
tings. Our contributions in this work are three- sentence-pair classification task) and position em-
fold: a) we highlight the importance of document beddings indicate the position of each token within
encoding for the summarization task; a variety the text sequence. These three embeddings are
of recently proposed techniques aim to enhance summed to a single input vector xi and fed to a
summarization performance via copying mecha- bidirectional Transformer with multiple layers:
nisms (Gu et al., 2016; See et al., 2017; Nallap- h̃l = LN(hl−1 + MHAtt(hl−1 )) (1)
ati et al., 2017), reinforcement learning (Narayan l l l
et al., 2018b; Paulus et al., 2018; Dong et al., h = LN(h̃ + FFN(h̃ )) (2)
2018), and multiple communicating encoders (Ce- where h0 = x are the input vectors; LN is the layer
likyilmaz et al., 2018). We achieve better results normalization operation (Ba et al., 2016); MHAtt
with a minimum-requirement model without using is the multi-head attention operation (Vaswani
any of these mechanisms; b) we showcase ways to et al., 2017); superscript l indicates the depth of
effectively employ pretrained language models in the stacked layer. On the top layer, B ERT will gen-
summarization under both extractive and abstrac- erate an output vector ti for each token with rich
tive settings; we would expect any improvements contextual information.
in model pretraining to translate in better summa- Pretrained language models are usually used to
rization in the future; and c) the proposed models enhance performance in language understanding
can be used as a stepping stone to further improve tasks. Very recently, there have been attempts
summarization performance as well as baselines to apply pretrained models to various generation
against which new proposals are tested. problems (Edunov et al., 2019; Rothe et al., 2019).
When fine-tuning for a specific task, unlike ELMo
2 Background whose parameters are usually fixed, parameters in
B ERT are jointly fine-tuned with additional task-
2.1 Pretrained Language Models
specific parameters.
Pretrained language models (Peters et al., 2018;
Radford et al., 2018; Devlin et al., 2019; Dong 2.2 Extractive Summarization
et al., 2019; Zhang et al., 2019) have recently Extractive summarization systems create a sum-
emerged as a key technology for achieving im- mary by identifying (and subsequently concate-
pressive gains in a wide variety of natural lan- nating) the most important sentences in a doc-
guage tasks. These models extend the idea of ument. Neural models consider extractive sum-
Or iginal BERT BERT for Summar ization
I n pu t
Docu m en t [ CLS] sent one [ SEP] 2nd sent [ SEP] sent again [ SEP] [ CLS] sent one [ SEP] [ CLS] 2nd sent [ SEP] [ CLS] sent again [ SEP]

Tok en
E[CLS] Esent Eone E[SEP] E2nd Esent E[SEP] Esent Eagain E[SEP] E[CLS] Esent Eone E[SEP] E[CLS] E2nd Esent E[SEP] E[CLS] Esent Eagain E[SEP]
Em beddi n gs

Segm en t
+ + + + + + + + + + + + + + + + + + + + + +
EA EA EA EA EA EA EA EA EA EA EA EA EA EA EB EB EB EB EA EA EA EA
Em beddi n gs

Posi t i on
+ + + + + + + + + + + + + + + + + + + + + +
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12
Em beddi n gs

Tr ansfor m er Layer s Tr ansfor m er Layer s

Con t ex t u al
T[CLS] Tsent Tone T[SEP] T2nd Tsent T[SEP] Tsent Tagain T[SEP] T[CLS] Tsent Tone T[SEP] T[CLS] T2nd Tsent T[SEP] T[CLS] Tsent Tagain T[SEP]
Em beddi n gs

Figure 1: Architecture of the original B ERT model (left) and B ERT S UM (right). The sequence on top is the input
document, followed by the summation of three kinds of embeddings for each token. The summed vectors are used
as input embeddings to several bidirectional Transformer layers, generating contextual vectors for each token.
B ERT S UM extends B ERT by inserting multiple [CLS] symbols to learn sentence representations and using interval
segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences.

marization as a sentence classification problem: Rush et al. (2015) and Nallapati et al. (2016)
a neural encoder creates sentence representations were among the first to apply the neural encoder-
and a classifier predicts which sentences should be decoder architecture to text summarization. See
selected as summaries. S UMMA RU NN ER (Nal- et al. (2017) enhance this model with a pointer-
lapati et al., 2017) is one of the earliest neural generator network (PT GEN) which allows it to
approaches adopting an encoder based on Recur- copy words from the source text, and a coverage
rent Neural Networks. R EFRESH (Narayan et al., mechanism (C OV) which keeps track of words that
2018b) is a reinforcement learning-based system have been summarized. Celikyilmaz et al. (2018)
trained by globally optimizing the ROUGE metric. propose an abstractive system where multiple
More recent work achieves higher performance agents (encoders) represent the document together
with more sophisticated model structures. L A - with a hierarchical attention mechanism (over the
TENT (Zhang et al., 2018) frames extractive sum- agents) for decoding. Their Deep Communicat-
marization as a latent variable inference problem; ing Agents (DCA) model is trained end-to-end
instead of maximizing the likelihood of “gold” with reinforcement learning. Paulus et al. (2018)
standard labels, their latent model directly max- also present a deep reinforced model (DRM) for
imizes the likelihood of human summaries given abstractive summarization which handles the cov-
selected sentences. S UMO (Liu et al., 2019) capi- erage problem with an intra-attention mechanism
talizes on the notion of structured attention to in- where the decoder attends over previously gen-
duce a multi-root dependency tree representation erated words. Gehrmann et al. (2018) follow a
of the document while predicting the output sum- bottom-up approach (B OTTOM U P); a content se-
mary. N EU S UM (Zhou et al., 2018) scores and se- lector first determines which phrases in the source
lects sentences jointly and represents the state of document should be part of the summary, and a
the art in extractive summarization. copy mechanism is applied only to preselected
phrases during decoding. Narayan et al. (2018a)
2.3 Abstractive Summarization propose an abstractive model which is particu-
larly suited to extreme summarization (i.e., single
Neural approaches to abstractive summarization
sentence summaries), based on convolutional neu-
conceptualize the task as a sequence-to-sequence
ral networks and additionally conditioned on topic
problem, where an encoder maps a sequence of
distributions (TC ONV S2S).
tokens in the source document x = [x1 , ..., xn ]
to a sequence of continuous representations z = 3 Fine-tuning B ERT for Summarization
[z1 , ..., zn ], and a decoder then generates the target
summary y = [y1 , ..., ym ] token-by-token, in an 3.1 Summarization Encoder
auto-regressive manner, hence modeling the con- Although B ERT has been used to fine-tune vari-
ditional probability: p(y1 , ..., ym |x1 , ..., xn ). ous NLP tasks, its application to summarization
is not as straightforward. Since B ERT is trained tion PosEmb adds sinusoid positional embed-
as a masked-language model, the output vectors dings (Vaswani et al., 2017) to T , indicating the
are grounded to tokens instead of sentences, while position of each sentence.
in extractive summarization, most models ma- The final output layer is a sigmoid classifier:
nipulate sentence-level representations. Although
segmentation embeddings represent different sen- ŷi = σ(Wo hL
i + bo ) (5)
tences in B ERT, they only apply to sentence- where hL i is the vector for senti from the top
pair inputs, while in summarization we must en- layer (the L-th layer ) of the Transformer. In
code and manipulate multi-sentential inputs. Fig- experiments, we implemented Transformers with
ure 1 illustrates our proposed B ERT architecture L = 1, 2, 3 and found that a Transformer with
for S UMmarization (which we call B ERT S UM). L = 2 performed best. We name this model
In order to represent individual sentences, we B ERT S UM E XT.
insert external [CLS] tokens at the start of each The loss of the model is the binary classifica-
sentence, and each [CLS] symbol collects features tion entropy of prediction ŷi against gold label yi .
for the sentence preceding it. We also use in- Inter-sentence Transformer layers are jointly fine-
terval segment embeddings to distinguish multi- tuned with B ERT S UM. We use the Adam opti-
ple sentences within a document. For senti we mizer with β1 = 0.9, and β2 = 0.999). Our learn-
assign segment embedding EA or EB depending ing rate schedule follows (Vaswani et al., 2017)
on whether i is odd or even. For example, for with warming-up (warmup = 10, 000):
document [sent1 , sent2 , sent3 , sent4 , sent5 ], we
would assign embeddings [EA , EB , EA , EB , EA ]. lr = 2e−3 · min (step −0.5 , step · warmup −1.5 )
This way, document representations are learned
hierarchically where lower Transformer layers 3.3 Abstractive Summarization
represent adjacent sentences, while higher lay-
We use a standard encoder-decoder framework for
ers, in combination with self-attention, represent
abstractive summarization (See et al., 2017). The
multi-sentence discourse.
encoder is the pretrained B ERT S UM and the de-
Position embeddings in the original B ERT
coder is a 6-layered Transformer initialized ran-
model have a maximum length of 512; we over-
domly. It is conceivable that there is a mis-
come this limitation by adding more position em-
match between the encoder and the decoder, since
beddings that are initialized randomly and fine-
the former is pretrained while the latter must be
tuned with other parameters in the encoder.
trained from scratch. This can make fine-tuning
3.2 Extractive Summarization unstable; for example, the encoder might overfit
Let d denote a document containing sentences the data while the decoder underfits, or vice versa.
[sent1 , sent2 , · · · , sentm ], where senti is the i-th To circumvent this, we design a new fine-tuning
sentence in the document. Extractive summariza- schedule which separates the optimizers of the en-
tion can be defined as the task of assigning a label coder and the decoder.
yi ∈ {0, 1} to each senti , indicating whether the We use two Adam optimizers with β1 = 0.9 and
sentence should be included in the summary. It β2 = 0.999 for the encoder and the decoder, re-
is assumed that summary sentences represent the spectively, each with different warmup-steps and
most important content of the document. learning rates:
With B ERT S UM, vector ti which is the vector ˜ E · min(step−0.5 , step · warmup−1.5 ) (6)
lrE = lr E
of the i-th [CLS] symbol from the top layer can
be used as the representation for senti . Several
˜
lrD = lrD · min(step −0.5
, step · warmup−1.5
D ) (7)
inter-sentence Transformer layers are then stacked ˜ E = 2e−3 , and warmup = 20, 000 for
where lr E
on top of B ERT outputs, to capture document-level ˜ D = 0.1, and warmup =
the encoder and lr D
features for extracting summaries:
10, 000 for the decoder. This is based on the
h̃l = LN(hl−1 + MHAtt(hl−1 )) (3) assumption that the pretrained encoder should
be fine-tuned with a smaller learning rate and
hl = LN(h̃l + FFN(h̃l )) (4)
smoother decay (so that the encoder can be trained
where h0 = PosEmb(T ); T denotes the sen- with more accurate gradients when the decoder is
tence vectors output by B ERT S UM, and func- becoming stable).
avg. doc length avg. summary length % novel bi-grams
Datasets # docs (train/val/test)
words sentences words sentences in gold summary
CNN 90,266/1,220/1,093 760.50 33.98 45.70 3.59 52.90
DailyMail 196,961/12,148/10,397 653.33 29.33 54.65 3.86 52.16
NYT 96,834/4,000/3,452 800.04 35.55 45.54 2.44 54.70
XSum 204,045/11,332/11,334 431.07 19.77 23.26 1.00 83.31

Table 1: Comparison of summarization datasets: size of training, validation, and test sets and average document
and summary length (in terms of words and sentences). The proportion of novel bi-grams that do not appear in
source documents but do appear in the gold summaries quantifies corpus bias towards extractive methods.

In addition, we propose a two-stage fine-tuning first split sentences with the Stanford CoreNLP
approach, where we first fine-tune the encoder on toolkit (Manning et al., 2014) and pre-processed
the extractive summarization task (Section 3.2) the dataset following See et al. (2017). Input doc-
and then fine-tune it on the abstractive summariza- uments were truncated to 512 tokens.
tion task (Section 3.3). Previous work (Gehrmann NYT contains 110,540 articles with abstractive
et al., 2018; Li et al., 2018) suggests that using summaries. Following Durrett et al. (2016), we
extractive objectives can boost the performance split these into 100,834/9,706 training/test exam-
of abstractive summarization. Also notice that ples, based on the date of publication (the test
this two-stage approach is conceptually very sim- set contains all articles published from January 1,
ple, the model can take advantage of information 2007 onward). We used 4,000 examples from the
shared between these two tasks, without funda- training as validation set. We also followed their
mentally changing its architecture. We name the filtering procedure, documents with summaries
default abstractive model B ERT S UM A BS and the less than 50 words were removed from the dataset.
two-stage fine-tuned model B ERT S UM E XTA BS. The filtered test set (NYT50) includes 3,452 ex-
amples. Sentences were split with the Stanford
4 Experimental Setup CoreNLP toolkit (Manning et al., 2014) and pre-
In this section, we describe the summarization processed following Durrett et al. (2016). Input
datasets used in our experiments and discuss vari- documents were truncated to 800 tokens.
ous implementation details. XSum contains 226,711 news articles accompa-
nied with a one-sentence summary, answering the
4.1 Summarization Datasets
question “What is this article about?”. We used the
We evaluated our model on three benchmark splits of Narayan et al. (2018a) for training, valida-
datasets, namely the CNN/DailyMail news high- tion, and testing (204,045/11,332/11,334) and fol-
lights dataset (Hermann et al., 2015), the New lowed the pre-processing introduced in their work.
York Times Annotated Corpus (NYT; Sandhaus Input documents were truncated to 512 tokens.
2008), and XSum (Narayan et al., 2018a). These Aside from various statistics on the three
datasets represent different summary styles rang- datasets, Table 1 also reports the proportion of
ing from highlights to very brief one sentence novel bi-grams in gold summaries as a measure
summaries. The summaries also vary with respect of their abstractiveness. We would expect mod-
to the type of rewriting operations they exemplify els with extractive biases to perform better on
(e.g., some showcase more cut and paste opera- datasets with (mostly) extractive summaries, and
tions while others are genuinely abstractive). Ta- abstractive models to perform more rewrite op-
ble 1 presents statistics on these datasets (test set); erations on datasets with abstractive summaries.
example (gold-standard) summaries are provided CNN/DailyMail and NYT are somewhat extrac-
in the supplementary material. tive, while XSum is highly abstractive.
CNN/DailyMail contains news articles and as- 4.2 Implementation Details
sociated highlights, i.e., a few bullet points giving
For both extractive and abstractive settings, we
a brief overview of the article. We used the stan-
used PyTorch, OpenNMT (Klein et al., 2017) and
dard splits of Hermann et al. (2015) for training,
the ‘bert-base-uncased’2 version of B ERT to im-
validation, and testing (90,266/1,220/1,093 CNN
plement B ERT S UM. Both source and target texts
documents and 196,961/12,148/10,397 DailyMail
2
documents). We did not anonymize entities. We https://git.io/fhbJQ
were tokenized with B ERT’s subwords tokenizer. Model R1 R2 RL
O RACLE 52.59 31.24 48.87
Extractive Summarization All extractive mod- L EAD -3 40.42 17.62 36.67
els were trained for 50,000 steps on 3 GPUs (GTX Extractive
S UMMA RU NN ER (Nallapati et al., 2017) 39.60 16.20 35.30
1080 Ti) with gradient accumulation every two
R EFRESH (Narayan et al., 2018b) 40.00 18.20 36.60
steps. Model checkpoints were saved and evalu- L ATENT (Zhang et al., 2018) 41.05 18.77 37.54
ated on the validation set every 1,000 steps. We N EU S UM (Zhou et al., 2018) 41.59 19.01 37.98
selected the top-3 checkpoints based on the evalu- S UMO (Liu et al., 2019) 41.00 18.40 37.20
TransformerE XT 40.90 18.02 37.17
ation loss on the validation set, and report the av-
Abstractive
eraged results on the test set. We used a greedy al- PTG EN (See et al., 2017) 36.44 15.66 33.42
gorithm similar to Nallapati et al. (2017) to obtain PTG EN+C OV (See et al., 2017) 39.53 17.28 36.38
an oracle summary for each document to train ex- DRM (Paulus et al., 2018) 39.87 15.82 36.90
tractive models. The algorithm generates an oracle B OTTOM U P (Gehrmann et al., 2018) 41.22 18.68 38.34
DCA (Celikyilmaz et al., 2018) 41.69 19.47 37.92
consisting of multiple sentences which maximize TransformerA BS 40.21 17.76 37.09
the ROUGE-2 score against the gold summary. B ERT-based
When predicting summaries for a new docu- B ERT S UM E XT 43.25 20.24 39.63
ment, we first use the model to obtain the score B ERT S UM E XT w/o interval embeddings 43.20 20.22 39.59
B ERT S UM E XT (large) 43.85 20.34 39.90
for each sentence. We then rank these sentences
B ERT S UM A BS 41.72 19.39 38.76
by their scores from highest to lowest, and select B ERT S UM E XTA BS 42.13 19.60 39.18
the top-3 sentences as the summary.
During sentence selection we use Trigram Table 2: ROUGE F1 results on CNN/DailyMail test
Blocking to reduce redundancy (Paulus et al., set (R1 and R2 are shorthands for unigram and bigram
2018). Given summary S and candidate sen- overlap; RL is the longest common subsequence). Re-
tence c, we skip c if there exists a trigram over- sults for comparison systems are taken from the au-
thors’ respective papers or obtained on our data by run-
lapping between c and S. The intuition is simi- ning publicly released software.
lar to Maximal Marginal Relevance (MMR; Car-
bonell and Goldstein 1998); we wish to minimize
the similarity between the sentence being consid- we focus on building a minimum-requirements
ered and sentences which have been already se- model and these mechanisms may introduce ad-
lected as part of the summary. ditional hyper-parameters to tune. Thanks to the
subwords tokenizer, we also rarely observe is-
Abstractive Summarization In all abstractive sues with out-of-vocabulary words in the out-
models, we applied dropout (with probability 0.1) put; moreover, trigram-blocking produces diverse
before all linear layers; label smoothing (Szegedy summaries managing to reduce repetitions.
et al., 2016) with smoothing factor 0.1 was also
used. Our Transformer decoder has 768 hidden 5 Results
units and the hidden size for all feed-forward lay-
ers is 2,048. All models were trained for 200,000 5.1 Automatic Evaluation
steps on 4 GPUs (GTX 1080 Ti) with gradient ac- We evaluated summarization quality automati-
cumulation every five steps. Model checkpoints cally using ROUGE (Lin, 2004). We report
were saved and evaluated on the validation set ev- unigram and bigram overlap (ROUGE-1 and
ery 2,500 steps. We selected the top-3 checkpoints ROUGE-2) as a means of assessing informa-
based on their evaluation loss on the validation set, tiveness and the longest common subsequence
and report the averaged results on the test set. (ROUGE-L) as a means of assessing fluency.
During decoding we used beam search (size 5), Table 2 summarizes our results on the
and tuned the α for the length penalty (Wu et al., CNN/DailyMail dataset. The first block in the ta-
2016) between 0.6 and 1 on the validation set; we ble includes the results of an extractive O RACLE
decode until an end-of-sequence token is emitted system as an upper bound. We also present the
and repeated trigrams are blocked (Paulus et al., L EAD -3 baseline (which simply selects the first
2018). It is worth noting that our decoder ap- three sentences in a document).
plies neither a copy nor a coverage mechanism The second block in the table includes various
(See et al., 2017), despite their popularity in ab- extractive models trained on the CNN/DailyMail
stractive summarization. This is mainly because dataset (see Section 2.2 for an overview). For
Model R1 R2 RL Model R1 R2 RL
O RACLE 49.18 33.24 46.02 O RACLE 29.79 8.81 22.66
L EAD -3 39.58 20.11 35.78 L EAD 16.30 1.60 11.95
Extractive Abstractive
C OMPRESS (Durrett et al., 2016) 42.20 24.90 — PTG EN (See et al., 2017) 29.70 9.21 23.24
S UMO (Liu et al., 2019) 42.30 22.70 38.60 PTG EN+C OV (See et al., 2017) 28.10 8.02 21.72
TransformerE XT 41.95 22.68 38.51 TC ONV S2S (Narayan et al., 2018a) 31.89 11.54 25.75
Abstractive TransformerA BS 29.41 9.77 23.01
PTG EN (See et al., 2017) 42.47 25.61 — B ERT-based
PTG EN + C OV (See et al., 2017) 43.71 26.40 — B ERT S UM A BS 38.76 16.33 31.15
DRM (Paulus et al., 2018) 42.94 26.02 — B ERT S UM E XTA BS 38.81 16.50 31.27
TransformerA BS 35.75 17.23 31.41
B ERT-based Table 4: ROUGE F1 results on the XSum test set.
B ERT S UM E XT 46.66 26.35 42.62 Results for comparison systems are taken from the au-
B ERT S UM A BS 48.92 30.84 45.41 thors’ respective papers or obtained on our data by run-
B ERT S UM E XTA BS 49.02 31.02 45.55 ning publicly released software.

Table 3: ROUGE Recall results on NYT test set. Re-


sults for comparison systems are taken from the au- CNN/DailyMail summaries are somewhat extrac-
thors’ respective papers or obtained on our data by run-
tive and even abstractive models are prone to copy-
ning publicly released software. Table cells are filled
with — whenever results are not available.
ing sentences from the source document when
trained on this dataset (See et al., 2017). Perhaps
unsurprisingly we observe that larger versions of
comparison to our own model, we also imple- B ERT lead to performance improvements and that
mented a non-pretrained Transformer baseline interval embeddings bring only slight gains.
(TransformerE XT) which uses the same architec- Table 3 presents results on the NYT dataset.
ture as B ERT S UM E XT, but with fewer parameters. Following the evaluation protocol in Durrett et al.
It is randomly initialized and only trained on the (2016), we use limited-length ROUGE Recall,
summarization task. TransformerE XT has 6 lay- where predicted summaries are truncated to the
ers, the hidden size is 512, and the feed-forward length of the gold summaries. Again, we report
filter size is 2,048. The model was trained with the performance of the O RACLE upper bound and
same settings as in Vaswani et al. (2017). L EAD-3 baseline. The second block in the table
The third block in Table 2 highlights the per- contains previously proposed extractive models as
formance of several abstractive models on the well as our own Transformer baseline. C OM -
CNN/DailyMail dataset (see Section 2.3 for an PRESS (Durrett et al., 2016) is an ILP-based model
overview). We also include an abstractive Trans- which combines compression and anaphoricity
former baseline (TransformerA BS) which has the constraints. The third block includes abstractive
same decoder as our abstractive B ERT S UM mod- models from the literature, and our Transformer
els; the encoder is a 6-layer Transformer with 768 baseline. B ERT-based models are shown in the
hidden size and 2,048 feed-forward filter size. fourth block. Again, we observe that they out-
The fourth block reports results with fine-tuned perform previously proposed approaches. On this
B ERT models: B ERT S UM E XT and its two vari- dataset, abstractive B ERT models generally per-
ants (one without interval embeddings, and one form better compared to B ERT S UM E XT, almost
with the large version of B ERT), B ERT S UM - approaching O RACLE performance.
A BS, and B ERT S UM E XTA BS. B ERT-based mod- Table 4 summarizes our results on the XSum
els outperform the L EAD-3 baseline which is not dataset. Recall that summaries in this dataset are
a strawman; on the CNN/DailyMail corpus it highly abstractive (see Table 1) consisting of a sin-
is indeed superior to several extractive (Nalla- gle sentence conveying the gist of the document.
pati et al., 2017; Narayan et al., 2018b; Zhou Extractive models here perform poorly as corrob-
et al., 2018) and abstractive models (See et al., orated by the low performance of the L EAD base-
2017). B ERT models collectively outperform line (which simply selects the leading sentence
all previously proposed extractive and abstractive from the document), and the O RACLE (which se-
systems, only falling behind the O RACLE upper lects a single-best sentence in each document) in
bound. Among B ERT variants, B ERT S UM E XT Table 4. As a result, we do not report results
performs best which is not entirely surprising; for extractive models on this dataset. The second
˜D
lr 0.30
˜E 1 0.1 0.01 0.001

Proportion of selected sentences


lr T RANSFORMER E XT
2e-2 50.69 9.33 10.13 19.26 0.25 B ERT S UM E XT
2e-3 37.21 8.73 9.52 16.88 O RACLE

0.20
Table 5: Model perplexity (CNN/DailyMail; valida-
tion set) under different combinations of encoder and 0.15
decoder learning rates.
0.10

block in Table 4 presents the results of various ab- 0.05


stractive models taken from Narayan et al. (2018a)
and also includes our own abstractive Transformer 0.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
baseline. In the third block we show the results Sentence position (in source document)
of our B ERT summarizers which again are supe-
Figure 2: Proportion of extracted sentences according
rior to all previously reported models (by a wide
to their position in the original document.
margin).

5.2 Model Analysis portion of novel n-grams in automatically gener-


Learning Rates Recall that our abstractive ated summaries is much lower compared to refer-
model uses separate optimizers for the encoder ence summaries, but in XSum, this gap is much
and decoder. In Table 5 we examine whether smaller. We also observe that on CNN/DailyMail,
the combination of different learning rates (lr ˜E B ERT E XTA BS produces less novel n-ngrams than
˜
and lrD ) is indeed beneficial. Specifically, we re- B ERTA BS, which is not surprising. B ERT E XTA BS
port model perplexity on the CNN/DailyMail val- is more biased towards selecting sentences from
idation set for varying encoder/decoder learning the source document since it is initially trained as
rates. We can see that the model performs best an extractive model.
˜ E = 2e − 3 and lr
with lr ˜ D = 0.1. The supplementary material includes examples
of system output and additional ablation studies.
Position of Extracted Sentences In addition to
the evaluation based on ROUGE, we also ana- 5.3 Human Evaluation
lyzed in more detail the summaries produced by
In addition to automatic evaluation, we also evalu-
our model. For the extractive setting, we looked at
ated system output by eliciting human judgments.
the position (in the source document) of the sen-
We report experiments following a question-
tences which were selected to appear in the sum-
answering (QA) paradigm (Clarke and Lapata,
mary. Figure 2 shows the proportion of selected
2010; Narayan et al., 2018b) which quantifies
summary sentences which appear in the source
the degree to which summarization models retain
document at positions 1, 2, and so on. The analysis
key information from the document. Under this
was conducted on the CNN/DailyMail dataset for
paradigm, a set of questions is created based on
Oracle summaries, and those produced by B ERT-
the gold summary under the assumption that it
S UM E XT and the TransformerE XT. We can see
highlights the most important document content.
that Oracle summary sentences are fairly smoothly
Participants are then asked to answer these ques-
distributed across documents, while summaries
tions by reading system summaries alone without
created by TransformerE XT mostly concentrate on
access to the article. The more questions a sys-
the first document sentences. B ERT S UM E XT out-
tem can answer, the better it is at summarizing the
puts are more similar to Oracle summaries, indi-
document as a whole.
cating that with the pretrained encoder, the model
Moreover, we also assessed the overall qual-
relies less on shallow position features, and learns
ity of the summaries produced by abstractive sys-
deeper document representations.
tems which due to their ability to rewrite content
Novel N-grams We also analyzed the output of may produce disfluent or ungrammatical output.
abstractive systems by calculating the proportion Specifically, we followed the Best-Worst Scal-
of novel n-grams that appear in the summaries but ing (Kiritchenko and Mohammad, 2017) method
not in the source texts. The results are shown in where participants were presented with the output
Figure 3. In the CNN/DailyMail dataset, the pro- of two systems (and the original document) and
B ERT S UM E XT A BS B ERT S UM A BS Reference CNN/DM NYT XSum
Abstractive QA Rank QA Rank QA Rank
Proportion of novel n-grams 0.8 L EAD 42.5† — 36.2† — 9.20† —
0.7 PTG EN 33.3† -0.24† 30.5† -0.27† 23.7† -0.36†
0.6
B OTTOM U P 40.6† -0.16† — — — —
0.5
0.4
TC ONV S2S — — — — 52.1 -0.20†
0.3 G OLD — 0.22† — 0.33† — 0.38†
0.2 B ERT S UM 56.1 0.17 41.8 -0.07 57.5 0.19
0.1
0.0
1-grams 2-grams 3-grams Table 7: QA-based and ranking-based evaluation.
Models with † are significantly different from B ERT-
(a) CNN/DailyMail Dataset S UM (using a paired student t-test; p < 0.05). Table
Proportion of novel n-grams

1.0 cells are filled with — whenever system output is not


available. G OLD is not used in QA setting, and L EAD
0.8
is not used in Rank evaluation.
0.6

0.4

0.2
Results for extractive and abstractive systems
are shown in Tables 6 and 7, respectively. We
0.0
1-grams 2-grams 3-grams compared the best performing B ERT S UM model
(b) XSum dataset in each setting (extractive or abstractive) against
various state-of-the-art systems (whose output is
Figure 3: Proportion of novel n-grams in model gener- publicly available), the L EAD baseline, and the
ated summaries.
G OLD standard as an upper bound. As shown
Extractive CNN/DM NYT
in both tables participants overwhelmingly pre-
L EAD 42.5† 36.2† fer the output of our model against comparison
N EU S UM 42.2† — systems across datasets and evaluation paradigms.
S UMO 41.7† 38.1† All differences between B ERT S UM and compari-
Transformer 37.8† 32.5† son models are statistically significant (p < 0.05),
B ERT S UM 58.9 41.9
with the exception of TC ONV S2S (see Table 7;
Table 6: QA-based evaluation. Models with † are sig- XSum) in the QA evaluation setting.
nificantly different from B ERT S UM (using a paired stu-
dent t-test; p < 0.05). Table cells are filled with — 6 Conclusions
whenever system output is not available.
In this paper, we showcased how pretrained B ERT
can be usefully applied in text summarization. We
asked to decide which one was better according to introduced a novel document-level encoder and
the criteria of Informativeness, Fluency, and Suc- proposed a general framework for both abstrac-
cinctness. tive and extractive summarization. Experimental
Both types of evaluation were conducted on results across three datasets show that our model
the Amazon Mechanical Turk platform. For the achieves state-of-the-art results across the board
CNN/DailyMail and NYT datasets we used the under automatic and human-based evaluation pro-
same documents (20 in total) and questions from tocols. Although we mainly focused on docu-
previous work (Narayan et al., 2018b; Liu et al., ment encoding for summarization, in the future,
2019). For XSum, we randomly selected 20 we would like to take advantage the capabilities of
documents (and their questions) from the release B ERT for language generation.
of Narayan et al. (2018a). We elicited 3 re-
sponses per HIT. With regard to QA evaluation, Acknowledgments
we adopted the scoring mechanism from Clarke This research is supported by a Google PhD Fel-
and Lapata (2010); correct answers were marked lowship to the first author. We gratefully acknowl-
with a score of one, partially correct answers edge the support of the European Research Coun-
with 0.5, and zero otherwise. For quality-based cil (Lapata, award number 681760, “Translating
evaluation, the rating of each system was com- Multiple Modalities into Text”). We would also
puted as the percentage of times it was chosen as like to thank Shashi Narayan for providing us with
better minus the times it was selected as worse. the XSum dataset.
Ratings thus range from -1 (worst) to 1 (best).
References Sebastian Gehrmann, Yuntian Deng, and Alexander
Rush. 2018. Bottom-up abstractive summarization.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- In Proceedings of the 2018 Conference on Empiri-
ton. 2016. Layer normalization. arXiv preprint cal Methods in Natural Language Processing, pages
arXiv:1607.06450. 4098–4109, Brussels, Belgium.
Jaime G Carbonell and Jade Goldstein. 1998. The use Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.
of MMR and diversity-based reranking for reoder- Li. 2016. Incorporating copying mechanism in
ing documents and producing summaries. In Pro- sequence-to-sequence learning. In Proceedings of
ceedings of the 21st Annual International ACL SI- the 54th Annual Meeting of the Association for Com-
GIR Conference on Research and Development in putational Linguistics (Volume 1: Long Papers),
Information Retrieval, pages 335–336, Melbourne, pages 1631–1640, Berlin, Germany. Association for
Australia. Computational Linguistics.
Asli Celikyilmaz, Antoine Bosselut, Xiaodong He, and Karl Moritz Hermann, Tomas Kocisky, Edward
Yejin Choi. 2018. Deep communicating agents for Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
abstractive summarization. In Proceedings of the leyman, and Phil Blunsom. 2015. Teaching ma-
2018 Conference of the North American Chapter of chines to read and comprehend. In Advances in Neu-
the Association for Computational Linguistics: Hu- ral Information Processing Systems, pages 1693–
man Language Technologies, Volume 1 (Long Pa- 1701.
pers), pages 1662–1675, New Orleans, Louisiana.
Svetlana Kiritchenko and Saif Mohammad. 2017.
James Clarke and Mirella Lapata. 2010. Discourse Best-worst scaling more reliable than rating scales:
constraints for document compression. Computa- A case study on sentiment intensity annotation. In
tional Linguistics, 36(3):411–441. Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2:
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Short Papers), pages 465–470, Vancouver, Canada.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean
standing. In Proceedings of the 2019 Conference of Senellart, and Alexander Rush. 2017. OpenNMT:
the North American Chapter of the Association for Open-source toolkit for neural machine translation.
Computational Linguistics: Human Language Tech- In Proceedings of ACL 2017, System Demonstra-
nologies, Volume 1 (Long and Short Papers), pages tions, pages 67–72, Vancouver, Canada.
4171–4186, Minneapolis, Minnesota.
Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Wang. 2018. Improving neural abstractive docu-
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming ment summarization with explicit information selec-
Zhou, and Hsiao-Wuen Hon. 2019. Unified tion modeling. In Proceedings of the 2018 Con-
language model pre-training for natural language ference on Empirical Methods in Natural Language
understanding and generation. arXiv preprint Processing, pages 1787–1796, Brussels, Belgium.
arXiv:1905.03197.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Yue Dong, Yikang Shen, Eric Crawford, Herke van matic evaluation of summaries. In Text Summariza-
Hoof, and Jackie Chi Kit Cheung. 2018. Bandit- tion Branches Out, pages 74–81, Barcelona, Spain.
Sum: Extractive summarization as a contextual ban-
dit. In Proceedings of the 2018 Conference on Em- Yang Liu, Ivan Titov, and Mirella Lapata. 2019. Single
pirical Methods in Natural Language Processing, document summarization as tree induction. In Pro-
pages 3739–3748, Brussels, Belgium. ceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational
Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. Linguistics: Human Language Technologies, Vol-
2016. Learning-based single-document summariza- ume 1 (Long and Short Papers), pages 1745–1755,
tion with compression and anaphoricity constraints. Minneapolis, Minnesota.
In Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Christopher Manning, Mihai Surdeanu, John Bauer,
Long Papers), pages 1998–2008, Berlin, Germany. Jenny Finkel, Steven Bethard, and David McClosky.
2014. The Stanford CoreNLP natural language pro-
Sergey Edunov, Alexei Baevski, and Michael Auli. cessing toolkit. In Proceedings of 52nd Annual
2019. Pre-trained language model representations Meeting of the Association for Computational Lin-
for language generation. In Proceedings of the guistics: System Demonstrations, pages 55–60, Bal-
2019 Conference of the North American Chapter of timore, Maryland.
the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long and Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Short Papers), pages 4052–4059, Minneapolis, Min- SummaRuNNer: A recurrent neural network based
nesota. sequence model for extractive summarization of
documents. In Proceedings of the 31st AAAI Con- Abigail See, Peter J. Liu, and Christopher D. Manning.
ference on Artificial Intelligence, pages 3075–3081, 2017. Get to the point: Summarization with pointer-
San Francisco, California. generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Linguistics (Volume 1: Long Papers), pages 1073–
Çağlar Gu̇lçehre, and Bing Xiang. 2016. Ab- 1083, Vancouver, Canada.
stractive text summarization using sequence-to-
sequence RNNs and beyond. In Proceedings of The Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe,
20th SIGNLL Conference on Computational Natural Jon Shlens, and Zbigniew Wojna. 2016. Rethink-
Language Learning, pages 280–290, Berlin, Ger- ing the inception architecture for computer vision.
many. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2818–2826, Las Vegas, Nevada.
2018a. Don’t give me the details, just the summary! Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
topic-aware convolutional neural networks for ex- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
treme summarization. In Proceedings of the 2018 Kaiser, and Illia Polosukhin. 2017. Attention is all
Conference on Empirical Methods in Natural Lan- you need. In Advances in Neural Information Pro-
guage Processing, pages 1797–1807, Brussels, Bel- cessing Systems, pages 5998–6008.
gium.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Le, Mohammad Norouzi, Wolfgang Macherey,
2018b. Ranking sentences for extractive summa- Maxim Krikun, Yuan Cao, Qin Gao, Klaus
rization with reinforcement learning. In Proceed- Macherey, et al. 2016. Google’s neural machine
ings of the 2018 Conference of the North American translation system: Bridging the gap between hu-
Chapter of the Association for Computational Lin- man and machine translation. In arXiv preprint
guistics: Human Language Technologies, Volume arXiv:1609.08144.
1 (Long Papers), pages 1747–1759, New Orleans,
Louisiana. Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming
Zhou. 2018. Neural latent extractive document sum-
Romain Paulus, Caiming Xiong, and Richard Socher. marization. In Proceedings of the 2018 Conference
2018. A deep reinforced model for abstractive sum- on Empirical Methods in Natural Language Pro-
marization. In Proceedings of the 6th International cessing, pages 779–784, Brussels, Belgium.
Conference on Learning Representations, Vancou-
ver, Canada. Xingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI-
BERT: Document level pre-training of hierarchical
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt bidirectional transformers for document summariza-
Gardner, Christopher Clark, Kenton Lee, and Luke tion. In Proceedings of the 57th Annual Meeting
Zettlemoyer. 2018. Deep contextualized word rep- of the Association for Computational Linguistics,
resentations. In Proceedings of the 2018 Confer- pages 5059–5069, Florence, Italy. Association for
ence of the North American Chapter of the Associ- Computational Linguistics.
ation for Computational Linguistics: Human Lan- Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang,
guage Technologies, Volume 1 (Long Papers), pages Ming Zhou, and Tiejun Zhao. 2018. Neural docu-
2227–2237, New Orleans, Louisiana. ment summarization by jointly learning to score and
select sentences. In Proceedings of the 56th Annual
Alec Radford, Karthik Narasimhan, Tim Salimans, Meeting of the Association for Computational Lin-
and Ilya Sutskever. 2018. Improving language un- guistics (Volume 1: Long Papers), pages 654–663,
derstanding by generative pre-training. In CoRR, Melbourne, Australia.
abs/1704.01444, 2017.

Sascha Rothe, Shashi Narayan, and Aliaksei Sev-


eryn. 2019. Leveraging pre-trained checkpoints
for sequence generation tasks. arXiv preprint
arXiv:1907.12461.

Alexander M. Rush, Sumit Chopra, and Jason Weston.


2015. A neural attention model for abstractive sen-
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 379–389, Lisbon, Portugal.

Evan Sandhaus. 2008. The New York Times Annotated


Corpus. Linguistic Data Consortium, Philadelphia,
6(12).

You might also like