You are on page 1of 12

BANDIT S UM: Extractive Summarization as a Contextual Bandit

Yue Dong∗ Yikang Shen∗


Mila/McGill University Mila/University of Montréal
yue.dong2 yi-kang.shen
@mail.mcgill.ca @umontreal.ca

Eric Crawford Herke van Hoof Jackie C.K. Cheung


Mila/McGill University University of Amsterdam Mila/McGill University
eric.crawford h.c.vanhoof jcheung
@mail.mcgill.ca @uva.nl @cs.mcgill.ca
arXiv:1809.09672v3 [cs.CL] 7 May 2019

Abstract This work is primarily concerned with extractive


summarization. Though abstractive summariza-
In this work, we propose a novel method for
tion methods have made strides in recent years, ex-
training neural networks to perform single-
document extractive summarization without tractive techniques are still very attractive as they
heuristically-generated extractive labels. We are simpler, faster, and more reliably yield seman-
call our approach BANDIT S UM as it treats ex- tically and grammatically correct sentences.
tractive summarization as a contextual ban- Many extractive summarizers work by selecting
dit (CB) problem, where the model receives sentences from the input document (Luhn, 1958;
a document to summarize (the context), and Mihalcea and Tarau, 2004; Wong et al., 2008;
chooses a sequence of sentences to include
Kågebäck et al., 2014; Yin and Pei, 2015; Cao
in the summary (the action). A policy gradi-
ent reinforcement learning algorithm is used et al., 2015; Yasunaga et al., 2017). Furthermore,
to train the model to select sequences of sen- a growing trend is to frame this sentence selection
tences that maximize ROUGE score. We per- process as a sequential binary labeling problem,
form a series of experiments demonstrating where binary inclusion/exclusion labels are cho-
that BANDIT S UM is able to achieve ROUGE sen for sentences one at a time, starting from the
scores that are better than or comparable to beginning of the document, and decisions about
the state-of-the-art for extractive summariza- later sentences may be conditioned on decisions
tion, and converges using significantly fewer
update steps than competing approaches. In
about earlier sentences. Recurrent neural networks
addition, we show empirically that BANDIT- may be trained with stochastic gradient ascent to
S UM performs significantly better than com- maximize the likelihood of a set of ground-truth
peting approaches when good summary sen- binary label sequences (Cheng and Lapata, 2016;
tences appear late in the source document1 . Nallapati et al., 2017). However, this approach has
two well-recognized disadvantages. First, it suf-
1 Introduction fers from exposure bias, a form of mismatch be-
Single-document summarization methods can be tween training and testing data distributions which
divided into two categories: extractive and ab- can hurt performance (Ranzato et al., 2015; Bah-
stractive. Extractive summarization systems form danau et al., 2017; Paulus et al., 2018). Second,
summaries by selecting and copying text snippets extractive labels must be generated by a heuris-
from the document, while abstractive methods aim tic, as summarization datasets do not generally in-
to generate concise summaries with paraphrasing. clude ground-truth extractive labels; the ultimate
∗ performance of models trained on such labels is
Equal contribution.
1
Our code can be found at https://github.com/ thus fundamentally limited by the quality of the
yuedongP/BanditSum heuristic.
An alternative to maximum likelihood training a whole before yielding affinities, which permits
is to use reinforcement learning to train the model affinities for different sentences in the document
to directly maximize a measure of summary qual- to depend on one another in arbitrary ways. In our
ity, such as the ROUGE score between the gener- technical section, we show how to apply policy
ated summary and a ground-truth abstractive sum- gradient reinforcement learning methods to this
mary (Wu and Hu, 2018). This approach has be- setting.
come popular because it avoids exposure bias, and The contributions of our work are as follows:
directly optimizes a measure of summary quality.
However, it also has a number of downsides. For • We propose a theoretically grounded method,
one, the search space is quite large: for a docu- based on the contextual bandit formalism,
ment of length T , there are 2T possible extrac- for training neural network-based extrac-
tive summaries. This makes the exploration prob- tive summarizers with reinforcement learn-
lem faced by the reinforcement learning algorithm ing. Based on this training method, we pro-
during training very difficult. Another issue is pose the BANDIT S UM system for extractive
that due to the sequential nature of selection, the summarization.
model is inherently biased in favor of selecting
earlier sentences over later ones, a phenomenon • We perform experiments demonstrating that
which we demonstrate empirically in Section 7. BANDIT S UM obtains state-of-the-art perfor-
The first issue can be resolved to a degree using mance on a number of datasets and requires
either a cumbersome maximum likelihood-based significantly fewer update steps than compet-
pre-training step (using heuristically-generated la- ing approaches.
bels) (Wu and Hu, 2018), or placing a hard upper
• We perform human evaluations showing that
limit on the number of sentences selected. The
in the eyes of human judges, summaries cre-
second issue is more problematic, as it is inherent
ated by BANDIT S UM are less redundant and
to the sequential binary labeling setting.
of higher overall quality than summaries cre-
In the current work, we introduce BANDIT S UM, ated by competing approaches.
a novel method for training neural network-based
extractive summarizers with reinforcement learn- • We provide evidence, in the form of experi-
ing. This method does away with the sequential ments in which models are trained on subsets
binary labeling setting, instead formulating extrac- of the data, that the improved performance
tive summarization as a contextual bandit. This of BANDIT S UM over competitors stems in
move greatly reduces the size of the space that part from better handling of summary-worthy
must be explored, removes the need to perform su- sentences that come near the end of the doc-
pervised pre-training, and prevents systematically ument (see Section 7).
privileging earlier sentences over later ones. Al-
though the strong performance of Lead-3 indicates 2 Related Work
that good sentences often occur early in the source
Extractive summarization has been widely studied
article, we show in Sections 6 and 7 that the con-
in the past. Recently, neural network-based meth-
textual bandit setting greatly improves model per-
ods have been gaining popularity over classical
formance when good sentences occur late without
methods (Luhn, 1958; Gong and Liu, 2001; Con-
sacrificing performance when good sentences oc-
roy and O’leary, 2001; Mihalcea and Tarau, 2004;
cur early.
Wong et al., 2008), as they have demonstrated
Under this reformulation, BANDIT S UM takes stronger performance on large corpora. Central to
the document as input and outputs an affinity for the neural network-based models is the encoder-
each of the sentences therein. An affinity is a decoder structure. These models typically use ei-
real number in [0, 1] which quantifies the model’s ther a convolution neural network (Kalchbrenner
propensity for including a sentence in the sum- et al., 2014; Kim, 2014; Yin and Pei, 2015; Cao
mary. These affinities are then used in a process et al., 2015), a recurrent neural network (Chung
of repeated sampling-without-replacement which et al., 2014; Cheng and Lapata, 2016; Nallapati
does not privilege earlier sentences over later ones. et al., 2017), or a combination of the two (Narayan
BANDIT S UM is free to process the document as et al., 2018; Wu and Hu, 2018) to create sentence
and document representations, using word embed- ization in which an agent repeatedly chooses one
dings (Mikolov et al., 2013; Pennington et al., of several actions, and receives a reward based on
2014) to represent words at the input level. These this choice. The agent’s goal is to quickly learn
vectors are then fed into a decoder network to gen- which action yields the most favorable distribu-
erate the output summary. tion over rewards, and choose that action as often
The use of reinforcement learning (RL) in as possible. In a contextual bandit, at each trial,
extractive summarization was first explored by a context is sampled and shown to the agent, af-
Ryang and Abekawa (2012), who proposed to ter which the agent selects an action and receives
use the TD(λ) algorithm to learn a value function a reward; importantly, the rewards yielded by the
for sentence selection. Rioux et al. (2014) im- actions may depend on the sampled context. The
proved this framework by replacing the learning agent must quickly learn which actions are favor-
agent with another TD(λ) algorithm. However, able in which contexts. Contextual bandits are a
the performance of their methods was limited by subset of Markov Decision Processes in which ev-
the use of shallow function approximators, which ery episode has length one.
required performing a fresh round of reinforce- Extractive summarization may be regarded as a
ment learning for every new document to be sum- contextual bandit as follows. Each document is a
marized. The more recent work of Paulus et al. context, and each ordered subset of a document’s
(2018) and Wu and Hu (2018) use reinforcement sentences is a different action. Formally, assume
learning in a sequential labeling setting to train ab- that each context is a document d consisting of
stractive and extractive summarizers, respectively, sentences s = (s1 , . . . , sNd ), and that each action
while Chen and Bansal (2018) combines both ap- is a length-M sequence of unique sentence indices
proaches, applying abstractive summarization to a i = (i1 , . . . , iM ) where it ∈ {1, . . . , Nd }, it 6= it0
set of sentences extracted by a pointer network for t 6= t0 , and M is an integer hyper-parameter.
(Vinyals et al., 2015) trained via REINFORCE. For each i, the extractive summary induced by i
However, pre-training with a maximum likelihood is given by (si1 , . . . , siM ). An action i taken in
objective is required in all of these models. context d is given a reward R(i, a), where a is the
The two works most similar to ours are Yao gold-standard abstractive summary that is paired
et al. (2018) and Narayan et al. (2018). Yao with document d, and R is a scalar reward function
et al. (2018) recently proposed an extractive sum- quantifying the degree of match between a and the
marization approach based on deep Q learning, a summary induced by i.
type of reinforcement learning. However, their A policy for extractive summarization is a neu-
approach is extremely computationally intensive ral network pθ (·|d), parameterized by a vector θ,
(a minimum of 10 days before convergence), which, for each input document d, yields a proba-
and was unable to achieve ROUGE scores bet- bility distribution over index sequences. Our goal
ter than the best maximum likelihood-based ap- is to find parameters θ which cause pθ (·|d) to as-
proach. Narayan et al. (2018) uses a cascade of sign high probability to index sequences that in-
filters in order to arrive at a set of candidate extrac- duce extractive summaries that a human reader
tive summaries, which we can regard as an approx- would judge to be of high-quality. We achieve
imation of the true action space. They then use an this by maximizing the following objective func-
approximation of a policy gradient method to train tion with respect to parameters θ:
their neural network to select summaries from this
approximated action space. In contrast, BANDIT- J(θ) = E [R(i, a)] (1)
S UM samples directly from the true action space,
and uses exact policy gradient parameter updates. where the expectation is taken over documents d
paired with gold-standard abstractive summaries
3 Extractive Summarization as a a, as well as over index sequences i generated ac-
Contextual Bandit cording to pθ (·|d).

Our approach formulates extractive summariza- 3.1 Policy Gradient Reinforcement Learning
tion as a contextual bandit which we then train an Ideally, we would like to maximize (1) using gra-
agent to solve using policy gradient reinforcement dient ascent. However, the required gradient can-
learning. A bandit is a decision-making formal- not be obtained using usual techniques (e.g. sim-
ple backpropagation) because i must be discretely obtain a new sentence to include. This normalize-
sampled in order to compute R(i, a). and-sample step is repeated M times, yielding M
Fortunately, we can use the likelihood ratio gra- unique sentences to include in the summary.
dient estimator from reinforcement learning and At each step of sampling-without-replacement,
stochastic optimization (Williams, 1992; Sutton we also include a small probability  of sampling
et al., 2000), which tells us that the gradient of this uniformly from all remaining sentences. This is
function can be computed as: used to achieve adequate exploration during train-
ing, and is similar to the -greedy technique from
∇θ J(θ) = E [∇θ log pθ (i|d)R(i, a)] (2) reinforcement learning.
Under this sampling scheme, we have the fol-
where the expectation is taken over the same vari-
lowing expression for pθ (i|d):
ables as (1).
Since we typically do not know the exact docu- M
(1 − )π(d)ij
!
Y 
ment distribution and thus cannot evaluate the ex- + (5)
Nd − j + 1 z(d) − j−1
P
pected value in (2), we instead estimate it by sam- j=1 k=1 π(d)ik
pling. We found that we obtained the best perfor- P
where z(d) = t π(d)t . For index sequences
mance when, for each update, we first sample one
that have length different from M , or that con-
document/summary pair (d, a), then sample B in-
tain duplicate indices, we have pθ (i|d) = 0.
dex sequences i1 , . . . , iB from pθ (·|d), and finally
Using this expression, it is straightforward to
take the empirical average:
use automatic differentiation software to compute
B
1 X ∇θ log pθ (i|d), which is required for the gradient
∇θ J(θ) ≈ ∇θ log pθ (ib |d)R(ib , a) (3) estimate in (3).
B
b=1
3.3 Baseline for Variance Reduction
This overall learning algorithm can be regarded as
Our sample-based gradient estimate can have high
an instance of the REINFORCE policy gradient al-
variance, which can slow the learning. One po-
gorithm (Williams, 1992).
tential cause of this high variance can be seen by
3.2 Structure of pθ (·|d) inspecting (3), and noting that it basically acts
There are many possible choices for the structure to change the probability of a sampled index se-
of pθ (·|d); we opt for one that avoids privileging quence to an extent determined by the reward
early sentences over later ones. We first decom- R(i, a). However, since ROUGE scores are al-
pose pθ (·|d) into two parts: πθ , a deterministic ways positive, the probability of every sampled
function which contains all the network’s param- index sequence is increased, whereas intuitively,
eters, and µ, a probability distribution parameter- we would prefer to decrease the probability of se-
ized by the output of πθ . Concretely: quences that receive a comparatively low reward,
even if it is positive. This can be remedied by the
pθ (·|d) = µ(·|πθ (d)) (4) introduction of a so-called baseline which is sub-
tracted from all rewards.
Given an input document d, πθ outputs a real- Using a baseline r, our sample-based estimate
valued vector of sentence affinities whose length of ∇θ J(θ) becomes:
is equal to the number of sentences in the docu-
B
ment (i.e. πθ (d) ∈ RNd ) and whose elements fall 1 X
in the range [0, 1]. The t-th entry π(d)t may be ∇θ log pθ (ib |d)(R(ib , a) − r) (6)
B
i=1
roughly interpreted as the network’s propensity to
include sentence st in the summary of d. It can be shown that the introduction of r does not
Given sentence affinities πθ (d), µ imple- bias the gradient estimator and can significantly
ments a process of repeated sampling-without- reduce its variance if chosen appropriately (Sutton
replacement. This proceeds by repeatedly nor- et al., 2000).
malizing the set of affinities corresponding to sen- There are several possibilities for the baseline,
tences that have not yet been selected, thereby ob- including the long-term average reward and the
taining a probability distribution over unselected average reward across different samples for one
sentences, and sampling from that distribution to document-summary pair. We choose an approach
known as self-critical reinforcement learning, in through a final sigmoid unit to yield sentence
which the test-time performance of the current affinities πθ (d).
model is used as the baseline (Ranzato et al., 2015; The use of a bidirectional recurrent network in
Rennie et al., 2017; Paulus et al., 2018). More the encoder is crucial, as it allows the network to
concretely, after sampling the document-summary process the document as a whole, yielding repre-
pair (d, a), we greedily generate an index se- sentations for each sentence that take all other sen-
quence using the current parameters θ: tences into account. This procedure is necessary to
deal with some aspects of summary quality such
igreedy = arg max pθ (i|d) (7) as redundancy (avoiding the inclusion of multiple
i
sentences with similar meaning), which requires
and calculate the baseline for the current update the affinities for different sentences to depend on
as r = R(igreedy , a). This baseline has the intu- one another. For example, to avoid redundancy,
itively satisfying property of only increasing the if the affinity for some sentence is high, then sen-
probability of a sampled label sequence when the tences which express similar meaning should have
summary it induces is better than what would be low affinities.
obtained by greedy decoding.
5 Experiments
3.4 Reward Function
In this section, we discuss the setup of our exper-
A final consideration is a concrete choice for the
iments. We first discuss the corpora that we used
reward function R(i, a). Throughout this work we
and our evaluation methodology. We then discuss
use:
the baseline methods against which we compared,
1 and conclude with a detailed overview of the set-
R(i, a) = (ROUGE-1f (i, a) + tings of the model parameters.
3
ROUGE-2f (i, a) + ROUGE-Lf (i, a)). (8)
5.1 Corpora
The above reward function optimizes the average Three datasets are used for our experiments: the
of all the ROUGE variants (Lin, 2004) while bal- CNN, the Daily Mail, and combined CNN/Daily
ancing precision and recall. Mail (Hermann et al., 2015; Nallapati et al., 2016).
We use the standard split of Hermann et al. (2015)
4 Model for training, validating, and testing and the same
In this section, we discuss the concrete instan- setting without anonymization on the three cor-
tiations of the neural network πθ that we use pus as See et al. (2017). The Daily Mail corpus
in our experiments. We break πθ up into two has 196,557 training documents, 12,147 validation
components: a document encoder fθ1 , which documents and 10,397 test documents; while the
outputs a sequence of sentence feature vectors CNN corpus has 90,266/1,220/1,093 documents,
(h1 , . . . , hNd ) and a decoder gθ2 which yields sen- respectively.
tence affinities:
5.2 Evaluation
h1 , . . . , hNd = fθ1 (d) (9) The models are evaluated based on ROUGE (Lin,
πθ (d) = gθ2 (h1 , . . . , hNd ) (10) 2004). We obtain our ROUGE scores using the
standard pyrouge package2 for the test set eval-
Encoder. Features for each sentence in isolation uation and a faster python implementation of the
are first obtained by applying a word-level Bidi- ROUGE metric3 for training and evaluating on the
rectional Recurrent Neural Network (BiRNN) to validation set. We report the F1 scores of ROUGE-
the embeddings for the words in the sentence, and 1, ROUGE-2, and ROUGE-L, which compute
averaging the hidden states over words. A sepa- the uniform, bigram, and longest common subse-
rate sentence-level BiRNN is then used to obtain a quence overlapping with the reference summaries.
representations hi for each sentence in the context
2
of the document. https://pypi.python.org/pypi/pyrouge/
0.1.3
Decoder. A multi-layer perceptron is used to 3
We use the modified version based on https://
map from the representation ht of each sentence github.com/pltrdy/rouge
5.3 Baselines Model ROUGE
1 2 L
Lead(Narayan et al., 2018) 39.6 17.7 36.2
We compare BANDIT S UM with other extractive Lead-3(ours) 40.0 17.5 36.2
methods including: the Lead-3 model, Sum- SummaRuNNer 39.6 16.2 35.3
DQN 39.4 16.1 35.6
maRuNNer (Nallapati et al., 2017), Refresh Refresh 40.0 18.2 36.6
(Narayan et al., 2018), RNES (Wu and Hu, 2018), RNES w/o coherence 41.3 18.9 37.6
DQN (Yao et al., 2018), and NN-SE (Cheng and BANDIT S UM 41.5 18.7 37.6
Lapata, 2016). The Lead-3 model simply pro-
Table 1: Performance comparison of different ex-
duces the leading three sentences of the document tractive summarization models on the combined
as the summary. CNN/Daily Mail test set using full-length F1.

Model CNN Daily Mail


5.4 Model Settings 1 2 L 1 2 L
Lead-3 28.8 11.0 25.5 41.2 18.2 37.3
NN-SE 28.4 10.0 25.0 36.2 15.2 32.9
We use 100-dimensional Glove embeddings (Pen- Refresh 30.4 11.7 26.9 41.0 18.8 37.7
nington et al., 2014) as our embedding initializa- BANDIT S UM 30.7 11.6 27.4 42.1 18.9 38.3
tion. We do not limit the sentence length, nor the
maximum number of sentences per document. We Table 2: The full-length ROUGE F1 scores of various
extractive models on the CNN and the Daily Mail test
use one-layer BiLSTM for word-level RNN, and
set separately.
two-layers BiLSTM for sentence-level RNN. The
hidden state dimension is 200 for each direction
on all LSTMs. For the decoder, we use a feed- 6.1 Rouge Evaluation
forward network with one hidden layer of dimen- We present the results of comparing BANDIT S UM
sion 100. to several baseline algorithms4 on the CNN/Daily
During training, we use Adam (Kingma and Ba, Mail corpus in Tables 1 and 2. Compared to other
2015) as the optimizer with the learning rate of extractive summarization systems, BANDIT S UM
5e−5 , beta parameters (0, 0.999), and a weight de- achieves performance that is significantly better
cay of 1e−6 , to maximize the objective function than two RL-based approaches, Refresh (Narayan
defined in equation (1). We employ gradient clip- et al., 2018) and DQN (Yao et al., 2018), as well
ping of 1 to regularize our model. At each iter- as SummaRuNNer, the state-of-the-art maximum
ation, we sample B = 20 times to estimate the liklihood-based extractive summarizer (Nallapati
gradient defined in equation 3. For our system, et al., 2017). BANDIT S UM performs a little bet-
the reported performance is obtained within two ter than RNES (Wu and Hu, 2018) in terms of
epochs of training. ROUGE-1 and slightly worse in terms of ROUGE-
At the test time, we pick sentences sorted by 2. However, RNES requires pre-training with the
the predicted probabilities until the length limit is maximum likelihood objective on heuristically-
reached. The full-length ROUGE F1 score is used generated extractive labels; in contrast, BANDIT-
as the evaluation metric. For M , the number of S UM is very light-weight and converges signifi-
sentences selected per summary, we use a value of cantly faster. We discuss the advantage of fram-
3, based on our validation results as well as on the ing the extractive summarization based on the con-
settings described in Nallapati et al. (2017). textual bandit (BANDIT S UM) over the sequential
binary labeling setting (RNES) in the discussion
Section 7.
6 Experiment Results We also noticed that different choices for the
4
Due to different pre-processing methods and different
In this section, we present quantitative results numbers of selected sentences, several papers report different
from the ROUGE evaluation and qualitative re- Lead scores (Narayan et al., 2018; See et al., 2017). We use
the test set provided by Narayan et al. (2018). Since their
sults based on human evaluation. In addition, we Lead score is a combination of Lead-3 for CNN and Lead-
demonstrate the stability of our RL model by com- 4 for Daily Mail, we recompute the Lead-3 scores for both
paring the validation curve of BANDIT S UM with CNN and Daily Mail with the preprocessing steps used in
See et al. (2017). Additionally, our results are not directly
SummaRuNNer (Nallapati et al., 2017) trained comparable to results based on the anonymized dataset used
with a maximum likelihood objective. by Nallapati et al. (2017).
policy gradient baseline (see Section 3.3) in BAN - Model Overall Coverage Non-
Redundancy
DIT S UM affect learning speed, but do not signif- Refresh 1.53 1.34 1.55
icantly affect asymptotic performance. Models BANDIT S UM 1.50 1.58 1.30
trained with an average reward baseline learned
most quickly, while models trained with three Table 4: Average rank of manual evaluation with 4
different baselines (greedy, average reward in a participants who expressed 20 pairwise preferences be-
batch, average global reward) all perform roughly tween the summaries generated by Refresh and our sys-
tem. The model with the lower score is better.
the same after training for one epoch. Models
trained without a baseline were found to under-
perform other baseline choices by about 2 points Mail) provided by Narayan et al. (2018) to 4 vol-
of ROUGE score on average. unteers. Tables 3 and 4 show the results of hu-
man evaluation in these two settings. BANDIT-
6.2 Human Evaluation
S UM is shown to be better than Refresh and Sum-
We also conduct a qualitative evaluation to un- maRuNNer in terms of overall quality and non-
derstand the effects of the improvements intro- redundancy. These results indicate that the use of
duced in BANDIT S UM on human judgments of the true policy gradient, rather than the approxi-
the generated summaries. To assess the effect mation used by Refresh, improves overall quality.
of training with RL rather than maximum like- It is interesting to observe that, even though BAN -
lihood, in the first set of human evaluations we DIT S UM does not have an explicit redundancy
compare BANDIT S UM with the state-of-the-art avoidance mechanism, it actually outperforms the
maximum likelihood-based model SummaRuN- other systems on non-redundancy.
Ner. To evaluate the importance of using an exact,
rather than approximate, policy gradient to opti-
6.3 Learning Curve
mize ROUGE scores, we perform another human
evaluation comparing BANDIT S UM and Refresh, Reinforcement learning methods are known for
an RL-based method that uses the an approxima- sometimes being unstable during training. How-
tion of the policy gradient. ever, this seems to be less of a problem for BAN -
We follow a human evaluation protocol similar DIT S UM , perhaps because it is formulated as a
to the one used in Wu and Hu (2018). Given a set contextual bandit rather than a sequential label-
of N documents, we ask K volunteers to evalu- ing problem. We show this by comparing the val-
ate the summaries extracted by both systems. For idation curves generated by BANDIT S UM and the
each document, a reference summary, and a pair of state-of-the-art maximum likelihood-based model
randomly ordered extractive summaries (one gen- – SummaRuNNer (Nallapati et al., 2017) (Fig-
erated by each of the two models) is presented to ure 1).
the volunteers. They are asked to compare and From Figure 1, we observe that BANDIT S UM
rank the extracted summaries along three dimen- converges significantly more quickly to good re-
sions: overall, coverage, and non-redundancy. sults than SummaRuNNer. Moreover, there is
Model Overall Coverage Non- less variance in the performance of BANDIT S UM.
Redundancy One possible reason is that extractive summariza-
SummaRuNNer 1.67 1.46 1.70 tion does not have well-defined supervised labels.
BANDIT S UM 1.33 1.54 1.30
There exists a mismatch between the provided la-
Table 3: Average rank of human evaluation based on
bels and human-generated abstractive summaries.
5 participants who expressed 57 pairwise preferences Hence, the gradient, computed from the maximum
between the summaries generated by SummaRuNNer likelihood loss function, is not optimizing the eval-
and BANDIT S UM. The model with the lower score is uation metric of interest. Another important mes-
better. sage is that both models are still far from the es-
timated upper bound5 , which shows that there is
To compare with SummaRuNNer, we randomly
still significant room for improvement.
sample 57 documents from the test set of Daily-
Mail and ask 5 volunteers to evaluate the extracted 5
The supervised labels for the upper bound estimation
summaries. While comparing with Refresh, we are obtained using the heuristic described in Nallapati et al.
use the 20 documents (10 CNN and 10 Daily- (2017).
in the affinity scores, which are produced by bi-
directional RNNs.
We provide empirical evidence for this conjec-
ture by comparing BANDIT S UM to the sequential
RL model proposed by Wu and Hu (2018) (Fig-
ure 2) on two subsets of the data: one with good
summary sentences appearing early in the article,
while the other contains articles where good sum-
mary sentences appear late. Specifically, we con-
struct two evaluation datasets by selecting the first
50 documents (Dearly , i.e., best summary occurs
early) and the last 50 documents (Dlate , i.e., best
summary occurs late) from a sample of 1000 doc-
Figure 1: Average of ROUGE-1,2,L F1 scores on the uments that is ordered by the average extractive
Daily Mail validation set within one epoch of training label index idx. Given an article with n sentences
on the Daily Mail training set. The x-axis (multiply indexed from 1, . . . , n and a greedy extractive la-
by 2,000) indicates the number of data example the bels set with three sentences (i, j, k)6 , the aver-
algorithms have seen. The supervised labels in Sum- age index for the extractive label is computed by
maRuNNer are used to estimate the upper bound. idx= (i + j + k)/3n.

6.4 Run Time


On CNN/Daily mail dataset, our model’s time-
per-epoch is about 25.5 hours on a TITAN Xp.
We trained the model for 3 epochs, which took
about 76 hours in total. For comparison, DQN
took about 10 days to train on a GTX 1080 (Yao
et al., 2018). Refresh took about 12 hours on a sin-
gle GPU to train (Narayan et al., 2018). Note that
this figure does not take into account the signifi-
cant time required by Refresh for pre-computing
ROUGE scores.

7 Discussion: Contextual Bandit Setting


Vs. Sequential Full RL Labeling Figure 2: Model comparisons of the average value for
ROUGE-1,2,L F1 scores (f ) on Dearly and Dlate . For
We conjecture that the contextual bandit (CB) set- each model, the results were obtained by averaging f
ting is a more suitable framework for modeling across ten trials with 100 epochs in each trail. Dearly
extractive summarization than the sequential bi- and Dlate consist of 50 articles each, such that the good
nary labeling setting, especially in the cases when summary sentences appear early and late in the arti-
good summary sentences appear later in the doc- cle, respectively. We observe a significant advantage of
BANDIT S UM compared to RNES and RNES3 (based
ument. The intuition behind this is that models
on the sequential binary labeling setting) on Dlate .
based on the sequential labeling setting are af-
fected by the order of the decisions, which bi- Given these two subsets of the data, three differ-
ases towards selecting sentences that appear ear- ent models (BANDIT S UM, RNES and RNES3) are
lier in the document. By contrast, our CB-based trained and evaluated on each of the two datasets
RL model has more flexibility and freedom to without extractive labels. Since the original se-
explore the search space, as it samples the sen- quential RL model (RNES) is unstable without
tences without replacement based on the affin- supervised pre-training, we propose the RNES3
ity scores. Note that although we do not explic- model that is limited to select no more then three
itly make the selection decisions in a sequential 6
For each document, a length-3 extractive summary with
fashion, the sequential information about depen- near-optimal ROUGE score is selected following the heuris-
dencies between sentences is implicitly embedded tic proposed by Nallapati et al. (2017).
sentences. Starting with random initializations rewriting. In Proceedings of the 56th Annual Meet-
without supervised pre-training, we train each ing of the Association for Computational Linguistics
(ACL).
model ten times for 100 epochs and plot the learn-
ing curve of the average ROUGE-F1 score com- Jianpeng Cheng and Mirella Lapata. 2016. Neural
puted based on the trained model in Figure 2. We summarization by extracting sentences and words.
can clearly see that BANDIT S UM finds a better so- In Proceedings of the 54th Annual Meeting of the
Association for Computational Linguistics (ACL).
lution more quickly than RNES and RNES3 on
both datasets. Moreover, it displays a significantly Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho,
speed-up in the exploration and finds the best solu- and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence model-
tion when good summary sentences appeared later ing. In NIPS 2014 Workshop on Deep Learning.
in the document (Dlate ).
John M Conroy and Dianne P O’leary. 2001. Text sum-
8 Conclusion marization via hidden markov models. In Proceed-
ings of the 24th Annual International ACM SIGIR
In this work, we presented a contextual ban- Conference on Research and Development in Infor-
mation Retrieval, pages 406–407. ACM.
dit learning framework, BANDIT S UM , for ex-
tractive summarization, based on neural networks Yihong Gong and Xin Liu. 2001. Generic text summa-
and reinforcement learning algorithms. BANDIT- rization using relevance measure and latent semantic
S UM does not require sentence-level extractive la- analysis. In Proceedings of the 24th Annual Inter-
national ACM SIGIR Conference on Research and
bels and optimizes ROUGE scores between sum- Development in Information Retrieval, pages 19–25.
maries generated by the model and abstractive ref- ACM.
erence summaries. Empirical results show that
Karl Moritz Hermann, Tomas Kocisky, Edward
our method performs better than or comparable Grefenstette, Lasse Espeholt, Will Kay, Mustafa Su-
to state-of-the-art extractive summarization mod- leyman, and Phil Blunsom. 2015. Teaching ma-
els which must be pre-trained on extractive la- chines to read and comprehend. In Advances in Neu-
bels, and converges using significantly fewer up- ral Information Processing Systems (NIPS).
date steps than competing approaches. In future Mikael Kågebäck, Olof Mogren, Nina Tahmasebi, and
work, we will explore the direction of adding an Devdatt Dubhashi. 2014. Extractive summarization
extra coherence reward (Wu and Hu, 2018) to im- using continuous vector space models. In Workshop
on Continuous Vector Space Models and their Com-
prove the quality of extracted summaries in terms positionality (CVSC) at EACL, pages 31–39.
of sentence discourse relation.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-
Acknowledgements som. 2014. A convolutional neural network for
modelling sentences. In Proceedings of the 52nd
The research was supported in part by Natural Annual Meeting of the Association for Computa-
Sciences and Engineering Research Council of tional Linguistics (ACL).
Canada (NSERC). The authors would like to thank Yoon Kim. 2014. Convolutional neural networks for
Compute Canada for providing the computational sentence classification. In Proceedings of the 2014
resources. Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP).
Diederik P Kingma and Jimmy Ba. 2015. Adam: A
References method for stochastic optimization. In International
Conference for Learning Representations (ICLR).
Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu,
Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Chin-Yew Lin. 2004. Rouge: A package for automatic
Courville, and Yoshua Bengio. 2017. An actor-critic evaluation of summaries. In Text Summarization
algorithm for sequence prediction. In International Branches Out: Proceedings of the ACL-04 Work-
Conference on Learning Representations (ICLR). shop.
Ziqiang Cao, Furu Wei, Sujian Li, Wenjie Li, Ming Hans Peter Luhn. 1958. The automatic creation of lit-
Zhou, and Houfeng Wang. 2015. Learning sum- erature abstracts. IBM Journal of Research and De-
mary prior representation for extractive summariza- velopment, 2(2):159–165.
tion. In Proceedings of the 53rd Annual Meeting of
the Association for Computational Linguistics. Rada Mihalcea and Paul Tarau. 2004. TextRank:
Bringing order into text. In Proceedings of the 2004
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- Conference on Empirical Methods in Natural Lan-
tive summarization with reinforce-selected sentence guage Processing (EMNLP).
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Richard S Sutton, David A McAllester, Satinder P
Dean. 2013. Efficient estimation of word represen- Singh, and Yishay Mansour. 2000. Policy gradi-
tations in vector space. In International Conference ent methods for reinforcement learning with func-
on Learning Representations (ICLR) Workshop. tion approximation. In Advances in Neural Informa-
tion Processing Systems (NIPS), pages 1057–1063.
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
SummaRuNNer: A recurrent neural network based Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
sequence model for extractive summarization of 2015. Pointer networks. In Advances in Neural In-
documents. In Proceedings of the 31st AAAI Con- formation Processing Systems (NIPS), pages 2692–
ference on Artificial Intelligence. 2700.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Ronald J Williams. 1992. Simple statistical gradient-
Çağlar Gu̇lçehre, and Bing Xiang. 2016. Ab- following algorithms for connectionist reinforce-
stractive text summarization using sequence-to- ment learning. In Reinforcement Learning, pages
sequence rnns and beyond. In Proceedings of the 5–32. Springer.
20th SIGNLL Conference on Computational Natu-
ral Language Learning (CoNLL). Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008. Ex-
tractive summarization using supervised and semi-
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. supervised learning. In Proceedings of the 22nd In-
2018. Ranking sentences for extractive summariza- ternational Conference on Computational Linguis-
tion with reinforcement learning. In Proceedings of tics (COLING), pages 985–992.
the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Yuxiang Wu and Baotian Hu. 2018. Learning to extract
Human Language Technologies (NAACL). coherent summary via deep reinforcement learning.
In Proceedings of the 32nd AAAI Conference on Ar-
Romain Paulus, Caiming Xiong, and Richard Socher. tificial Intelligence (AAAI).
2018. A deep reinforced model for abstractive sum-
marization. In International Conference on Learn- Kaichun Yao, Libo Zhang, Tiejian Luo, and Yanjun
ing Representations (ICLR). Wu. 2018. Deep reinforcement learning for ex-
tractive document summarization. Neurocomputing,
Jeffrey Pennington, Richard Socher, and Christopher 284:52–62.
Manning. 2014. Glove: Global vectors for word
representation. In Proceedings of the 2014 Con- Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
ference on Empirical Methods in Natural Language Ayush Pareek, Krishnan Srinivasan, and Dragomir
Processing (EMNLP), pages 1532–1543. Radev. 2017. Graph-based neural multi-document
summarization. In Proceedings of the 21st Confer-
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, ence on Computational Natural Language Learning
and Wojciech Zaremba. 2015. Sequence level train- (CoNLL), pages 452–462.
ing with recurrent neural networks. arXiv preprint
arXiv:1511.06732. Wenpeng Yin and Yulong Pei. 2015. Optimizing sen-
tence modeling and selection for document summa-
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, rization. In Proceedings of the 24th International
Jerret Ross, and Vaibhava Goel. 2017. Self-critical Joint Conference on Artificial Intelligence (IJCAI),
sequence training for image captioning. In IEEE pages 1383–1389.
Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 7008–7024.
Cody Rioux, Sadid A Hasan, and Yllias Chali. 2014.
Fear the reaper: A system for automatic multi-
document summarization with reinforcement learn-
ing. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 681–690.
Seonggi Ryang and Takeshi Abekawa. 2012. Frame-
work of automatic text summarization using rein-
forcement learning. In Proceedings of the 2012
Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural
Language Learning (EMNLP), pages 256–265.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. In Proceedings of the 55th An-
nual Meeting of the Association for Computational
Linguistics (ACL), pages 1073–1083.
BANDIT S UM: Extractive Summarization as a Contextual Bandit
Supplemental Materials
A Samples

Source Document
(CNN)A top al Qaeda in the Arabian Peninsula leader – who a few years ago was in a U.S. detention
facility – was among five killed in an airstrike in Yemen, the terror group said, showing the organiza-
tion is vulnerable even as Yemen appears close to civil war. Ibrahim al-Rubaish died Monday night in
what AQAP’s media wing, Al-Malahem Media, called a ”crusader airstrike.” The Al-Malahem Media
obituary characterized al-Rubaish as a religious scholar and combat commander. A Yemeni Defense
Ministry official and two Yemeni national security officials not authorized to speak on record con-
firmed that al-Rubaish had been killed, but could not specify how he died. Al-Rubaish was once held
by the U.S. government at its detention facility in Guantanamo Bay, Cuba. In fact, he was among a
number of detainees who sued the administration of then-President George W. Bush to challenge the
legality of their confinement in Gitmo. He was eventually released as part of Saudi Arabia’s program
for rehabilitating jihadist terrorists, a program that U.S. Sen. Jeff Sessions, R-Alabama, characterized
as ”a failure.” In December 2009, Sessions listed al-Rubaish among those on the virtual ” ’Who’s
Who’ of al Qaeda terrorists on the Arabian peninsula ... who have either graduated or escaped from
the program en route to terrorist acts.” The United States has been active in Yemen, working closely
with governments there to go after AQAP leaders like al-Rubaish. While it was not immediately clear
how he died, drone strikes have killed many other members of the terrorist group. Yemen, however,
has been in disarray since Houthi rebels began asserting themselves last year. The Shiite minority
group even managed to take over the capital of Sanaa and, in January, force out Yemeni President
Abdu Rabu Mansour Hadi – who had been a close U.S. ally in its anti-terror fight. Hadi still claims he
is Yemen’s legitimate leader, and he is working with a Saudi-led military coalition to target Houthis
and supporters of former President Ali Abdullah Saleh. Meanwhile, Yemen has been awash in vio-
lence and chaos – which in some ways has been good for groups such as AQAP. A prison break earlier
this month freed 270 prisoners, including some senior AQAP figures, according to a senior Defense
Ministry official, and the United States pulled the last of its special operations forces out of Yemen
last month, which some say makes things easier for AQAP.
Reference
AQAP says a “crusader airstrike” killed Ibrahim al-Rubaish. Al-Rubaish was once detained by the
United States in Guantanamo.
Refresh
(CNN) A top al Qaeda in the Arabian Peninsula leader – who a few years ago was in a U.S. detention
facility – was among five killed in an airstrike in Yemen, the terror group said, showing the organiza-
tion is vulnerable even as Yemen appears close to civil war. Ibrahim al-Rubaish died Monday night
in what AQAP’s media wing, Al-Malahem Media, called a “crusader airstrike.” Al-Rubaish was once
held by the U.S. government at its detention facility in Guantanamo Bay, Cuba.
BANDIT S UM
Ibrahim al-Rubaish died Monday night in what AQAP ’s media wing, Al-Malahem Media, called a
“crusader airstrike.” A Yemeni Defense Ministry official and two Yemeni national security officials
not authorized to speak on record confirmed that al-Rubaish had been killed, but could not specify how
he died. Al-Rubaish was once held by the U.S. government at its detention facility in Guantanamo
Bay, Cuba.

Table 5: Example from the dataset showing summaries generated by BANDIT S UM and Refresh. Refresh selected
the articles first sentence (blue) which contains the redundant information “Ibrahim al-Rubaish was in a U.S.
detention facility”. In contrast, BanditSum selected the yellow sentence containing extra source information.
Source Document
(CNN)You probably never knew her name, but you were familiar with her work. Betty Whitehead
Willis, the designer of the iconic ”Welcome to Fabulous Las Vegas” sign, died over the weekend. She
was 91. Willis played a major role in creating some of the most memorable neon work in the city.
The Neon Museum also credits her with designing the signs for Moulin Rouge Hotel and Blue Angel
Motel. Willis visited the Neon Museum in 2013 to celebrate her 90th birthday. Born about 50 miles
outside of Las Vegas in Overton, she attended art school in Pasadena, California, before returning
home. She retired at age 77. Willis never trademarked her most-famous work, calling it ”my gift to
the city.” Today it can be found on everything from T-shirts to refrigerator magnets.
Reference
Willis never trademarked her most-famous work, calling it “my gift to the city”. She created some of
the city’s most famous neon work.
Refresh
Betty Whitehead Willis, the designer of the iconic “Welcome to Fabulous Las Vegas” sign, died over
the weekend. Willis played a major role in creating some of the most memorable neon work in the
city. Willis visited the Neon Museum in 2013 to celebrate her 90th birthday.
BANDIT S UM
Betty Whitehead Willis , the designer of the iconic “Welcome to Fabulous Las Vegas” sign , died over
the weekend. She was 91. The Neon Museum also credits her with designing the signs for Moulin
Rouge Hotel and Blue Angel Motel.

Table 6: Example from the dataset showing summaries generated by BANDIT S UM and Refresh. BANDIT S UM
tends to select more precise information.

You might also like