Professional Documents
Culture Documents
B S: Extractive Summarization As A Contextual Bandit
B S: Extractive Summarization As A Contextual Bandit
Our approach formulates extractive summariza- 3.1 Policy Gradient Reinforcement Learning
tion as a contextual bandit which we then train an Ideally, we would like to maximize (1) using gra-
agent to solve using policy gradient reinforcement dient ascent. However, the required gradient can-
learning. A bandit is a decision-making formal- not be obtained using usual techniques (e.g. sim-
ple backpropagation) because i must be discretely obtain a new sentence to include. This normalize-
sampled in order to compute R(i, a). and-sample step is repeated M times, yielding M
Fortunately, we can use the likelihood ratio gra- unique sentences to include in the summary.
dient estimator from reinforcement learning and At each step of sampling-without-replacement,
stochastic optimization (Williams, 1992; Sutton we also include a small probability of sampling
et al., 2000), which tells us that the gradient of this uniformly from all remaining sentences. This is
function can be computed as: used to achieve adequate exploration during train-
ing, and is similar to the -greedy technique from
∇θ J(θ) = E [∇θ log pθ (i|d)R(i, a)] (2) reinforcement learning.
Under this sampling scheme, we have the fol-
where the expectation is taken over the same vari-
lowing expression for pθ (i|d):
ables as (1).
Since we typically do not know the exact docu- M
(1 − )π(d)ij
!
Y
ment distribution and thus cannot evaluate the ex- + (5)
Nd − j + 1 z(d) − j−1
P
pected value in (2), we instead estimate it by sam- j=1 k=1 π(d)ik
pling. We found that we obtained the best perfor- P
where z(d) = t π(d)t . For index sequences
mance when, for each update, we first sample one
that have length different from M , or that con-
document/summary pair (d, a), then sample B in-
tain duplicate indices, we have pθ (i|d) = 0.
dex sequences i1 , . . . , iB from pθ (·|d), and finally
Using this expression, it is straightforward to
take the empirical average:
use automatic differentiation software to compute
B
1 X ∇θ log pθ (i|d), which is required for the gradient
∇θ J(θ) ≈ ∇θ log pθ (ib |d)R(ib , a) (3) estimate in (3).
B
b=1
3.3 Baseline for Variance Reduction
This overall learning algorithm can be regarded as
Our sample-based gradient estimate can have high
an instance of the REINFORCE policy gradient al-
variance, which can slow the learning. One po-
gorithm (Williams, 1992).
tential cause of this high variance can be seen by
3.2 Structure of pθ (·|d) inspecting (3), and noting that it basically acts
There are many possible choices for the structure to change the probability of a sampled index se-
of pθ (·|d); we opt for one that avoids privileging quence to an extent determined by the reward
early sentences over later ones. We first decom- R(i, a). However, since ROUGE scores are al-
pose pθ (·|d) into two parts: πθ , a deterministic ways positive, the probability of every sampled
function which contains all the network’s param- index sequence is increased, whereas intuitively,
eters, and µ, a probability distribution parameter- we would prefer to decrease the probability of se-
ized by the output of πθ . Concretely: quences that receive a comparatively low reward,
even if it is positive. This can be remedied by the
pθ (·|d) = µ(·|πθ (d)) (4) introduction of a so-called baseline which is sub-
tracted from all rewards.
Given an input document d, πθ outputs a real- Using a baseline r, our sample-based estimate
valued vector of sentence affinities whose length of ∇θ J(θ) becomes:
is equal to the number of sentences in the docu-
B
ment (i.e. πθ (d) ∈ RNd ) and whose elements fall 1 X
in the range [0, 1]. The t-th entry π(d)t may be ∇θ log pθ (ib |d)(R(ib , a) − r) (6)
B
i=1
roughly interpreted as the network’s propensity to
include sentence st in the summary of d. It can be shown that the introduction of r does not
Given sentence affinities πθ (d), µ imple- bias the gradient estimator and can significantly
ments a process of repeated sampling-without- reduce its variance if chosen appropriately (Sutton
replacement. This proceeds by repeatedly nor- et al., 2000).
malizing the set of affinities corresponding to sen- There are several possibilities for the baseline,
tences that have not yet been selected, thereby ob- including the long-term average reward and the
taining a probability distribution over unselected average reward across different samples for one
sentences, and sampling from that distribution to document-summary pair. We choose an approach
known as self-critical reinforcement learning, in through a final sigmoid unit to yield sentence
which the test-time performance of the current affinities πθ (d).
model is used as the baseline (Ranzato et al., 2015; The use of a bidirectional recurrent network in
Rennie et al., 2017; Paulus et al., 2018). More the encoder is crucial, as it allows the network to
concretely, after sampling the document-summary process the document as a whole, yielding repre-
pair (d, a), we greedily generate an index se- sentations for each sentence that take all other sen-
quence using the current parameters θ: tences into account. This procedure is necessary to
deal with some aspects of summary quality such
igreedy = arg max pθ (i|d) (7) as redundancy (avoiding the inclusion of multiple
i
sentences with similar meaning), which requires
and calculate the baseline for the current update the affinities for different sentences to depend on
as r = R(igreedy , a). This baseline has the intu- one another. For example, to avoid redundancy,
itively satisfying property of only increasing the if the affinity for some sentence is high, then sen-
probability of a sampled label sequence when the tences which express similar meaning should have
summary it induces is better than what would be low affinities.
obtained by greedy decoding.
5 Experiments
3.4 Reward Function
In this section, we discuss the setup of our exper-
A final consideration is a concrete choice for the
iments. We first discuss the corpora that we used
reward function R(i, a). Throughout this work we
and our evaluation methodology. We then discuss
use:
the baseline methods against which we compared,
1 and conclude with a detailed overview of the set-
R(i, a) = (ROUGE-1f (i, a) + tings of the model parameters.
3
ROUGE-2f (i, a) + ROUGE-Lf (i, a)). (8)
5.1 Corpora
The above reward function optimizes the average Three datasets are used for our experiments: the
of all the ROUGE variants (Lin, 2004) while bal- CNN, the Daily Mail, and combined CNN/Daily
ancing precision and recall. Mail (Hermann et al., 2015; Nallapati et al., 2016).
We use the standard split of Hermann et al. (2015)
4 Model for training, validating, and testing and the same
In this section, we discuss the concrete instan- setting without anonymization on the three cor-
tiations of the neural network πθ that we use pus as See et al. (2017). The Daily Mail corpus
in our experiments. We break πθ up into two has 196,557 training documents, 12,147 validation
components: a document encoder fθ1 , which documents and 10,397 test documents; while the
outputs a sequence of sentence feature vectors CNN corpus has 90,266/1,220/1,093 documents,
(h1 , . . . , hNd ) and a decoder gθ2 which yields sen- respectively.
tence affinities:
5.2 Evaluation
h1 , . . . , hNd = fθ1 (d) (9) The models are evaluated based on ROUGE (Lin,
πθ (d) = gθ2 (h1 , . . . , hNd ) (10) 2004). We obtain our ROUGE scores using the
standard pyrouge package2 for the test set eval-
Encoder. Features for each sentence in isolation uation and a faster python implementation of the
are first obtained by applying a word-level Bidi- ROUGE metric3 for training and evaluating on the
rectional Recurrent Neural Network (BiRNN) to validation set. We report the F1 scores of ROUGE-
the embeddings for the words in the sentence, and 1, ROUGE-2, and ROUGE-L, which compute
averaging the hidden states over words. A sepa- the uniform, bigram, and longest common subse-
rate sentence-level BiRNN is then used to obtain a quence overlapping with the reference summaries.
representations hi for each sentence in the context
2
of the document. https://pypi.python.org/pypi/pyrouge/
0.1.3
Decoder. A multi-layer perceptron is used to 3
We use the modified version based on https://
map from the representation ht of each sentence github.com/pltrdy/rouge
5.3 Baselines Model ROUGE
1 2 L
Lead(Narayan et al., 2018) 39.6 17.7 36.2
We compare BANDIT S UM with other extractive Lead-3(ours) 40.0 17.5 36.2
methods including: the Lead-3 model, Sum- SummaRuNNer 39.6 16.2 35.3
DQN 39.4 16.1 35.6
maRuNNer (Nallapati et al., 2017), Refresh Refresh 40.0 18.2 36.6
(Narayan et al., 2018), RNES (Wu and Hu, 2018), RNES w/o coherence 41.3 18.9 37.6
DQN (Yao et al., 2018), and NN-SE (Cheng and BANDIT S UM 41.5 18.7 37.6
Lapata, 2016). The Lead-3 model simply pro-
Table 1: Performance comparison of different ex-
duces the leading three sentences of the document tractive summarization models on the combined
as the summary. CNN/Daily Mail test set using full-length F1.
Source Document
(CNN)A top al Qaeda in the Arabian Peninsula leader – who a few years ago was in a U.S. detention
facility – was among five killed in an airstrike in Yemen, the terror group said, showing the organiza-
tion is vulnerable even as Yemen appears close to civil war. Ibrahim al-Rubaish died Monday night in
what AQAP’s media wing, Al-Malahem Media, called a ”crusader airstrike.” The Al-Malahem Media
obituary characterized al-Rubaish as a religious scholar and combat commander. A Yemeni Defense
Ministry official and two Yemeni national security officials not authorized to speak on record con-
firmed that al-Rubaish had been killed, but could not specify how he died. Al-Rubaish was once held
by the U.S. government at its detention facility in Guantanamo Bay, Cuba. In fact, he was among a
number of detainees who sued the administration of then-President George W. Bush to challenge the
legality of their confinement in Gitmo. He was eventually released as part of Saudi Arabia’s program
for rehabilitating jihadist terrorists, a program that U.S. Sen. Jeff Sessions, R-Alabama, characterized
as ”a failure.” In December 2009, Sessions listed al-Rubaish among those on the virtual ” ’Who’s
Who’ of al Qaeda terrorists on the Arabian peninsula ... who have either graduated or escaped from
the program en route to terrorist acts.” The United States has been active in Yemen, working closely
with governments there to go after AQAP leaders like al-Rubaish. While it was not immediately clear
how he died, drone strikes have killed many other members of the terrorist group. Yemen, however,
has been in disarray since Houthi rebels began asserting themselves last year. The Shiite minority
group even managed to take over the capital of Sanaa and, in January, force out Yemeni President
Abdu Rabu Mansour Hadi – who had been a close U.S. ally in its anti-terror fight. Hadi still claims he
is Yemen’s legitimate leader, and he is working with a Saudi-led military coalition to target Houthis
and supporters of former President Ali Abdullah Saleh. Meanwhile, Yemen has been awash in vio-
lence and chaos – which in some ways has been good for groups such as AQAP. A prison break earlier
this month freed 270 prisoners, including some senior AQAP figures, according to a senior Defense
Ministry official, and the United States pulled the last of its special operations forces out of Yemen
last month, which some say makes things easier for AQAP.
Reference
AQAP says a “crusader airstrike” killed Ibrahim al-Rubaish. Al-Rubaish was once detained by the
United States in Guantanamo.
Refresh
(CNN) A top al Qaeda in the Arabian Peninsula leader – who a few years ago was in a U.S. detention
facility – was among five killed in an airstrike in Yemen, the terror group said, showing the organiza-
tion is vulnerable even as Yemen appears close to civil war. Ibrahim al-Rubaish died Monday night
in what AQAP’s media wing, Al-Malahem Media, called a “crusader airstrike.” Al-Rubaish was once
held by the U.S. government at its detention facility in Guantanamo Bay, Cuba.
BANDIT S UM
Ibrahim al-Rubaish died Monday night in what AQAP ’s media wing, Al-Malahem Media, called a
“crusader airstrike.” A Yemeni Defense Ministry official and two Yemeni national security officials
not authorized to speak on record confirmed that al-Rubaish had been killed, but could not specify how
he died. Al-Rubaish was once held by the U.S. government at its detention facility in Guantanamo
Bay, Cuba.
Table 5: Example from the dataset showing summaries generated by BANDIT S UM and Refresh. Refresh selected
the articles first sentence (blue) which contains the redundant information “Ibrahim al-Rubaish was in a U.S.
detention facility”. In contrast, BanditSum selected the yellow sentence containing extra source information.
Source Document
(CNN)You probably never knew her name, but you were familiar with her work. Betty Whitehead
Willis, the designer of the iconic ”Welcome to Fabulous Las Vegas” sign, died over the weekend. She
was 91. Willis played a major role in creating some of the most memorable neon work in the city.
The Neon Museum also credits her with designing the signs for Moulin Rouge Hotel and Blue Angel
Motel. Willis visited the Neon Museum in 2013 to celebrate her 90th birthday. Born about 50 miles
outside of Las Vegas in Overton, she attended art school in Pasadena, California, before returning
home. She retired at age 77. Willis never trademarked her most-famous work, calling it ”my gift to
the city.” Today it can be found on everything from T-shirts to refrigerator magnets.
Reference
Willis never trademarked her most-famous work, calling it “my gift to the city”. She created some of
the city’s most famous neon work.
Refresh
Betty Whitehead Willis, the designer of the iconic “Welcome to Fabulous Las Vegas” sign, died over
the weekend. Willis played a major role in creating some of the most memorable neon work in the
city. Willis visited the Neon Museum in 2013 to celebrate her 90th birthday.
BANDIT S UM
Betty Whitehead Willis , the designer of the iconic “Welcome to Fabulous Las Vegas” sign , died over
the weekend. She was 91. The Neon Museum also credits her with designing the signs for Moulin
Rouge Hotel and Blue Angel Motel.
Table 6: Example from the dataset showing summaries generated by BANDIT S UM and Refresh. BANDIT S UM
tends to select more precise information.