You are on page 1of 14

Information Sciences 509 (2020) 22–35

Contents lists available at ScienceDirect

Information Sciences
journal homepage: www.elsevier.com/locate/ins

An unsupervised constrained optimization approach to


compressive summarization
Natalia Vanetik a, Marina Litvak a, Elena Churkin a, Mark Last b,∗
a
Department of Software Engineering, Shamoon College of Engineering, Beer Sheva, Israel
b
Department of Software and Information Systems Engineering, Ben Gurion University of the Negev, Beer Sheva, Israel

a r t i c l e i n f o a b s t r a c t

Article history: Automatic summarization is typically aimed at selecting as much information as possible
Received 10 October 2016 from text documents using a predefined number of words. Extracting complete sentences
Revised 25 August 2019
into a summary is not an optimal way to solve this problem due to redundant information
Accepted 28 August 2019
that is contained in some sentences. Removing the redundant information and compil-
Available online 2 September 2019
ing a summary from compressed sentences should provide a much more accurate result.
Keywords: Major challenges of compressive approaches include the cost of creating large summariza-
Compressive summarization tion corpora for training the supervised methods, the linguistic quality of compressed sen-
Budgeted sentence compression tences, the coverage of the relevant content, and the time complexity of the compression
Polytope model procedure. In this work, we attempt to address these challenges by proposing an unsuper-
vised polynomial-time compressive summarization algorithm. The proposed algorithm iter-
atively removes redundant parts from original sentences. It uses constituency-based parse
trees and hand-crafted rules for generating elementary discourse units (EDUs) from their
subtrees (standing for phrases) and selects ones with a sufficient tree gain. We define a
parse tree gain as a weighted function of its node weights, which can be computed by any
extractive summarization model capable of assigning importance weights to terms. The
results of automatic evaluations on a single-document summarization task confirm that
the proposed sentence compression procedure helps to avoid redundant information in
the generated summaries. Furthermore, the results of human evaluations confirm that the
linguistic quality—in terms of readability and coherency—is preserved in the compressed
summaries while improving their coverage. However, the same evaluations show that com-
pression in general harms the grammatical correctness of compressed sentences though, in
most cases, this effect is not significant for the proposed compression procedure.
© 2019 Elsevier Inc. All rights reserved.

1. Introduction

Automatic summarization can help indexing large repositories of textual data by identifying the essence of each doc-
ument. Summarization may also significantly reduce the information overload by providing users with shorter versions of
original documents.
Taxonomically, we distinguish between an automatically generated extract – the most salient fragments of the input
documents (e.g., sentences or paragraphs) and an abstract – a reformulated synopsis expressing the main idea of the input


Corresponding author.
E-mail addresses: nataliav@sce.ac.il (N. Vanetik), marinal@sce.ac.il (M. Litvak), elenach@ac.sce.ac.il (E. Churkin), mlast@bgu.ac.il (M. Last).

https://doi.org/10.1016/j.ins.2019.08.079
0020-0255/© 2019 Elsevier Inc. All rights reserved.
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 23

documents. Since generating abstracts requires deep linguistic analysis of the input documents, most existing summarizers
work in an extractive manner [1].
However, given tight length constraints, extractive systems are quite limited in the summaries they can produce. Long but
partially relevant sentences may “fill out” the limited, predefined length of a summary and prevent the inclusion of other
relevant sentences. Compressive summarization aims at overcoming this limitation by compiling summaries from compressed
sentences composed of strictly relevant information [2].
The main contribution of our work is combining sentence compression and summarization-oriented optimization in or-
der to preserve both informativeness and grammatical correctness of the summary sentences. In our model, sentence com-
pression is driven by a summarization weighting scheme unlike general sentence compression, where salient parts of a
sentence can be omitted. This is because we determine informativeness and importance of sentences and sentence parts at
the document- and document set-level, whereas sentence compression is usually performed at the sentence level only. Our
method tackles this challenge by performing these two tasks together.
The rest of this paper is organized as follows: Section 2 describes the related work. Section 3 introduces the prob-
lem statement and relevant definitions. Section 4 explains the EDUs generation procedure, general optimization pro-
cedure and the compression algorithm. Section 5 contains experimental results for automated and human evaluations.
Section 6 presents our conclusions and suggestions for future work. The Appendix contains detailed description of anno-
tation guidelines used by the human evaluators.

2. Related work

Earlier works in compressive summarization applied text simplification techniques [3–5]. These works pursued the same
goal – to omit redundant and irrelevant information from selected sentences and consequently maximize the coverage of
important information in a summary, all while keeping the valid grammatical structure of generated sentences. More recent
works take different approaches to reaching both objectives. Most approaches compress sentences by editing the syntactic
trees of the original sentences, as described in [6]. The motivation behind using syntactic trees is to keep a valid grammat-
ical structure of compressions. Some works consider graph structure of texts [7], pursuing the same goal. An unsupervised
approach, as described in [7], generates a representative sentence for each cluster of similar sentences. This is accomplished
by representing a cluster as a word graph and computing the shortest path between start and end nodes. Also, hybrid
methods exist that combine classic approaches. For example, McDonald [8] uses a rich feature set consisting of surface-
level bigram features and features over dependency and phrase-structure trees, in order to learn which words should be
dropped from a sentence. In his work, sentences are generated directly, as strings, without explicit editing of trees. Au-
thors of [9] propose an abstractive summarization framework that constructs new sentences by merging more fine-grained
syntactic units, namely, noun phrases and verb phrases. They first extract phrases from constituency tree and subsequently
calculate salience scores for them, and then generate new summary sentences. The sentence generation task is formulated
as an optimization problem. Chali et al. [10] divide the task of summarization into document shrinking, where the sentence
compression and merging are performed to produce new candidate sentences, and extractive summarization, represented as
a submodular function maximization problem under budgeted constraints.
In [11] sentence compression techniques are used to facilitate query-focused multi-document summarization. Authors
present a sentence-compression-based framework for the task, and design a series of learning-based compression models
built on parse trees. An innovative beam search decoder is proposed to efficiently find highly probable compressions.
Discourse-based approaches incorporate discourse properties [12] for deciding whether to remove or retain words from
a sentence [13]. These approaches are supervised and represent documents by trees whose leaves correspond to Elementary
Discourse Units (EDUs) and whose nodes specify how these units are linked to each other. Unlike previous works, Clarke
and Lapata [14] apply a discourse-based compression model to entire documents rather to isolated sentences. They define
sentence compression as a constrained optimization problem. Given a long sentence, a compression is created by retaining
the words that maximize a scoring function (such as tf-idf of topic-related words). Their model is based on lexical chains
and the “Centering Theory,” which assumes that certain entities mentioned in a sentence (e.g., subjects) are more central
than others. These central entities should be retained in the compressed sentences.
Some works introduce supervised methods for compressive summarization [15–17] by combining extraction and com-
pression scores in an optimization problem and, as such, extract and compress the extracted sentences at the same time.
Paper [15] uses Integer Linear Programming (ILP) with supervised objective function learned from the data in order to ex-
tract and compress sentences at the same time; the grammaticality of a summary is handled by giving preference to sum-
maries containing fewer and longer sentences. Paper [16] introduces another supervised method for simultaneous extraction
and sentence compression based on n-gram types in the summary and ILP. In [17] a supervised ILP-based model for simulta-
neous extraction and compression is proposed; here, grammaticality constraints are derived from arcs in dependency-based
parse trees.
Many works recently adopted deep learning language models for handling sentence compression [18–21] and abstractive
compressive summarization [22,23] problems. These models are known for their ability to achieve very accurate perfor-
mance, but they require training on the annotated data of a considerable size.
Our unsupervised summarization approach combines methods for term weighting and sentence compression into a
Weighted Compression (WeC) model. The term-weighting model provides a weight for each term occurrence in the document,
24 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Fig. 1. System pipeline.

learned from unlabeled data, and the sentence compression strives to compress the sentences so that the combined weight
of summary terms is maximal.
Similar to the approaches proposed in [10,24] and in contrast to approaches proposed in [15–17], we apply the sen-
tence extraction and compression models sequentially in an unsupervised manner. However, unlike the work of Morita
et al. [24], our approach relies on splitting sentences into elementary discourse units, EDUs,1 using constituency parse trees;
learns term weights from data; and compresses sentences by iteratively removing EDUs with low weight. In contrast to
the work in [10], where the document shrinking by compressing sentences is followed by extractive summarization, we
first apply extractive summarization model which gives weights for words in sentences and then compress sentences by
removing less-weighted EDUs. Unlike the approaches presented in [11,15–17,19–23], our method is unsupervised and relies
on LP over rationals with a pre-defined objective function. In contrast, the supervised methods, especially those based on
deep learning, require significant amounts of training data. Furthermore, our sentence compression method gives preference
to grammaticality over compression rate. We discover which sentence terms are more informative with the help of term
weights obtained from unsupervised optimization, but uninformative EDUs are only removed if the remaining sentence is
grammatically sound.
The EDUs are generated from constituency-based syntax trees, similarly to [2,16], and filtered by a set of heuristic rules.
The main difference of our approach from discourse-based approaches described in [14] is that we rely on term weights as
well as sentence EDUs in order to generate compressed sentences.

3. Problem statement

3.1. Outline

An elementary discourse unit (EDU) is a grammatically independent part of a sentence (see [25]). We propose to shorten
sentences by iteratively removing EDUs, preserving important content by optimizing the weight function measuring accumu-
lative importance, and preserving valid syntax by following the syntactic structure of a sentence. We denote our approach
to abstractive (compressive) summarization as weighted compression (WeC). The proposed approach consists of the following
steps, depicted in Fig. 1:

1. Term weight assignment


We apply a weighting model that assigns a non-negative weight to every term occurrence in all sentences of the
document.
2. EDU selection and ranking
Our approach builds a summary from compressed sentences. Sentences are compressed by removal of EDUs. At this
stage, we prepare a list of candidate EDUs for removal. First, we generate the list of EDUs from constituency-based
syntax trees [26] of sentences. Then, we omit EDUs that may create a grammatically incorrect sentence after their
removal. Finally, we compute weights for all “safe” (remaining) EDU candidates from term weights obtained in the
first stage and sort them by increasing weight.
3. Budgeted sentence compression
We define a summary cost as its length measured in words or characters. We are given a budget for the summary
cost, for example, the maximal number of words in a summary. The compressive part of our method is responsible
for selecting EDUs in all sentences such that
(a) the weight to cost ratio of the summary is maximal;
(b) the summary length does not exceed the given budget.

3.2. Notation

Formally speaking, our problem can be defined as follows: We are given a set of sentences S1 , . . . , Sn over terms t1 , . . . , tm .
A term is a word after stemming and stop-word removal. A non-negative real term weight wij in range [0,1] is assigned
to every occurrence of term tj in sentence Si . Additionally, a word limit L is defined for the final summary. We define
compressive summarization as the task of generating and selecting compressions C1 , . . . , Cn of sentences S1 , . . . , Sn so that
  
the sum of compressed and selected sentence weights, ni=1 t ∈C wi j , is maximized under the length constraint ni=1 |Ci | ≤
j i
L. Compressed sentences are expected to be more succinct than the originals, to contain the important content from the
originals, and to be grammatically correct.

1
meaningful sentence parts.
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 25

After weight assignment is performed, non-negative real term weights 0 ≤ wij ≤ 1 are assigned to every occurrence of term
tj in sentence Si . The vector of term weights of sentence Si is denoted by wi = (wi1 , . . . , wim ).

Example 1. Let the document contain a single sentence S1 = My dog likes eat ing hot dogs. We can say that it contains terms
t1 = my, t2 = dog, t3 = like, t4 = eat and t5 = hotdog. Using the term weighting scheme of Morita et al., [24] that assigns
weight 1 to nouns, verbs, and adverbs, and weight 0 to other parts of speech,2 we obtain the following weight vector for
sentence S1 :
w1 = {w11 = 0.0, w12 = 1.0,
w13 = 1.0, w14 = 1.0, w15 = 1.0}
This means that after compression we wish to keep the terms t2 , t3 , t4 , t5 in S1 with confidence 1.0 and the term t1 with
confidence 0.0.3

4. Weighted compression model

4.1. Weighting model

This model aims at assigning a non-negative weight to every term occurrence in all sentences of the document. It can be
based on either a sophisticated extraction algorithm or a very simple heuristic filtering. Our system supports three different
weighting schemes4 :

1. Polytope weighting model. This model uses an efficient text representation model for extractive summarization, with
the purpose of representing all possible extracts (sentence subsets). Each sentence is represented by a hyperplane,
and all sentences derived from a document form hyperplane intersections (polytope). Then, all possible extracts can
be represented by subplanes of hyperplane intersections that are not located far from the boundary of the polytope.
Following the maximum coverage principle, this model is aimed at finding the extract that optimizes the chosen
objective function. The solution, provided by Linear Programming (LP) over fractionals, assigns a non-negative real
weights in range from 0 to 1 to all term occurrences an a document, representing their importance for a summary.
One can find the details about the model and possible objective functions in [27]. In this work, we utilized this model
with one particular objective function, which maximizes the information coverage as a weighted term sum, where
terms appearing earlier in the text get higher weight.
2. Extractive models of Gillick and Favre [28] and McDonald [29]. We assign the weight of 1 to all term occurrences in
the subset of sentences extracted by model, and weight of 0 to the rest of occurrences.
3. The Morita et al., [24] weighting scheme that assigns weight 1 to nouns, verbs, adverbs and adjectives, and weight 0
to all other words.

Note that all weights lay in range [0,1]. Additionally, we set the weight of all term occurrences to 0 if a term belongs to
a list of stopwords.

4.2. EDU selection and ranking

The aim of this stage of our approach is to provide a list of candidate EDUs for removal from original sentences. We
generate the candidates from phrases contained in constituency-based syntax trees. After candidate EDUs are generated,
the procedure removes the least important EDUs from sentences, thus creating their shorter (compressed) versions. With-
out preserving the syntactic structure of sentences in EDUs generation, we may end up with grammatically incorrect and
meaningless sentences.
In constituency-based parse trees of constituency grammars (or phrase structure grammars) nonterminal nodes are la-
beled by syntactic phrasal categories and parts of speech, whereas terminal nodes are labeled by words.
Fig. 2 shows constituency-based tree of the sentence “My dog likes eating hotdogs.”
We generate EDUs as subtrees of a syntax tree. The number of these EDUs for a single sentence can be significant for
constituency-based syntax trees, but this number cannot exceed the number of nodes in the tree. Fig. 3 shows two subtrees,
one representing a “likes eating hotdogs” phrase and another representing a “My dog” phrase.
Some EDUs cannot be removed safely from a sentence; after removal, a grammatically incorrect or too short sentence
may be generated. Therefore, we have built a set of heuristic rules (see Table 1) in order to select EDUs that can be safely
removed from a sentence. These rules were built by empirical evaluation on DUC2002 and DUC2004 datasets.5 If a can-

2
We support this scheme among others in our system but do not report results with it, due to its poor performance with our approach.
3
In general, the weights can take any value in the [0,1] range.
4
In our experiments, we report only the results for the first two weighting schemes, which provided the best summaries with the introduced compres-
sive approach.
5
In our further research, we applied a supervised approach, where a set of rules is automatically induced from the training set and then used for
classification [30] of EDU candidates. We observed that using a small training set or a large one, coming from completely different domains, makes no
significant difference. Therefore, we may assume that both hand-crafted and supervision-based rules are not expected to be too specific.
26 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Fig. 2. Parse tree for the sentence “My dog likes eating hotdogs”.

Fig. 3. Two candidate EDUs in a syntax tree.

Table 1
Selection rules for candidate EDUs.

Column 1 2 3 4 5 6 7
Rule # Max Max % EDU First First subfrase POS POS
words of a phrase POS of following preceding
in EDU sentence type of EDU EDU EDU EDU

1 1 - - RB - - -
2 5 - SBAR WP - - -
3 - 0.5 SBAR IN - “,” or “;” or “.” -
4 - 0.5 SBAR WDT WHNP none -
5 - 0.5 SBAR WRB WHAVP - none
6 - 0.75 PP RB - “,” or “;” or “.” , or ; or -
7 1 - - - - “,” or “;” or “-” none
8 5 - VP TO - “,” or “;” or “-” none
9 3 - - - - none “,” or “;” or “-”
10 - 0.75 - CC - none “,” or “;” or “-”
11 - 0.75 PP VBG or IN - none “,” or “;” or “-”
12 5 - - WP or VBD - none “,” or “;” or “-”
13 - 0.75 VP - VBG none “,” or “;” or “-”
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 27

Fig. 4. Parse tree with node and subtree weight and cost.

didate EDU does not fit one of the rules, it is omitted from the set of candidates as its eventual removal may result in a
grammatically incorrect sentence.
The rule conditions depend on EDU size and structure as a subtree in constituency-based syntax tree. The EDU size can
be limited by either the number of words or percentage of its content out of the entire sentence (columns 1–2 of Table 1).
For instance, a candidate EDU cannot occupy more than 75% of a sentence. EDU internal structural parameters are its phrase
type (root type of a subtree in the syntax tree), type of the first POS of an EDU, and the type of its first subphrase (root
of the leftmost non-trivial subtree in a syntax tree). These parameters are shown in columns 3–5 of Table 1; “-” denotes
that a rule condition is ignored. EDU external structure parameters are POS of words immediately following (columns 6 of
Table 1) or immediately preceding (column 7 of Table 1) an EDU in a sentence. We rely mostly on punctuation in these
rules (appearance of a semicolon, dash or comma immediately before or immediately after an EDU); “none” denotes that
no such POS exists in a sentence, i.e., an EDU appears at the beginning or at the end of a sentence.
We define two parameters for syntax subtrees – weight and cost. The weight of a terminal node, i.e., a node that repre-
sents a word, is provided by the weighting model and it is a real number in the range [0,1]. The cost of a terminal node n
is defined as in summary budget. If the summary is limited by the number of words, we set cost (n ) = 1 for all words. If the
summary is limited by the number of characters, we set cost (n ) = |n| for all words, where |n| is the number of characters
in the word of terminal n. The weight of a subtree of a syntax tree is the total weight (sum of weights) of a partial sentence
represented by that subtree. Likewise, the cost of a subtree of a syntax tree is the total cost of a partial sentence represented
by that subtree. Fig. 4 gives examples of subtrees with their weight and cost derived from the weight and cost values for
their leaf nodes.
We compute a weight and a cost for every candidate EDU, as a weight and a cost of a subtree representing it. Candidate
EDUs are then sorted by increasing weights, so that those with lower weights will be removed from sentences first.

4.3. Compression model

Given a cost budget L, which describes the desired summary length in terms of the number of words, or characters, the
aim is to find

arg max weight (C ) s.t. cost (C ) ≤ L (1)


where C is a set of compressed sentences which are grammatically valid.
We propose a sentence compression algorithm that iteratively removes the least significant EDUs from the summary. If
we do not have enough EDUs to remove, we delete entire sentences, starting with the ones of the smallest weight to cost
ratio. Fig. 5 shows two candidate EDUs, with T1 having lower weight to cost ratio that equals 0.5.

4.4. Summarization as a weighted compression

This section summarizes all stages, described above, into one summarization algorithm, named WeC (Weighted Compres-
sion), described in Algorithm 1.
28 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Fig. 5. Two candidate EDUs for removing.

Algorithm 1: WeC compressive summarization.


1 Input: Syntax trees C1 , . . . , Cn of sentences S1 , . . . , Sn ,
2 weight function weight (),
3 cost function cost (),
4 maximal summary length L.
5 Output: Compressed sentences Ccompressed
6 EDUs ← generateEDUs(C1 , . . . , Cn );
7 // Select candidate EDUs
8 foreach T ∈ EDUs do
9 if T satisfies no selection rule then
10 EDUs ← EDUs \ {T }
11 end
12 end
13 Ccompressed ← {S1 , . . . , Sn };
14 if cost (Ccompressed ) ≤ L then
15 return Ccompressed ;
16 end
17 while cost (Ccompressed ) > L and EDUs = ∅ do
18 // Select EDU of minimal importance
weight (T )
19 T ← arg minT ∈EDUs ;
cost (T )
20 Let Si be the sentence containing T ;if cost (Si \ T ) ≥ 5 and Si \ T contains a noun and a verb then
21 Ccompressed ← Ccompressed \ {T };
22 else
23 EDUs ← EDUs \ {T };
24 end
25 end
26 return Ccompressed ;

The input of the algorithm includes:

• A real-valued non-negative weight and cost of each term in every sentence of the document;
• The maximal summary length (cost) L;
• The syntax trees of all sentences in the document.

The main steps of Algorithm 1 are:


N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 29

Table 2
Extractive and compressive summarization methods.

Notation Description

EGil l ick&Favre ILP concept-based extraction method of Gillick and Favre [28]
EMcDonald ILP extraction method of McDonald [29]
EPoly Polytope extraction method with position_first objective function [27]
WCGil l ick&Favre Weighted Compression with weights generated by EGil l ick&Favre
WCMcDonald Weighted Compression with weights generated by EMcDonald
WCPoly Weighted Compression with weights generated by EPoly

Table 3
Baseline and topline methods.

Notation Description

CGillick ILP concept-based compression method of Gillick et al. [16,28]


MUSE GA-based supervised extraction method of Litvak et al. [33]
ERandom Extraction method extracting random sentences
OCCAMS Topline extraction method for the best coverage of gold standard summaries [31]
WC1 Weighted Compression with all weights assigned to 1
WCRandom Weighted Compression with randomly generated weights

• Generation of candidate EDUs from syntax trees of each sentence is performed in line 6.
• Filtering out EDUs using ad-hoc rules described in Table 1 is performed in lines 8–11.
• If the initial set of sentences already satisfies the length constraint, no compression needs to be performed (lines
13–16). Otherwise, remaining EDUs are ordered by weight/cost ratio and are removed iteratively until the desired
summary length is reached (lines 17–25). If we do not have enough EDUs to remove, we delete entire sentences,
starting with the ones of the smallest weight to cost ratio. We make sure that every compressed sentence still has a
noun and a verb and at least 5 words when an EDU is removed from it (line 21).

5. Experiments

The primary objective of our experiments was to see whether weighted compression (WeC) improves performance of
extractive summarization. Therefore, three state-of-the-art unsupervised extractive approaches were taken as baselines and
also as a part–weighting model–of WeC. Section 4.1 explains how we used these approaches for generating weights.
Table 2 gives the list of evaluated approaches and their notations. The first three rows present baseline extractive ap-
proaches. Note that systems EGil l ick&F avre and EMcDonald are ILP-based systems; comparison to some other ILP-based systems
such as system of [17] was not possible due to software unavailability. The last three rows present compressive summariza-
tion approaches, where the same compression procedure, described in Section 4.4, was applied to generated weights. Our
experiments consist of two independent parts: automated evaluation, for measuring the coverage of the relevant content,
and human evaluation, for evaluating the linguistic quality of the generated summaries.
The secondary objective of our evaluation was to compare our approach to additional baselines, including supervised ex-
tractive and topline methods. Table 3 contains the description of all baseline and topline methods used in this stage of eval-
uation. Three out of six baselines are extraction methods: OCCAMS [31] is a topline method that computes the best coverage
of the gold standard summaries (it was used as a topline in the MultiLing 2015 community task [32]); MUSE [33] is a super-
vised method6 that computes an optimized model for sentence extraction using a genetic algorithm; and ERandom extracts
random sentences. Other three baselines are compressive models: CGil l ick&F avre is an ILP-based model described in [16,28];
WC1 is our weighted compression model assigning identical weights of 1 to all terms; and WCRandom is our compression
model using randomly generated weights.
Due to the time constraints, we did not perform human evaluations for the baselines described in Table 3 and compared
WeC to them only in terms of automatically calculated scores.
The average running time of our compression algorithm per text file was 0.06 sec for the DUC 2002 dataset, 0.27 for the
DUC 2004 dataset and 0.62 sec for the DUC 2007 dataset. The computer characteristics were as follows: i7-4610M, 3.00 GHz,
2 Cores, 4 Logical Processors, 16 GB RAM, SSD.

6
In this work we used the model trained on the DUC2002 dataset.
30 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Table 4
DUC corpora.

Dataset # files Avg file size Word limit # of GS per doc

DUC 2002 533 3.3 Kb 100 2–4


DUC 2004 50 35.4 Kb 100 4
DUC 2007 23 70.8 Kb 250 4

Table 5
Results. DUC 2002.

System R-1 R R-1 P R-1 F R-2 R R-2 P R-2 F

OCCAMS 0.490 0.504 0.497 0.243 0.250 0.246


MUSE 0.458 0.452 0.455 0.210 0.206 0.208
ERandom 0.354 0.396 0.374 0.127 0.141 0.133
CGillick 0.294 0.333 0.312 0.079 0.089 0.084
WC1 0.343 0.396 0.367 0.116 0.134 0.124
WCRandom 0.341 0.396 0.365 0.114 0.132 0.122
EGil l ick&Favre 0.401 0.407 0.401 0.160 0.162 0.160
WCGil l ick&Favre 0.410∗ 0.413 0.409 0.166 0.166 0.165
EMcDonald 0.393 0.407 0.396 0.156 0.159 0.156
WCMcDonald 0.401∗ 0.403 0.399 0.158 0.158 0.157
EPoly 0.448 0.453 0.447 0.213 0.214 0.212
WCPoly 0.450 0.450 0.447 0.211 0.210 0.210

Table 6
Results. DUC 2004.

System R-1 R R-1 P R-1 F R-2 R R-2 P R-2 F

OCCAMS 0.358 0.373 0.365 0.077 0.080 0.079


MUSE 0.329 0.329 0.329 0.067 0.067 0.067
ERandom 0.264 0.301 0.281 0.035 0.040 0.037
CGillick 0.204 0.238 0.219 0.026 0.030 0.028
WC1 0.262 0.301 0.280 0.039 0.045 0.041
WCRandom 0.249 0.286 0.266 0.032 0.037 0.034
EGil l ick&Favre 0.291 0.292 0.292 0.051 0.051 0.051
WCGil l ick&Favre 0.296 0.297 0.296 0.051 0.051 0.051
EMcDonald 0.285 0.285 0.285 0.045 0.045 0.045
WCMcDonald 0.285 0.286 0.285 0.047 0.047 0.047
EPoly 0.282 0.281 0.282 0.046 0.046 0.046
WCPoly 0.283 0.283 0.283 0.047 0.047 0.047

5.1. Automated evaluation

We used DUC 2002 corpus [34], DUC 2004 corpus [35] and DUC 2007 corpus [36] for this part of our experiments.
For multi-document corpora, we merged document sets into single meta-documents by appending texts one to another. A
summary of these corpora parameters is given in Table 4.
All summarizers were evaluated by ROUGE-1 and ROUGE-2 metrics7 [37]. The goal of this evaluation was to measure
summaries quality in terms of content. ROUGE does it by measuring similarity between the generated summaries and the
gold standard provided by human experts.
Tables 5–7 contain ROUGE-1 and ROUGE-2 scores (recall, precision, and f-measure) for all summarizers on DUC 2002,
DUC 20 04, and DUC 20 07 datasets, respectively. Two first line show two extractive topline methods–OCCAMS which ex-
tracts sentences with best coverage of the gold standard, and MUSE that supervisely learns the optimization model for the
sentence extraction. The next four lines present the results of our baselines, one of them extractive and three are compres-
sive (see Table 3). Six bottom lines contain the results of three pairs of summarizers, where one of them is extractive and
the second one is based on the introduced weighted compression model with weights generated by the former. The paired
T-test showed that there is a statistically significant improvement between EGil l ick&F avre and WCGil l ick&F avre , and EMcDonald and
WCMcDonald in DUC2002 corpus, and EMcDonald and WCMcDonald in DUC2007 corpus (denoted by ∗ ). There are also slight im-
provements in most cases when compression is applied but they are not statistically significant. As can be seen from the
results, using weights generated by some extractive model usually performs better than using random or static weights. Be-
cause extraction of compressed sentences instead of entire sentences into a summary allows to put more information into

7
The following command line has been used: -a -l < word count > -n 1 -2 4 -u.
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 31

Table 7
Results. DUC 2007.

System R-1 R R-1 P R-1 F R-2 R R-2 P R-2 F

OCCAMS 0.434 0.427 0.430 0.109 0.107 0.108


MUSE 0.249 0.434 0.315 0.055 0.095 0.069
ERandom 0.323 0.355 0.338 0.058 0.063 0.060
CGillick 0.285 0.309 0.296 0.053 0.057 0.055
WC1 0.319 0.355 0.336 0.057 0.063 0.060
WCRandom 0.307 0.343 0.324 0.053 0.059 0.056
EGil l ick&Favre 0.339 0.363 0.347 0.067 0.074 0.070
WCGil l ick&Favre 0.342 0.362 0.349 0.068 0.074 0.070
EMcDonald 0.267 0.367 0.300 0.052 0.070 0.058
WCMcDonald 0.282∗ 0.359 0.309 0.054 0.068 0.059
EPoly 0.321 0.355 0.331 0.063 0.069 0.064
WCPoly 0.319 0.354 0.329 0.062 0.069 0.064

Table 8
Human evaluations. Scores (numbers in round brackets denote p-values).

System Readability Coherency Relevance Coverage

EGil l ick&Favre 3.68 3.34 3.34 2.88


WCGil l ick&Favre 3.64 3.25 3.41 3.04
EMcDonald 3.54 3.06 3.25 3.08
WCMcDonald 3.54 3.13 3.41 3.27 (0.025)
EPoly 4.02 3.80 3.55 2.99
WCPoly 3.85 3.60 (0.033) 3.53 3.21 (0.007)

it, in most cases the recall is affected positively (more gold-standard information in the resultant summary), as opposed to
the precision (the volume of noise in the summary may also increase).
In general, we can conclude that compression improves summarization performance in terms of coverage.

5.2. Human evaluation

This part of experiments was aimed at comparing between extractive and abstractive (compressive) summarization ap-
proaches, by measuring grammaticality and other important characteristics of extracted sentences and entire summaries.
The same six summarizers from Table 2 were evaluated.
We applied the summarizers on 58 documents manually selected from MultiLing 2015 [32] corpus.8 The selected doc-
uments satisfied such parameters as length (long enough, 600 words in average) and context (meaningful text focused on
a few central topics). 128 undergraduate students participated in the pilot, where each participant was asked to read a
document and its summaries, generated by 6 systems and gold standard, and evaluate each summary by answering several
multiple-choice questions. Namely, the participants were asked to grade readability, coherency, relevance, and coverage of
the produced summaries, on a scale of 1 to 5. Also, they were asked to answer several questions about the grammaticality
(most/none/some are grammatically incorrect), the length (most/none/some are too long/short), and the order of sentences
in a summary (the summary sentences appear in the wrong/right order). We measured the inter–annotator agreement us-
ing the following procedure. First, the categorical answers to each question were converted to numerical grades, since we
considered “some” and “none” to be closer to each other than “none” and “most”. Thus, we used the scale of 0 to 2 to
map the answers of “none” to “most”, respectively. Then we represented each annotator’s answers by a vector of the grades
she/he has assigned to all documents. Finally, we measured the average Manhattan distance between the annotators’ vectors
of grades assigned to each question and normalized it by the maximal distance due to the different scales used in different
questions. The average inter–annotator agreement over all questions was calculated as 1 - average normalized Manhattan
distance and found equal to 0.725.
Tables 8 and 9 contains the results of statistical analysis of all collected answers.
As can be seen, compression produces less coherent summaries just in one of three cases (in case of WCP oly), but it also
improves coverage in case of EMcDonald and EPoly (numbers in round brackets denote p-values). In terms of other parameters,
such as readability and relevance, extractive and compressive approaches perform without significant differences.
According to the participants’ scores, the proposed compression approach produces shorter sentences, which are not
less grammatically valid than their originals. Also, compression normalizes the length of summary sentences – much less
compressive summaries have received the “Most sentences are too long” feedback than the extractive ones.

8
available at https://drive.google.com/file/d/1tmo_9AeTlDMzroQ9qTmM97R4o4BR-fiU/view?usp=drivesdk.
32 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Table 9
Human evaluations. Questions (numbers in round brackets denote p-values).

Question How many sentences were How many sentences were too How many sentences were too Were sentences in
grammatically correct? long? short? correct order?

System Most None Some Most None Some Most None Some No Yes

EGil l ick&Favre 4 88 37 27 39 63 3 104 22 17 112


WCGil l ick&Favre 9 71 48 14 (0.029) 46 68 2 93 33 21 107
EMcDonald 13 66 48 5 92 30 14 71 42 38 89
WCMcDonald 13 64 51 5 85 38 13 70 45 36 92
EPoly 5 100 22 21 55 51 6 100 21 26 101
WCPoly 3 87 36 9 (0.021) 58 59 2 92 32 29 97

6. Conclusions and future work

In this paper, we introduced an unsupervised algorithm for sentence compression based on a new approach of weighted
compression (Wec). Its main advantage vs. the supervised techniques is that it does not rely on a costly process of generating
a training corpus of document summaries composed of manually selected and compressed sentences.
Our experimental results show that the introduced approach preserves the linguistic quality—in terms of lexical compar-
isons with ROUGE, readability and coherence—of the compressed sentences while improving their coverage. However, the
human evaluations confirm that the proposed compression procedure does not improve the grammaticality of compressed
sentences, regardless of a weighting scheme. It is worthy to note that the differences in readability scores are insignificant
for all scenarios, therefore we can claim that the readability was also preserved in compressed summaries. Another impor-
tant advantage of the introduced approach, especially over the ILP-based compression methods, is its polynomial runtime
complexity and its fast actual running time. Also, our approach is adaptable to different objective functions that can be
adjusted to a specific summarization task.
The weaknesses of the proposed methodology can be summarized into the following: (1) Our method relies on the syntax
(constituency) parsing, which may be imprecise. (2) The weighting scheme depends on the selected objective function,
which represents the optimal summary requirements. Because summarization quality is task-dependent (generic, query-
focused, update, personalized summarization, etc.), it is impossible to express by a single function. (3) The EDU selection
depends on a set of manually designed rules, which need to be adapted to different languages. As such, the proposed
methodology is language-dependent. In the future, we plan to enhance sentence compression by using supervised learning
in order to select sentence clauses that can be safely removed from a sentence; we expect such methods to produce shorter
sentences with improved precision scores. We believe that supervised selection of EDUs using a large amount of syntactical
features will generate better rules for EDU removal. A supervised approach is expected to improve both grammaticality
(EDUs will stay in a sentence if their removal would harm its grammaticality) and compression ratio (more redundant EDUs
can be removed safely). In fact, our work-in-progress on supervised selection of EDUs supports this claim. Moreover, we
have learned empirically that using a large training set rather than a small one does not result in a significant improvement
of compression precision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.

Acknowledgment

This work was partially funded by the U.S. Department of the Navy, Office of Naval Research (W911NF-13-1-0171).

Appendix

In this section, we describe the guidelines we have used for human summary evaluation and discuss agreement between
judges. All evaluators have received the following guidelines for evaluation of summaries:

1. First, download the folder containing a full article and its summaries generated by different systems.
2. Then, follow the detailed instructions and answer all questions in the Google form of the experiments (Figs. 6 and 7
show the exact Google form we used):
(a) Start with reading the article thoroughly in order to understand the main topics it describes. You must dedicate at
least 5 minutes to reading the article. You have to read the article to its end.
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 33

Fig. 6. Google form used for human evaluations, part 1.


34 N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35

Fig. 7. Google form used for human evaluations, part 2.

(b) Then, read all summaries of the article one by one and fill the questionnaire for every summary separately. Each
questionnaire contains a mandatory field for your ID, a mandatory field for the ID of an article (the same ID for all
questionnaires you will fill), a mandatory field for the ID of a summary (different for every one of the summaries you
evaluate), and a number of mandatory question about the summary quality.
(c) Note that there are 6 summaries for an article and therefore you have to fill and submit 6 questionnaires.
N. Vanetik, M. Litvak and E. Churkin et al. / Information Sciences 509 (2020) 22–35 35

References

[1] I. Mani, M. Maybury, Advances in automatic text summarization, MIT Press, Cambridge, MA, 1999.
[2] K. Knight, D. Marcu, Statistics-based summarization - step one: sentence compression, in: AAAI/IAAI, 20 0 0, pp. 703–710.
[3] H. Jing, Sentence reduction for automatic text summarization, in: ANLP 20 0 0, 20 0 0, pp. 310–315.
[4] P. Lal, S. Ruger, Extract-based summarization with simplification, in: Proceedings of Document Understanding Conference, 2002.
[5] A.N.A. Siddharthan, K. McKeown, Syntactic simplification for improving content selection in multi-document summarization, in: 20th International
Conference on Computational Linguistics (COLING 2004), 2004, p. 896.
[6] K. Knight, D. Marcu, Summarization beyond sentence extraction: a probabilistic approach to sentence compression, Artif. Intell. 139 (2002) 91–107.
[7] K. Filippova, Multi-sentence compression: finding shortest paths in word graphs, in: COLING 10, 2010, pp. 322–330.
[8] R. McDonald, Discriminative sentence compression with soft syntactic evidence, in: EACL, 2006, pp. 297–304.
[9] L. Bing, P. Li, Y. Liao, W. Lam, W. Guo, R.J. Passonneau, Abstractive multi-document summarization via phrase selection and merging, CoRR
abs/1506.01597 (2015).
[10] Y. Chali, M. Tanvee, M.T. Nayeem, Towards abstractive multi-document summarization using submodular function-based framework, sentence compres-
sion and merging, in: Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2,
2017, pp. 418–424.
[11] L. Wang, H. Raghavan, V. Castelli, R. Florian, C. Cardie, A sentence compression based framework to query-focused multi-document summarization,
CoRR (2016). abs/1606.07548.
[12] B.J. Grosz, S. Weinstein, A.K. Joshi, Centering: a framework for modeling the local coherence of discourse, Comput. Linguist. 21 (2) (1995) 203–225.
[13] J. Clarke, M. Lapata, Modelling compression with discourse constraints, in: EMNLP-CoNLL, 2007, pp. 1–11.
[14] J. Clarke, M. Lapata, Discourse constraints for document compression, Computat. Linguist. 36 (3) (2010) 411–441.
[15] A.F.T. Martins, N.A. Smith, Summarization with a joint model for sentence extraction and compression, in: North American Chapter of the Association
for Computational Linguistics: Workshop on Integer Linear Programming for NLP, 2009, pp. 1–9.
[16] T. Berg-Kirkpatrick, D. Gillick, D. Klein, Jointly learning to extract and compress, in: Annual Meeting of the Association for Computational Linguistics,
2011, pp. 481–490.
[17] M. Almeida, A. Martins, Fast and robust compressive summarization with dual decomposition and multi-task learning, in: The 51st Annual Meeting of
the Association for Computational Linguistics (ACL 2013), 2013, pp. 196–206.
[18] K. Filippova, E. Alfonseca, C.A. Colmenares, L. Kaiser, O. Vinyals, Sentence compression by deletion with LSTMs, in: EMNLP, 2015, pp. 360–368.
[19] S. Chopra, M. Auli, A.M. Rush, Abstractive sentence summarization with attentive recurrent neural networks, in: Proceedings of the 2016 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 93–98.
[20] W. Liu, P. Liu, Y. Yang, Y. Gao, J. Yi, An attention-based syntax-tree and tree-LSTM model for sentence summarization, Int. J. Perform.Eng. 13 (5) (2017)
775.
[21] L. Wang, J. Jiang, H.L. Chieu, C.H. Ong, D. Song, L. Liao, Can syntax help? Improving an LSTM-based sentence compression model for new domains, in:
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017, pp. 1385–1393.
[22] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al., Abstractive text summarization using sequence-to-sequence rnns and beyond, arXiv preprint
arXiv:1602.06023 (2016).
[23] J. Cheng, M. Lapata, Neural summarization by extracting sentences and words, arXiv preprint arXiv:1603.07252 (2016).
[24] H. Morita, R. Sasano, H. Takamura, M. Okumura, Subtree extractive summarization via submodular maximization, in: ACL 2013, 2013, pp. 1023–1032.
[25] D. Marcu, From discourse structures to text summaries, in: Proceedings of the ACL, volume 97, 1997, pp. 82–88.
[26] C.D. Manning, H. Schütze, Foundations of statistical natural language processing, 999, MIT Press, 1999.
[27] M. Litvak, N. Vanetik, Mining the gaps: towards polynomial summarization, in: Proceedings of the International Joint Conference on Natural Language
Processing, 2013, pp. 655–660.
[28] D. Gillick, B. Favre, A scalable global model for summarization, in: Proceedings of the NAACL HLT Workshop on Integer Linear Programming for Natural
Language Processing, 2009, pp. 10–18.
[29] R. McDonald, A study of global inference algorithms in multi-document summarization, in: Advances in Information Retrieval, 2007, pp. 557–564.
[30] E. Churkin, M. Last, M. Litvak, N. Vanetik, Sentence compression as a supervised learning with a rich feature space, CICLing, 2018.
[31] S.T. Davis, J.M. Conroy, J.D. Schlesinger, Occams–an optimal combinatorial covering algorithm for multi-document summarization, in: 2012 IEEE 12th
International Conference on Data Mining Workshops, IEEE, 2012, pp. 454–463.
[32] G. Giannakopoulos, J. Kubina, F. Meade, J.M. Conroy, M.D. Bowie, J. Steinberger, B. Favre, M. Kabadjov, U. Kruschwitz, M. Poesio, Multiling 2015:
Multilingual summarization of single and multi-documents, on-line fora, and call-center conversations, in: 16th Annual Meeting of the Special Interest
Group on Discourse and Dialogue, 2015, p. 270.
[33] M. Litvak, M. Last, M. Friedman, A new approach to improving multilingual summarization using a genetic algorithm, in: Proceedings of the 48th
Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2010, pp. 927–936.
[34] DUC, Document Understanding Conference, 2002. ( http://www-nlpir.nist.gov/projects/duc/data/2002_data.html).
[35] DUC, Document Understanding Conference, 2004. ( http://www-nlpir.nist.gov/projects/duc/data/2004_data.html).
[36] DUC, Document Understanding Conference, 2007. ( http://www-nlpir.nist.gov/projects/duc/data/2007_data.html).
[37] C.-Y. Lin, Rouge: a package for automatic evaluation of summaries, in: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004),
2004, pp. 25–26.

You might also like