Professional Documents
Culture Documents
Syntactic Trimming of Extracted Sentences For Improving Extractive Multi-Document Summarization
Syntactic Trimming of Extracted Sentences For Improving Extractive Multi-Document Summarization
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 177
Abstract— For managing a vast hoard of online or offline information, summarization can be the useful means because the users can de-
cide about the relevance of an individual document or a document cluster using just summary information. Multi-document summaries can
enable the users to identify the main theme (central idea) of a cluster of texts very rapidly. This paper presents a sentence compression
based summarization technique that uses a number of local and global sentence-trimming rules to improve the performance of an extractive
multi-document summarization system. For our experiments, we develop (1) a primary summarization system, which extracts sentences to
form a draft summary and (2) a trimming component, which accepts a draft summary for revision. The trimming component eliminates the low
content and redundant elements from the sentences in the draft summaries using a number of local and global sentence-trimming rules with-
out hampering the grammaticality and the fluency of the summaries. In effect, the trimming process makes rooms for more diverse and sa-
lient units to appear in a summary. Our test results on DUC 2004 data set show that the summarization system, which integrates both the
extraction component and the trimming component, performs better than some state-of-the art summarization approaches.
Index Terms— Information Overload, Multi-document summarization, Natural Language Processing, Syntactic Trimming
—————————— ——————————
1 INTRODUCTION
Besides the above-mentioned patterns, the following time word [But/CC with/IN exorbitant/JJ salaries/NNS
sequences have also been identified from the corpus. paid/VBN to/TO unproven/JJ stars/NNS over/IN
the/DT few/JJ years/NNS and/CC 100/CD mil-
“last year", "last week", "last month", "the first time", “next lion/CD deals/NNS sprouting/VB…]
month” , “this year”
R4: Delete the information source (such as a “spokesman
R2: Remove low content adverbs from the sentences said today”) appears at the end of a sentence with the pattern
<comma+ segment + dot>, where segment do not contain any
The tagger assigns the tag <RB> for the adverb. We named entity, but it contains words such as “reported”, “said”
identified few exceptions where this rule should not be etc..
applicable. Some words such as “ago”, “well”, “earlier”,
“before”, “not”, are tagged as <RB>, but they should not The segment indicating news source information
be deleted to maintain the fluency of a summary usually starts with a comma (,) and ends with a period (.).
The constituents to be deleted in the following exam- This segment is more distinctly identified by a list of do-
ple is shown in bold italic. main specific keywords and phrases such as “reported”,
“announced”, “ according to”, “officials said”
Tagged input:
[Portuguese/JJ writer/NN who/WP took/VBD up/IN R5: If a sentence starts with a PP with the pattern <IN +
literature/NN relatively/RB late/JJ in/IN life/NN NP + Comma>, delete it if noun heads in the PP are
and/CC whose/WP$ richly/RB imaginative/JJ no- lightweight. Here the tag <IN> indicates preposition and NP
vels/NNS soon/RB won/VBD him/PRP a/DT follow- indicates noun phrase.
ing/VBG of/IN loyal/JJ. . .]
Here, lightweight means the weight is less than a thre-
After application of rule R2: shold (2.2 in this setting) and PP means prepositional
[Portuguese/JJ writer/NN who/WP took/VBD up/IN phrase, NP means noun phrase.
literature/NN late/JJ in/IN life/NN and/CC
whose/WP$ imaginative/JJ novels/NNS won/VBD Tagged input:
him/PRP a/DT following/VBG of/IN loyal/JJ . . .]
[In/IN an/DT almost/RB casual/JJ fashion/NN,/, the/DT
document/NN seems/VBZ to/TO confirm/VB two/CD
R3: Remove adjectives with idf <2.5 from the noun phrase if of/IN the/DT central/JJ charges/NNS of/IN the/DT feder-
al/JJ case/NN against/IN bin/NN Laden/NNP.]
the noun phrase contains more than 2 keywords. Here idf
means inverse document frequency (used in traditional infor-
After application of Rule R5:
mation settings). The DFA used for identifying noun phrases
[the/DT document/NN seems/VBZ to/TO con-
in shown in Fig-1.
firm/VB two/CD of/IN the/DT central/JJ
charges/NNS of/IN the/DT federal/JJ case/NN
against/IN bin/NN Laden/NNP.
3.2 Global Trimming
Adjective
Modifiers are used before or after the named entities
such as person name, organization name and location
name. The noun phrases also contain modifier terms
Start Noun
such as adjectives, adverbs. The modifier terms may re-
peat (partly or in-full) in a summary along with the units
(such as a named entity (NE) or a noun phrase (NP)) they
are modifying. The redundant information found in such
Article cases can be deleted with minimum loss of information
and without loss of grammaticality. We consider the
modifier terms as candidates for global trimming and the
Fig. 1 DFA for noun phrase identification
parts of modifiers can be eliminated if it is already found
in the previously selected sentences. Before applying
trimming rules, the modifier terms should be identified
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 181
carefully to maintain grammaticality. We apply different is greater than a threshold value (which is 0.5 in our set-
syntactic rules to identify modifier terms from the noun ting). Similarity between two modifiers M1 and M2 is cal-
phrases and from the surroundings of the named entities. culated as follows:
So, we have two types of global trimming: Named Entity
centric sentence trimming and sentence trimming by Sim(M1,M2)= (2*|M1 M2| ) / (|M1|+|M2|)
thinning noun phrases.
Articles and prepositions (if any) are removed while
Named Entity Centric Trimming measuring similarity between two modifiers.
Named Entity centric trimming has three steps: named
entity identification and identification of modifiers sur- Tagged input:
rounding named entities and Formation of trimming (Candidate segment for trimming is shown in bold
rules italics and the similar phrase found in the previously
A word with tag <NNP> (which is used by the tagger scanned sentences is shown in bold only)
to indicate proper noun) has been considered as a part of
a named entity and a sequence of words tagged with [Saudi/NNP exile/NN Osama/NNP bin/NN La-
<NNP> constitutes a named entity. For example, the den/NNP ,/, the/DT alleged/VBN mastermind/NN
phrase, “former Chilean dictator Augusto Pinochet”, is …]
tagged by the tagger as “former/JJ Chilean/JJ dicta- [ …newspaper/NN interview/NN of/IN Afghanis-
tor/NN Augusto/NNP Pinochet/NNP” and “Augus- tan/NNP -/: based/VBN Saudi/NNP billionaire/NN
to/NNP Pinochet/NNP” constitutes a named entity since Osama/NNP bin/NN Laden/NNP who/WP has/VBZ
it is a sequence of NNPs. been/VBN accused/VBN…]
In our work, we consider a noun phrase having named
entity (NE) at the head as a named entity phrase (NEP). After application of rule R6A:
An example of NEP is “former Chilean dictator Augusto
Pinochet”, where “Augusto Pinochet” is a named entity [...Saudi/NNP exile/NN Osama/NNP bin/NN La-
which is at the head of NEP. den/NNP ,/, the/DT alleged/VBN mastermind/NN ...]
We divide NEP into two parts as <modifier+ NE >. [.... interview/NN of/IN Laden/NNP who/WP
But, sometimes it may happen that a very common word has/VBZ been/VBN accused/VBN. . .]
(such as "President") appears as a part of NEP since the
tagger assigns the tag <NNP> to label it. To handle this In the above example, “Saudi/NNP exile/NN Osa-
situation, we convert the word to lowercase and check the ma/NNP bin/NN Laden/NNP” and “Afghanistan/NNP
idf value of this word. If the word is found in the vocabu- -/: based/VBN Saudi/NNP billionaire/NN Osama/NNP
lary and idf value is <2.5, we consider this word as a part bin/NN Laden/NNP” are the two sentence segments,
of the modifier, otherwise we consider it as a part of the which are basically named entity phrases (NEP). These
named entity. The procedure to identify a modifier and two NEPs contain the same named entity (assuming that
named entity works as follows: consecutive NNPs represents a named entity) “Laden” at
Say, A is the left most word and H is a head - word in their heads. So, our system divides first segment in to a
a NEP. Scan NEP from right to left checking NNPs and modifier “Saudi exile Osama bin” and a head named enti-
stop exactly when we encounter a non-NNP or a NNP ty “Laden” and second segment in to the modifier “Afg-
with idf value<2.5. Say, we stop at such a word X. Then hanistan based Saudi billionaire Osama bin” and a head
we consider the segment spanning from the word A to named entity “Laden”. According to similarity metric
the word X as modifier and the rest of the NEP is consi- mentioned above, similarity between two modifiers is
dered as a NE. >0.5. So, the modifier of the entity “Laden” in the second
A modifier of a named entity may appear in an anoth- sentence has been deleted according to the rule R6A.
er form which is called “noun in apposition” and this
form of modifier are extracted using syntactical patterns R6B: If any NP matches completely (word by word) with
<NE, M, > or <NE, M. >, where M is the modifier and NE the modifier of a NEP already seen in the previously scanned
is the named entity. Rule for named entity centric phrase sentences, replace the NP with NE head of the NEP.
trimming is as follows:
Tagged input:
R6A: Delete the modifier of a named entity in its current
mention in a sentence if the modifier of its current mention is [. . . the/DT trial/NN of/IN Malaysian/NNP for-
similar to one of modifiers of its early mentions in already- mer/JJ deputy/NN prime/JJ minister/NN An-
scanned sentences. war/NNP Ibrahim/NNP on/IN charges/NNS of/IN
corruption/NN. . .]
To apply the above rule, we maintain a list of modifi- [. . . his/PRP$ concerns/NNS about/IN the/DT ar-
ers for each mention of a named entity covered in the rest/NN of/IN Malaysian/NNP former/JJ deputy/NN
previously selected sentences while scanning a draft prime/JJ minister/NN ./.]
summary from top to bottom. Two modifiers are taken to
be similar when the term based similarity between them After application of Rule6B:
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 182
[. . . the/DT trial/NN of/IN Malaysian/NNP for- At/IN least/JJS 32/CD people/NNS were/VBD
mer/JJ deputy/NN prime/JJ minister/NN An- killed/VBN and/CC widespread/JJ flooding/NN
war/NN Ibrahim/NNP on/IN charges/NNS of/IN prompted/VBD more/JJR than/IN 150/CD,/,000/CD
corruption/NN. . .] to/TO seek/VB higher/JJR ground/NN ./.
[. . .his/PRP$ concerns/NNS about/IN the/DT ar-
rest/NN of/IN Anwar/NNP Ibrahim/NNP] After Application of the rule R7B:
At/IN least/JJS 32/CD were/VBD killed/VBN
and/CC widespread/JJ flooding/NN prompted/VBD
Simple NP Trimming more/JJR than/IN 150/CD,/, 000/CD to/TO seek/VB
We trim the noun phrases containing a named entity at higher/JJR ground/NN ./.
their heads using Named Entity centric trimming rules
discussed above. But, we treat differently with the noun
phrases having no named entity at their heads. In this
case, we consider the trailing non-noun words (adjectives 4 EXPERIMENTS AND RESULTS
and adverbs) of a noun phrase as modifier terms if the For comparing system-generated summaries to refer-
length of the NP is >2 and the distance between the word ence summaries, we use an automatic summary evalua-
and the noun head is >=2. The distance between two tion tool, ROUGE (Recall-Oriented Understudy for Gist-
phrasal words is measured by position of the head-word ing Evaluation, version 1.5.5) [23][24] developed by the
in the phrase minus position of a word in the phrase. Po- Information Science Institute at the University of South-
sition value of a word in a phrase increases from left to ern California. ROUGE is an automated tool, which com-
right of the phrase. The noun words satisfying above syn- pares a generated summary from an automated system
tactical constraints is considered as a modifier words if idf with one or more ideal summaries. The ideal summaries
value of the word is <2.5. The low idf value of the word are called models. ROUGE is based on n-gram overlap
signifies that the word is very common in the text corpus. between the system-produced and reference summaries.
The NP trimming rule is as follows. ROUGE was used in the 2004 and 2005 Document Under-
standing Conferences (DUC) (National Institute of Stan-
R7A: If A is a NP in a sentence S and Bi is one of NPs be- dards and Technology (NIST), 2005) as the evaluation
longing to the list of noun phrases found in the already-scanned tool. We consider ROUGE-1, ROUGE-2 and ROUGE-SU4
sentences and head (A)=head (Bi), delete the modifier words of average_F scores to measure the performance of each
A, which matches with that of Bi summarizer. ROUGE-1 evaluates unigram based overlap
between the system produced and reference summaries,
Tagged input: ROUGE-2 evaluates bigram co-occurrence while ROUGE-
(Candidate segment for trimming is shown in bold SU4 evaluates ‘‘skip bigrams,’’ which are pairs of words
italics and the similar phrase in the previously revised (in sentence order) having intervening word gaps no
sentence is shown in bold only) larger than 4 words1.
We chose as our input data the document sets used in
[ . . .to/TO launch/VB the/DT first/JJ component/NN the task2 in the Document Understanding Conference
of/IN a/DT multibillion/JJ dollar/NN internation- (DUC) in 2004. Task2 in DUC 2004 was designed to eva-
al/JJ space/NN station/NN after/IN a/DT year/NN luate short multi-document summary generation. This
of/IN delay/NN] collection contains 50 test document sets, each with ap-
[The/DT first/JJ part/NN of/IN the/DT internation- proximately 10 news stories. For each document set, four
al/JJ space/NN station/NN was/VBD smoothly/RB human-generated summaries are provided for the target
orbiting/VBG . . .] length of 665 bytes (approximately 100 words).
In our experiment, for each input data set, a draft summary
After application of Rule R7A: of 200 words is generated by the sentence extraction me-
thod discussed in section 2. Then, all the draft summaries
[. . . to/TO launch/VB the/DT first/JJ component/NN are tagged by the POS tagger. The trimming rules dis-
of/IN a/DT multibillion/JJ dollar/NN internation- cussed in section 3 are then applied one by one to the sen-
al/JJ space/NN station/NN after/IN a/DT year/NN tences in each draft summary. After trimming and resiz-
of/IN delay/NN] ing a draft summary for each document set, a final sum-
[The/DT first/JJ part/NN of/IN the/DT space/NN sta- mary of 665 bytes is selected from the trimmed draft
tion/NN was/VBD smoothly/RB orbiting/VBG . . .] summary. The results of the evaluation of the overall
summarization performance using ROUGE package (ver-
R7B: Trim noun phrases of the pattern <CD + X>, where sion 1.5.5) are shown in table 1, table 2 and table 3. We
CD is the tag for numeric value and X indicates common nouns show three of the ROUGE scores as our experimental results:
specifying human beings. We assume a predetermined list of ROUGE-1 (unigram-based), ROUGE-2 (bigram-based), and
words such as “people”, “persons”, ”soldiers” etc. for X. ROUGE-SU4 (skip bigram) metric. The summarization per-
formances before trimming and after trimming are shown
Tagged Input:
1 ROUGE-1.5.5 version with arguments: -n 2 -x -m -2 4
-u -b 665
JOURNAL OF COMPUTING, VOLUME 2, ISSUE 7, JULY 2010, ISSN 2151-9617
HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/
WWW.JOURNALOFCOMPUTING.ORG 183
separately in the tables in terms of ROUGE scores. be improved by using local and global trimming rules.
The proposed system has been compared with the We have used only syntactic trimming rules to eliminate
best system (peer65) and one baseline system (the lead less important or redundant constituents of the summary
baseline) participated in DUC 2004 on task 2. The Lead sentences.
baseline simply takes the first 665 bytes of the most recent The more improvement in the overall summarization
news article in a document cluster as a summary. performance can be possible by introducing new trim-
We can see from the table 1-3 that our system outperforms ming rules, which will be used to compress a draft sum-
the top performing system and the baseline system on task2 of mary without hampering the gramaticality and the fluen-
DUC 2004 over all three ROUGE metrics. cy of the summary. We hope that the trimming compo-
nent proposed in this paper may be used as a plug-in
TABLE 1 component at the output of any sentence extraction based
ROUGE-1 SYSTEMS COMPARISON ON DUC 2004 DATASET summarizer to boost up its summarization performance.