Professional Documents
Culture Documents
Abstract—Multi-document summary plays an increasingly summarization [6]. Alguliev, Aliguliyev, and Hajirahimova
important role with the exponential document growth on the presented a differential evolution algorithm to maximize
web. Among many traditional multi-document summarization three criteria simultaneously [6]. In “[7]”, the summarization
techniques, the latent semantic analysis (LSA) is a unique duo task was defined as a global inference problem and solved by
to its using latent semantic information instead of original
an approximate dynamic programming approach. Wang and
feature, which results in a better performance. However, since
those approaches based on LSA evaluate and select sentence Li explored a combination of several summarizations with
individually, none of them is able to remove the redundant different weights which were measured by its agreement
sentences. In this paper, we propose a new method to evaluate with the other members of the summarization systems [1].
a sentence subset based on its capacity to reproduce term The LSA-based method, which applied the singular value
projections on right singular vectors. Finally, the experiments decomposition (SVD) to generic text summarization, is also
on DUC2002 and DUC2004 datasets validate the effectiveness popular in the machine learning community. The pioneer
of our proposed methods. work is by Gong and Liu, which treats each of left singular
vectors as a ‘topics’ and uses the matrix of right singular
Keywords-multi-document summrization; latent semantic
analysis; singular value decomposition;forward selection
vectors to extract sentences [8]. Since Gong and Liu’s
method treat the top ‘topics’ as equally important, a
summary may include sentences about ‘topics’ which are not
I. INTRODUCTION particularly important [9]. Steinberge et al. suggested using
With the exponentially growing of electronic documents the singular values to adjust the importance of corresponding
from the WWW, there has been a great interest in the multi- ‘topics’, and their experiments have shown a better
document summarization problem recently. Many methods performance [9]. Furthermore, Steinberge et al. also
have been proposed for finding a concise form from a corpus developed the Addition method for incorporating anaphoric
of documents. Generally, they can be classified into information in an LSA matrix and shown that even the
extractive summarization and abstractive summarization [1]. imperfect anaphoric information can improve the
In this paper, we focus on extractive multi-document performance of LSA-based multi-document summary [10].
summarization. However, the above described summarization method
In the literature, extractive multi-document summarization based on LSA has a significant disadvantages. They all
techniques have been widely studied. Some classic evaluate and select sentence according to its project on the
approaches include Centroid-based methods, Graph-based latent singular vectors individually without considering their
methods, NMF(non-negative matrix factorization)-based dependence. Therefore, the summary may include many
methods and CRF(conditional random fields)-based methods, inappropriate sentences.
etc [1]. MEAD is one of the Centroid-based methods. It uses In this paper, we present a new LSA-based sentence
information of centroid value, positional value, and first- extraction summarizer which evaluates a set of summary
sentence overlap to extract sentences to form a summary [2]. sentences based on its projection similarity to that of the full
LexPageRank constructs a sentence graph firstly, and then sentences set on the top latent singular vectors. Experimental
computes the sentence importance based on the concept of results show that the performance of our method is
eigenvector centrality [3]. Lee et al. proposed a framework consistently better than those of other two LSA-based
based on the non-negative matrix factorization (NMF) of methods and comparable with those of other two non-LSA-
sentence-term matrix to extract sentences with high weighted based methods.
scores [4]. CRF-based summarization splits input document The structure of the paper is as follows. Section 2
into a sentence sequence and evaluates the each sentence by discusses our proposed LSA-based sentence extraction
CRF to represent its importance [5]. summarizer. Experimental results are shown and discussed
Recently, researchers focus more on the studies of in Section 3. Finally Section 4 concludes the paper.
optimization based approaches in extractive document
improve with some modifications in the length evaluation of Thus, selecting the best summary is reduced to a global
sentence [9][10]. Formally, the length of sentence sk is optimization problem as follows.
m r
calculated by
r
Minimize Q(c) ¦¦ ( y
i 1 j 1
ij yˆ ij ) 2 (7)
sk ¦v 2
ki V i2 (2) subject to
ci {0,1}
i 1
(8)
where vki is a item of the i’st right singular vector and V i n
is the i’st non-negative singular value. ¦c i 1
i K. (9)
B. Selecting Sentence by Its Capicity of Projection Here, the K is the predefined number of sentences in
Similarity summary. The integrality constraint on ci “(8)” is
The above LSA-based summarization methods are easy automatically satisfied in the problem above.
to implement, but they all evaluate and select sentence
independently. However, it is well known that each of the
individually important sentences in the summary does not
178
C. The LSA-based forward sentence selection algorithm III. EXPERIMENTS
Observe that “(7)”-“(9)” is an actually nonlinear integer We test the summarization result empirically using the
programming model and it can be solved by many DUC2002 and DUC2004 data sets, both of which are open
optimization techniques such as dynamic programming, benchmark data sets from Document Understanding
branch-and-bound, genetic algorithm and etc. But no Conference (DUC) for automatic summarization evaluation.
technique has been accepted as the best so far [6]. In this DUC2002 data set consists of 59 clusters and each cluster
paper, a simple but efficient heuristic relying on local includes about 10 documents. DUC2004 data set consists of
descent search is adopted to solve problem “(7)”-“(9)”. 50 clusters and each cluster includes a fixed number ̢ 10
Specifically, a sequential forward-selection algorithm is documents. For each cluster in both data sets, reference
performed. Its pseudo code is described as follows: summaries produced by human are provided. The length of
1. Initialize summary S to an empty set; reference summaries for DUC2002 is 200 sentences while
2. Initialize R to the full sentence set of a corpus of the one for DUC2004 is 665 bytes. ROUGE is widely
documents;
applied by DUC for performance evaluation of document
summary. It measures the quality of a summary by counting
3. Construct the terms by sentences matrix A for a corpus of
the unit overlaps between the candidate summary and a set
documents. of reference summaries. Let ref is a reference summaries, se
4. Perform the SVD on A to obtain the singular value matrix is a single sentence and n-gram is a text unit, thus ROUGE-
6 , the left singular vector matrix U and the right singular Nis an n-gram recall computed as follows:
vector matrix V .
T
¦ ¦ n-gramse Count match (n-gram) (11)
ROUGE-N= seref
5. For l = 1 : K % K iterations for K sentences; ¦ seref ¦ n-gramse Count(n-gram)
h = number of sentences in set R; where Count match (n-gram) is the maximum number of n-
For j = 1 :h % evaluate each sentence in R; grams co-occurring in a candidate summary and the
Take sentence j from set R and add it temporarily to set reference summaries, and Count(n-gram) is the number of
S; n-grams in the reference summaries. In this paper, we use
Calculate cost function “(6)” using all sentences in set
ROUGE-1 and ROUGE-2 to evaluate our results.
To evaluate the effectiveness of our method, we conduct
S;
two sets of experiments. In the first experiment, we compare
End our method with most related approaches GLLSA (Gong
Add the sentence that leads to minimum cost function and Liu’s approach) [6] and SJLSA (Steinberger and Jezek’
“(6) ” to set S; approach) [7], which are also based on latent semantic
analysis. Table I shows the ROUGE scores of three
Eliminate the sentence selected from set R;
approaches on two datasets. From the table, it can be seen
End that our method performs consistently better than the other
The advantage of this algorithm is that it is simple and two approaches. Since all these three approaches only use
computationally efficient. To select K sentences from a set purely lexical LSA features, improvement of our method
of n sentences of a corpus of documents, the runtime of this compared with other two is attributed to our sentence
algorithm is O((1/2)(2n-K+1) K) since each iteration of the evaluation criterion which is calculated considering the rest
loop takes O(n) and the loop will iterate at K times. It is of the summary in its entirety. Meanwhile, the performance
noted that if the algorithm is applied in unsupervised of SJLSA is also consistently better than that of GLLSA,
learning, the number of sentences to be selected can’t be which is consistent with the experiments performed by
determined previously. In this circumstance, the number of Steinberge et al. [9].
sentences to be selected in summary can be determined
TABLE I. COMPARISON WITH GLLSA AND SJLSA
based on value of cost function “(6)”. The error reduction
introduced by the summary is defined as Data set Methods ROUGE-1 Rank ROUGE-2 Rank
Q(c)
ERROR(summary)= m r u 100% (10) GLLSA 0.422 3 0.143 3
¦¦ yij2
i 1 j 1
DUC 2002 SJLSA 0.435 2 0.152 2
OUR
0.441 1 0.157 1
The criterion ERROR(summary) can be used to measure METHOD
the performance of the summary and to determine the GLLSA 0.331 3 0.071 3
number K. DUC2004 SJLSA 0.352 2 0.074 2
OUR
0.367 1 0.087 1
METHOD
179
In the second experiment, Centroid-based method MEAD The work is partially supported by grants from Ministry of
and MMR(Maximal Marginal Relevance) are used to test Education, Humanities and social science projects (Project
our method [2][11]. The result of comparison is displayed in No. 11YJA870024).
table II. From this table, it is observed that although only based
on the original term-sentence feature, our method achieves
quite good results compared with MEAD approach which use REFERENCES
both the term-sentence feature and the position feature.
Although, on DUC2002 the performance of our method both [1] D. Wang and T. Li, “Weighted consensus multi-document
rank at 2th among those compared approaches, while the one of summarization”. Information Processing and Management, 2012,
MEAD method both rank at 1th, the difference between ours 48(3), pp.513̢523.
and MEAD’s is not significant. Furthermore, our method and [2] D. Radev, H. Jing, M. Stys, and D. Tam, “Centroid-based
summarization of multiple documents.” *nformation Processing and
MEAD both perform consistently better than MMR Management, 2004, 40(6), pp.919̽938.
approach. [3] G. Erkan, and D. Radev, “LexPageRank: Prestige in multi-document
text summarization.” In Proceedings of the Conference on Empirical
TABLE II. COMPARISON WITH MEAD AND MMR Methods in Natural Language Processing, 2004, pp.365–371.
Data set Methods ROUGE-1 Rank ROUGE-2 Rank
[4] J.-H. Lee, S. Park, C.–M. Ahn, and D. Kim, “Automatic generic
document summarization based on non-negative matrix
MEAD 0.452 1 0.164 1 factorization.” Information Processing and Management, 2009, 45(1),
pp. 20̽34.
DUC 2002 MMR 0.416 3 0.153 3
[5] D. Shen, J.-T. Sun, H. Li, Q. Yang, Z. Chen, “Document
OUR summarization using conditional random fields.” In Proceedings of
0.441 2 0.157 2
APPRAOCH the 20th international joint conference on artificial intelligence,
MEAD 0.373 1 0.084 2 Hyderabad, India, 2007, pp. 2862̽2867.
[6] R. M. Alguliev, R. M. Aliguliyev, and M. S. Hajirahimova,
DUC2004 MMR 0.352 3 0.071 3 “GenDocSum + MCLR:Generic document summarization based on
OUR maximum coverage and less redundancy.” Expert Systems with
0.367 2 0.087 1 Applications, 2012, 39(16), pp.12460̽12473.
APPROACH
[7] R. McDonald, “A study of global inference algorithms in multi-
document summarization.” In Proceedings of 29th European
IV. CONCLUSION conference on IR research, 2007, vol.4425, pp. 557̽564, Springer-
Verlag, LNCS.
In this paper, we have shown a novel way of approaching [8] Y. Gong and X.Liu, “Generic text summarization using relevance
the task of multi-document summarization by using latent measure and latent semantic analysis”. Proceedings of ACM SIGIR.
semantic analysis. The performance of this approach on the New Orleans, US,2002, pp.19–25.
DUC2002 and DUC2004 multi-document summarization [9] J.Steinberger, M. Poesio, M.A.Kabadjov and K.Jezek, “Two Uses of
Anaphora Resolution in Summarization,” Information Processing &
tasks shows improvement over other LSA-based Management, vol.43, pp. 1663̢1680,November 2007.
summarization algorithms in terms of the ROUGUE-1 and [10] J.Steinberge and K.Jezek, “Using Latent Semantic Analysis in Text
ROUGE-2 recall measures. Furthermore, the performance Summarization and Summary Evaluation,” Proceedings of the 7th
of our method is comparable with other state-of-the-art International Conference ISIM,2004, pp.93-100.
multi-document summarization approaches. [11] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based
reranking for reordering documents and producing summaries,”
Proceedings of the 21st Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, Melbourne,
ACKNOWLEDGMENT Australia, 1998, pp. 335̽336.
180