You are on page 1of 4

2014 Seventh International Symposium on Computational Intelligence and Design

A New Approach for Multi-Document Summarization based on Latent Semantic


Analysis

Shuchu Xiong Yihui Luo


College of Computer and Information College of Computer and Information
Hunan University of Commerce Hunan University of Commerce
Changsha, China Changsha, China
xiongshuchu@126.com Yihuiluo217 @163.com

Abstract—Multi-document summary plays an increasingly summarization [6]. Alguliev, Aliguliyev, and Hajirahimova
important role with the exponential document growth on the presented a differential evolution algorithm to maximize
web. Among many traditional multi-document summarization three criteria simultaneously [6]. In “[7]”, the summarization
techniques, the latent semantic analysis (LSA) is a unique duo task was defined as a global inference problem and solved by
to its using latent semantic information instead of original
an approximate dynamic programming approach. Wang and
feature, which results in a better performance. However, since
those approaches based on LSA evaluate and select sentence Li explored a combination of several summarizations with
individually, none of them is able to remove the redundant different weights which were measured by its agreement
sentences. In this paper, we propose a new method to evaluate with the other members of the summarization systems [1].
a sentence subset based on its capacity to reproduce term The LSA-based method, which applied the singular value
projections on right singular vectors. Finally, the experiments decomposition (SVD) to generic text summarization, is also
on DUC2002 and DUC2004 datasets validate the effectiveness popular in the machine learning community. The pioneer
of our proposed methods. work is by Gong and Liu, which treats each of left singular
vectors as a ‘topics’ and uses the matrix of right singular
Keywords-multi-document summrization; latent semantic
analysis; singular value decomposition;forward selection
vectors to extract sentences [8]. Since Gong and Liu’s
method treat the top ‘topics’ as equally important, a
summary may include sentences about ‘topics’ which are not
I. INTRODUCTION particularly important [9]. Steinberge et al. suggested using
With the exponentially growing of electronic documents the singular values to adjust the importance of corresponding
from the WWW, there has been a great interest in the multi- ‘topics’, and their experiments have shown a better
document summarization problem recently. Many methods performance [9]. Furthermore, Steinberge et al. also
have been proposed for finding a concise form from a corpus developed the Addition method for incorporating anaphoric
of documents. Generally, they can be classified into information in an LSA matrix and shown that even the
extractive summarization and abstractive summarization [1]. imperfect anaphoric information can improve the
In this paper, we focus on extractive multi-document performance of LSA-based multi-document summary [10].
summarization. However, the above described summarization method
In the literature, extractive multi-document summarization based on LSA has a significant disadvantages. They all
techniques have been widely studied. Some classic evaluate and select sentence according to its project on the
approaches include Centroid-based methods, Graph-based latent singular vectors individually without considering their
methods, NMF(non-negative matrix factorization)-based dependence. Therefore, the summary may include many
methods and CRF(conditional random fields)-based methods, inappropriate sentences.
etc [1]. MEAD is one of the Centroid-based methods. It uses In this paper, we present a new LSA-based sentence
information of centroid value, positional value, and first- extraction summarizer which evaluates a set of summary
sentence overlap to extract sentences to form a summary [2]. sentences based on its projection similarity to that of the full
LexPageRank constructs a sentence graph firstly, and then sentences set on the top latent singular vectors. Experimental
computes the sentence importance based on the concept of results show that the performance of our method is
eigenvector centrality [3]. Lee et al. proposed a framework consistently better than those of other two LSA-based
based on the non-negative matrix factorization (NMF) of methods and comparable with those of other two non-LSA-
sentence-term matrix to extract sentences with high weighted based methods.
scores [4]. CRF-based summarization splits input document The structure of the paper is as follows. Section 2
into a sentence sequence and evaluates the each sentence by discusses our proposed LSA-based sentence extraction
CRF to represent its importance [5]. summarizer. Experimental results are shown and discussed
Recently, researchers focus more on the studies of in Section 3. Finally Section 4 concludes the paper.
optimization based approaches in extractive document

978-1-4799-7005-6/14 $31.00 © 2014 IEEE 177


DOI 10.1109/ISCID.2014.27
II. SELECTING SUMMARY BASED ON PROJECTION guarantee they collectively produce the best summary. In the
SIMILARITY present study, we choose a subset of sentence from a set of
documents, and evaluate the significance of them in a
A. Lsa-based Summarization collective manner based on their capacity to reproduce term
Inspired by the latent semantic indexing, Gong and Liu projections on right singular vectors. This idea of summary
applied the singular value decomposition (SVD) to generic evaluation can be summarized as follows.
It is noted that one of beauty to be addressed in latent
text summarization [8]. Specifically, a document D is split
semantic analysis is its duality. Thus, from a mathematical
into sentences D={s1 ,s2 ,...,sn } , where si is the i-th sentence point of view, the singular value decomposition can be
in the document and presented as a term by sentences matrix considered as a linear transform that map terms(sentence)
A=[A1 ,A2 ,...,An ] with each column vector Ai , representing from the original measurement space to a new space spanned
the weighted term-frequency vector of sentence i in the by a set of right singular vectors(left singular vectors).
document under consideration. Assume an m × n matrix A, From “(1)”, with simple algebra, term vector x i of A in
quite often m t n , the SVD of A is defined as the new space spanning by a set of right singular vectors is
A U6VT (1) presented by
yi =61VT x i (3)
Where U [U1 ,U2 ,...,Un ] and Ui [u1i , u2i ,..., umi ]T is
where xi =[xi1 ,xi 2 ,...,xin ]T , yi =[yi1 yi 2 ,...,yir ]T ,and xij is the
called a left singular vector; 6 diag(V1 ,V 2 ,...,V n ) is n × n
weighted term-frequency of term i in sentence j. The new
diagonal matrix, whose diagonal elements are non-negative
singular values sorted in descending order; variable yij , j 1,2,..., r are values of the projection of
V [V1 ,V2 ,...,Vn ] and Vi [v1i , v2i ,..., vni ]T is called a right term i on the jth right singular vector. Let c=(c1 ,c2 ,...,cn )t is
singular vector. If the first r largest singular values in 6 are a vector of indicator variables, where ci 1 if the J-th
kept and the remaining smaller ones are removed, then U sentence in the corpus of documents is selected into the
becomes an m × r matrix, 6 is an r × r matrix and VT is an summary, and 0 otherwise. The superscript t denotes the
r × n matrix. transpose of a vector. Assume yˆ ij presents the projection
As shown in “(1)”, the SVD derives the latent semantic produced only by a set of summary sentence. As shown in
structure from the document represented by matrix A. For “(3)”, the new variable yˆ ij and yij are presented by “(4)” and
example, a left singular vector U i expresses the i-th main “(5)” respectively.
‘topic’ of the document, the magnitude of a singular value in n

matrix 6 indicates the important degree of the yˆ ij ¦q


k 1
jk xik ck (4)
corresponding ‘topic’ within the document and the item v ji n

in the vector Vi indicates the relative importance of yij ¦q


k 1
jk xik (5)
sentence j about ‘topic’ i.
where q jk is the item of the product of matrix 6 1 and
According to the above semantic structure, Gong and
Liu’s method simply chooses for each ‘topic’ the most VT .Therefore, the capacity of a summary reproducing term
important sentence independently. It means that if v ji is the projections on right singular vectors can be evaluated by the
following cost function.
largest item in vector Vi , then the sentence j becomes the m r
i’th sentence chosen to go in summary. Steinberger et al. Q(c) ¦¦ ( y ij  yˆ ij ) 2 (6)
follow the same scheme as Gong and Liu proposed, but i 1 j 1

improve with some modifications in the length evaluation of Thus, selecting the best summary is reduced to a global
sentence [9][10]. Formally, the length of sentence sk is optimization problem as follows.
m r
calculated by
r
Minimize Q(c) ¦¦ ( y
i 1 j 1
ij  yˆ ij ) 2 (7)
sk ¦v 2
ki ˜ V i2 (2) subject to
ci {0,1}
i 1
(8)
where vki is a item of the i’st right singular vector and V i n
is the i’st non-negative singular value. ¦c i 1
i K. (9)
B. Selecting Sentence by Its Capicity of Projection Here, the K is the predefined number of sentences in
Similarity summary. The integrality constraint on ci “(8)” is
The above LSA-based summarization methods are easy automatically satisfied in the problem above.
to implement, but they all evaluate and select sentence
independently. However, it is well known that each of the
individually important sentences in the summary does not

178
C. The LSA-based forward sentence selection algorithm III. EXPERIMENTS
Observe that “(7)”-“(9)” is an actually nonlinear integer We test the summarization result empirically using the
programming model and it can be solved by many DUC2002 and DUC2004 data sets, both of which are open
optimization techniques such as dynamic programming, benchmark data sets from Document Understanding
branch-and-bound, genetic algorithm and etc. But no Conference (DUC) for automatic summarization evaluation.
technique has been accepted as the best so far [6]. In this DUC2002 data set consists of 59 clusters and each cluster
paper, a simple but efficient heuristic relying on local includes about 10 documents. DUC2004 data set consists of
descent search is adopted to solve problem “(7)”-“(9)”. 50 clusters and each cluster includes a fixed number ̢ 10
Specifically, a sequential forward-selection algorithm is documents. For each cluster in both data sets, reference
performed. Its pseudo code is described as follows: summaries produced by human are provided. The length of
1. Initialize summary S to an empty set; reference summaries for DUC2002 is 200 sentences while
2. Initialize R to the full sentence set of a corpus of the one for DUC2004 is 665 bytes. ROUGE is widely
documents;
applied by DUC for performance evaluation of document
summary. It measures the quality of a summary by counting
3. Construct the terms by sentences matrix A for a corpus of
the unit overlaps between the candidate summary and a set
documents. of reference summaries. Let ref is a reference summaries, se
4. Perform the SVD on A to obtain the singular value matrix is a single sentence and n-gram is a text unit, thus ROUGE-
6 , the left singular vector matrix U and the right singular Nis an n-gram recall computed as follows:
vector matrix V .
T
¦ ¦ n-gramse Count match (n-gram) (11)
ROUGE-N= seref
5. For l = 1 : K % K iterations for K sentences; ¦ seref ¦ n-gramse Count(n-gram)
h = number of sentences in set R; where Count match (n-gram) is the maximum number of n-
For j = 1 :h % evaluate each sentence in R; grams co-occurring in a candidate summary and the
Take sentence j from set R and add it temporarily to set reference summaries, and Count(n-gram) is the number of
S; n-grams in the reference summaries. In this paper, we use
Calculate cost function “(6)” using all sentences in set
ROUGE-1 and ROUGE-2 to evaluate our results.
To evaluate the effectiveness of our method, we conduct
S;
two sets of experiments. In the first experiment, we compare
End our method with most related approaches GLLSA (Gong
Add the sentence that leads to minimum cost function and Liu’s approach) [6] and SJLSA (Steinberger and Jezek’
“(6) ” to set S; approach) [7], which are also based on latent semantic
analysis. Table I shows the ROUGE scores of three
Eliminate the sentence selected from set R;
approaches on two datasets. From the table, it can be seen
End that our method performs consistently better than the other
The advantage of this algorithm is that it is simple and two approaches. Since all these three approaches only use
computationally efficient. To select K sentences from a set purely lexical LSA features, improvement of our method
of n sentences of a corpus of documents, the runtime of this compared with other two is attributed to our sentence
algorithm is O((1/2)(2n-K+1) K) since each iteration of the evaluation criterion which is calculated considering the rest
loop takes O(n) and the loop will iterate at K times. It is of the summary in its entirety. Meanwhile, the performance
noted that if the algorithm is applied in unsupervised of SJLSA is also consistently better than that of GLLSA,
learning, the number of sentences to be selected can’t be which is consistent with the experiments performed by
determined previously. In this circumstance, the number of Steinberge et al. [9].
sentences to be selected in summary can be determined
TABLE I. COMPARISON WITH GLLSA AND SJLSA
based on value of cost function “(6)”. The error reduction
introduced by the summary is defined as Data set Methods ROUGE-1 Rank ROUGE-2 Rank
Q(c)
ERROR(summary)= m r u 100% (10) GLLSA 0.422 3 0.143 3

¦¦ yij2
i 1 j 1
DUC 2002 SJLSA 0.435 2 0.152 2
OUR
0.441 1 0.157 1
The criterion ERROR(summary) can be used to measure METHOD

the performance of the summary and to determine the GLLSA 0.331 3 0.071 3
number K. DUC2004 SJLSA 0.352 2 0.074 2
OUR
0.367 1 0.087 1
METHOD

179
In the second experiment, Centroid-based method MEAD The work is partially supported by grants from Ministry of
and MMR(Maximal Marginal Relevance) are used to test Education, Humanities and social science projects (Project
our method [2][11]. The result of comparison is displayed in No. 11YJA870024).
table II. From this table, it is observed that although only based
on the original term-sentence feature, our method achieves
quite good results compared with MEAD approach which use REFERENCES
both the term-sentence feature and the position feature.
Although, on DUC2002 the performance of our method both [1] D. Wang and T. Li, “Weighted consensus multi-document
rank at 2th among those compared approaches, while the one of summarization”. Information Processing and Management, 2012,
MEAD method both rank at 1th, the difference between ours 48(3), pp.513̢523.
and MEAD’s is not significant. Furthermore, our method and [2] D. Radev, H. Jing, M. Stys, and D. Tam, “Centroid-based
summarization of multiple documents.” *nformation Processing and
MEAD both perform consistently better than MMR Management, 2004, 40(6), pp.919̽938.
approach. [3] G. Erkan, and D. Radev, “LexPageRank: Prestige in multi-document
text summarization.” In Proceedings of the Conference on Empirical
TABLE II. COMPARISON WITH MEAD AND MMR Methods in Natural Language Processing, 2004, pp.365–371.
Data set Methods ROUGE-1 Rank ROUGE-2 Rank
[4] J.-H. Lee, S. Park, C.–M. Ahn, and D. Kim, “Automatic generic
document summarization based on non-negative matrix
MEAD 0.452 1 0.164 1 factorization.” Information Processing and Management, 2009, 45(1),
pp. 20̽34.
DUC 2002 MMR 0.416 3 0.153 3
[5] D. Shen, J.-T. Sun, H. Li, Q. Yang, Z. Chen, “Document
OUR summarization using conditional random fields.”  In Proceedings of
0.441 2 0.157 2
APPRAOCH the 20th international joint conference on artificial intelligence,
MEAD 0.373 1 0.084 2 Hyderabad, India, 2007, pp. 2862̽2867.
[6] R. M. Alguliev, R. M. Aliguliyev, and M. S. Hajirahimova,
DUC2004 MMR 0.352 3 0.071 3 “GenDocSum + MCLR:Generic document summarization based on
OUR maximum coverage and less redundancy.” Expert Systems with
0.367 2 0.087 1 Applications, 2012, 39(16), pp.12460̽12473.
APPROACH
[7] R. McDonald, “A study of global inference algorithms in multi-
document summarization.” In Proceedings of 29th European
IV. CONCLUSION conference on IR research, 2007, vol.4425, pp. 557̽564, Springer-
Verlag, LNCS.
In this paper, we have shown a novel way of approaching [8] Y. Gong and X.Liu, “Generic text summarization using relevance
the task of multi-document summarization by using latent measure and latent semantic analysis”. Proceedings of ACM SIGIR.
semantic analysis. The performance of this approach on the New Orleans, US,2002, pp.19–25.
DUC2002 and DUC2004 multi-document summarization [9] J.Steinberger, M. Poesio, M.A.Kabadjov and K.Jezek, “Two Uses of
Anaphora Resolution in Summarization,” Information Processing &
tasks shows improvement over other LSA-based Management, vol.43, pp. 1663̢1680,November 2007.
summarization algorithms in terms of the ROUGUE-1 and [10] J.Steinberge and K.Jezek, “Using Latent Semantic Analysis in Text
ROUGE-2 recall measures. Furthermore, the performance Summarization and Summary Evaluation,” Proceedings of the 7th
of our method is comparable with other state-of-the-art International Conference ISIM,2004, pp.93-100.
multi-document summarization approaches. [11] J. Carbonell and J. Goldstein, “The use of MMR, diversity-based
reranking for reordering documents and producing summaries,”
Proceedings of the 21st Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, Melbourne,
ACKNOWLEDGMENT Australia, 1998, pp. 335̽336.

180

You might also like