You are on page 1of 7

Extractive Automatic Text Summarization through

Lexical Chain Method using WordNet Dictionary

Nimisha Dheer Mr. Chetan Kumar


MTech Scholar Associate professor
Department of Computer Science Engineering, Department of Computer Science Engineering,
Kautilya Institute of Technology & Engineering, Kautilya Institute of Technology & Engineering,
Jaipur, India Jaipur, India
Email- nimishadheer@gmail.com Email- chetanmnit@yahoo.in

Abstract-The current technology of automatic text summarization An abstractive summarization [2] attempts to develop an
imparts an important role in the information retrieval (IR) and understanding of the main concepts in every document and
text classification, and it provides the best solution to the then specifies those ideas in clear natural language. It uses
information overload problem. Text summarization is a process linguistic strategies to look at and interpret the text and so to
of reducing the size of a text while protecting its information
content. When taking into consideration the size and number of
search out the new concepts and expressions to best describe it
documents which are available on the Internet and from the by generating a brand new shorter text that conveys the most
other sources, the requirement for a highly efficient tool on which necessary info from the initial text document.
produces usable summaries is clear. We present a better Extractive text summarization [2] methods are often divided
algorithm using lexical chain computation & WordNet. The
algorithm one which makes lexical chains that is computationally
into 2 steps:
feasible for the user. Using these lexical chains the user will 1. Pre processing step and
generate a summary, which is much more effective compared to 2. Processing step.
the solutions available and also closer to the human generated
summary.

Keywords: Lexical Chains, Summary Generation, Text


Summarization, WordNet

I. INTRODUCTION
A summary may be defined as a text that's created from one
or a lot of texts, that contains a major portion of the data
within the original text(s), which isn't any longer than half of
the initial text(s) . Text summarization [1] is the method of
distilling the foremost important data from a source (or
sources) to provide a short version for a specific user (or
users) and task (or tasks). Fig 1: Classification of Text Summarization
When summarization can be done by computer, i.e. Pre processing [2] is structured illustration of the initial text. It
automatically, then this is called Automatic Text usually includes:
summarization. We can apply summarization technique on
just one document or multiple ones and also the source a. Sentences boundary identification:-In English,
documents may be in a very single-language (monolingual) or sentence boundary is known with presence of dot at
several languages (trans-lingual or multilingual). the end of sentence.
b. Stop-Word Elimination:-Common words with no
A. Classification of Text Summarization: semantics and that don't combine relevant info to the
Text summarization strategies are often classified into task is eliminated.
extractive and abstractive summarization [2] [4]. An extractive c. Stemming:-The purpose of stemming is to get the
summarization technique consists of choosing necessary stem or base form of every word that emphasize its
sentences, paragraphs etc. from the original document and semantics.
concatenating them into shorter kind. The importance of In processing step, characteristics influencing the relevancy of
sentences is determined based on statistical and linguistic sentences are determined and calculated and so weights are
characteristics [2] of sentences. This is conceptually easy, allotted to those features using weight learning technique.
simple to implement. Final score of every sentence is decided using Feature-weight

978-1-4673-9916-6/16/$31.00 ©2016 IEEE

Page 749 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
equation. Prime hierarchal sentences are elected for final the previous sentence, then the previous sentence must be
summary enclosed within the summary based on condition.

II. RELATED WORK


Text summarization [1] is the method of creating a condensed
version of original document. This condensed version ought to
have vital content of the initial document. Analysis is being
done since many years to get coherent and indicative
summaries [3] using totally different techniques. Text
summarization is represented as 2 step method
1. Building a source representation from the initial
document.
2. Generating summary from the source representation Fig 2: Architecture of Text Summarization
Text summarization is generally classified into two types: Lexical chains can be distributed over sentences and different
Single document summarization [8] and multi-document text parts. Words may be grouped in the same lexical chain
summarization [4]. The text summarization is classified into when: [10]
extractive and abstractive depending on the nature of text
illustration within the summary, detail is defined in previous • Two noun instances are identical and are used in the same
section. We used extractive summarization [12] in our sense.
proposed work. • Two noun instances are used in the same sense (i.e.,
synonyms).
Automatic text summarization (ATS) [3] is considered as • The senses of two noun instances have the
method of reducing a text document with a computer program hypernym/hyponym relation in between them.
so as to form a summary that retains the main or important • The senses of two noun instances are siblings in the
details of the initial document. As the drawback of info hypernym/hyponym tree.
overload has increased, and because the amount of data has
risen, therefore has interest in automatic summarization [3]. Barzilay and Elhadad used lexical chains first [5], as an
intermediate step in the text summarization process to extract
Berzilay & Elhada[5] given an improved algorithmic rule that important concepts from a document. They showed that
constructs all possible interpretations of the source text using cohesion is one of the surface signs of discourse structure and
lexical chains. It’s an efficient methodology for text lexical chains can be used to identify it. They relied on
summarization as lexical chains establish and capture WordNet [6] to provide sense possibilities for word instances
necessary ideas of the document while not going into deep as well as semantic relations among them. Senses in the
semantic analyses. Lexical chains are made using some WordNet database are represented relationally by synonym
knowledge base that contains nouns and its numerous sets (synsets) are considered as the collection of all the words
associations. sharing a common sense. Words of the same category are
They [5] used WordNet [6] to come up with domain-specific linked through semantic relations like synonymy and
extractive summary using Lexical chains for the nouns within hyponymy. Lexical chains were constructed in three steps:
the document. These rule segments convert the given content 1. Select a set of candidate words.
into sentences & then into tokens. These tokens square 2. For each candidate word, find an appropriate chain
measure labeled mistreatment POS tagger [7].Then they according to a relatedness criterion.
decide to merge these senses into all of the prevailing chains 3. If such a chain is found, insert the word into the chain
altogether doable ways in which, hence building each doable and update the chain.
interpretation of the segment. Next merge chains between
segments that contain a word within the same sense in Once the chains were constructed, they showed that picking
common. The algorithm then calculates score of lexical the ideas which is presented by the strong lexical chains gives
chains, determines the strongest chain and uses this to get a a far better understanding of the central topic of a text than
summary. They conjointly used the idea of correlation of picking only the most frequent words in the text. Finally they
sentences to get a decent quality summary. The terms that used these strong chains to extract sentences from the original
occur within the strongest lexical chains are thought of as key text to construct a summary. After Barzilay [9] and Elhadad,
terms and also the score of sentences is calculated on the basis many researchers followed this approach to use lexical chains
of the presence of key terms in it. All the sentences are graded in text summarization. Silber-McCoy proposed a new
on the basis of their score and top n sentences are chosen for algorithm to compute lexical chains that was based on
inclusion within the summary. Then the correlation of Barzilay-Elhadad method but was linear in space and time.
sentences is checked and if any sentence has correlation with Since the method proposed in had exponential complexity [5],
it was hard to compute lexical chains for large documents. For

Page 750 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
this purpose, Silber and McCoy recompiled the WordNet noun Step 1: Input document for generating summary (.txt file).
database into a binary format and memory-mapped it. Then,
Step 2: Segmentation [13].
they created “meta chains” that represent every possible
representation of the text. These “meta chains” were used to Step 3: Tokenization [14].
disambiguate word senses and to create the lexical chains.
Since WordNet was recompiled into a new format, it could be Step 4: POS tagging [15].
accessed as a large array and this allowed the algorithm to Step 5: Nouns and compound nouns filtering
compute the lexical chains in linear time. After the chains
were computed, the strong chains were selected, and summary Step 6: Word senses of nouns and compound nouns from
sentences were extracted. WordNet dictionary.

The increased in the growth of the net has resulted in huge Step 7: Collection of candidate words:
amounts of information that has become tougher to access Step 8: Check the synonyms, hypernyms and antonyms using
with efficiency. Web users need tools to manage this immense WordNet dictionary
amount of information. The main goal of this analysis is to
form an economical and effective tool that's able to summarize Step 9: Making the Lexical chains of candidate words.
quite large documents quickly. This analysis presents a linear Step 10: Finding scoring the chains
time algorithmic rule [10] for finding out lexical chains that
could be a technique of capturing the “aboutness” of a Step 11: Generate summary
document. This technique is compared to previous, less Step 12: Evaluate value of Recall.
efficient strategies of lexical chain extraction. They
additionally give different strategies for extracting and Formula Applied in the Solution:
evaluation lexical chains. They show that their technique 1: Significance of the lexical chain:-
provides similar results to previous analysis, however is
For each chain find the significance of each chain
considerably quite more efficient. This efficiency is important
in web search applications where several quite large . (1)
documents might have to be summarized promptly, and where
the reaction time to the end user is very vital. Where length (L) is the length of a particular chain L. Length
This initial part of their implementation constructs an array of (l) is the sum of all chains length in the text. In this way each
“meta chains” [10]. Every Meta chain contains a score and a chain has its significance.
data structure that encapsulates the meta-chain. The score is 2: Formula for Utility of the lexical chain:-
computed as every word is inserted into the chain. Whereas
the implementation creates a flat illustration of the source text, The utility of a lexical chain L to a document D is defined as

The “best” interpretation finds the score of all the Meta chain. (2)
Overall “score” [10] of all the meta-chains is largest. 3: Computing the Threshold value:-
Then we have studied [11] a new algorithmic program to find This formula is used for computing the accepted chains and
out lexical chains in a text, merging many robust knowledge the formula is as follows,
sources: the WordNet thesaurus, a part-of-speech tagger,
shallow parser for the identification of nominal teams, and a (3)
segmentation algorithmic program. Summarization is carried
out in four steps: the initial step is, text is segmented, lexical 4: Computing the recall:-
chains are made, strong chains are marked or identified and This formula is used to calculate the percentage of match from
vital sentences are extracted. the human generated summary and our algorithm generated
Lexical chains [8] are created via WordNet. The score of summary
every Lexical chain is calculated supported keyword strength Recall = (Total Words Matched In Human Summary / Total
& alternative features. The main concept of implementing Words); (4)
lexical chains helps to research the document semantically and
therefore the concept of correlation of sentences helps to think 5: Computing the time to generate summary
about the relation of sentence with preceding or succeeding Here we will take the start time which is the time the process
sentence. This improves the standard of summary generated started and end time when the process stopped or ended after
[3].This is my Base paper. generated recall and summary.

III. PROPOSED MODEL Total_Seconds = End_Time-Start_Time (5)


I proposed a model for automatic text summarization using
lexical chains and WordNet dictionary.

Page 751 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
Algo 1: Algorithm of Summary Generated Algo 2: Algorithm of Stemming & Tagging
SummaryGenrate(fname, per2, spath) STEMTAGG(line)
[The function SummaryGenerate will generate the summary [Perform stemming and then tagging]
and find the recall and the time taken. fname is name of the 1. Using the WORDNET API, the base for of words is
file which contains the original text ,per2 is the percentage of obtained
the summary and spath is the path of human generated 2. Tagged the line using the MAXENTTAGGER class.
summary] 3. Return the finally tagged line.
1. Set starttime:=current_system_time, nlines:=0,
br:=openfile(fname) Algo 3: Algorithm of Noun Compound Filtering
2. while ((strLine = br.readLine()) =! NULL) { EXTRACT(nlines)
3. Set slines[nlines]=strLine, Set nlines := nlines +1 ; [ extract only nouns and pronouns , declare elines array for
} that and words array will store all words ]
[End of while loop] 1. Set nc :=0, wcount :=0;
4. Set totline :=nlines & Set ob :=new AllDemo(); [create the 2. Split the each element from nlines array and store in twords
object of call ALLDEMO] array
5. Repeat for nl :=0 to nlines-1 by 1 do: 3. Set s2 :="";
6. Call STEMTAGG(nlines[nl]) 4. Repeat for k :=0 to k< twords.length-1 by 1 do :
[End of for loop] 5. If twords[k] endswith "NN" or "NNP" then:
7. Call EXTRACT(nlines) (a) Set s2 :=s2+twords[k]+" “;
8. Call UNIQUEWORDS(words) (b)Setwords[wcount++]=twords[k].substring(0,twords[k].inde
9. Call CHECKREL(UWORDS) xOf("_"));
10. Set sum :=0 & Repeat for chidx=0 to chn-1 by 1 do [End of If structure] [End of inner for loop].
Set sum := sum + chainlengths[chidx]; 6. If (s2.length()!=0) then :
[End of for loop] Set elines[nc++] :=s2;
11. Call SCORE(chainlengths) [End of if structure]
12. Open the file mysummary.txt to write the summary [End of outer for loop]
generated & Open mysummary.txt for reading purpose into 7. Return elines, words
object br2 Algo 4:Algorithm for finding the Unique Word
13. Repeat Steps 14 to 16 while ((strLine = br2.readLine()) !=
NULL) { UNIQUEWORDS(words)
1.Set ucount, found :=0 & Repeat for nl :=0 to nl<wcount-1
14. Set s :=strLine & Split s into array of words and label it
twords array. by 1 do :
15. Repeat for i=0 to twords.length-1 by 1 do : 2. Repeat for k:=0 to ucount-1 By 1 do :
16. Set twordtot[twcnt++]=twords[i]; [twordtot will contain 3. If (uwords[k]==words[nl]) then :
all words from the mysummary.txt] fcount[k]++, Set found :=1;
[End of If structure]
[End of for loop]
17. Open the file contained in variable fname2 to write the [End of inner for loop]
summary generated & Open fname2 for reading purpose into 4. If (found==0) then :
(a) Set uwords[ucount] :=words[nl];
object br2.
18. Repeat Step 19 to while ((strLine = br2.readLine()) != (b) Set fcount[ucount] :=1;
NULL) { (c) ucount++;
[End of If structure]
19. Set s :=strLine & Split s into array of words and label it
twords array. } [End of outer for loop]
20. Repeat for i=0 to twords.length-1 by 1 do: 5. Return uwords, fcount
21. Set twordtot2[twcnt2++]=twords[i]; [twordtot will contain Algo 5: Algorithm to Check the Synonyms, Antonyms,
all words from the file fname2] Hypernyms
[End of for loop] CHECKREL(UWORDS)
22.Compare twordtot and twordto2 array to number of words [Check the Synonyms,Antonyms,Hypernyms and Hyponyms
matched and store it in wmatchcnt and twcnt2 is total words in and find the chain lengths and store in chainlengths array
original summary using WordNet library. and find the chain lengths and store in
23. Set recall=wmatchcnt*1.0/twcnt2; chainlengths array]
24. Set endtime = current_system_time [generated using the 1.Set chains :=1, ucont=0, incn=0, chn=0, found :=0;
predefined function call] & Set diff = endtime - starttime; 2.Repeat for nl=0 to ucount-1 & Repeat for k=0 to incn-1 by
25.Print (recall) & Print (diff); 1 do:
3. If(inchain[k] ==uwords[nl] ) then :
Set found :=1;
[End of If strucutre]

Page 752 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
4.If(found==0) then : Generate the Summary of Proposed work
Check the Synonyms,Antonyms,Hypernyms and Hyponyms
and find the chain lengths and store in chainlengths array.
5.Return
[End of If structure]
[End of inner for loop]
[End of outer for loop]
Algo 6: Algorithm to find the Scoring Chain Algorithm
SCORE(chainlengths)
[Find the Scoring of the chain using Significance and Utility]
1.Set sig[chidx] :=-
(chainlengths[chidx]/(sum*1.0))*(Math.log(chainlengths[chid
x]/(sum*1.0))/Math.log(2.0) );
2.Print("Chain "+(chidx+1)+ " Signifance is " + sig[chidx]);
[End of for loop]
3.Repeat for uidx :=0 to chn-1 by 1 do :
4.Set ulits[uidx]=sig[uidx]*chainlengths[uidx];
[End of for loop]
5.Set sutility=0; [calculate sum of utility of all chains] &
Repeat for uidx=0 to chains-1;uidx++)
6. Print("Utility of Chain "+(uidx+1)+"\t\t = "+ulits[uidx]);
7.Set sutility=sutility+ulits[uidx];
[End of for loop]
8.Set autility :=sutility/chn, Set thold :=sutility/(chains*2);
9.Set cn=0; [Find the accepted chains] & Repeat for uidx :=0
to chains-1 by 1 do :
10. If(ulits[uidx]>thold) then :
(a) Set sel:= uidx [Add the chain words in the achainw
array] Generate the Summary of base paper [3]
(b) Repeat for id=0 to chainlengths[sel] by 1 do :
(c) Set achainw[cn++]= chainsall[sel][id]+" ";
[End of for loop]
[End of if structure]
[End of outer for loop]
11.Repeat for idx1=0 to totline-1 , Repeat for idx2=0 to cn-1
by 1 do :
12.Set matchcnt[idx1][0]=idx1, Set matchcnt[idx1][1]=0;
13. If(slines[idx1].indexOf(achainw[idx2])!=-1) [Search
Presence of word in the line]
Set matchcnt[idx1][1]=matchcnt[idx1][1]+1;
[End of If structure]
[End of Inner if structure]
[End of Outer if structue]
14.Sort the matchcnt array in the descending order in terms of
count using the bubble sorting
15.Set exlines=totline*per2/100;
16.Extract the exlines elements form the matchcnt array
17.Return

IV. IMPLEMENTATION
I am using Eclipse Java EE (Enterprise Edition) IDE
(Integrated Development Environment) for Web Developers
Version: Kepler Service Release 1 Build id: 20130919-0819 in
my Dissertation.
Run the Automatic Text Summarization Using WordNet of
Proposed Work

Page 753 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
V. EVALUATION AND EXPERIMENTAL
300
We have run our program on around 40 sample summaries
and from those we have presented around four samples and 250

the result of the comparison are presented in the form of the 200
graph & table.
150
We used four Sample Data to generate base paper & proposed
work base summary & compare to human generated summary 100

and find the value of recall, total words matched & time taken 50
to generate summary. We used the 30% of lexical chains for
summary of sample data. We also show the result in below. 0
Sample Sample Sample Sample
Data 1 Data 2 Data 3 Data 4
Table 1: Result of text summary generation
Time Taken in BasePaper 214 122 128 105
Time Taken to Time Taken in Proposed Work 145 48 35 32
Total Words generate
Recall
Matched summary
Input
(In Sec) Comparison of Time taken in base & proposed model
File
Propos Propos Propo
Base Base Base
ed ed sed
Paper paper paper
work work work VI. CONCLUSION AND FUTURE WORK
Data1 0.54 0.56 77 79 214 145
The document summarization problem is a very important
Data2 0.44 0.72 37 61 122 48 problem due to its impact on the information retrieval methods
Data3 0.41 0.44 63 68 128 35 as well as on the efficiency of the decision making processes,
and particularly in the age of Big Data Analysis. Though a
Data4 0.53 0.64 102 123 105 32
good kind of text summarization techniques and algorithms
are developed there's a requirement for developing new
approaches to supply precise and reliable document
1 summaries that may tolerate variations in document
characteristics.
0.8

In this thesis, we presented a method to find out the lexical


0.6
chains as an efficient intermediate representation of our
0.4 document. Along with WordNet API, our method also
included the nouns and proper nouns in the computation of
0.2 lexical chains. And the statistical calculations in our proposed
0
methodology resulted in the better output as compared to the
Sample Data Sample Data Sample Data Sample Data base paper.
1 2 3 4
Recall of BasePaper 0.54 0.44 0.41 0.53 I wish to pursue further researches in the following areas:
Recall of Proposed Work 0.56 0.72 0.44 0.64 Document clustering is the main step forward towards the
identification of the various representations in a multi-
Comparison of recall value of base and proposed model document collection. Accurate and better similarity measure
imparts an important role in the process of determining the
150 overall efficiency of the clusters or parts of the document. I
have computed the similarity measure and calculations based
120
on the overlap of nouns and proper nouns between two words
90 and the segments. I will in my further studies try to extend this
similarity measure including verb, adverb etc.
60
Multi-document summarization is still a very complex and
30 difficult task. I will try to extend my research to generate the
0
summaries related to multiple documents at a time.
Sample Sample Sample Sample
Data 1 Data 2 Data 3 Data 4
Acknowledgment
Total Words Matched in
77 37 63 102
BasePaper I take this opportunity to express a deep sense of gratitude
Total Words Matched in
79 61 68 123
towards my guide Mr. Chetan Kumar. I am thankful to all
Proposed Work
faculties especially Mr. Gopal Kumar, Assistant professor of
Comparison of Total words matched in base and proposed model Kautilya Institute of Technology and Engineering who taught
the fundamental essential to undertake such a synopsis.

Page 754 of 1357


1339
2nd IEEE International Conference on Engineering and Technology (ICETECH), 17th & 18th March 2016, Coimbatore, TN, India.
Without their valuable guidance it would have been extremely
difficult to grasp and visualize the synopsis.

References
[1]Elena Lloret “Text Summarization: An Overview”, paper supported by the
spanish government under the project TEXT-MESS (TIN2006-15265-C06-
01), 2008.
[2] Vishal Gupta, Gurpreet Singh Lehal “A Survey of Text Summarization
Extractive Techniques”, Journal of Emerging Technologies in Web
Intelligence, Vol 2, No 3 (2010), 258-268, Aug 2010,
doi:10.4304/jetwi.2.3.258-268
[3] A.R.Kulkarni, S.S.Apte, “An Automatic Text Summarization Using
Lexical Cohesion and Correlation of Sentences”, 2014, IJRET: International
Journal of Research in Engineering and Technology eISSN: 2319-1163
pISSN: 2321-7308.
[4] A. N. Gulati, Dr. S. D. Sawarkar "A Pandect of Different Text
Summarization Techniques”, International Journal of Advanced Research in
Computer Science and Software Engineering Volume 5, Issue 4, April 2015
ISSN: 2277 128X .
[5] Barzilay, R. and M. Elhadad, “Using Lexical Chains for Text
Summarization”, ACL/EACL-97 Workshop on Intelligent Scalable Text
Summarization, pp. 10–17, 1997.
[6]. “WORDNET”, http://wordnet.princeton.edu/.
[7] M.Suneetha, S. Sameen Fatima, “Corpus based Automatic Text
Summarization System with HMM Tagger”, International Journal of Soft
Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-1, Issue-3,
July 2011.
[8] Dr. Anjali Ganesh Jivani, “A Comparative Study of Stemming
Algorithms”, 2011, Journal Int. J. Comp. Tech. Appl Volume 2, Issue 6, Pages
1930-1938.
[9]Barzilay, R., “Lexical Chains for Summarization”, M.Sc. thesis, Ben-
Gurion University of the Negev, Department of Mathematics & Computer
Science, 1997.
[10] Silber, H. G. and K. F. McCoy, “Efficient Text Summarization Using
Lexical Chains”, Proceedings of the 5th international conference on Intelligent
user interfaces, pp. 252–255, January 2000
[11]Kamal Sarkar, "Using Domain Knowledge for Text, Summarization in
Medical Domain" International Journal of Recent Trends in Engineering,
Vol 1, No. 1, May 2009
[12] M S Patil, M S Bewoor, S H Patil, "Survey on Extractive Text
Summarization Approaches", NCI2 TM: 2014 ISBN: 978-81-927230-0-6
[13] Dilek Hakkani-Tur¨ , Slav Petrov, Dan Klein “Efficient sentence
segmentation using syntactic features”, Benoit Favre, Published in: Spoken
Language Technology Workshop, 2008. SLT 2008. IEEE, Date of
Conference: 15-19 Dec. 2008 Page(s): 77 – 80 E-ISBN: 978-1-4244-3472-5
Conference Location : Goa DOI: 10.1109/SLT.2008.4777844.
[14] RM Kaplan, “A method for tokenizing text” - Inquiries into words,
constraints and contexts, 2005 - stanford.edu.
[15] A Practical Part-of-Speech Tagger Doug Cutting and Julian Kupiec and
Jan Pedersen and Penelope Sibun Xerox Palo Alto Research Center 3333
Coyote Hill Road, Palo Alto, CA 94304, USA.

Page 755 of 1357


1339

You might also like