Professional Documents
Culture Documents
COLLEGE OF ENGINEERING
MAYILADUTHURAI, MANNAMPANDAL-609 305
A SEMINAR REPORT
on
A SURVEY ON TEXT MINING SUMMARIZATION
Submitted by
ABINAYA V 820317405002
I YEAR M.E. (CSE) / II SEMESTER
MAY 2018
BONAFIDE CERTIFICATE
Date: ………….
------------------------------------------------------------------------------------------------
-
Submitted for the University Examination held in May 2018 at
A.V.C College of Engineering, Mannampandal-Mayiladuthurai.
Examined on:
CHAPTER PAGE
TITLE
NO NO
ABSTRACT
LIST OF FIGURES
LIST OF ABBREVIATIONS
1 INTRODUCTION 1
2 DATA COLLECTION
Research Challenges
CONCLUSION 37
REFERENCES
FINAL PAPER
ABSTRACT
LIST OF FIGURES
FIGURE NO FIGURE NAME PAGE NO
1 Text summarization process 3
LIST OF TABLES
1. INTRODUCTION
Automatic summarization involves reducing a text document or a larger corpus
of multiple documents into a short set of words or paragraph that conveys the main
meaning of the text. Two methods for automatic text summarization they are
Extractive and Abstractive. Extractive methods work by selecting a subset of existing
words, phrases, or sentences in the original text to form the summary. In contrast,
abstractive methods build an internal semantic representation and then use natural
language generation techniques to create a summary that is closer to what a human
might generate. Such a summary might contain words not explicitly present in the
original. A large number of techniques and approaches have been developed in this
field of research (Jones2007). A summary generated by an automatic text summarizer
should consist of the most relevant information in a document and at the same time, it
should occupy less space than the original document. Nevertheless, automatic
summary generation is a challenging task.
With the increase in on-line publishing, large internet users and fast
development of electronic government (e-government), need of text summarization
has emerged. As the information communication technologies are growing at a great
speed, a large number of electronic documents are available on-line and users are
facing difficulty to find relevant information. Moreover, internet has provided large
collections of text on a variety of topics. This accounts for the redundancy in the texts
available on-line. Users get so exhausted reading large amount of texts that they may
skip reading many important and interesting documents. Therefore, robust text
summarization system is currently needed in this generation. These systems can
compress information from various documents into a shorter length, readable
summary
This approach is very simple and crude often used for keyword extraction from
the documents. There is no predefined dataset required for this approach. To extract
the keywords from documents it uses several statistical features of the document such
as, term or word frequency (TF), Term Frequency-inverse document frequency (TF-
IDF), position of keyword (POK), etc. These techniques are independent of any
language, such that if the summarizer is developed using these techniques, then it can
summarize text in any language. So, these techniques do not require any additional
linguistic knowledge or complex linguistic Processing.
A coherent based approach basically deals with the cohesion relations among
the words. Cohesion relations among elements in a text: reference, ellipsis,
substitution, conjunction, and lexical cohesion. Lexical chain (LC), Word Net (WN)
lexical chain score of a word (LCS), direct lexical chain score of a word (DLCS) ,
lexical chain span score of a word (LCSS) , direct lexical chain span score of a word
(DLCSS) , Rhetorical Structure Theory (RST).
2.DATA COLLECTION
Yeh et al. (2005) proposed two new techniques for automatic summarization
of text: Modified Corpus Based Approach (MCBA) and Latent Semantic Analysis-
based TRM technique (LSA+TRM). MCBA, being a trainable summarizer depends
on a score function and analyzes important features for generating summaries like
Position (Pos), +ve keyword, −ve keyword, Resemblance to the Title (R2T) and
Centrality (Cen). For improving this corpus based approach, two new ideas are
utilized: (a) in order to denote the importance of various sentence positions, these
sentence positions are ranked, (b) Genetic Algorithm (GA) (Russell and Norvig 1995)
trains the score function for obtaining an appropriate combination of feature weights.
LSA+TRM approach uses LSA to obtain a document’s semantic matrix and builds a
relationship map for semantic text by employing a sentence’s semantic representation.
LSA is used to extract latent structures from a document.. MCBA and LSA+TRM
approach focus on summarizing single documents and produce indicative, extract
based summaries. The disadvantage is summaries lack coherence and cohesion most
of the times. In LSA+TRM approach, obtaining the best dimension reduction ratio
and explaining LSA effects are difficult. Moreover, it takes a large time to compute
SVD.
2.1.5 Automatic text summarization using MR, GA, FFNN, GMM and PNN
based models
Rafael Ferreira(2014) This paper advocates the thesis that the quality of the
summary obtained with combinations of sentence scoring methods depend on text
subject. Such hypothesis is evaluated using three different contexts: news, blogs and
articles. Both quantitative and qualitative measures are used to evaluate which
combination of the sentence scoring methods yield better results for each context. The
best combinations for short and well-formed texts (news) is a combination of word
frequency, tf/idf, sentence position, and resemblance to the title. For blogs, short and
unstructured texts, also achieve good results using word frequency and tf/idf.
However, differently from news, combining these methods with text rank score and
sentence length improves the results. In the case of scientific articles the best
combinations include: cue-phase, sentence position, tf/idf, and resemblance to the
title.
TITLE AUTHOR AND PROS CONS
YEAR
Trained (Yeh et al. 2005) LSA+TRM Summaries lack
summarizer and approach generates coherence and
latent semantic a summary of cohesion most of
analysis for semantically related the times. Feature
summarization of sentences. The weights of score
text. approaches are function generated
language- by GA doesn’t
independent always give good
performance results
for the test corpus.
Text understanding (Ye et al. 2007) DCL focuses on Computation cost is
and summarization semantics and more for generating
through document employs only a complete DCL
concept lattice reliable features. because of
The sentences are observing all
coherent and possible
represent important combinations of
and different local concepts
topics and the
summary is
generated with least
loss of answer. The
evaluation
framework does not
require human-
made summaries.
Evolutionary Alguliev et al. 2013 This approach Runtime
optimization reduces redundancy complexity of DE
algorithm for in the summaries, which is a
summarizing selects important population-based
multiple documents sentences from the stochastic search
document and method is more
includes relevant
content of the
original document
Statistical and (Ferreira et al. Apart from using This system strives
linguistic based 2014) statistical and to search for
summarization semantic important sentences
system for multiple similarities, this in groups of
documents approach different topics and
linguistically treats hence suffers from
the input text by the problem of
performing sentence ordering
discourse analysis
and co-reference
resolution. As it is
an unsupervised
approach, it does
not require
annotated corpus
Graph-based Parveen and Strube This approach This approach is
Extractive 2015 doesn’t depend on capable of
summarization by any parameter and generating
considering training data as it is summary from a
importance, non- an unsupervised single document
redundancy and technique and only
coherence summary, being
coherent is of good
quality
Tag Clouds for Thomas Hentrich, The tag cloud It is less effective at
Summarizing Web Benjamin M. Good, interface is the task of enabling
Search Results Mark D. Wilkinson advantageous in users to discover
presenting relations between
descriptive concepts.
information and in
reducing user
frustration
Summarizing Yanwei Pang The tags and photos Lack of
tourist destinations are organized performance of
by mining user- appropriately to photo selection and
generated generate a user experience.
travelogues and representative and
photos comprehensive
summary which
describes a given
destination both
textually and
visually.
Experimental
results based on a
large collection of
travelogues and
photos show the
effectiveness of the
proposed
destination
summarization
framework..