You are on page 1of 21

A.V.C.

COLLEGE OF ENGINEERING
MAYILADUTHURAI, MANNAMPANDAL-609 305

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CP5281 –TERM PAPER WRITING AND SEMINAR

A SEMINAR REPORT
on
A SURVEY ON TEXT MINING SUMMARIZATION

Submitted by

ABINAYA V 820317405002
I YEAR M.E. (CSE) / II SEMESTER
MAY 2018

BONAFIDE CERTIFICATE

Register No: 820317405002

Certified to be the bonafide record of seminar Paper Presented by


ABINAYA.V of I Year II Semester M.E Computer Science Engineering Degree
Course for the Practical CP5281 –TERM PAPER WRITING AND
SEMINAR in A.V.C. College of Engineering , Mannampandal,
Mayiladuthurai during the academic year 2017-18.

Guide Faculty In-charge HEAD OF THE DEPARTMENT


Dr.S.PADMAPRIYA, M.E., Ph.D.,

Date: ………….

------------------------------------------------------------------------------------------------
-
Submitted for the University Examination held in May 2018 at
A.V.C College of Engineering, Mannampandal-Mayiladuthurai.

Examined on:

Examiner-1 Examiner- 2 Examiner - 3


TABLE OF CONTENTS

CHAPTER PAGE
TITLE
NO NO

ABSTRACT

LIST OF FIGURES

LIST OF ABBREVIATIONS

1 INTRODUCTION 1

---A SURVEY ON TEXT


2
SUMMARIZATION

1.2 Need Of Text Summarization 2

1.3 Text Summarization Types 5

1.4 Text Summarization Approaches 11

2 DATA COLLECTION

2.1 Literature Survey

2.2 Literature Survey Table

PAPER CLASSIFICATION AND


3 24
CATEGORIZATION

Research Challenges

CONCLUSION 37

REFERENCES

FINAL PAPER
ABSTRACT

In recent times, data is growing rapidly in every domain such as


news, social media, banking, education, etc. Due to the excessiveness of data,
there is a need of automatic summarizer which will be capable to summarize the
data especially textual data in original document without losing any critical
purposes. Text summarization is emerged as an important research area in
recent past. In this regard, review of existing work on text summarization
process is useful for carrying out further research. In this paper, recent literature
on automatic keyword extraction and text summarization are presented since
text summarization process is highly depend on keyword extraction. This
literature includes the discussion about different methodology used for text
summarization. It also discusses about different databases used for text
summarization in several domains. Finally, it discusses briefly about issues and
research challenges faced by researchers along with future direction.

LIST OF FIGURES
FIGURE NO FIGURE NAME PAGE NO
1 Text summarization process 3

2 Classification of text summarization 15


approach

LIST OF TABLES

TABLE NO. TABLE NAME PAGE NO


1.1 Different types of summaries
on the basis of various
factors

1. INTRODUCTION
Automatic summarization involves reducing a text document or a larger corpus
of multiple documents into a short set of words or paragraph that conveys the main
meaning of the text. Two methods for automatic text summarization they are
Extractive and Abstractive. Extractive methods work by selecting a subset of existing
words, phrases, or sentences in the original text to form the summary. In contrast,
abstractive methods build an internal semantic representation and then use natural
language generation techniques to create a summary that is closer to what a human
might generate. Such a summary might contain words not explicitly present in the
original. A large number of techniques and approaches have been developed in this
field of research (Jones2007). A summary generated by an automatic text summarizer
should consist of the most relevant information in a document and at the same time, it
should occupy less space than the original document. Nevertheless, automatic
summary generation is a challenging task.

1.1 LITERATURE SURVEY ON TEXT SUMMARIZATION

A literature survey on text summarization process are having many techniques


and approaches are there. There are various type of summaries in the text
summarization process. Automatic keyword extraction is the process of selecting
words and phrases from the text document that can at best project the core sentiment
of the document without any human intervention depending on the model. The target
of automatic keyword extraction is the application of the power and speed of current
computation abilities to the problem of access and recovery, stressing upon
information organization without the added costs of human annotators.
Summarization is a process where the most salient features of a text are extracted and
compiled into a short abstract of the original document. According to Mani and
Maybury text summarization is the process of distilling the most important
information from a text to produce an abridged version for a particular task and user.
Summaries are usually around 17\% of the original text and yet contain everything
that could have been learned from reading the original article . In the wake of big data
analysis, summarization is an efficient and powerful technique to give a glimpse of
the whole data.

1.2 NEED OF TEXT SUMMARIZATION

With the increase in on-line publishing, large internet users and fast
development of electronic government (e-government), need of text summarization
has emerged. As the information communication technologies are growing at a great
speed, a large number of electronic documents are available on-line and users are
facing difficulty to find relevant information. Moreover, internet has provided large
collections of text on a variety of topics. This accounts for the redundancy in the texts
available on-line. Users get so exhausted reading large amount of texts that they may
skip reading many important and interesting documents. Therefore, robust text
summarization system is currently needed in this generation. These systems can
compress information from various documents into a shorter length, readable
summary

1.3 VARIOUS TYPES OF TEXT SUMMARIZATION

1.3.1 Based on Number of Document:

Single and Multi-Document Summarizations: Summary is generated from a


single document in single document summarization whereas in multi-document
summarization, many documents are used for generating a summary. It is considered
that summarization of single document is extended to generate summarization of
multiple documents. But the task of summarizing multiple documents is more difficult
than the task of summarizing single documents. Redundancy is one of the biggest
problems in summarizing multiple documents. There are some systems that tackle
with redundancy by initially selecting the sentences at the beginning of the paragraph
and then measuring the similarity of the next sentence with the already chosen
sentences and if this sentence consists of relevant new content, then only it is selected
(Sarkar 2010). MMR approach (Maximal Marginal Relevance) is suggested by
Carbonell and Goldstein (1998) for reducing redundancy. Researchers from all over
the world are investigating different methods to produce best results in multi-
document summarization.

1.3.2 Extractive and Abstractive Summarization:

Extractive or Abstractive summarization is also a classification of document


summarization. An extract summary is generated in extractive summarization by
selecting a few relevant sentences from the original document. Summary’s length
depends on the compression rate. It is a simple and robust method for summarization
of text. Here some saliency scores are assigned to sentences in the documents and
then highly scored sentences are chosen to generate the summary. Whereas
abstractive summarization produces an abstract summary which includes words and
phrases different from the ones occurring in the source document. Therefore, abstract
is a summary that consists of ideas or concepts taken from the original document but
are re-interpreted and shown in a different form. It needs extensive natural language
processing. Therefore, it is much more complex than extractive summarization. Thus,
extractive summarization, due its increased feasibility has attained a standard in
summarization of documents.

1.3.3 Based on Summary Usage:

Generic and Query based: Topic-focused or user-focused summaries are the


other names for query-focused summaries. Such a summary includes the query related
content whereas a general sense of the information present in the document is
provided in a generic summary.
Based on the style of output, there are two types of summaries: indicative and
informative summaries. Indicative summaries tell what the document is about. They
give information about the topic of the document. Informative summaries, while
covering the topics, give the whole information in elaborated form.

Figure 1. Text summarization process

1.3.4 Based on Techniques:

Summarization task can be either supervised or unsupervised. Training data is


needed in a supervised system for selecting important content from the documents.
Large amount of labeled or annotated data is needed for learning techniques. These
systems are addressed at sentence level as two-class classification problem in which
sentences belonging to the summary are termed as positive samples and sentences not
present in the summary are named as negative samples. For performing sentence
classification, some popular classification methods are employed such as Support
Vector Machine (SVM) and neural networks. On the other hand, unsupervised
systems do not require any training data. They generate the summary by accessing
only the target documents. Thus, they are suitable for any newly observed data
without any advanced modifications. Such systems apply heuristic rules to extract
highly relevant sentences and generate a summary. The technique employed in
unsupervised system is clustering.

Table 1 Different types of summaries on the basis of various factors


S.No. Types of summary Factors

1. Single and multi-document Number of documents


2. Extractive and abstractive Output (if extract or abstract is required)

3 Generic and query-focused Purpose (whether general or query related data is


required)
4. Supervised and unsupervised Availability of training data
5. Mono, multi and cross-lingual Language
6. Web-based For summarizing web pages
7. E-mail based For summarizing e-mails
8. Personalized Information speciÞc to a userÕs need

9. Update Current updates regarding a Topic


10. Sentiment-based Opinions are detected
11. Survey Important facts regarding person, place or any other
Entity

1.4 Text summarization approach

1.4.1 Statistical Based Approach

This approach is very simple and crude often used for keyword extraction from
the documents. There is no predefined dataset required for this approach. To extract
the keywords from documents it uses several statistical features of the document such
as, term or word frequency (TF), Term Frequency-inverse document frequency (TF-
IDF), position of keyword (POK), etc. These techniques are independent of any
language, such that if the summarizer is developed using these techniques, then it can
summarize text in any language. So, these techniques do not require any additional
linguistic knowledge or complex linguistic Processing.

1.4.2 Machine learning Based Approach:

Machine learning is a feature dependent approach we one need annotated


dataset to trained the models. There are several popular machine learning approaches
namely, Nave Bayes (NB), decision trees (DTs),Hidden Markov Model (HMM),
Maximum Entropy (ME), Neural Network (NN), Support Vector Machine (SVM) etc.
used for text summarization

1.4.3 Graph based approaches:

In a graph, text elements (words or sentences) are represented by nodes and


edges connect the related text elements (semantically related) together. Erkan and
Radev (2004) proposed LexRank which is a summarization system for multiple
documents where those selective sentences are shown in the graph which are expected
to be a part of the summary. If similarity among two sentences lies above a given
limit, then there is a connection between them in the graph. After the network is
made, important sentences are selected by the system by carrying out a random walk
on the graph.

Figure 2. Classification of text summarization approach

1.4.4 Coherent Based Approach

A coherent based approach basically deals with the cohesion relations among
the words. Cohesion relations among elements in a text: reference, ellipsis,
substitution, conjunction, and lexical cohesion. Lexical chain (LC), Word Net (WN)
lexical chain score of a word (LCS), direct lexical chain score of a word (DLCS) ,
lexical chain span score of a word (LCSS) , direct lexical chain span score of a word
(DLCSS) , Rhetorical Structure Theory (RST).

1.4.5 Algebraic Approach

In this approach, one use algebraic theories namely, matrix, transpose of


matrix, Eigen vectors, etc. There are many algorithms used for text summarization
using algebraic approach such as Latent Semantic Analysis (LSA), Meta Latent
Semantic Analysis (MLSA), Symmetric nonnegative matrix factorization (SNMF) ,
Sentence level semantic analysis (SLSS) , Non-Negative Matrix factorization (NMF) ,
Singular Value Decomposition (SVD) [, Semi-Discrete Decomposition (SDD).

1.4.6 Discourse based approaches:

This approach is used in linguistic techniques for automatic text


summarization. Discourse relations in the text are discovered here. Discourse relations
represent connections between sentences and parts in a text. Mann and Thompson
(1988) proposed Rhetorical Structure Theory (RST) in computational linguistics
domain to act as a discourse structure. RST has two main aspects: (a) coherent texts
contain a few number of units, connected together by rhetorical relations, (b) In
coherent texts, there must be some kind of relation between various parts of the text.
Coherence as well as cohesion are the two main challenging issues in text
summarization. Linguistic approaches are helpful in understanding the meaning of the
document for summary generation.

2.DATA COLLECTION

2.1 LITERATURE SURVEY:

2.1.1. Trained summarizer and latent semantic analysis for summarization


of text

Yeh et al. (2005) proposed two new techniques for automatic summarization
of text: Modified Corpus Based Approach (MCBA) and Latent Semantic Analysis-
based TRM technique (LSA+TRM). MCBA, being a trainable summarizer depends
on a score function and analyzes important features for generating summaries like
Position (Pos), +ve keyword, −ve keyword, Resemblance to the Title (R2T) and
Centrality (Cen). For improving this corpus based approach, two new ideas are
utilized: (a) in order to denote the importance of various sentence positions, these
sentence positions are ranked, (b) Genetic Algorithm (GA) (Russell and Norvig 1995)
trains the score function for obtaining an appropriate combination of feature weights.
LSA+TRM approach uses LSA to obtain a document’s semantic matrix and builds a
relationship map for semantic text by employing a sentence’s semantic representation.
LSA is used to extract latent structures from a document.. MCBA and LSA+TRM
approach focus on summarizing single documents and produce indicative, extract
based summaries. The disadvantage is summaries lack coherence and cohesion most
of the times. In LSA+TRM approach, obtaining the best dimension reduction ratio
and explaining LSA effects are difficult. Moreover, it takes a large time to compute
SVD.

2.1.2 Text understanding and summarization through document concept


lattice

Ye et al. (2007) proposed a data structure named as Document Concept Lattice


(DCL)in which concepts of the source document are represented through a direct
acyclic graph such that the set of overlapping concepts are represented by nodes.
Here, concepts are words representing concrete entities and their corresponding
actions. Thus, concepts indicate important facts and help to answer important
questions. Through DCL, the summarization algorithm selects a globally optimum set
of sentences that represent maximum number of possible concepts with the use of
minimum number of words. Finally, this algorithm produces the output summary with
the set of sentences that accounts for the highest representative power. Conclusion:
The proposed approach is competitive with respect to existing sentenceclustering and
sentence scoring techniques.
2.1.3 Tag Clouds for Summarizing Web Search Results

Thomas Hentrich (2007) In this paper, we describe an application, PubCloud


that uses tag clouds for the summarization of results from queries over the PubMed
database of biomedical literature. PubCloud responds to queries of this database with
tag clouds generated from words extracted from the abstracts returned by the query.
By summarizing the text content of web pages returned by a query, tag clouds offer
not only an overview of the knowledge represented in the entire response, but also an
interface that enables users to navigate to potentially relevant information hidden deep
down in the response list. As our results suggested, tag clouds are not a panacea for
the summarization of web search results. Although they do provide some
improvements in terms of summarizing descriptive information, tag clouds do not
help users in identifying relational concepts, and in fact, slow them down when they
need to retrieve specific items.

2.1.4 Summarization of emails through conversational cohesion and


subjective opinions

Carenini et al. (2008) proposed new approaches for summarizing email


conversations. Initially a fragment quotation graph is built with the conversation
involving a few emails in which nodes represent distinct fragments and edges
represent replying relationship among fragments. Then this fragment quotation graph
helps to form a sentence quotation graph such that a distinct node in this graph
represents each sentence in the email conversation and a replying relationship is
represented between two nodes by an edge. In order to assign weights to the edges,
three kinds of measures for cohesion are explored: clue words (stem-dependent),
semantic similarity (Word Net-dependent) and cosine similarity (TF–IDF dependent).
The task of extractive summarization is considered as a node ranking problem.
Therefore, Generalized Clue Word Summarizer (CWS) and Page-Rank, i.e., the two
graph-based summarization approaches, are used for computing each sentence’s score
(node) and then highly scored sentences are used to generate the summary.

2.1.5 Automatic text summarization using MR, GA, FFNN, GMM and PNN
based models

Fattah and Ren (2009) proposed a method to improve selection of content in


automatic summarization of text with the help of a few statistical features. This
method, being a trainable summarizer focuses on different statistical features in every
sentence for producing summaries. These features are: Position of Sentence (Pos),
+ve keyword, −ve keyword, Resemblance of sentence to the Title (R2T), Centrality of
Sentence (Cen), Presence of Name Entity in sentence (PNE), Presence of Numbers in
sentence (PN), Bushy Path of sentence (BP), Relative Length of sentence (RL), and
Aggregate Similarity (AS). By combining all these features, Genetic Algorithm (GA)
and Mathematical Regression (MR) models are trained for getting an appropriate mix
of feature weights. Feed Forward Neural Network (FFNN) and Probabilistic Neural
Network (PNN) are used for classification of sentences.

2.1.6 A Frequent Term and Semantic Similarity based Single Document


Text Summarization Algorithm

Naresh Kumar Nagwani(2011) In this paper a single document frequent


terms based text summarization algorithm is introduced. Semantic similarity is also
used in the algorithm. The designed algorithm works in three steps. In the first step
the document which is required to be summarized is processed by eliminating the stop
word and by applying the stemmers. In the second step term-frequent data is
calculated from the document and frequent terms are selected, for these selected
words the semantic equivalent terms are also generated. Finally in the third step all
the sentences in the document, which are containing the frequent and semantic
equivalent terms, are filtered for summarization. The designed algorithm is
implemented using open source technologies like java, DISCO, Porters stemmer etc.
and verified over the standard text mining corpus.

2.1.7 Summarizing tourist destinations by mining user-generated


travelogues and photos.

Yanwei Pang(2011) In this paper, a framework of summarizing tourist


destinations by leveraging the rich textual and visual information in large amount of
user-generated travelogues and photos on the Web. The proposed framework first
discovers location-representative tags from travelogues and then select relevant and
representative photos to visualize these tags. Finally, the tags and photos are
organized appropriately to generate a representative and comprehensive summary
which describes a given destination both textually and visually. Experimental results
based on a large collection of travelogues and photos show the effectiveness of the
proposed destination summarization framework.

2.1.8 Evolutionary optimization algorithm for summarizing multiple documents

Alguliev et al. (2013) suggested an optimization approach named as OCDsum-


SaDE for generic document summarization. This approach deals with content
coverage and redundancy at the same time, i.e. it can directly extract important
sentences from the given collection, thus covering the relevant portion of the original
document and redundancy can be reduced in the summary. An algorithm named as
self-adaptive differential evolution (DE) is developed for solving the problem of
optimization. One of the key problems while summarizing multiple documents is
redundancy. This method focuses on all the three features of summarization: content
coverage, diversity and length. Storn and Price (1997) proposed an algorithm based
on population named as DE which is similar to Genetic Algorithms (GA) and uses
crossover, mutation and selection operators. Search begins in this self-adaptive DE
approach with a group of individuals randomly selected from the decision space.
Crossover operator is invoked to enhance the diversity of parameter vectors. Mutation
operation is used in this algorithm as a method of search and selection operator directs
the search towards promising areas in the search space. Conclusion: The proposed
method leads to competitive performance. Statistical results depict that this method
performs better than other baseline methods.

2.1.9 An Approach To Automatic Text Summarization Using Simplified


Lesk Algorithm And Wordnet

Alok Ranjan Pal,1 Projjwal Kumar Maiti1 (2013) proposed a single-


document input text is summarized according to the given percentage of
summarization using unsupervised learning. First, the Simplified Lesk Approach is
applied to each of the sentences to find the weight of each sentence (refer section 2.2).
Next, the sentences with derived weights are arranged in descending order with
respect to their weights. Now, according to a specific percentage of summarization at
a particular instance, certain numbers of sentences are selected as a summary. Lastly,
the selected sentences are rearranged according to their original sequence in the input
text. The proposed approach is based on the semantic information of the extracts in a
text. So, different parameters like formats, positions of different units in the text are
not taken into account.

2.1.10 A Context Based Text Summarization System

Rafael Ferreira(2014) This paper advocates the thesis that the quality of the
summary obtained with combinations of sentence scoring methods depend on text
subject. Such hypothesis is evaluated using three different contexts: news, blogs and
articles. Both quantitative and qualitative measures are used to evaluate which
combination of the sentence scoring methods yield better results for each context. The
best combinations for short and well-formed texts (news) is a combination of word
frequency, tf/idf, sentence position, and resemblance to the title. For blogs, short and
unstructured texts, also achieve good results using word frequency and tf/idf.
However, differently from news, combining these methods with text rank score and
sentence length improves the results. In the case of scientific articles the best
combinations include: cue-phase, sentence position, tf/idf, and resemblance to the
title.
TITLE AUTHOR AND PROS CONS
YEAR
Trained (Yeh et al. 2005) LSA+TRM Summaries lack
summarizer and approach generates coherence and
latent semantic a summary of cohesion most of
analysis for semantically related the times. Feature
summarization of sentences. The weights of score
text. approaches are function generated
language- by GA doesn’t
independent always give good
performance results
for the test corpus.
Text understanding (Ye et al. 2007) DCL focuses on Computation cost is
and summarization semantics and more for generating
through document employs only a complete DCL
concept lattice reliable features. because of
The sentences are observing all
coherent and possible
represent important combinations of
and different local concepts
topics and the
summary is
generated with least
loss of answer. The
evaluation
framework does not
require human-
made summaries.
Evolutionary Alguliev et al. 2013 This approach Runtime
optimization reduces redundancy complexity of DE
algorithm for in the summaries, which is a
summarizing selects important population-based
multiple documents sentences from the stochastic search
document and method is more
includes relevant
content of the
original document
Statistical and (Ferreira et al. Apart from using This system strives
linguistic based 2014) statistical and to search for
summarization semantic important sentences
system for multiple similarities, this in groups of
documents approach different topics and
linguistically treats hence suffers from
the input text by the problem of
performing sentence ordering
discourse analysis
and co-reference
resolution. As it is
an unsupervised
approach, it does
not require
annotated corpus
Graph-based Parveen and Strube This approach This approach is
Extractive 2015 doesn’t depend on capable of
summarization by any parameter and generating
considering training data as it is summary from a
importance, non- an unsupervised single document
redundancy and technique and only
coherence summary, being
coherent is of good
quality
Tag Clouds for Thomas Hentrich, The tag cloud It is less effective at
Summarizing Web Benjamin M. Good, interface is the task of enabling
Search Results Mark D. Wilkinson advantageous in users to discover
presenting relations between
descriptive concepts.
information and in
reducing user
frustration
Summarizing Yanwei Pang The tags and photos Lack of
tourist destinations are organized performance of
by mining user- appropriately to photo selection and
generated generate a user experience.
travelogues and representative and
photos comprehensive
summary which
describes a given
destination both
textually and
visually.
Experimental
results based on a
large collection of
travelogues and
photos show the
effectiveness of the
proposed
destination
summarization
framework..

3. PAPER CLASSIFICATION AND CATEGORIZATION

Summarization Dataset Evaluation Baseline


approach and its used measure approaches
author
Trained summarizer 100 Precision,Recall CBA and
and latent political and F-measure MCBA,LSA-
articles TRM
from New
Taiwan
Weeekly
Text understanding DUC 2005 and ROUGE-2 and Techniques
and summarization DUC 2006 ROUGE-SU4 available for
through document clustering and
concept lattice (Ye et scoring of sentences
al. 2007)
Summarization of Enron email Sentence pyramid CWS, CWS-
emails through dataset precision, Cosine, CWS-
conversational ROUGE-2 and lesk, CWS-jcn,
cohesion and ROUGE-L PR-Clue, PR-
subjective opinions lesk, PR-Cosine,
(Carenini et al. 2008) PR-jcn, OpFind,
OpBear
Evolutionary DUC ROUGE-1, DUCbest,
optimization 2002 ROUGE-2, Random, FGB,
algorithm for and ROUGE-L and LSA , BSTM,
summarizing multiple DUC ROUGE-SU LexRank,,
documents (Alguliev 2004 Centroid, MCKP
et al. 2013) WFS-NMF and
WCS
Text summarization is an interesting research field and it has a wide range of
applications Especially, more focus is emphasized on recent extractive approaches of
text summarization developed in the last decade. Even though, it is impossible to
describe all different methods and algorithms throughly regarding the limits of this
article, it should give a rough overview of current progresses in the field of text
summarization process

You might also like