You are on page 1of 39

Automated Text Summarization

CHAPTER : 1

INTRODUCTION

1.1 ABOUT SUMMARIZATION

The first and the obvious question that erupts in our minds is –What actually is Summarization?
According to Noah Bubenhofer,--

“Summarization is the basic process of creating a summary text,


which is a derivative of the source text condensed by selection
and/or generalization on important content.”

Therefore, the process of summarization of any particular text ranges from the interpretation of
the source text, grasping and analyzing the meaning being imparted to it; then representing the
meaning of the source text which is based on certain areas of information, and finally creating a
summary of what all has been understood and what all needs to be represented. Therefore, as can
be understood from above, the input that has to be given is the basic text which has to be derived
from and the output is the short, content-rich and summarized text.

Thereby the basic motto behind this summarization is to grasp the ultimate meaning of the text
which has been presented to us in a short time and in the form of a relatively shorter text. This
shorter text, which is a subset of the actual text may not really convey us every minute details of
the actual text, but it certainly aims at conveying the basic and actual idea which is trying to be
conveyed in the text presented for summarization. Pictorially, this process can be represented in
fig. 1.1.

B.E. (I.T.) 2007-08. 1


Automated Text Summarization

Fig. 1.1

Automatic text summarization can be used:

• To summarize news to SMS or WAP-format for mobile phones/PDA.


• To let a computer synthetically read the summarized text. Written text can be to long and
boring to listen to.
• In search engines to present compressed descriptions of the search results (see the
Internet search engine Google).
• In keyword directed subscription of news which are summarized and pushed to the user
(see) e.g. Nyhetsguiden(In Swedish)
• To search in foreign languages and obtain an automatically translated summary of the
automatically summarized text.

B.E. (I.T.) 2007-08. 2


Automated Text Summarization

1.2 HISTORY
Over the past few years, especially with the emergence of the Internet, the exchange of
information has increased immensely, affecting all of us. On the one hand, the scientific
community makes us aware instantly of its scientific breakthroughs while on the other hand,
journalists present reports from around the world in real time. The growing number of electronic
articles, magazines and books that are made available everyday, puts more pressure on
professionals from every walk of life as they struggle with information overload.

With the increasing availability of information and the limited time people have to sort through it
all, it has become more and more difficult for professionals in various fields to keep abreast of
developments in their respective disciplines A large portion of all available information today
exists in the form of unstructured texts. Books, magazine articles, research papers, product
manuals, memorandums, e-mails, and of course the Web, all contain textual information in the
natural language form. Analyzing huge piles of textual information is often involved in making
informed and correct business decisions.

By and large, we all have to deal with reviewing large volumes of textual information.
This problem could be solved by the use of “Automated Text Summarization” systems. Such
systems have been in research for over 50 years. It views a system that can model the
information processing capabilities of the human brain. In general, systems based on traditional
text summarization approaches analyzed a natural language text in a certain way at the level of
individual sentences. The objective was to create a semantic representation of a sentence in the
form of structured relations between important words comprising this sentence.

To solve this task, various previously developed linguistic molds were tried with the sentence
and its components. When a mold matched the sentence well, a corresponding semantic
construction was associated with the sentence. This technique provides a good first guidance for
understanding the meaning of a text. But as it turns out, the main problem with this approach is
that there can be too many different molds that one needs to build for analyzing different types of
sentences. In addition, the list of exceptional constructions in this approach quickly grows

B.E. (I.T.) 2007-08. 3


Automated Text Summarization

prohibitively large. In other words, this approach works well only for a limited subset of natural
language texts.

1.3 BUSINESS PROSPECTIVE

Due to the huge amount of unstructured information available it becomes a very tedious job for a
professional to find relevant information. A large number of man-hours are lost in simply sorting
out the relevant information. Automated Text Summarization can take care of this problem. It
can provide intelligent summary of texts automatically. These summaries provide the
information regarding the content of the text without the need to go through the entire text. For
some professions automated intelligent text analysis capabilities can be critical. An automated
text summarization function could be used by magazine editors, political and business analysts,
venture capitalists, lawyers, and students, who wish to see accurate summaries before plunging
into the full documents. An efficient navigation through a text base, as well as clustering and
classification of texts, could enhance the effectiveness of working with large text bases including
academic documents (for researchers), electronic news flow (for marketers and investment
bankers), and e-mail systems (for all users). The semantic information retrieval capability could
save millions of man-hours by increasing the relevance and precision of a database search or
internet surfing. Clustering a collection of documents that represent the press reaction to the
latest marketing moves of your company and your competitors could help assess the
effectiveness of your marketing campaign. A combination of all of these functions with a natural
language information retrieval capability could facilitate creating a new generation of powerful
and intelligent corporate Help Desk and Call Support Center solutions.

B.E. (I.T.) 2007-08. 4


Automated Text Summarization

CHAPTER : 2

CHALLENGES

The challenges for the text summarization were following:-


• Document type
• Document language
• Document style

2.1 DOCUMENT TYPE

It is true that a summary of a large article is more useful than going through the entire text of the
article. The summary effectively lists out the article in brief. For example a doctor would find the
summary of a 50-page article on a medical treatment more useful than the résumé of a one-page
article on the same topic. Unfortunately, the difficulty of producing a high quality summary is
proportional to the length of the document. We know that it is quite difficult to summarize a
novel like Charles Dickens’ 1000 page novel “A Tale of Two Cities” as compared to
summarization of a news article .The same challenge is faced by the automated system.
Moreover the automated system must proceed rapidly.

Also the automated system must be able to process documents of different formats (like word,
text, PDF, Web pages, e-mail) which it might come across.

2.2 DOCUMENT LANGUAGE

2.2.1 MULTILINGUAL FACTOR:


Summarizing documents would be a much easier task if everybody spoke the same language.
This is evidently not the case as thousands of languages are spoken worldwide. Challenges posed
by the multilingual factor originate from more concrete aspects, namely, grammar and syntax.
Just consider the following examples. In German, new words can be created by combining
existing ones. The Japanese, on their part, do not use punctuation. Finally, in English, the use of
the Em dash (—) is common while rarely, if ever, seen in French. These are just a few examples.
If summarizing a text in German is feasible for a German-speaking person, an automated

B.E. (I.T.) 2007-08. 5


Automated Text Summarization

solution must do better by taking into account the subtleties inherent in a language and still
generate summaries of high quality.

2.2.2 SEMANTIC FACTOR:


Ambiguity in the vocabulary constitutes another challenge for the summarization process.
Synonyms and related words are sources of vocabulary ambiguity. The word “capital”, for
example, has very different meanings according to the context: geographical (capital of a
country), economic (capital gain), linguistic (capital letters).This ambiguity also appears on
another level in the form of idiomatic expressions. For example, if the author of a document
writes “dancing to a different tune”, it is not meant to be taken literally of course. For all of these
instances, summarization technologies must be able to automatically make use of the context to
ascertain the true meaning and differentiate the words and expressions.

2.3 DOCUMENT STYLE:

2.3.1 HUMAN FACTOR:


Human intervention introduces still more complexity to the summarization process.
Grammar and syntax serve to provide structure to a written language. By following the rules of
grammar and syntax, people construct sentences that are easy to read and whose meaning is
unambiguous. Of course, writing is essentially a creative process. Furthermore, many people
who write texts pay little attention to questions of grammar and syntax.

Newsgroups are a case in point: while it is true that some newsgroup messages are clear and
concise, many others are characterized by poorly constructed sentences, slang and typing errors.
The text summarization process must take the human factor into consideration and be able to
generate superior summaries from poorly written texts as well as from Pulitzer Prize works.

Optimized summarization technologies must be able to cast the net wide enough to grasp the
essential ideas contained in a document, much as person would do, regardless of the document
type, language or style. Such technologies must process the ideas and present them to the reader
in such a way that the result is a faithful representation of the original text. By reading the
resulting résumé, the reader will know if the original document should be read in its entirety.

B.E. (I.T.) 2007-08. 6


Automated Text Summarization

CHAPTER : 3

BASIC PREPARATION OF TEXT

FIG 3.1
As can be inferred from fig 3.1, whenever a text in the form of a document or query is inserted
into the summarizer, it is being optimized as per our requirements. When a text is submitted to
the ATS (Automated Text Summarization) system, it is first optimized and then subjected to
various algorithms to extract the summary from it.
The basic steps in preparation of the text are: -
• Document Standardization
• Language Detection
• Sentence Boundary Recognition
• Long Document Segmentation

B.E. (I.T.) 2007-08. 7


Automated Text Summarization

3.1 DOCUMENT STANDARDIZATION:

The first step in the summarization process is the “standardization” of document content:
documents, which exist in a variety of formats, must be converted to a common text format
before their content can be interpreted. There are various text formats available and the ATS
system must be able to work on all of them. For this all the text must be converted to a standard
format. This not only includes only the pure text format but also the vectorial representations of
text such as PDF format.

3.2 LANGUAGE DETECTION:

Documents can be in a number of languages and the ATS system must be able to identify the
language immediately. Once the language is identified specific linguistic principles, rule systems
and powerful heuristics are applied to it based on the rules of the given language. Many ATS
system employ linguistic algorithms that are designed to process written language in “real life”
settings. This process involves statistical data about word usage gleaned from studying thousands
of documents.

3.3 SENTENCE BOUNDARY RECOGNITION

Punctuation marks are oftentimes a source of ambiguity, thus causing a real problem for
automated systems in determining the beginning and the end of sentences. Automated text
summarization technologies implement a wide range of heuristics to isolate sentences, bulleted
lists, and special strings such as e-mail addresses and scientific formulas. In addition, they
tokenize each and every word according to the context in order to identify actions, people, places
and things.

B.E. (I.T.) 2007-08. 8


Automated Text Summarization

3.4 LONG DOCUMENT SEGMENTATION

Once the key concepts are determined, the text summarization technologies formulate a sort of
“picture” of the overall document, and then proceed to divide it into its constituent text segments
(refer fig 3.2).

Fig 3.2 DOCUMENT SEGMENTATION

These steps are necessary because they make possible the generation of summaries regardless of
the length of the original document (while long documents remain a major obstacle for
competing systems). This rather delicate operation subsequently triggers a separate analysis of
each segment (refer fig 3.3), which is then followed by the integration of all the composite
segments into a single, complete representation of the original document.

B.E. (I.T.) 2007-08. 9


Automated Text Summarization

Fig 3.3 ANALYSIS OF SEGMENTS AND INTEGRATION OF COMPOSITE SEGMENTS

B.E. (I.T.) 2007-08. 10


Automated Text Summarization

CHAPTER: 4

SUMMARIZATION APPROACHES

There are a number of summarization approaches that can be implemented in the ATS systems.
Each developer of the system uses the approach he deems best. A few of the more popular
approaches are given below:

• Multi document Vs Single document


• Query based Vs Generic
• Informative Vs Indicative
• Monolingual Vs Multilingual

4.1 MULTI DOCUMENT VS SINGLE DOCUMENT

Most ATS systems function in the single document context, where a single document is
condensed to a shorter form. In the context of information retrieval, we have multiple documents
that are returned by a single search request. To generate a single output that summarizes the
salient points across these multiple documents is more difficult. Since the documents are related
by a common query, they likely contain similar content, thus a system cannot simply concatenate
many single document summaries together, because repetition of salient points would result. If
ATS is to be a successful methodology for information retrieval, a system that can handle
repetition in multiple documents is a prerequisite.

4.2 QUERY BASED VS GENERIC

ATS systems often produce generic summaries that highlight the most salient points of a given text.
However, in the online search and retrieval context, an ATS system has access to the query given by the
user and should adapt its output to suit the user's declared information need. There are many instances
when a rational IR framework finds query keywords in only a subsection of a larger document. Showing

B.E. (I.T.) 2007-08. 11


Automated Text Summarization

this relationship between the query terms and the document has been proven to be an important factor.
While it would be acceptable to store generic document summaries and present them in an IR
system, a more favorable approach is to produce per-query customized summaries.

4.3 INFORMATIVE VS INDICATIVE

Informative summaries provide information on the salient aspects of a document, seeking to


cover as many topics as possible. These summaries omit detail or supporting information and just
cover the most important points of the document. Summaries of this type often are used in place
of the document as an overview, and are suitable for fulfilling a user's information need if they
are browsing for information or have a general interest in the subject of the document.

Indicative summaries, on the other hand, are meant to only hint at the contents of the document.
In the IR context, indicative summaries play an interesting role because they help the user in
judging the relevance of the document, and in determining whether to consider full-text retrieval.
They assist a user who is searching for information and has a specific information need.

4.4 MONOLINGUAL VS MULTILINGUAL

ATS systems have to face the difficulty of use of multiple languages (multilingual) in the text
document. It is more difficult and complex to achieve than the single language (monolingual)
text documents. In the multilingual text inputs the ATS system not only has to identify the
different languages but also has to apply different heuristics applicable to different languages
they belong to. Since a number of documents contain more than a few multilingual subsections it
becomes necessary for the system to be able to identify them. It is mostly true in case of
applications such as web searches, etc.

B.E. (I.T.) 2007-08. 12


Automated Text Summarization

CHAPTER: 5

TECHNIQUES OF SUMMARIZATION

Fig. 5.1

As can be inferred from the fig. 5.1, there are a number of modules and techniques for
implementing the automated text summarization system. Some use the frequency counts to judge
the importance of the sentence while the others use ranking mechanism. Other algorithms may
use the heading or the named entity to find the important portions of the text. Some of the
techniques of summarization are given below:
• Exploitation of Named Entities
• Machine Learning technique

B.E. (I.T.) 2007-08. 13


Automated Text Summarization

5.1 EXPLOITATION OF NAMED ENTITIES:

Named Entities are often seen as important cues to the topic of a text. They are among the most
information dense tokens of the text and largely define the domain of the text. Therefore, Named
Entity Recognition should greatly enhance the identification of important text segments when
used by an (extraction based) automatic text summarizer. It also utilizes the presence of bold tag
used to emphasize the contents of the text. Headings are also given higher weight by this
technique. This approach is particularly useful in case of newspaper article and journals as their
topics easily convey the context and direction of the text.

In news paper text the most relevant information is always presented at the top. In some cases the
articles are even written to be cuttable from the bottom. Because of this we can use the Position
Score concept: sentences in the beginning of the text are given higher scores than latter ones.
Sentences that contain keywords are scored high. A keyword is an open class word with a high
Term Frequency (tf). Sentences containing numerical data are also considered carrying important
information. All the above parameters are put in a naïve combination function with modifiable
weights to obtain the total score of each sentence.

The extract summaries generated with Named Entity technique where then manually compared
on sentence level with the extracts generated by a sample group of people with majority vote. We
found that with Named Entity Recognition the summaries generated and the standard summaries
by majority vote only had 33.9% of their sentences in common. On the other hand, without
Named Entity Recognition the summaries generated shared as many as 57.2% of the sentences
with the standard summaries. This figure doesn’t tell us about how good the summaries were. It
simply tells us about how well the technique mimics the human selection.

PROBLEMS:

There are some problems with the use of Named Entities technique. They are listed below:

B.E. (I.T.) 2007-08. 14


Automated Text Summarization

a) Reference errors – It is the most common problem of the extraction based


summarization. It is the presence of reference error due to removed antecedents. This
problem is of two types, one is of wrong antecedent and the other is of the absent or
missing antecedents.

b) Loss of background information - It tends to prioritize elaborative sentences over


introductory and thus sometimes is responsible for serious losses of sentences giving
background information. One way of dealing with the problem of loss of background
information is of course to raise the size of the extraction unit. If we raise the extraction
unit to encompass for example paragraphs instead of sentences the system would identify
and extract only the most important paragraph

c) Condensed Redundancy – When no weighting of Named Entities is carried out


clusters of interrelated sentences tend to get extracted because of the large amount of
common words. This gives high cohesion throughout the summary but sometimes leads
problems with condensed redundancy. When summarizing with weighting of Named
Entities the resulting summaries sometimes seem very repetitive but are in fact generally
less redundant than the ones created without weighting of Named Entities.

When producing informative summaries for immediate consumption, for example in a news
surveillance or business intelligence system, the background may often be more or less well
known. In this case the most important parts of the text is what is new and which participants
play a role in the scenario. Here Named Entity Recognition can be helpful in highlighting the
different participants and their respective role in the text. Other suggested and applied methods
of solving the coherence problem are, as we have seen, to raise the extraction unit to the level of
paragraphs or to use a very low, almost insignificant, weight on Named Entities.

5.2 MACHINE LEARNING TECHNIQUES:

We present a summarization procedure based on the application of trainable


Machine Learning algorithms which employs a set of features extracted

B.E. (I.T.) 2007-08. 15


Automated Text Summarization

directly from the original text. These features are of two kinds: statistical –
based on the frequency of some elements in the text; and linguistic –
extracted from a simplified argumentative structure of the text. We also
present some computational results obtained with the application of our
summarizer to some well known text databases.

One important task in this field is automatic summarization, which consists of reducing the size
of a text while preserving its information content. A summarizer is a system that produces a
condensed representation of its input’s for user consumption. Summary construction is, in
general, a complex task which ideally would involve deep natural language processing
capacities. In order to simplify the problem, current research is focused on extractive-summary
generation. An extractive summary is simply a subset of the sentences of the original text. These
summaries do not guarantee a good narrative coherence, but they can conveniently represent an
approximate content of the text for relevance judgment.

A summary can be employed in an indicative way – as a pointer to some parts of the original
document, or in an informative way – to cover all relevant information of the text. In both cases
the most important advantage of using a summary is its reduced reading time.

Summary generation by an automatic procedure has also other advantages:


(i) the size of the summary can be controlled;
(ii) its content is determinist;
(iii) the link between a text element in the summary and its position in the original text can
be easily established.

A frequently employed text model is the vectorial model. After the preprocessing step each text
element – a sentence in the case of text summarization – is considered as a N-dimensional vector.
So it is possible to use some metric in this space to measure similarity between text elements.

B.E. (I.T.) 2007-08. 16


Automated Text Summarization

The most employed metric is the cosine measure, defined as:

cos q = (<x.y>) / (|x| . |y|) for vectors x and y,


Where (<,>) indicates the scalar product,
And |x| indicates the module of x.

Therefore maximum similarity corresponds to cos q = 1,

whereas cos q = 0 indicates total discrepancy between the text elements.

This technique proposes an automatic procedure to generate reference summaries: if each


original text contains an author-provided summary, the corresponding size-K reference extractive
summary consists of the K most similar sentences to the author-provided summary, according to
the cosine measure. Using this approach it is easy to obtain reference summaries, even for big
document collections.

Set of employed features are:


• Sentence Length - This feature is employed to penalize sentences that are too
short, since these sentences are not expected to belong to the summary [7]. We
use the normalized length of the sentence, which is the ratio of the number of
words occurring in the sentence over the number of words occurring in the
longest sentence of the document.

• Sentence Position - This feature can involve several items, such as the
position of a sentence in the document as a whole, its the position in a section, in
a paragraph, etc.. We use here the percentile of the sentence position in the
document, the final value is normalized to take on values between 0 and 1.

• Sentence-to-Sentence Cohesion - This feature is obtained as follows: for


each sentence s we first compute the similarity between s and each other sentence
s’ of the document; then we add up those similarity values, obtaining the raw

B.E. (I.T.) 2007-08. 17


Automated Text Summarization

value of this feature for s; the process is repeated for all sentences. The
normalized value (in the range [0, 1]) of this feature for a sentence s is obtained
by computing the ratio of the raw feature value for s over the largest raw feature
value among all sentences in the document. Values closer to 1.0 indicate sentences
with larger cohesion.

• Sentence-to-Centroid Cohesion - This feature is obtained for a sentence s


as follows: first, we compute the vector representing the centroid of the document,
which is the arithmetic average over the corresponding coordinate values of all
the sentences of the document; then we compute the similarity between the
centroid and each sentence, obtaining the raw value of this feature for each
sentence. The normalized value in the range [0, 1] for s is obtained by computing
the ratio of the raw feature value over the largest raw feature value among all
sentences in the document. Sentences with feature values closer to 1.0 have a
larger degree of cohesion with respect to the centroid of the document, and so are
supposed to better represent the basic ideas of the document.

• Depth in tree - It is a consensus that the generation and analysis of the


complete rethorical structure of a text would be impossible at the current state of
the art in text processing. In spite of this, some methods based on a surface
structure of the text have been used to obtain good-quality summaries. To obtain
this approximate structure we first apply to the text an agglomerative lustering
algorithm. The basic idea of this procedure is that similar sentences must be
grouped together, in a bottom-up fashion, based on their lexical similarity. As
result a hierarchical tree is produced, hose root represents the entire document.
This tree is binary, since at each step two clusters are grouped. This feature for a
sentence s is the depth of s in the tree.

• Indicator of main concepts - This is a binary feature, indicating whether or


not a sentence captures the main concepts of the document. These main concepts
are obtained by assuming that most of relevant words are nouns. Hence, for each

B.E. (I.T.) 2007-08. 18


Automated Text Summarization

sentence, we identify its nouns using a part-of-speech software [3]. For each noun
we then compute the number of sentences in which it occurs. The fifteen nouns
with largest occurrence are selected as being the main concepts of the text.
Finally, for each sentence the value of this feature is considered “true” if the
sentence contains at least one of those nouns, and “false” otherwise.

• Occurrence of non-essential information. We consider that some words


are indicators of non-essential information. These words are speech markers such
as “because”, “furthermore”, and “additionally”, and typically occur in the
beginning of a sentence. This is also a binary feature, taking on the value “true” if
the sentence contains at least one of these discourse markers, and “false”
otherwise.

The ML-based trainable summarization framework consists of the following steps:

1. We apply some standard preprocessing information retrieval methods to each document,


namely stop-word removal, case folding and stemming.

2. All the sentences are converted to its vectorial representation.

3. We compute the set of features described in the previous subsection. Continuous features are
discretized: we adopt a simple “class-blind” method, which consists of separating the original
values into equal-width intervals. We did some experiments with different discretization
methods, but surprisingly the selected method, although simple, has produced better results.

4. A ML trainable algorithm is employed; we employ two classical algorithms, namely C4.5 and
Naive Bayes. As usual in the ML literature, we employ these algorithms trained on a training
set and evaluated on a separate test set.
.

B.E. (I.T.) 2007-08. 19


Automated Text Summarization

5.3 WORD CLUSTERS AND RANKING ALGORITHM TECHNIQUE:

This is a statistical text summarizer based on Machine Learning (ML) and the text-span
extraction paradigm. This approach allows both generic and query-based summaries. However
for evaluation purposes, we present here results for a generic SDS (Single Document
Summarization). For a given document, the system provides a set of unordered extracts which
are supposed to be the most relevant to its topics. Previous work on the application of machine
learning techniques for SDS used the classification framework. Such approaches usually train a
classifier, using a training set of documents and their associated summaries, to distinguish
between summary and non-summary sentences. After training, these systems operate on
unlabeled text by ranking sentences of a new document according to the output of the classifier.
The classifier is learned by comparing its output to a desired output reflecting a global class
information. Under this framework one assumes that all sentences from different documents are
comparable with respect to this class information. This hypothesis holds for scientific articles but
for a large variety of collections, documents are heterogeneous and their summaries depend
much more on the content of their texts than on global class information.

A generic summary of a document has to reflect its key points. We need here statistical features
which give different information about the relevance of sentences for the summary. The features
considered important for a generic summary and grouped into seven categories: Indicator phrases
(such as cue-words or acronyms), Frequency and title keywords, location as well as sentence
length cutoff heuristics and the number of semantic links between a sentence and its neighbors.

The generic queries represent two sources of evidence we use to find relevant sentences in a
document. Since title keywords may be very short, we have employed query-expansion
techniques such as Local Context Analysis (LCA) or thesaurus expansion methods (i.e.WordNet)
as well as a learning based expansion technique.

Expansion via WordNet and LCA: From the title keyword query, we formed two
other queries, reflecting local links between the title keywords and other words in the

B.E. (I.T.) 2007-08. 20


Automated Text Summarization

corresponding document:

– title keyword and LCA, constituted by keywords in the title of a document and the most
frequent words in most similar sentences to the title keyword query according to the
cosine measure.

– title keyword and most frequent terms, constituted by high frequency document
words and the keywords in the title of a document,

An expanded query is obtained from the title keywords of a document and their first order
synonyms using WordNet, title keyword and WordNet.

5.3.1 EXPANSION WITH WORD CLUSTER:

First different word-clusters are formed based on words co-occurring in sentences of all
documents in the corpus. For discovering these word clusters, each word w in the vocabulary V is
first characterized as a p-dimensional vector w, where

w =< n (w, i) > I ∈ { 1,...,p } ,


representing the number of occurrences of w in each sentence i.

Under this representation, word clustering is then performed using clustering algorithm
maximizing the Classification Maximum Likelihood criterion. From these clusters we obtained
two other expanded queries by first adding to title keywords, words in their respective clusters,
title keyword and term-clusters. And secondly by projecting each sentence of a document and
the title keyword query in the space of these word-clusters, Projected title keyword. For the
latter we characterize each sentence in a document and the title keyword query by a vector
where each characteristic represents the number of occurrences of words from a cluster in that
sentence or in the title keyword query. The characteristics in this representation are related to the
degree of representation of each word-cluster in a given sentence or in the title keyword query.

B.E. (I.T.) 2007-08. 21


Automated Text Summarization

5.3.2 SIMILARITY MEASURES:

The tf-idf representation is used to compute the cosine similarity measure between sentence x
and query q as :

Sim1(q, s) = _∑w∈ s,q tf (w, q)tf (w, s)idf2(w)/||w|| ||s||


where, tf (w, x) is the frequency of word w in x (q or s),
idf(w) is the inverse document frequency of word w and ||x|| = √∑w∈x(tf (w, x)idf(w))2.

We also expected to reweight sentences containing acronyms e.g. HMM (Hidden Markov
Models),NLP(Natural Language Processing), ... The resulting feature computes similarity
between the title keywords and sentences using the same similarity measure than Sim1 except
that acronyms are given a higher weight. The resulting similarity measure writes

Sim2(q, s) = ∑w∈ s,q tf (w, q)tf *(w, s)idf2(w)/||w|| ||s||

Hence, we have counted as twice the term frequency of acronyms e.g.

tf *(w, s) =2 * tf (w, s) if w is an acronym.

5.3.3 RANKING FOR TEXT SUMMARIZATION:

In order to combine sentence features, ML approaches for SDS adopt a classification framework.
The motivation for such approaches is that a classification training error of 0 implies that scores assigned
to relevant/irrelevant sentences from a classifier are all greater/lower than a constant c, resulting in an
appropriate rankings of sentences. However, on real life applications, this classification error is
never zero. In this case, for a given document, we cannot predict about the ranking of a
misclassified sentence relatively to the other ones. The reason is that the classification error is
computed by comparing sentence scores with respect to a constant, and not relatively to each
other. It can then happen that a misclassified irrelevant sentence gets higher score than relevant

B.E. (I.T.) 2007-08. 22


Automated Text Summarization

ones. In other terms, minimizing the classification error does not necessary leads to the
optimization of the ranks of relevant sentences in the same document.

Algorithms relying on the ML ranking framework will be more effective in practice for the SDS
task. In this case, instead of classifying sentences as relevant/irrelevant, a ranking algorithm
classifies pairs of sentences. More specifically, it considers the pair of sentences (s, s′) coming
from a same document, such that just one of the two sentences is relevant. The goal is then to
learn a scoring function H from the following assumption: a pair is correctly classified if and
only if the score of the relevant sentence is greater than the score of the irrelevant one.

The error on the pairs of sentences, called the Ranking loss of H is equal to:

Rloss(D,H) =1/|D|∑d∈D 1/|Sd1 ||Sd−1| Σs∈Sd1 ∑s′∈Sd−1[[H(s_)


≥ H(s)]]

B.E. (I.T.) 2007-08. 23


Automated Text Summarization

CHAPTER- 06

RECENT TRENDS AND DEVELOPMENTS

A number of major and minor researches and projects have been undertaken lately, to make the
system more viable and thereby more conducive for the users. The important ones are listed
below:

AREA: Research

I. SUMMARIST

Objective

Summarization is a hard problem of Natural Language Processing because, to do it properly, one


has to really understand the point of a text. This requires semantic analysis, discourse processing,
and inferential interpretation (grouping of the content using world knowledge). The last step,
especially, is complex, because systems without a great deal of world knowledge simply cannot
do it. Therefore, attempts so far of performing true abstraction--creating abstracts as summaries
have not been very successful.

Fortunately, however, an approximation called extraction is more feasible today. To create an


extract, a system needs simply to identify the most important/topical/central topic(s) of the text,
and return them to the reader. Although the summary is not necessarily coherent, the reader can
form an opinion of the content of the original. Most automated summarization systems today
produce extracts only.

SUMMARIST is an attempt to develop robust extraction technology as far as it can go and then
continue research and development of techniques to perform abstraction. This work faces the
depth vs. robustness tradeoff: either systems analyze/interpret the input deeply enough to
produce good summaries (but are limited to small application domains), or they work robustly

B.E. (I.T.) 2007-08. 24


Automated Text Summarization

over more or less unrestricted text (but cannot analyze deeply enough to fuse the input into a true
summary, and hence perform only topic extraction). In particular, symbolic techniques, using
parsers, grammars, and semantic representations, do not scale up to real-world size, while
Information Retrieval and other statistical techniques, being based on word counting and word
clustering, cannot create true summaries because they operate at the word (surface) level instead
of at the concept level.

To date, SUMMARIST produces extract summaries in five languages (and has been linked to
translation engines for these languages in the MUST system). Work is underway both to extend
the extract-based capabilities of SUMMARIST and to build up the large knowledge collection
required for inference-based abstraction.

Approach

We are building SUMMARIST, a system that combines symbolic concept-level world


knowledge (embodied in ISI's ontology SENSUS dictionaries, and similar resources) with robust
NLP processing (using techniques from Information retrieval and elsewhere) to overcome the
problems of the depth/robustness tradeoff. SUMMARIST is based on the following 'equation':

Summarization = Topic Identification + Interpretation +


Generation

For each step, the system hybridizes techniques as follows:

1. Topic Identification:

Generalizing word-level IR techniques, and adding additional techniques of topic spotting, we


use SENSUS and dictionaries to perform 'concept counting' and generalization, in order to
identify important topics in the text. English, Japanese, Spanish, Indonesian, and Arabic
preprocessing modules and lexicons are providing multilingual capabilities. This is the most
developed stage of SUMMARIST at this time.

B.E. (I.T.) 2007-08. 25


Automated Text Summarization

2. Interpretation:

Training on Wall Street Journal and other texts, we employ statistical techniques from IR (word
clustering, tf.idf, chi-squared) and cognitive psychology (latent semantic analysis, WordNet,
etc.), as well as lexicons and dictionaries, to perform 'concept-based' topic fusion (interpretation)
to find true summarizing concepts. To achieve the robust performance required for general
utility, we are busy building a large collection of 'concept families', organized in the SENSUS
ontology.

3. Generation:

We will develop three alternatives: a keyword lister; a phrase template generator; and one of ISI's
sentence planners and sentence generators (Penman, NITROGEN). All three will provide
hyperlinks from the summary back into the source document.

Prototypes of each portion of the system have already been built and are separately evaluated. A
formal evaluation of 18 systems was performed under the auspices of the TIPSTER research
funding program in February 1989.

FEATURES

1. Multilinguality:

Since the system uses no parser or grammar, and since for Machine Translation and other
purposes at ISI we have built lexicons of English (90,000 items), Japanese (220,000), Spanish
(45,000), Arabic (60,000), Indonesian (110,000), and Korean (110,000), most of which have
been partially linked to SENSUS, the system's design makes it possible to provide English
summaries or keyword extracts of documents written in any of these four languages. Chin-Yew
Lin's recent work on embedding SUMMARIST in a multilingual web access and information
retrieval system called Must, with the addition of a shallow Indonesian-to-English translator,
illustrates this approach.

B.E. (I.T.) 2007-08. 26


Automated Text Summarization

2. Discourse-level processing:

In order to produce a coherent, fluent summary, and to determine the flow of the author's
argument, it is necessary to determine the overall discourse structure of the text. Daniel Marcu's
work shows how, by first performing automated discourse analysis and then removing sentences
peripheral to the main message of the text, it is possible to construct coherent extracts. This
module will be added to SUMMARIST in the near future.

3. Indicator phrases:

Phrases such as "in conclusion" and "note that" in some genres indicate important content. The
project of Hao Liu focused on developing techniques to learn useful indicator phrases
automatically.

II. EXTRACTOR

Content Summarization:

Extractor is an exceptional content summarization utility using patented technology to


summarize text, e-mail and html content into weighted lists of keywords and key phrases.
Uniquely positioned for web services, Extractor is immediately capable of consuming documents
of any length and subject matter, distilling the precise, contextual meaning of any content into
keyword and key phrase summary formats. Extractor's unique patented technology delivers
precise content summaries in any subject domain without retraining and without human
intervention.

Contextual:

A unique feature of the patented Extractor technology is the ability to summarize content by
showing how keywords and key phrases are used in context of a document. The resulting

B.E. (I.T.) 2007-08. 27


Automated Text Summarization

summary provides an unparalleled level of subject relevance. This particular feature of Extractor,
for instance, allows an analytical comparison of one document against another or collection of
documents displaying similar or dissimilar characteristics. Ideal for portal content aggregation,
document indexing, keyword linking or semantic-based information systems.

Relevant Information:

By design Extractor is an objective provider of content summaries in contrast to traditional


human influenced subjective summary approaches. Statistically proven, Extractor is 85% to 93%
accurate regardless of subject domain. The ability to quickly discern relevant and meaningful
information is the corner stone of the Extractor Technology.

Definition of keyphrase extractor:

Many journals ask their authors to provide a list of key words for their articles. We call these
keyphrases, rather than key words, because they are often phrases of two or more words, rather
than single words. We define a keyphrase list as a short list of phrases (typically five to fifteen
phrases) that capture the main topics discussed in a given document. We define automatic
keyphrase extraction as the automatic selection of important, topical phrases from within the
body of a document. Automatic keyphrase extraction is a special case of the more general task of
automatic keyphrase generation, in which the generated phrases do not necessarily appear in the
body of the given document.

Keyphrases for Metadata:

Many researchers believe that metadata is essential to address the problems of document
management. Metadata is meta-information about a document or set of documents. There are
several standards for document metadata, including the Dublin Core Metadata Element Set
(championed by the US Online Computer Library Center), the MARC (Machine-Readable
Cataloging) format (maintained by the US Library of Congress), the GILS (Government

B.E. (I.T.) 2007-08. 28


Automated Text Summarization

Information Locator Service) standard (from the US Office of Social and Economic Data
Analysis), and the CSDGM (Content Standards for Digital Geospatial Metadata) standard (from
the US Federal Geographic Data Committee). All of these standards include a field for
keyphrases (although they have different names for this field).

Keyphrases for Highlighting:

When we skim a document, we scan for keyphrases, to quickly determine the topic of the
document. Highlighting is the practice of emphasizing keyphrases and key passages (e.g.,
sentences or paragraphs) by underlining the key text, using a special font, or marking the key text
with a special colour. The purpose of highlighting is to facilitate skimming. Automatic keyphrase
extraction can be used for highlighting and also to enable text-to-speech software to provide
audio skimming capability.

Keyphrases for indexing:

An alphabetical list of keyphrases, taken from a collection of documents or from parts of a


single long document (chapters in a book), can serve as an index.

Keyphrases for Interactive Query Refinement:

Using a search engine is often an iterative process. The user enters a query, examines the
resulting hit list, and modifies the query, then tries again. Most search engines do not have any
special features that support the iterative aspect of searching. One approach to interactive query
refinement is to take the user's query, fetch the first round of documents, extract keyphrases from
them, and then display the first round of documents to the user, along with suggested refinements
to the first query, based on combinations of the first query with the extracted keyphrases.

B.E. (I.T.) 2007-08. 29


Automated Text Summarization

Keyphrases for Web Log Analysis:

Web site managers often want to know what visitors to their site are seeking. Most web servers
have log files that record information about visitors, including the Internet address of the client
machine, the file that was requested by the client, and the date and time of the request. There are
several commercial products that analyze these logs for web site managers. Typically these tools
will give a summary of general traffic patterns and produce an ordered list of the most popular
files on the web site. A web log analysis program can use keyphrases to provide a deeper view of
traffic. Instead of producing an ordered list of the most popular files on the web site, a log
analysis tool can produce a list of the most popular keyphrases on the site. This can give web site
managers insight into which topics on their web site are most popular.

Workforce Optimization:

Delegating responsibility is a management 101 lesson. Providing the tools to appropriately


empower a workforce for their delegated responsibilities has changed dramatically since
management 101 was written. Relevant information is a critical tool for the success of any
business and providing relevant information in exact context is what gives an organization the
ultimate competitive advantage. Rather than working through the normal, time consuming,
iterative search engine process, Extractor empowers corporate information with relevant and
meaningful presentations in direct relation to the changing needs of today's dynamic workforce?

B.E. (I.T.) 2007-08. 30


Automated Text Summarization

III. FIELD: Sentence Extraction

1. Surrey University: Summ-It applet

This summarization system works by extracting sentences using Lexical Cohesion.

2. Royal Institute of Technology (Sweden): SweSum

SweSum extracts sentences to product an extract type summary. It is closely related to the work
at ISI. Summaries are created from Swedish or English texts in the either the newspaper or
academic domains. Sentences are extracted by ranking sentences according to weighted word
level features and were trained on a tagged Swedish news corpus. The summarization tool can be
hooked up to search engine results.

3. University of Ottawa: The Text Summarization Project

Not much is available about this research project except their project proposal. In it they
proposed to use machine learning techniques to identify keywords. Keyword identification can
then be used to select sentences for extraction. They planned to use surface level statistics such
as frequency analysis and surface level linguistic features such as sentence position.

4. Columbia University: FociSum (1998)

The FociSum system takes a question and answer approach to summarization. Sentences that
answer key questions regarding participants, organizations and other wh-questions are extracted.
The result is a concatenation of sentence fragments and clauses found in the original document.
The system first uses a named entity extractor to find the foci of the document. A question
generator is used to suggest relationships between these entities. The document is parsed to find
candidate answers for these questions on the basis of syntactic form. Sentence fragments and
clauses are pulled out of the selected sentences.

B.E. (I.T.) 2007-08. 31


Automated Text Summarization

4. University of Southern California: ISI Summarist

Summarist is produces summaries of web documents. It has been hooked up to the Systran
translation system to provide a gisting tool for news articles in any language. Summarist first
identifies the main topics of the document using statistical techniques on features such as
position, and word counts. Current reseach is underway to use cue phrases and discourse
structure. These concepts must be interpreted so that of a chain of lexically connected sentences,
the sentence with the most general concept is selected and extracted. Subsequent work will take
these extracted sentences to construct a more coherent summary.

IV.FIELD: Deep Understanding

1. The Sheffield University TRESTLE

This project produces summaries in the news domain. It uses MUC to extract the main concepts
of the text which then presumably is used to generated summaries. Unfortunately, not much
information is available on the official website regarding the system architecture.

2. Columbia University: SUMMONS (1996)

Summons is a multi-document summary system in the news domain. It begins with the results of
a MUC-style information extraction process, namely a template with instantiated slots of pre-
defined semantics. From this, it can generate a summary by using a sophisticated natural
language generation stage. This stage was previously developed under other projects and
includes a content selection substage, a sentence planning substage and a surface generation
stage. Because the templates have well-defined semantics, the type of summary produced
approaches that of human abstracts. That is they are more coherent and readable. However, this
approach is domain specific, relying on the layout of news articles for the information extraction
stage.

B.E. (I.T.) 2007-08. 32


Automated Text Summarization

V.FIELD: Hybrid Approaches

(These combine extraction techniques with more traditional NLP techniques)

1. Columbia University: MultiGen (1999),

MultiGen is a multi-document system in the news domain. It extracts sentence fragments that
represent key pieces of information in the set of related documents. This is done by using
machine learning to group together paragraph sized chunks of text into clusters of related topics.
Sentences from these clusters are parsed and the resulting trees are merged together to form,
building logical representations of propositions containing the commonly occurring concepts.
This logical representation is turned into a sentence using the FUF/SURGE grammar. Matching
concepts uses linguistic knowledge such as stemming, part-of-speech, synonymity and verb
classes. Merging trees makes use of identified paraphrase rules.

2. Copy and Paste (1999).

The Copy and Paste system is a single document summarizer that is domain independent. It is
designed to take the results of a sentence extraction summarizer and extract key concepts from
these sentences. These concepts are then combined to form new sentences. The system thus,
copies the surface form of these key concepts and pastes them into the new sentences. This is
done by first reducing the sentence removing any extraneous information. This step uses
probabilities learnt from a training corpus, and lexical links. The reduces sentences are merged
by using rules such as adding extra information about speakers, adding conjunctives and merging
common elements.

B.E. (I.T.) 2007-08. 33


Automated Text Summarization

AREA: Commercial

1. Data hammer by Glucose

Data hammer is a product to designed to summarized online texts and works in conjunction the
user's web browser. It extracts sentences by using an algorithm called 'Micro word Tree
Trimming' which they created. A demo version is available from their website.

2. Text Analyst by Megaputer

Text Analyst extracts sentences of documents on the user's computer. The official website of Text
Analyst describes the summarization process they use. A semantic network is constructed from
the source document using a neural network. They state that the construction of the semantic
network is not dependent on prior domain specific knowledge. A graphical representation of
concepts and relationships in the source document is shown to the user for selection. Sentences
with matching concepts and relationships are extracted.

3. IBM summarization products

IBM Japan incorporates summarization tools in two of its products: Internet King of Translation
(Japanese) and Lotus Word Pro (Japanese Version). The types of summaries produced are
sentence extracts, selected using rhetorical relations and position within the document.
Extraction is done statistically and can utilise genre specific features.

IBM also has a toolset called Text Analysis, which has as one of its components, a
summarization tool. This toolset is part of the Intelligent Text Miner product. It produces
summaries by extracting sentences. As with most commercial products, this is done by ranking
the sentences by a measure of importance and then selecting the topmost ranks. Ranking is
achieved by word level features and the user can select the extract length.

B.E. (I.T.) 2007-08. 34


Automated Text Summarization

4. Websumm by Mitre

Mitre's WebSumm performs sentence extraction over single or multiple documents in


conjunction with a search engine. The resulting summary is an extract of sentences based on a
users query. This is done by representing the source document(s) as a network of sentences.
Using the query terms to select nodes which are related, the sentences are extracted. The
summarisation tool is able to handle both similar and contrasting sentences across multiple
documents.

5. InText Search Engine

This search engine development kit is used to make a search engine for a particular website. In
doing so, it is supposed to allow dynamic creation of summaries of documents in the website
when requested by users. There is no information as to how this is done on their website.

6. InText by Island Soft

InText extracts key sentences by using key words, although the exact technique is not mentioned
on their website. Their description mentions that the user may choose one of several extraction
techniques. InText is a product that the user installs and uses on documents already residing on
the computer.

7. British Telecom (ProSum) or NetSum

It is difficult to get to official site, so this paragraph is based on second-hand descriptions.


British Telecom produced a summarization tool for both offline and online texts which works by
selecting key sentences and extracting them as a summary. However, since ProSum and NetSum
are commercially available products, the internal mechanisms behind the extraction process.
Alternatively, instead of extracting the sentences, the tool highlights them in the original
document. The user is able to change the length of the summary produced. According to
researchers at the University of Ottawa, ProSum works best with factual documents of a single
theme. These include genres such as news, paper articles and technical journals. It doesn't work
as well with lists and narrative works.
B.E. (I.T.) 2007-08. 35
Automated Text Summarization

8. inXight (LinguistX)

inXight's Summary Server is an application that creates extraction-based summaries offline.


Users view summaries when they move their mouse over a hypertext link to a document that has
been previously summarized. It uses statistical extraction techniques based on features such as
sentence position, sentence length and keywords. The application allows the user to specify
length and salience of certain keywords. It also provides the ability for further training on
structured documents of other genres.

9. IBM Japan Summarization Project

"The importance of a sentence is determined by some surface clues such as the number of
important keywords, the type of sentence (fact, conjecture, opinion, etc.), rhetorical relations in
the context, and the location in which a sentence exists in a document."

• Copernic Summarizer
• Sinope Summarizer
• Pertinence Summarizer
• Microsoft (MS Word Auto Summarization)
• Apple
• Discontinued Summarization Products
o Webcompass by Quarterdeck

B.E. (I.T.) 2007-08. 36


Automated Text Summarization

CHAPTER- 07

CONCLUSION

Due to the advent of Internet and availability of huge amount of data, it has become necessary to
continue further development of the summarization tools. Earlier most of the summarization
tools used frequency count and other statistical techniques for summarization. But now tools
which can evaluate generic queries are also available in the market. A new concept of Ranking of
text has also been developed. Research is going on in this field to make the summary as human
like as possible. It was seen that the quality of the original text also makes an impact on the
quality of the summary. Some techniques are better suited to a certain type of documents while
they may not perform as well on other type of texts. For example the Named Entity technique
works best with news articles.

As of the current situation summarization tools are not developed enough to provide effective
summaries on its own. Further research is needed to produce high quality summaries without
human intervention. This is a very lucrative field from both the business as well as research point
of view.

----------------------------------------------

B.E. (I.T.) 2007-08. 37


Automated Text Summarization

REFERENCES

• Inderjeet Mani and Mark T. Maybury: Advances in Auto Text Summarization.

• Endres-Niggemeyer, Brigitte (1998): Summarizing Information

• Marcu, Daniel (2000): The Theory and Practice of Discourse Parsing and Summarization

• Ronen Feldman and James Sanger, The Text Mining Handbook, Cambridge University

Press

• M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning

Techniques, WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp.

966-974.

• Müürisep, Kaili and Pilleriin Mutso. ESTSUM - Estonian newspaper texts summarizer.

Proceedings of The Second Baltic Conference on Human Language Technologies. April

4-5, 2005. Tallinn, pages 311-316.

• de Smedt, K., A. Liseth, M. Hassel, H. Dalianis 2005. How short is good? An evaluation

of automatic summarization. In Holmboe, H. (ed.) Nordisk Sprogteknologi 2004. Årbog

for Nordisk Språkteknologisk Forskningsprogram 2000-2004, pp 267-287, Museum

Tusculanums Forlag,

• Hassel, Martin: Evaluation of automatic text summarization - a practical implementation.

Licentiate thesis, Stockholm, NADA-KTH.

B.E. (I.T.) 2007-08. 38


Automated Text Summarization

Some very important links and websites that have provided me with immense information in this
topic are mentioned below:

• www.summarization.com/

• www.dsv.su.se/~hercules/textsammanfattningeng.html

• www.doc.ic.ac.uk/~nd/surprise_97/journal/vol4/hks/summ.html

• www.mitpress.mit.edu/book-home.tcl

• www.site.uottawa.ca/tanka/ts.html

• www.isi.edu/natural-language/projects/SUMMARIST.html

• www.law.kuleuven.ac.be/icri/conferences/acl_summarization2004.php

• www.extractor.com/

B.E. (I.T.) 2007-08. 39