Automatic summarisation - introduction and evaluation

Constantin Or˘san a

March 10, 2010

What is a summary?

Abstract of scientific paper

Source: (Sparck Jones, 2007)

Summary of a news event

Source: Google news http://news.google.com

Summary of a web page

Source: Bing http://www.bing.com

Summary of financial news

Source: Yahoo! Finance http://finance.yahoo.com/

Summary of financial news

Source: Yahoo! Finance http://finance.yahoo.com/

Summary of financial news

Source: Yahoo! Finance http://finance.yahoo.com/

Maps

Source: Google Maps http://maps.google.co.uk/

Maps

Source: Google Maps http://maps.google.co.uk/

Summaries in everyday life

Summaries in everyday life
• Headlines: summaries of newspaper articles

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.)

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news • Biography: resume, obituary

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news • Biography: resume, obituary • Abridgment: of books

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news • Biography: resume, obituary • Abridgment: of books • Review: of books, music, plays

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news • Biography: resume, obituary • Abridgment: of books • Review: of books, music, plays • Scale-downs: maps, thumbnails

Summaries in everyday life
• Headlines: summaries of newspaper articles • Table of contents: summary of a book, magazine • Digest: summary of stories on the same topic • Highlights: summary of an event (meeting, sport event, etc.) • Abstract: summary of a scientific paper • Bulletin: weather forecast, stock market, news • Biography: resume, obituary • Abridgment: of books • Review: of books, music, plays • Scale-downs: maps, thumbnails • Trailer: from film, speech

Summaries in computational linguistics usually

• are produced from the text of one or several documents • are a text or a list of sentences

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)
• “an abstract summarises the essential contents of a particular

knowledge record, and it is a true surrogate of the document” (Cleveland, 1983)

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)
• “an abstract summarises the essential contents of a particular

knowledge record, and it is a true surrogate of the document” (Cleveland, 1983)

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)
• “an abstract summarises the essential contents of a particular

knowledge record, and it is a true surrogate of the document” (Cleveland, 1983)
• “the primary function of abstracts is to indicate and predict

the structure and content of the text” (van Dijk, 1980)

Definitions of summary

• “an abbreviated, accurate representation of the content of a

document preferably prepared by its author(s) for publication with it. Such abstracts are also useful in access publications and machine-readable databases” (American National Standards Institute Inc., 1979)
• “an abstract summarises the essential contents of a particular

knowledge record, and it is a true surrogate of the document” (Cleveland, 1983)
• “the primary function of abstracts is to indicate and predict

the structure and content of the text” (van Dijk, 1980)

Definitions of summary (II)

• “the abstract is a time saving device that can be used to find

a particular part of the article without reading it; [...] knowing the structure in advance will help the reader to get into the article; [...] as a summary of the article, it can serve as a review, or as a clue to the content”. Also, an abstract gives “an exact and concise knowledge of the total content of the very much more lengthy original, a factual summary which is both an elaboration of the title and a condensation of the report [...] if comprehensive enough, it might replace reading the article for some purposes” (Graetz, 1985).

Definitions of summary (II)

• “the abstract is a time saving device that can be used to find

a particular part of the article without reading it; [...] knowing the structure in advance will help the reader to get into the article; [...] as a summary of the article, it can serve as a review, or as a clue to the content”. Also, an abstract gives “an exact and concise knowledge of the total content of the very much more lengthy original, a factual summary which is both an elaboration of the title and a condensation of the report [...] if comprehensive enough, it might replace reading the article for some purposes” (Graetz, 1985).

Definitions of summary (II)

• “the abstract is a time saving device that can be used to find

a particular part of the article without reading it; [...] knowing the structure in advance will help the reader to get into the article; [...] as a summary of the article, it can serve as a review, or as a clue to the content”. Also, an abstract gives “an exact and concise knowledge of the total content of the very much more lengthy original, a factual summary which is both an elaboration of the title and a condensation of the report [...] if comprehensive enough, it might replace reading the article for some purposes” (Graetz, 1985).
• these definitions refer to human produced summaries

Definitions for automatic summaries

• these definitions are less ambitious

Definitions for automatic summaries

• these definitions are less ambitious • “a concise representation of a document’s content to enable

the reader to determine its relevance to a specific information” (Johnson, 1995)

Definitions for automatic summaries

• these definitions are less ambitious • “a concise representation of a document’s content to enable

the reader to determine its relevance to a specific information” (Johnson, 1995)

Definitions for automatic summaries

• these definitions are less ambitious • “a concise representation of a document’s content to enable

the reader to determine its relevance to a specific information” (Johnson, 1995)
• “a summary is a text produced from one or more texts, that

contains a significant portion of the information in the original text(s), and is not longer than half of the original text(s)”. (Hovy, 2003)

Definitions for automatic summaries

• these definitions are less ambitious • “a concise representation of a document’s content to enable

the reader to determine its relevance to a specific information” (Johnson, 1995)
• “a summary is a text produced from one or more texts, that

contains a significant portion of the information in the original text(s), and is not longer than half of the original text(s)”. (Hovy, 2003)

What is automatic summarisation?

What is automatic (text) summarisation

• Text summarisation • a reductive transformation of source text to summary text through content reduction by selection and/or generalisation on what is important in the source. (Sparck Jones, 1999)

What is automatic (text) summarisation

• Text summarisation • a reductive transformation of source text to summary text through content reduction by selection and/or generalisation on what is important in the source. (Sparck Jones, 1999)

What is automatic (text) summarisation

• Text summarisation • a reductive transformation of source text to summary text through content reduction by selection and/or generalisation on what is important in the source. (Sparck Jones, 1999) • the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks). (Mani and Maybury, 1999)

What is automatic (text) summarisation

• Text summarisation • a reductive transformation of source text to summary text through content reduction by selection and/or generalisation on what is important in the source. (Sparck Jones, 1999) • the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks). (Mani and Maybury, 1999)

What is automatic (text) summarisation

• Text summarisation • a reductive transformation of source text to summary text through content reduction by selection and/or generalisation on what is important in the source. (Sparck Jones, 1999) • the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks). (Mani and Maybury, 1999) • Automatic text summarisation = The process of producing

summaries automatically.

Related disciplines

There are many disciplines which are related to automatic summarisation:
• automatic categorisation/classification • term/keyword extraction • information retrieval • information extraction • question answering • text generation • data/opinion mining

Automatic categorisation/classification

• Automatic text categorisation • is the task of building software tools capable of classifying text documents under predefined categories or subject codes • each document can be in one or several categories • examples of categories: Library of Congress subject headings • Automatic text classification • is usually considered broader than text categorisation • includes text clustering and text categorisation • in does not necessary require to know the classes • Examples: email/spam filtering, routing,

Term/keyword extraction

• automatically identifies terms/keywords in texts • a term is a word or group of words which are important in a

domain and represent a concept of the domain
• a keyword is an important word in a document, but it is not

necessary a term
• terms and keywords are extracted using a mixture of

statistical and linguistic approaches
• automatic indexing identifies all the relevant occurrences of a

keyword in texts and produces indexes

Information retrieval (IR)
• Information retrieval attempts to find information relevant to

a user query and rank it according to its relevance
• the output is usually a list of documents in some cases

together with relevant snippets from the document
• Example: search engines • needs to be able to deal with enormous quantities of

information and process information in any format (e.g. text, image, video, etc.)
• is a field which achieved a level of maturity and is used in

industry and business
• combines statistics, text analysis, link analysis and user

interfaces

Information extraction (IE)
• Information extraction is the automatic identification of

predefined types of entities, relations or events in free text
• quite often the best results are obtained by rule-based

approaches, but machine learning approaches are used more and more
• can generate database records • is domain dependent • this field developed a lot as a result of the MUC conferences • one of the tasks in the MUC conferences was to fill in

templates
• Example: Ford appointed Harriet Smith as president

Information extraction (IE)
• Information extraction is the automatic identification of

predefined types of entities, relations or events in free text
• quite often the best results are obtained by rule-based

approaches, but machine learning approaches are used more and more
• can generate database records • is domain dependent • this field developed a lot as a result of the MUC conferences • one of the tasks in the MUC conferences was to fill in

templates
• Example: Ford appointed Harriet Smith as president

Information extraction (IE)
• Information extraction is the automatic identification of

predefined types of entities, relations or events in free text
• quite often the best results are obtained by rule-based

approaches, but machine learning approaches are used more and more
• can generate database records • is domain dependent • this field developed a lot as a result of the MUC conferences • one of the tasks in the MUC conferences was to fill in

templates
• Example: Ford appointed Harriet Smith as president • Person: Harriet Smith • Job: president • Company: Ford

Question answering (QA)
• Question answering aims at identifying the answer to a

question in a large collection of documents
• the information provided by QA is more focused than

information retrieval
• a QA system should be able to answer any question and

should not be restricted to a domain (like IE)
• the output can be the exact answer or a text snippet which

contains the answer
• the domain took off as a result of the introduction of QA

track in TREC
• user-focused summarisation = open-domain question

answering

Text generation

• Text generation creates text from computer-internal

representations of information
• most generation systems rely on massive amounts of linguistic

knowledge and manually encoded rules for translating the underlying representation into language
• text generation systems are very domain dependent

Data mining

• Data mining is the (semi)automatic discovery of trends,

patterns or unusual data across very large data sets, usually for the purposes of decision making
• Text mining applies methods from data mining to textual

collections
• Processes really large amounts of data in order to find useful

information
• In many cases it is not known (clearly) what is sought • Visualisation has a very important role in data mining

Opinion mining
• Opinion mining (OM) is a recent discipline at the crossroads

of information retrieval and computational linguistics which is concerned not with the topic a document is about, but with the opinion it expresses.
• Is usually applied to collections of documents (e.g. blogs) and

seen part of text/data mining
• Sentiment Analysis, Sentiment Classification, Opinion

Extraction are other names used in literature to identify this discipline. • Examples of OM problems:
• What is the general opinion on the proposed tax reform? • How is popular opinion on the presidential candidates evolving? • Which of our customers are unsatisfied? Why?

Characteristics of summaries

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries • they do not necessary refer to summaries

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries • they do not necessary refer to summaries • there are three types of factors:

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries • they do not necessary refer to summaries • there are three types of factors: • input factors: characterise the input document(s)

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries • they do not necessary refer to summaries • there are three types of factors: • input factors: characterise the input document(s) • purpose factors: define the transformations necessary to obtain the output

Context factors

• the context factors defined by Sparck Jones (1999; 2001)

represent a good way of characterising summaries
• they do not necessary refer to automatic summaries • they do not necessary refer to summaries • there are three types of factors: • input factors: characterise the input document(s) • purpose factors: define the transformations necessary to obtain the output • output factors: characterise the produced summaries

Context factors

Input factors Form - Structure - Scale - Medium - Genre - Language - Format Subject type Unit

Purpose factors Situation Use Summary type Coverage Relation to source

Output factors Form - Structure - Scale - Medium - Language - Format Subject matter

Input factors - Form

• structure: explicit organisation of documents.

Can be problem - solution structure of scientific documents, pyramidal structure of newspaper articles, presence of embedded structure in text (e.g. rhetorical patterns)

Input factors - Form

• structure: explicit organisation of documents.

Can be problem - solution structure of scientific documents, pyramidal structure of newspaper articles, presence of embedded structure in text (e.g. rhetorical patterns)
• scale: the length of the documents

Different methods need to be used for a book and for a newspaper article due to very different compression rates

Input factors - Form

• structure: explicit organisation of documents.

Can be problem - solution structure of scientific documents, pyramidal structure of newspaper articles, presence of embedded structure in text (e.g. rhetorical patterns)
• scale: the length of the documents

Different methods need to be used for a book and for a newspaper article due to very different compression rates
• medium: natural language/sublanguage/specialised language

If the text is written in a sublanguage it is less ambiguous and therefore it’s easier to process.

Input factors - Form

• language: monolingual/multilingual/cross-lingual

Input factors - Form

• language: monolingual/multilingual/cross-lingual • Monolingual: the source and the output are in the same language

Input factors - Form

• language: monolingual/multilingual/cross-lingual • Monolingual: the source and the output are in the same language • Multilingual: the input is in several languages and output in one of these languages

Input factors - Form

• language: monolingual/multilingual/cross-lingual • Monolingual: the source and the output are in the same language • Multilingual: the input is in several languages and output in one of these languages • Cross-lingual: the language of the output is different from the language of the source(s)

Input factors - Form

• language: monolingual/multilingual/cross-lingual • Monolingual: the source and the output are in the same language • Multilingual: the input is in several languages and output in one of these languages • Cross-lingual: the language of the output is different from the language of the source(s) • formatting: whether the source is in any special formatting.

This is more a programming problem, but needs to be taken into consideration if information is lost as a result of conversion.

Input factors

• Subject type: intended readership

Indicates whether the source was written from the general reader or for specific readers. It influences the amount of background information present in the source.

Input factors

• Subject type: intended readership

Indicates whether the source was written from the general reader or for specific readers. It influences the amount of background information present in the source.
• Unit: single/multiple sources (single vs. multi-document

summarisation) mainly concerned with the amount of redundancy in the text

Why input factors are useful?

The input factors can be used whether to summarise a text or not:
• Brandow, Mitze, and Rau (1995) use structure of the

document (presence of speech, tables, embedded lists, etc.) to decide whether to summarise it or not.
• Louis and Nenkova (2009) train a system on DUC data to

determine whether the result is expected to be reliable or not.

Purpose factors
• Use: how the summary is used

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document,

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document,

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document, • previewing: get the structure of the source, etc.

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document, • previewing: get the structure of the source, etc. • Summary type: indicates how is the summary

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document, • previewing: get the structure of the source, etc. • Summary type: indicates how is the summary • indicative summaries provide a brief description of the source without going into details,

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document, • previewing: get the structure of the source, etc. • Summary type: indicates how is the summary • indicative summaries provide a brief description of the source without going into details, • informative summaries follow the ideas main ideas and structure of the source

Purpose factors
• Use: how the summary is used • retrieving: the user uses the summary to decide whether to read the whole document, • substituting: use the summary instead of the full document, • previewing: get the structure of the source, etc. • Summary type: indicates how is the summary • indicative summaries provide a brief description of the source without going into details, • informative summaries follow the ideas main ideas and structure of the source • critical summaries give a description of the source and discuss its contents (e.g. review articles can be considered critical summaries)

Purpose factors

• Relation to source: whether the summary is an extract or

abstract

Purpose factors

• Relation to source: whether the summary is an extract or

abstract
• extract: contains units directly extracted from the document

(i.e. paragraphs, sentences, clauses),

Purpose factors

• Relation to source: whether the summary is an extract or

abstract
• extract: contains units directly extracted from the document

(i.e. paragraphs, sentences, clauses),
• abstract: includes units which are not present in the source

Purpose factors

• Relation to source: whether the summary is an extract or

abstract
• extract: contains units directly extracted from the document

(i.e. paragraphs, sentences, clauses),
• abstract: includes units which are not present in the source

• Coverage: which type of information should be present in the

summary

Purpose factors

• Relation to source: whether the summary is an extract or

abstract
• extract: contains units directly extracted from the document

(i.e. paragraphs, sentences, clauses),
• abstract: includes units which are not present in the source

• Coverage: which type of information should be present in the

summary
• generic: the summary should cover all the important

information of the document,

Purpose factors

• Relation to source: whether the summary is an extract or

abstract
• extract: contains units directly extracted from the document

(i.e. paragraphs, sentences, clauses),
• abstract: includes units which are not present in the source

• Coverage: which type of information should be present in the

summary
• generic: the summary should cover all the important

information of the document,
• user-focused: the user indicates which should be the focus of

the summary

Output factors
• Scale (also referred to as compression rate): indicates the

length of the summary
• American National Standards Institute Inc. (1979)

recommends 250 words
• Borko and Bernier (1975) point out that imposing an arbitrary

limit on summaries is not good for their quality, but that a length of around 10% is usually enough • Hovy (2003) requires that the length of the summary is kept less then half of the source’s size • Goldstein et al. (1999) point out that the summary length seems to be independent from the length of the source • the structure of the output can be influenced by the structure

of the input or by existing conventions
• the subject matter can be the same as the input, or can be

broader when background information is added

Evaluation of automatic summarisation

Why is evaluation necessary?

• Evaluation is very important because it allows us to assess the

results of a method or system
• Evaluation allows us to compare the results of different

methods or systems
• Some types of evaluation allow us to understand why a

method fails
• almost each field has its specific evaluation methods • there are several ways to perform evaluation • How the system is considered • How humans interact with the evaluation process • What is measured

How the system is considered

• black-box evaluation: • the system is considered opaque to the user • the system is considered as a whole • allows direct comparison between different systems • does not explain the system’s performance

How the system is considered

• black-box evaluation: • the system is considered opaque to the user • the system is considered as a whole • allows direct comparison between different systems • does not explain the system’s performance • glass-box evaluation: • each of the system’s components are assessed in order to understand how the final result is obtained • is very time consuming and difficult • relies on phenomena which are not fully understood (e.g. error propagation)

How humans interact with the process
• off-line evaluation • also called automatic evaluation because it does not require human intervention • usually involves the comparison between the system’s output and a gold standard • very often annotated corpora are used as gold standards • are usually preferred because they are fast and not directly influenced by the human subjectivity • can be repeated • cannot be (easily) used in all the fields

How humans interact with the process
• off-line evaluation • also called automatic evaluation because it does not require human intervention • usually involves the comparison between the system’s output and a gold standard • very often annotated corpora are used as gold standards • are usually preferred because they are fast and not directly influenced by the human subjectivity • can be repeated • cannot be (easily) used in all the fields • online evaluation • requires humans to assess the output of the system according to some guidelines • is useful for those tasks where the output of the system cannot be uniquely predicted (e.g. summarisation, text generation, question answering, machine translation) • are time consuming, expensive and cannot be easily repeated

What it is measured

• intrinsic evaluation: • evaluates the results of a system directly • for example: quality, informativeness • sometimes does not give a very accurate view of how useful the output can be for another task

What it is measured

• intrinsic evaluation: • evaluates the results of a system directly • for example: quality, informativeness • sometimes does not give a very accurate view of how useful the output can be for another task • extrinsic evaluation: • evaluates the results of another system which uses the results of the first • examples: post-edit measures, relevance assessment, reading comprehension

Evaluation used in automatic summarisation
• evaluation is very difficult task because there is no clear idea

what constitutes a good summary
• the number of perfectly acceptable summaries from a text is

not limited
• four types of evaluation methods

Evaluation used in automatic summarisation
• evaluation is very difficult task because there is no clear idea

what constitutes a good summary
• the number of perfectly acceptable summaries from a text is

not limited
• four types of evaluation methods

On-line Off-line evaluation

Intrinsic Direct evaluation Target-based evaluation

Extrinsic Task-based evaluation Automatic evaluation

Direct evaluation

• intrinsic & online evaluation • requires humans to read summaries and measure their quality

and informativeness according to some guidelines
• is one of the first evaluation methods used in automatic

summarisation
• to a certain extent it is quite straight forward which makes it

appealing for small scale evaluation
• it is time consuming, subjective and in many cases cannot be

repeated by others

Direct evaluation: quality
• it tries to assess the quality of a summary independently from

the source
• can be simple classification of sentences in acceptable or

unacceptable
• Minel, Nugier, and Piat (1997) proposed an evaluation

protocol which considers the coherence, cohesion and legibility of summaries
• cohesion of a summary is measured in terms of dangling

anaphors
• the coherence in terms of discourse ruptures. • the legibility is decided by jurors who are requested to classify

each summary in very bad, bad, mediocre, good and very good. • it does not assess the contents of a summary so it could be

misleading

Direct evaluation: informativeness
• assesses how correctly the information in the source is

reflected in the summary
• the judges are required to read both the source and the

summary, for this reason making the process longer and more expensive • judges are generally required to:
• identify important ideas from the source which do not appear

in the summary
• ideas from the summary which are not important enough and

therefore should not be there
• identify the logical development of the ideas and see whether

they appear in the summary • given that it is time consuming automatic methods to

compute the informativeness are preferred

Target-based evaluation

• it is the most used evaluation method • compares the automatic summary with a gold standard • they are appropriate for extractive summarisation methods • it is intrinsic and off-line • it does not require to have humans involved in the evaluation • has the advantage of being fast, cheap and can be repeated

by other researchers
• the drawback is that it requires a gold standard which usually

is not easy to produce

Corpora as gold standards

• usually annotated corpora are used as gold standard • usually the annotation is very simple: for each sentence it

indicates whether it is important enough to be included in the summary or not
• such corpora are normally used to assess extracts • can be produced manually and automatically • these corpora normally represent one point of view

Manually produced corpora

• Require human judges to read each text from the corpus and

to identify the important units in each text according to guidelines
• Kupiec, Pederson, and Chen (1995) and Teufel and Moens

(1997) took advantage of the existence of human produced abstracts and asked human annotators to align sentences from the document with sentences from the abstracts.
• it is not necessary to use specialised tools apply this

annotation, but in many cases they can help

Guidelines for manually annotated corpora
• Edmundson (1969) annotated a heterogenous corpus

consisting of 200 documents in the fields of physics, life science, information science and humanities. The important sentences were considered to be those which indicated:
• • • •

what the subject area is, why the research is necessary, how the problem is solved, which are the findings of the research.

• Hasler, Or˘san, and Mitkov (2003) annotated a corpus of a

newspaper articles and the important sentences were considered those linked to the main topic of text as indicated in the title (See http://clg.wlv.ac.uk/projects/CAST/ for the complete guidelines)

Problems with manually produced corpora

• given how subjective the identification of important sentences

is, the agreement between annotators is low
• the inter-annotator agreement is determined by the genre of

texts and the length of summaries
• Hasler, Or˘san, and Mitkov (2003) tries to measure the a

agreement between three annotators and notice very low value, but
• when the contents is compared the agreement increases

Automatically produced corpora
• Relies on the fact that very often human produce summaries

by copy-paste from the source
• there are algorithms which identify sets of sentences from the

source which cover the information in the summary
• Marcu (1999) employed a greedy algorithm which eliminates

sentences from the whole document that do not reduce the similarity between the summary and the remaining sentences.
• Jing and McKeown (1999) treat the human produced abstract

as a sequence of words which appears in the document, and reformulate the problem of alignment as the problem of finding the most likely position of the words from the abstract in the full document using a Hidden Markov Model.

Evaluation measures used with annotated corpora
• usually precision, recall and f-measure are used to calculate

the performance of a system
• the list of sentences extracted by the program is compared

with the list of sentences marked by humans
Extracted by program True Positives False positives Not-extracted by program False negatives True negatives

Extracted by humans Not extracted by humans

Precision =

TruePositives TruePositives + FalsePositives TruePositives Recall = TruePositives + FalseNegatives (β 2 + 1)PR F − score = β2P + R

Summary Evaluation Environment (SEE)

• SEE environment was is being used in the DUC evaluations • is a combination between direct and target evaluation • it requires humans to assess whether each unit from the

automatic summary appears in the target summary
• it also offers the option to answer questions about the quality

of the summary (e.g. Does the summary build from sentence to sentence to a coherent body of information about the topic?)

Relative utility of sentences (Radev et. al., 2000)

• Addresses the problem that humans often disagree when they

are asked to select the top n% sentences from a document
• Each sentence in the document receives a score from 1 to 10

depending on how “summary worthy” is
• The score of an automatic summary is the normalised score of

the extracted sentences
• When several judges are available the score of a summary is

the average over all judges
• Can be used for any compression rate

Target-based evaluation without annotated corpora

• They require that the sources have a human provided

summary (but they do not need to be annotated)
• Donaway et. al. (2000) propose to use cosine similarity

between an automatic summary and human summary - but it relies on words co-occurrences
• ROUGE uses the number of overlapping units (Lin, 2004) • Nenkova and Passonneau (2004) proposed the pyramid

evaluation method which addresses the problem that different people select different content when writing summaries

ROUGE
• ROUGE = Recall-Oriented Understudy for Gisting Evaluation

(Lin, 2004)
• inspired by BLEU (Bilingual Evaluation Understudy) used in

machine translation (Papineni et al., 2002)
• Developed by Chin-Yew Lin and available at

http://berouge.com
• Compares quality of a summary by comparison with ideal

summaries
• Metrics count the number of overlapping units • There are several versions depending on how the comparison

is made

ROUGE-N

N-gram co-occurrence statistics is a recall oriented metric
• S1: Police killed the gunman • S2: Police kill the gunman • S3: The gunman kill police

• S2=S3

ROUGE-L

Longest common sequence
• S1: police killed the gunman • S2: police kill the gunman • S3: the gunman kill police

• S2 = 3/4 (police the gunman) • S3 = 2/4 (the gunman) • S2 > S3

ROUGE-W

Weighted Longest Common Subsequence
• S1: [A B C D E F G] • S2: [A B C D H I J] • S3: [A H B J C I D]

• ROUGE-W favours consecutive matches • S2 better than S3

ROUGE-S
ROUGE-S: Skip-bigram recall metric
• Arbitrary in-sequence bigrams are computed • S1: police killed the gunman (“police killed”, “police the”,

“police gunman”, “killed the”, “killed gunman”, “the gunman”)
• S2: police kill the gunman (“police the”, “police gunman”,

“the gunman”)
• S3: the gunman kill police (“the gunman”) • S4: the gunman police killed (“police killed”, “the gunman”)

• S2 better than S4 better than S3 • ROUGE-SU adds unigrams to ROUGE-S

ROUGE

• Experiments on DUC 2000 - 2003 data shows good corelation

with human judgement
• Using multiple references achieved better correlation with

human judgement than just using a single reference.
• Stemming and removing stopwords improved correlation with

human judgement

The pyramid method (Nenkova, Passonneau, and McKeown, 2007)

• attempts to solve Human Variation, Analysis Granularity,

Semantic Equivalence, Extracts vs. Abstracts
• assumes that there are multiple models for each summary • it tries to measure the overlap in meaning rather than words • relies on Summary Content Units (SCUs) which are

semantically motivated, subsentential units of variable length but not bigger than a sentential clause.
• SCUs are annotated in the gold standard

The pyramid method - Example
A The cause of the fire was unknown. B A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

The pyramid method - Example
SCU: A cable car caught fire (Weight = 4) A The cause of the fire was unknown. B A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

The pyramid method - Example
SCU: A cable car caught fire (Weight = 4) SCU: The cause of the fire is unknown (Weight = 1) A The cause of the fire was unknown. B A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

The pyramid method - Example
SCU: A cable car caught fire (Weight = 4) SCU: The cause of the fire is unknown (Weight = 1) SCU: The accident happened in the Austrian Alps (Weight = 3) A The cause of the fire was unknown. B A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps, caught fire inside a mountain tunnel, killing approximately 170 people. D On November 10, 2000, a cable car filled to capacity caught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

The pyramid method

• a pyramid is built on the

basis of the weight of SCUs
• Top: few SCUs, high weight • Bottom: many SCUs, low

weight
• In theory, an informative

summary does not include an SCU from a lower tier unless all the SCUs from higher tiers are included as well

The pyramid method
• for a pyramid with n tiers, Tn top one, T1 bottom • the weight of a SCU in tier Ti is i • Di the number of SCUs in the summary that appear at tier i • the total SCU weight D = n i ∗ Di i=1 • the optimal contents for a summary with X SCUs is:
n n

Max =
i=j+1

i ∗ |Ti | + j ∗ (X −
i=j+1

|Ti |)

where j = max(
i

n

|Tt | ≥ X )
t=i

• the pyramid score is the ratio between D and Max

Task-based evaluation
• is an extrinsic and on-line evaluation • instead of evaluating the summaries directly, humans are

asked to perform tasks using summaries and the accuracy of these tasks is measured
• the assumption is that the accuracy does not decrease when

good summaries are used
• the time should reduce • Example of tasks: classification of summaries according to

predefined classes (Saggion and Lapalme, 2000), determining the relevance of a summary to a topic (Miike et al., 1994; Oka and Ueda, 2000), and reading comprehension (Morris, Kasper, and Adams, 1992; Or˘san, Pekar, and Hasler, 2004). a

Task-based evaluation

• this evaluation can be very useful because it assess a summary

in real situations
• it is time consuming and requires humans to be involved in

the evaluation process
• in order to obtain statistically significant results a large

number of judges have to be involved
• this evaluation method has been used in evaluation

conferences

Automatic evaluation
• extrinsic and off-line evaluation method • tries to replace humans in task-based evaluations with

automatic methods which perform the same task and are evaluated automatically • Examples:
• text retrieval (Brandow, Mitze, and Rau, 1995): increase in

precision but drastic reduction of recall
• text categorisation (Kolcz, Prabakarmurthi, and Kalita, 2001):

the performance of categorisation increases • has the advantage of being fast and cheap, but in many cases

the tasks which can benefit from summaries are as difficult to evaluate as automatic summarisation (e.g. Kuo et al. (2002) proposed to use QA)

intrinsic

extrinsic From (Sparck Jones, 2007)

intrinsic
• semi-purpose: inspection (e.g. for proper

English)

extrinsic From (Sparck Jones, 2007)

intrinsic
• semi-purpose: inspection (e.g. for proper

English)
• quasi-purpose: comparison with models (e.g.

ngrams, nuggets)

extrinsic From (Sparck Jones, 2007)

intrinsic
• semi-purpose: inspection (e.g. for proper

English)
• quasi-purpose: comparison with models (e.g.

ngrams, nuggets)
• pseudo-purpose: simulation of task contexts

(e.g. action scenarios)

extrinsic From (Sparck Jones, 2007)

intrinsic
• semi-purpose: inspection (e.g. for proper

English)
• quasi-purpose: comparison with models (e.g.

ngrams, nuggets)
• pseudo-purpose: simulation of task contexts

(e.g. action scenarios)
• full-purpose: operation in task context (e.g.

report writing) extrinsic From (Sparck Jones, 2007)

Evaluation conferences

• evaluation conferences are conferences where all the

participants have to complete the same task on a common set of data
• these conferences allow direct comparison between the

participants
• such conferences determined quick advances in fields: MUC

(information extraction), TREC (Information retrieval & question answering), CLEF (question answering for non-English languages and cross-lingual QA)

SUMMAC

• the first evaluation conference organised in automatic

summarisation (in 1998)
• 6 participants in the dry-run and 16 in the formal evaluation • mainly extrinsic evaluation: • adhoc task determine the relevance of the source document to a query (topic) • categorisation assign to each document a category on the basis of its summary • question answering answer questions using the summary • a small acceptability test where direct evaluation was used

SUMMAC

• the TREC dataset was used • for the adhoc evaluation 20 topics each with 50 documents

were selected
• the time for the adhoc task halves with a slight reduction in

the accuracy (which is not significant)
• for the categorisation task 10 topics each with 100 documents

(5 categories)
• there is no difference in the classification accuracy and the

time reduces only for 10% summaries
• more details can be found in (Mani et al., 1998)

Text Summarization Challenge
• is an evaluation conference organised in Japan and its main

goals are to evaluate Japanese summarisers
• it was organised using the SUMMAC model • precision and recall were used to evaluate single document

summaries
• humans had to assess the relevance of summaries from text

retrieved for specific queries to these queries
• is also included some readability measures (e.g. how many

deletions, insertions and replacements were necessary)
• more details can be found in (Fukusima and Okumura, 2001;

Okumura, Fukusima, and Nanba, 2003)

Document Understanding Conference (DUC)
• it is an evaluation conference organised part of a larger

program called TIDES (Translingual Information Detection, Extraction and Summarisation)
• organised from 2000 • at be beginning it was not that different from SUMMAC, but

in time more difficult tasks were introduced:
• 2001: single and multi-document generic summaries with 50,

100, 200, 400 words
• 2002: single and multi-document generic abstracts with 50,

100, 200, 400 words, and multi-document extracts with 200 and 400 words • 2003: abstracts of documents and document sets with 10 and 100 words, and focused multi-document summaries

Document Understanding Conference
• in 2004 participants were required to produce short (<665

bytes) and (very short <75 bytes) summaries of single documents and document sets, short document profile, headlines
• from 2004 ROUGE is used as evaluation method • in 2005: short multiple document summaries, user-oriented

questions
• in 2006: same as in 2005 but also used pyramid evaluation • more information available at: http://duc.nist.gov/ • in 2007: 250 word summary, 100 update task, pyramid

evaluation was used as a community effort
• in 2008 DUC became TAC (Text Analysis Conference)

References

American National Standards Institute Inc. 1979. American National Standard for Writing Abstracts. Technical Report ANSI Z39.14 – 1979, American National Standards Institute, New York. Borko, Harold and Charles L. Bernier. 1975. Abstracting concepts and methods. Academic Press, London. Brandow, Ronald, Karl Mitze, and Lisa F. Rau. 1995. Automatic condensation of electronic publications by sentence selection. Information Processing & Management, 31(5):675 – 685. Cleveland, Donald B. 1983. Introduction to Indexing and Abstracting. Libraries Unlimited, Inc. Edmundson, H. P. 1969. New methods in automatic extracting. Journal of the Association for Computing Machinery, 16(2):264 – 285, April. Fukusima, Takahiro and Manabu Okumura. 2001. Text Summarization Challenge Text summarization evaluation in Japan (TSC). In Proceedings of Automatic Summarization Workshop. Goldstein, Jade, Mark Kantrowitz, Vibhu Mittal, and Jaime Carbonell. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 121 – 128, Berkeley, California, August, 15 – 19. Graetz, Naomi. 1985. Teaching EFL students to extract structural information from abstracts. In J. M. Ulign and A. K. Pugh, editors, Reading for Professional Purposes: Methods and Materials in Teaching Languages. Leuven: Acco, pages 123–135. Hasler, Laura, Constantin Or˘san, and Ruslan Mitkov. 2003. Building better corpora a for summarisation. In Proceedings of Corpus Linguistics 2003, pages 309 – 319, Lancaster, UK, March, 28 – 31.

Hovy, Eduard. 2003. Text summarisation. In Ruslan Mitkov, editor, The Oxford Handbook of computational linguistics. Oxford University Press, pages 583 – 598. Jing, Hongyan and Kathleen R. McKeown. 1999. The decomposition of human-written summary sentences. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR’99), pages 129 – 136, University of Berkeley, CA, August. Johnson, Frances. 1995. Automatic abstracting research. Library review, 44(8):28 – 36. Kolcz, Aleksander, Vidya Prabakarmurthi, and Jugal Kalita. 2001. Summarization as feature selection for text categorization. In Proceedings of the 10th International Conference on Information and Knowledge Management, pages 365 – 370, Atlanta, Georgia, US, October 05 - 10. Kuo, June-Jei, Hung-Chia Wung, Chuan-Jie Lin, and Hsin-Hsi Chen. 2002. Multi-document summarization using informative words and its evaluation with a QA system. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), pages 391 – 401, Mexico City, Mexico, February, 17 – 23. Kupiec, Julian, Jan Pederson, and Francine Chen. 1995. A trainable document summarizer. In Proceedings of the 18th ACM/SIGIR Annual Conference on Research and Development in Information Retrieval, pages 68 – 73, Seattle, July 09 – 13. Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26. Louis, Annie and Ani Nenkova. 2009. Performance confidence estimation for automatic summarization. In Proceedings of the 12th Conference of the European Chapter of the ACL, page 541548, Athens, Greece, March 30 - April 3.

Mani, Inderjeet, Therese Firmin, David House, Michael Chrzanowski, Gary Klein, Lynette Hirshman, Beth Sundheim, and Leo Obrst. 1998. The TIPSTER SUMMAC text summarisation evaluation: Final report. Technical Report MTR 98W0000138, The MITRE Corporation. Mani, Inderjeet and Mark T. Maybury, editors. 1999. Advances in automatic text summarisatio. MIT Press. Marcu, Daniel. 1999. The automatic construction of large-scale corpora for summarization research. In The 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pages 137–144, Berkeley, CA, August 15 – 19. Miike, Seiji, Etsuo Itoh, Kenji Ono, and Kazuo Sumita. 1994. A full-text retrieval system with a dynamic abstract generation function. In Proceedings of the 17th ACM SIGIR conference, pages 152 – 161, Dublin, Ireland, 3-6 July. ACM/Springer. Minel, Jean-Luc, Sylvaine Nugier, and Gerald Piat. 1997. How to appreciate the quality of automatic text summarization? In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scallable Text Summarization, pages 25 – 30, Madrid, Spain, July 11. Morris, Andrew H., George M. Kasper, and Dennis A. Adams. 1992. The effect and limitations of automatic text condensing on reading comprehension performance. Information Systems Research, 3(1):17 – 35. Nenkova, Ani, Rebecca Passonneau, and Kathleen McKeown. 2007. The pyramid method: Incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, 4(2). Oka, Mamiko and Yoshihiro Ueda. 2000. Evaluation of phrase-representation summarization based on information retrieval task. In NAACL-ANLP 2000 Workshop on Automatic Summarization, pages 59 – 68, Seattle, Washington, April 30.

Okumura, Manabu, Takahiro Fukusima, and Hidetsugu Nanba. 2003. Text Summarization Challenge 2: Text Summarization Evaluation at NTCIR Workshop 3. In Proceeding of the HLT-NAACL 2003 Workshop on Text Summarization, pages 49 – 56, Edmonton, Alberta, Canada, May 31 – June 1. Or˘san, Constantin, Viktor Pekar, and Laura Hasler. 2004. A comparison of a summarisation methods based on term specificity estimation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC2004), pages 1037 – 1041, Lisbon, Portugal, May 26 – 28. Papineni, K., S. Roukos, T. Ward, and W. J. Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual meeting of the Association for Computational Linguistics (ACL 2002), pages 311 – 318. Saggion, Horacio and Guy Lapalme. 2000. Concept identification and presentation in the context of technical text summarization. In NAACL-ANLP 2000 Workshop on Automatic Summarization, pages 1 – 10, Seattle, Washington, April 30. Sparck Jones, Karen. 1999. Automatic summarizing: factors and directions. In Inderjeet Mani and Mark T. Maybury, editors, Advances in automatic text summarization. The MIT Press, chapter 1, pages 1 – 12. Sparck Jones, Karen. 2001. Factorial summary evaluation. In Proceedings of the Workshop on Text Summarization (DUC 2001), New Orleans, Louisiana, USA, September 13-14. Sparck Jones, Karen. 2007. Automatic summarising: The state of the art. Information Processing and Management, 43:1449 – 1481. Teufel, Simone and Marc Moens. 1997. Sentence extraction as a classification task. In Proceedings of the ACL’97/EACL’97 Workshop on Intelligent Scallable Text Summarization, pages 58 – 59, Madrid, Spain, July 11.

van Dijk, Teun A. 1980. Text and context : explorations in the semantics and pragmatics of discourse. London : Longman.