Pre-Trained Language Model-Based Automatic Text Summarization With Information Coverage and Factual Error Detection

Pre-Trained Language
Model-Based
Automatic Text Summarization
with Information Coverage and
Factual Error Detection
情報被覆と事実エラー検出を用いた
事前学習済み言語モデルに基づく
テキスト自動要約
Haihan Yu
Supervisor: Prof. Yoshimasa Tsuruoka
Graduate School of Information and Science

University of Tokyo
This dissertation is submitted for the degree of

Master of Science
January 2021
Acknowledgements
Throughout my study in the University of Tokyo, I have received a great deal

of support and assistance.
First and foremost, I would like to express my sincere gratitude to my
advisor, Professor Yoshimasa Tsuruoka, for his continuous support and care
to me. His patience and immense knowledge has guided me through the
research. It would be impossible for me to make this thesis without his kind
advice, inspiring comments, and insightful questions. It is my great honor to
finish my research under his guidance.
I would also like to thank all the members in our laboratory, especially
Haoyu Zhang and Yifan Deng. Being warm and experienced in research, they
are my friends as well as my tutors. It is them who helped me, a new-comer
to Japan, to settle everything down; it is them who guided me, a novice
in research, through the toughest beginning; and it is them who gave me
invaluable advice on both academics and life. I am grateful to be able to have
a group of friends like them.
I also take this opportunity to express my special thanks to Stella Huang,
without whom that none of this would ever happen. Her unconditional
support gives me infinite energy to work on this thesis.
I humbly extend my thanks to all of them who have given me advice and
help. There are so many of you that I can not list here, but I would like to
take this opportunity to express my genuine appreciation for their work. The
work is not able to be complete without all these efforts made for me.
Abstract
Text summarization is a long-existing problem in the field of natural language

processing. Its purpose is to generate a new, shorter text based on the
original text while maintaining the important ideas. While the name "text
summarization" well describes the ultimate goal of this task, it can be further
divided into two different approaches, namely extractive summarization
and abstractive summarization. In recent years, with the performance
improvement introduced by the development of reinforcement learning
techniques, abstractive summarization, being closer to the process of human
summarization, has gradually become the mainstream approach for the
summarization task. In this research, we take advantage of the success of
pre-trained language models in summarization task, use it as a baseline
model and try to further improve its summarization quality by solving two
key problems that has not been solved: information repetition, and factual
inconsistency against source, via different approaches. As shown by the
result, coverage analysis modules helps improving summarization quality
with limited ROUGE score improvement, and factual error detection and
correction module is helpful in correcting factual errors, but suffers from
false positive errors.
Table of contents
List of figures ix
List of tables x
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Extractive Summarization . . . . . . . . . . . . . . . . . 4
2.1.2 Abstractive Summarization . . . . . . . . . . . . . . . . 6
2.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Encoder-Decoder Model . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Pre-Trained Language Model . . . . . . . . . . . . . . . . . . . 11
2.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Semantic Coverage Analysis . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Probabilistic Model-Based Approach . . . . . . . . . . 15
2.4.2 Neural Network-Based Approach . . . . . . . . . . . . 15
2.5 Factual Error Detection and Correction . . . . . . . . . . . . . . 16
3 Literature Review 19
3.1 Text Summarization with Pre-Trained Language Models . . . 19
3.1.1 Extractive Summarization . . . . . . . . . . . . . . . . . 20
viii Table of contents
3.1.2 Abstractive Summarization . . . . . . . . . . . . . . . . 21

3.2 Semantic Coverage Analysis . . . . . . . . . . . . . . . . . . . . 21
3.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Semantic Coverage Vector . . . . . . . . . . . . . . . . . 24
3.3 Factual Error Detection and Correction . . . . . . . . . . . . . . 26
3.3.1 Factual Error Detection . . . . . . . . . . . . . . . . . . 27
3.3.2 Factual Error Correction . . . . . . . . . . . . . . . . . . 28
4 Adjusting Extractive Summary Based on Semantic Coverage Anal-

ysis 30
4.1 Ideas of Investigation . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Semantic Coverage Models . . . . . . . . . . . . . . . . . . . . 31
4.3 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.1 Extractive Summarization Model . . . . . . . . . . . . . 32
4.3.2 Coverage Models . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Factual Error Detection and Correction for Abstractive Summariza-

tion 37
5.1 Ideas of investigation . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6 Conclusions 42
6.1 Thesis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References 45
List of figures
2.1 Position of extracted sentences [19] . . . . . . . . . . . . . . . . 8

2.2 Example of Patent datasets [31] . . . . . . . . . . . . . . . . . . 9
2.3 Generic Encoder-Decoder Model . . . . . . . . . . . . . . . . . 10
2.4 Concept of a Generic Pre-Trained Language Model [7] . . . . 12
2.5 Structure of BERT Model [4] . . . . . . . . . . . . . . . . . . . 13
2.6 Structure of BART Model [16] . . . . . . . . . . . . . . . . . . . 14
3.1 Architecture of Original Bert (left) and BERTSUM(right) [19] . 20

3.2 Example showing non-perfect translation (left) and better
translation (right) [32] . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Architecture of NN-based coverage model . . . . . . . . . . . 26
List of tables
2.1 Example of Extractive Summary and Abstractive Summary . . 5

2.2 Key Statistics of Some Common Datasets . . . . . . . . . . . . 8
2.3 Example of Factual Inconsistencies in Abstractive Summary . 17
3.1 Common heuristics for TF score . . . . . . . . . . . . . . . . . . 23

3.2 Corruption Sample . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Extractive Summarization with Information Coverage Analysis 34

4.2 Sample Summary from the Model . . . . . . . . . . . . . . . . 34
5.1 Example of a Successful Correction of Factual Error . . . . . . 39

5.2 Precision and Accuracy on Errors Detection . . . . . . . . . . . 40
Chapter 1
Introduction
1.1 Background
Text summarization is a classic natural language processing (NLP) problem
which can be traced back to at least 40 years ago. The problem aims to
generate a stream of text based on the original contents while maintaining the
important ideas. The problem has drawn a lot of attention from researchers,
for its nature of integrating a lot of different sub-tasks in NLP and its great
potential in many real-world applications. Examples for its applications
include tools that can help users navigate and digest internet contents, and
question answering. [22] To properly solve this problem, the system has
to both well understand the original content (i.e. identify the important
contents), and generate logical and reasonable output (i.e. aggregate and/or
paraphrase the identified contents into a summary) [23]. Researchers have
identified the major paradigms of summarizations [24], and among them,
single-document summarization has consistently drawn attention from
researchers.
Modern approaches to single-document summarization are based on
supervised learning of large amounts of labeled data, taking advantage of the
success of neural network architectures and their ability to learn continuous
features without recourse to pre-processing tools or linguistic annotations. In
recent years, Language model pre-training has advanced the state of the art
in many NLP tasks ranging from sentiment analysis, to question answering,
etc. A variety of pre-trained language models, including ELMo [27], BERT [4],
2 Introduction
XLNet [35], have succeeded in advancing the state-of-the-art performance of

many NLP tasks by a great margin. Specifically, BERT, a model pre-trained
on vast amounts of text, and with an unsupervised objective of masked
language modeling, can be further fine-tuned with various task-specific
objectives. Researchers have shown that BERT also has the potential for text
summarization task. [19]
However, the current summarization models are far from being perfect.
There are many issues that cannot be properly handled by the current models,
for example, the repetition and negligence of key information for extractive
summarization, and the factual inconsistency for abstractive summarization.
This research aims to improve the current summarization models by making
improvements on these specific problems. To be more specific, we try
to implement semantic coverage modules and factual error detection and
correction modules into the current pre-trained language model (PLM)-based
summarization models to improve the quality of its output.
1.2 Research Problem

In this research, we aim at improving the output quality of pre-trained
language based summarization model, for both extractive summarization
and abstractive summarization. In order to do so, we identified two problems
that have been observed in current results, namely information repetition
and negligence for extractive model, as well as factual inconsistency for
abstractive models. We study the approaches to alleviate these two problems,
and try to integrate them into a text summarization system to examine its
effect.
1.3 Thesis Outline

This thesis is organized as follows:
In Chapter 2, the background knowledge for this research, including text
summarization, Pre-Trained Language Model, text similarity analysis, and
factual error detection and correction are introduced.
1.3 Thesis Outline 3
In Chapter 3, we investigate some important studies related to the quality

of text summarization, including coverage text summarization models pre-
trained with language models, coverage modeling for text, as well as factual
analysis for text.
In Chapter 4, we investigate how to integrate coverage analysis into
extractive summarization task and the effect of such integration.
In Chapter 5, we investigate the application of factual error detection
system on abstractive summarization task.
Chapter 6 gives the conclusion of this research, and states possible future
works in this field.
Chapter 2
Background
2.1 Text Summarization

Text summarization, as introduced in the previous chapter, is a long-existing
NLP problem that includes many facets. For example, similar to human
approach to this problem, we can categorize the task by the nature of its
output: if the summary is simply a reproduction of the original content, then
it is called an extractive summary; and if a new paragraph, then it is called an
abstractive summary. We will look into the details of these two approaches
in this section.
2.1.1 Extractive Summarization

Extractive summarization means to find the most important sentences from
the original document, and then concatenate them into a summary. In
this process, the sentence is selected as a whole, and not modified in any
way. So this process can be considered as a sentence selection problem: the
summarization model needs to identify the most important sentences from
the source according its connection with the main idea of the corresponding
document.
Since the introduction of extractive summarization task to the NLP world,
a lot of probability based models were created to solve this problem based
on specific features, such as position of sentences, word count, etc. Such
approaches are easy to implement, but has obvious problems in its flexibility
2.1 Text Summarization 5
and quality. So in recent years, many researchers have started trying to solve
this problem with other more advanced techniques.
SuMMaRuNNer [21] is one the first summarization schemes that uses
neural network approaches, in this approach, the author adopts an encoder
model based on Recurrent Neural Networks (RNN). Then, Narayan et al.
[22] introduced a training scheme that aims at optimizing the ROUGE metric
globally via reinforcement learning model. Other than these approaches,
some researchers have also tried to think differently of the extractive summa-
rization problem. An interesting approach is introduced by Zhang et al. [36],
in which they defined this question as latent variable inference problem. So
the training objective is to directly maximize the likelihood of human-written
summaries given selected sentences, instead of the traditional approach of
maximizing the likelihood of "gold" label.
Source West Ham are showing interest in Crystal Palace midfielder

James McArthur. The Scotland international only joined
Palace last summer in a deal worth £7million from Wigan
Athletic. However, West Ham have been impressed by his
workmanlike performances. They want extra legs in their
midfield and have concerns over whether they will sign Alex
Song on a permanent basis from Barcelona as Inter Milan
are also keen.
Extractive the scotland international only joined palace last summer
in a deal worth £7million from wigan athletic .<q>west
ham are showing interest in crystal palace midfielder james
mcarthur .<q>however , west ham have been impressed by
his workmanlike performances .
Abstractive west ham are showing interest in james mcarthur - with the
club impressed by his workmanlike performances<q>the
scotland international only joined palace last summer in a
deal worth £7million from wigan athletic<q>mcarthur has
scored twice in 27 premier league appearances for the eagles
this season.
Table 2.1 Example of Extractive Summary and Abstractive Summary
6 Background
2.1.2 Abstractive Summarization

Unlike extractive summarization, abstraction summarization requires the
generation of new sentences. So it is often defined as a sequence-to-sequence
problem. In this task, we have to read the document as the input, and
then generate a summary as the output according to this input. Intuitively,
this definition makes the abstractive summarization task very similar to
machine translation, or other text generation tasks. Due to the difficulty of
text generation, abstractive summarization, together with other similar tasks,
has not become a research task until very recent years.
Text generation has become a research topic again with the development
of neural encoder-decoder model. Then, almost simultaneously, researchers
[20][29] have started touching abstractive summarization task with this new
technology. The encoder-decoder model was then improved by a pointer-
generator network (PTGen) [30]. However, the overall quality of abstractive
summary is rather limited, especially when compared to extractive summary.
Recently, noticing the enormous success of pre-trained language model in
various NLP tasks, researchers has started to apply it to both abstractive
and extractive summarization tasks. While pre-trained language model
has improved the overall quality and fluency of abstractive summarization,
but the long existing problems for abstractive summarization, including
inaccurate reproduction of factual details, weak capability in dealing with
out-of-vocabulary words, and self-repeating are still yet to be solved.
2.1.3 Evaluation Metrics

To evaluate the quality of generated summary, Recall-Oriented Understudy
for Gisting Evaluation (abbreviated as ROUGE) [18] is the most common
evaluation metric. ROUGE is designed for evaluating the output of summa-
rization or machine translation systems based on a human-made reference
text. The ROUGE metrics compare the generated text against a reference text
to measure their word-level similarity. ROUGE has a few variants, as we can
see from their calculation formulas below, ROUGE-N is calculated based on
the overlap of N-grams between the generated text and the reference, while
ROUGE-L is calculated based on the longest common sequence.
P P
ri ∈ref n-gram∈ri Count(n-gram, cand)
ROUGE − N(cand, re f ) = P
ri ∈ref Number of N-grams(ri )
In this formula, ri are sentences in the reference document, and Count(n-

gram, cand) is the number of occurrence of the specified n-gram in the
candidate document.
(1 + β2 )RLCS (cand, re f )PLCS (cand, re f )

ROUGE − L(cand, re f ) = ,
RLCS (cand, re f ) + β2 PLCS (cand, re f )
where
P
ri ∈re f |LCS(cand, ri )|
RLCS (cand, re f ) =
numbero f words(re f )
P
ri ∈re f |LCS(cand, ri )|
PLCS (cand, re f ) =
numbero f words(cand)
and parameter β controls the relative importance of recall and precision.

Because ROUGE score favors recall, β is usually set to a high value.
Such ROUGE metrics are good at telling the difference between the
reference text and the generated output. However, as we can easily notice from
their calculation formulas, they are not able to not provide any information
on the meaning of generated output. Also, due to the nature of abstractive
summarization, the model may produce sentences that do not follow grammar
rules. So, ROUGE by itself is not enough to evaluate the quality of summary.
Other than ROUGE score, human evaluation on the informativeness, fluency,
and succinctness is also indispensable, and to some extent, human evaluation
is even more important than the ROUGE score. But because of its lack of
subjectivity, it is hard to compare the quality of different models. Therefore,
researchers have suggested the necessity of a new metric that could overcome
these inefficiencies. Such a new metric would be highly preferable and
might be useful for the further improvement of summarization quality, but
unfortunately, as of now, there have no metrics other than ROUGE been
proposed.
8 Background
2.1.4 Datasets
Currently, the most widely-used datasets for text summarization are CNN/Daily
Mail Corpus [10], NYTimes Corpus [5]. CNN/DM dataset contains around
300,000 news articles and accompanied highlight points, all written by editors,
and we usually treat them as gold summary; similarly, NYTimes dataset
contains about 100,000 articles and their abstractive summaries. Other than
that, other major datasets (e.g. XSum [23]) are also set up based on news.
Table 2.2 lists some key statistics of these mainstream datasets.
Datasets No. of document avg. doc length summary length

words sentences words sentences
CNN 92579 760.50 33.98 45.70 3.59
DailyMail 219506 653.33 29.33 54.65 3.86
NYT 104286 800.04 35.55 45.54 2.44
XSum 226711 431.07 19.77 23.26 1.00
Table 2.2 Key Statistics of Some Common Datasets
Fig. 2.1 Position of extracted sentences [19]
However, as a readily available resource, news has a natural tendency

of concentrating the most important information in the first few sentences,
and extractive models will amplify such tendency, resulting an imprecise

output. As shown in the figure below, models rely easily on shallow position
features, and then ignores the deeper document representations.
To overcome such problem, in recent years, some new labeled dataset on
other types of texts, such as legislation bills [14], patents [31] have come out
to help researchers explore the versatility of current models. From Fig. 2.2,
we can see the other than differences in lexical features, patents (as well as
bills) usually have a standardized format, and are interconnected with other
similar patents, which might give extra information for summarization. Such
differences, including more vocabularies, higher compression rate (the ratio
of word count between the original documents and the summaries), and the
sparsity of important ideas, etc., pose harder challenges to summarization
models.
Fig. 2.2 Example of Patent datasets [31]
Another important problem with the current data sets is that they are
100% in English. For languages with similar structures to that of English,
similar techniques might be transplanted directly, but for other languages
10 Background
with completely different syntax or grammar structures, very few (if not zero)
researches has been conducted to see whether these techniques are applicable
due to the lack of data.
2.2 Encoder-Decoder Model

Abstractive summarization, together with many other NLP tasks that involve
text generation, can not be solved adequately under probabilistic models. To
cope with this problem, researchers have proposed encoder-decoder model,
which is a generic concept that can be implemented with many different
actual models for actual application. Thanks to its versatility and adaptability
in different scenarios, encoder-decoder model has become the most important
model for text generation tasks.
Fig. 2.3 Generic Encoder-Decoder Model
As we can see from the fig. 2.2, an encoder-decoder model compose

of three parts: encoder, state representation (vector), and decoder. In this
structure, Encoder is a stack of recurrent units that each accepts one piece of
element from the input sequence, and collectively, the encoder goes through
an internal process that obtains information from the input. In this process, we
can use different internal structures (e.g. LSTM, RNN, etc.) that better fits the
needs of specific tasks. And the output of encoder is the state representation.
The state representation is usually a vector that aims to encapsulate the
information from the encoder, in an attempt to help the decoder to make
more accurate predictions, and it is the first hidden state for the decoder part
of the model. As for the decoder, similar to the encoder, it is also a stack of
recurrent unit. The decoder tries to give an prediction output based on its
input(s), some NLP tasks (e.g. translation) does not need extra information
2.3 Pre-Trained Language Model 11
other the input to the encoder, the state representation itself includes all the
information needed to make predictions; while other tasks (e.g. fact check)
requires an extra input to the decoder to help with the prediction process.
Therefore, similar to that of encoder, we can choose from a variety of possible
candidate structures for the decoder and use the one that best fits the task
requirements.
The most important advantage of encoder-decoder model is that it allows
input and outputs to be of significantly different sizes, and gives the flexibility
to choose different internal structures while following the same general idea.
However, the idea of state representation vector has also shown insufficiency
in summarizing information when the input sequence is too long. So, based
on such idea, researchers have started to look for new models that will help
solve these issues, and one major progress is the pre-trained language model.
2.3 Pre-Trained Language Model

Earlier this century, apart from the summarization task, scientists have created
many different models and systems to solve various NLP tasks. However,
due to the fact that these models are trained for very limited (in most cases,
only one) tasks and with a small amount of data, such models show great
insufficiency in adjusting to the change of input.
In order to counter this problem, researchers have started implementing
the idea of creating a general language model that fits for all scenarios, and
such idea is now referred to as the Pre-Trained Language Model (PLM). As
we can see from Fig. 2.3, a pre-trained language model is first fed with an
enormous amount of unannotated data. The model will be able to "learn" the
language in this process, and then, we can "fine-tune" this model by training
it with task-specific data sets. Because the model has been fed with large
amounts of data about the language, it will be easier for it to adjust itself and
of what to do with specific tasks.
Pre-trained language models, with all the corpus learned before fine-
tuning, can minimize the confusion of the model when seeing a new sequence
of text. It has been usually used to enhance performance in language
understanding tasks, and have been proven to have reach state of the art
12 Background
Fig. 2.4 Concept of a Generic Pre-Trained Language Model [7]
status in many NLP tasks. Very recently, there are successful attempts to
apply pre-trained models to various language generation problems.
2.3.1 BERT
Bidirectional Encoder Representations from Transformers, or BERT [4], is
probably the most important pre-trained language model in the past 5 years.
Developed by Google in 2018, it is a multi-layer bidirectional Transformer
encoder. The base version of BERT has 12 layers, 12 attention heads, and
110 million parameters, while the LARGE version has 4 layers, 16 attention
heads, and 340 million parameters.
The BERT model is the first to use bidirectional model instead of single
directional one, which means that BERT reads the whole input of words at
once, without a specific order of left-to-right or right-to-left, which enables it
to learn the context of a word from all of the words surrounding it.
BERT is a model pre-trained on two tasks, namely masked language
modeling and next sentence prediction. In the Masked language modeling
(Masked LM) is the task, 15% of the words sent into the model is randomly
chosen and replaced with a different token. For the best performance in
training, the authors said that instead of changing all of this 15% of tokens to
be a [MASK] token, only 80% will be changed to [MASK] token, while 5% to
be a random different word and the remaining 15% to be left unchanged. The
training objective of the task is to recover these masked words based on the
context provided by the other non-masked words in the input sequence. And
the next sentence prediction is a binary classification task that requires the
system to determine if the second input sentence is actually next to the first
2.3 Pre-Trained Language Model 13
Fig. 2.5 Structure of BERT Model [4]
input sentence. The two tasks help the model to understand the language on
both word-level and sentence-level, therefore allowing the model to be able
to hand any downstream single sequence and sequence-pair tasks without
significant task-specific modification on the structure.
When conducting further fine-tuning for specific tasks, personalized
adjustment is available for BERT model to fit the specific requirements
of downstream tasks, which helps it gain both flexibility and generality.
With such outstanding properties, many researches based on BERT has been
implemented in a short period of time, and have advanced the state-of-the-art
performance for many NLP tasks.
2.3.2 BART
Bidirection and Auto-Regresstive Transformers, or BART [16] is a denoising
auto-encoder for pre-training seq-to-seq models. It is introduced by Facebook
in 2019, shortly after BERT. BART is a pre-trained by texts that are corrupted
by arbitrary noising function, and aims at reconstructing the original texts.
14 Background
BART can seen as a generalized BERT and many other recent pre-training
schemes, including GPT [28], etc. As we can see from Fig. 2.5, in the
BART model, the inputs to the encoder does not have to be aligned with
the decoder, therefore allows noise transformations in arbitrary formats. In
the pre-training process, BART is pre-trained on document recovery task, in
which the document is corrupted by token masking, deletion, infilling, as well
as sentence permutation, and document rotation. The target of pre-training
is to denoise such corruption and reconstruct the original documents, i.e. to
optimize the negative log likelihood of the original document.
After pre-training, similar to all other pre-trained language models, we
can further fine-tune BART model on many different downstream tasks, e.g.
sequence classification, token classification, etc. As suggested by the authors,
due to its pre-training scheme, BART is a model that is "particularly effective"
for language generation task, and works well for comprehension tasks. The
model has been able to give even better performance than BERT in many
tasks. So, after its invention, researchers have started to apply BART model
on abstractive summarization, which is itself a language generation task that
requires language comprehension.
Fig. 2.6 Structure of BART Model [16]
2.4 Semantic Coverage Analysis

Semantic coverage is the measurement on how much information from
the original text is covered by another paragraph regarding to their actual
meanings. It is a very useful evaluation for the quality of summarization
or machine translation. However, this is a tricky question because all the
2.4 Semantic Coverage Analysis 15
common factors we would assume to determine the similarity between two

texts, such as common words, sentence structures, would all fail in some
cases. As it is easy to imagine, we could easily create two sentences that share
lots of commonalities but are completely irrelevant, or two sentence with
no common words but same meanings. Nirenburg et al. [25] first formally
analyzed this concept of coverage from the perspectives of depth, breadth,
and size. Other than lexical coverage, it also points out the necessity to
put focus on ontological coverage and to take care the breadth of meaning
representations, which may result in undesired coverage analysis result. As
we can see, the idea of "coverage" is not something with clear definition. So
in practice, many researchers have proposed their definitions, metrics, as
well as solutions towards this problem from completely different thinking
perspectives. In this research, we will only try solve this practical problem
with different coverage models, and will not go into the details of the
correctness proof of them.
2.4.1 Probabilistic Model-Based Approach

Though counter-examples can be easily created, common words and sentence
structures are still important and intuitive indicators on coverage. Especially
for the extractive summarization task, most of the words from the summary,
are supposed to be taken out directly from the original text. From the
beginning of NLP research, many probabilistic-based metrics have been
designed to measure the similarity between two sentences or paragraphs.
Among these metrics, TF-IDF [13] and cosine similarity are probably the
most widely-used ones. When the texts share many common words, TF-IDF
has been proved to be an effective indicator for similarity of meaning. Cosine
similarity, on the other hand, quantifies the similarity of two vectors (in this
case, sentences), by finding its inner product.
2.4.2 Neural Network-Based Approach

However, as mentioned in previous section, probabilistic model can only
recognize the "similarity" in terms of common words, therefore, it would
easily fail when the same word is paraphrased with its synonym. because
its output has a higher degree of freedom, symbolized by words that are
16 Background
not from the source, or sentence structures that are completely different.
Therefore, we need some more complicated, neural network-based approach
to determine the similarity between sentences and help us solve the problem.
Neural Machine Translation (NMT) task, thanks to the introduction
attention model [33], has enhanced state-of-the-art performance in recent
years. But similar to summarization task, due to the fact that attention model
tends to ignore past alignment information, NMT is also suffering from
the problem of over-translation and under-translation. So researcher has
proposed many different methods to alleviate this problem.
In MT task, a common approach for word-level coverage is to maintain a
coverage set that keeps track of which source words have been translated
(or, "covered") in the past. If we define x = {x1 , x2 , x3 , x4 } as a sample sentence
to be translated, then a coverage set would be initialized as C = {0, 0, 0, 0},
which denotes that none of the source words have been translated at this
time. Every time a translation is applied, ideally, the corresponding change
should be made to the set, and for a complete and correct translation, we
wish to make the coverage set to be C = {1, 1, 1, 1} in the end. In this way,
we can make sure that every source word is translated once exactly. And to
extract the feature that source and target texts are not a one-to-one word-level
corresponding relationship, in practice, the coverage set is usually set as a
vector, so that for each word xi in the source text, its coverage is defined by a
vector (c1 , c2 , ..., c j ), in which each number symbolizes the degree of coverage
by the corresponding word in the target text.
2.5 Factual Error Detection and Correction

Factual error stands for the situation where a pair of sentences or texts have
contradiction or inconsistent information. It is a very important problem
in the language generation task. For extractive summarization task, this is
usually not a problem, because all sentences are selected from the source
without any change. However, as we can see from the example in Table 2.3,
for abstractive summary, this would affect the quality of output severely
while not harming ROUGE score. The possibility of factual errors will greatly
reduce the usability of such automatically-generated summaries, because
they have to be all checked by human at this time, which could cost great
2.5 Factual Error Detection and Correction 17
labor. Traditionally, this task is treated as a classification problem that requires

the system to determine if a sentence’s content is consistent with a source
document. But for our task, classifying the consistency is not sufficient, we
also need to correct the errors whenever it happens.
Source A Japan Railway maglev train hit 603 kilometers per hour
(374 miles per hour) on an experimental track in Yamanashi
Tuesday, setting a decisive new world record. (5 sentences
abbreviated) That beat the old record of 581 kilometers per
hour (361 miles per hour), which was set in 2003 during
another Japanese maglev test. (The other texts abbreviated)
Summary a japan railway maglev train hit 603 kilometers per hour (
374 miles per hour ) on an experimental track in yamanashi
tuesday<q>the new record was set in 2003 during another
japanese maglev test
Table 2.3 Example of Factual Inconsistencies in Abstractive Summary
Approaches that tries to solve this problem can be divided into two cate-
gories, one is to make a model paying special attention to factual consistency,
while the other is to check and evaluate the factual correctness after the text
has been generated. But for whichever approach, a very important limiting
factor that stops them from making progress is the lack of data. Vlachos
and Riedel [34] have collected and created a dataset of 221 labelled claims
and their corresponding sources in the political domain. Other than that,
very few datasets are available for the task of fact check. Kryscinski et al.
[15] solved this problem by applying the idea of data augmentation. In their
approach, they have tried to modify the source text in a variety of ways
that will create both positive and negative examples based on the original
sentences. Therefore, enabling models to be trained by a much bigger number
of data.
But consistency check is only halfway done for the problem, we also
need to correct the sentence whenever it is inconsistent with the source.
Being a sentence generation problem, it has been rather unrealistic before the
invention of pre-trained language model. BART, being a language model
that has been proven effective for various text denoising task, has been one
of the most promising structure for this problem. Cao et al. [3] , taking
advantage of BART model and data augmentation technique, is among
18 Background
the first to successfully implement correction functionality for abstractive

summarization.
Previous research [20] has shown that in line with people’s intuition,
ROUGE score, while being the one and only quality evaluation metric for
summarization task, does not bring any information towards the factual
correctness of text. To counter this problem and better evaluate the correctness
of summaries, Goodrich et al. [9] have proposed a new model-based metric
that complements the function of ROUGE. In this model, it checks the
overlapped fact triples between a source document, generated text, and
Wikidata.
Chapter 3
Literature Review
3.1 Text Summarization with Pre-Trained Language

Models
As mentioned in previous chapter, pre-trained language models, being able to
obtain knowledge about the language prior to the training process on specific
tasks, have significant pushed forward the state-of-the-art performances for
a variety of NLP tasks. Therefore, they have been the focus of the NLP
community in recent years. Liu et al. [19] are among the first to adopt BERT
in the summarization task.
Their model, called BERTSUM, as shown by fig. 3.1, is a slightly modified
version of BERT. In the original BERT model, [CLS] is inserted in at the
beginning of a paragraph, and [SEP] is inserted at the end of every sentence,
while BERTSUM adds a [CLS] token to the beginning of every sentence. In this
way, the segment embedding layer of the original document is segmented
(illustrated by the different colors in the segment embedding layer). In
this way, it offers the model with the capability of distinguishing multiple
sentences, and collecting the features based on sentence level. Besides,
such a practice also enables the model to learn document representation
hierarchically. In the following sections, we will explore its approaches in
generating extractive summaries and abstractive summaries respectively.
20 Literature Review
Fig. 3.1 Architecture of Original Bert (left) and BERTSUM(right) [19]
3.1.1 Extractive Summarization

Let d be a document that consists of sentences [s1 , s2 , ..., sm ]. Then, the
extractive summarization task, similar to most other extractive summarization
approaches, is defined as sentence labeling task that gives each sentence
si a label yi ∈ {0, 1}, where 0 means that the sentence will be included as a
part of the summary, and vice versa. Therefore, the training objective for
extractive summarization can be defined as a process to minimize the binary
classification entropy of the prediction y′i against the gold label yi . Because
the human-made summary is actually an abstractive summary that can not
be used as a training objective, the author uses a greedy approach to generate
an oracle sentence that will choose sentences from the source document to
maximize the ROUGE-2 score against the human-made summary as the
"oracle" extractive summary.
Because each sentence is supplemented with a [CLS] token, the authors
use vector ti , which is the i-th [CLS] symbol from the top layer as the
representation of the i-th sentence si . Then, some inter-sentence layers are
stacked to capture document-level features that aims at reproducing the
target of training, i.e. a label vector yi . When generating summaries, the
model will generate a score for each sentence from the source document,
and then it will select the 3 sentences that have the highest score as the
summary. In the meantime, in order to reduce redundancy, the model also
uses a Trigram Blocking Scheme [26], which skip a certain summary sentence
if there is a trigram overlap between the existing summary and candidate
sentence.
In the training process, the extractive model was trained 50,000 steps with
gradient accumulation. The author saved model checkpoints and evaluate
them every 1,000 steps to find the best performance checkpoints.
3.1.2 Abstractive Summarization

For abstractive summarization, the authors followed a standard encoder-
decoder structure. The encoder is a pre-trained BERTSUM model and the
decoder is a randomly initialized transformer. Because the encoder is pre-
trained while the decoder has no knowledge at the starting point, the authors
uses two Adam optimizers with different warm-up steps and learning rates
for the encoder and the decoder respectively, to reduce the mismatch between
them, based on the assumption that models that has already been pre-trained
should have a smaller learning rate and smoother decay.
Some researches [8] [17] show that using extractive summarization as the
training objective can improve the performance of abstractive summarization.
So, in this work, the authors used a two-state fine-tuning scheme, in which
the encoder model is first fine-tuned with extractive task and then fine-tuned
with the abstractive task to enhance its performance.
As the test result shows, ROUGE scores for both extractive and summa-
rization task have surpassed other models at the time, the model has also
shown a flexibility in creating new n-grams that does not exist in the source
document, and better feedback in human evaluation, so the model can be
used as a good starting point for further improvement on the summarization
task.
3.2 Semantic Coverage Analysis

In Chapter 2, we mentioned that the degree of coverage between source
and output could be an important indicator on the quality of output. As
of now, there have not been a metric that directly measure the degree of
coverage between a source document and a summary. So, we will start
with some metrics sharing the similar idea with coverage. One of the key
idea that is similar to coverage is "semantic similarity", which is defined
as to what degree two sentences are semantically close. Researchers have
created many models to defined such similarity. In this research, we will

explore the potentials of semantic coverage analysis through models based on
probabilistic model (TF-IDF model) and neural network (Semantic Coverage
Model), respectively.
3.2.1 TF-IDF
Term frequency - inverse document frequency, or TF-IDF in short, introduced
in 1972 [13], is a numerical statistic that describes the "term specificity", i.e.
to feature the importance of words to a document, based on a large corpus
D. Let t be a specific word, and d be a document, term frequency, with the
common definition listed in table 3.1, catches the feature of how many times
a word comes out in a document, thus extracting the relative importance of
each word (based on the assumption that a word shows up more often is
more important). Meanwhile, inverse document frequency, which is defined
by id f (t, D) = log(|D| ÷ (1 + |d ∈ D : t ∈ T|)) ( |d ∈ D : t ∈ T| means the number
of documents where word t appears), measures how much information a
word provides by giving lower scores to words that shows up more often
among the corpus. And then, the term frequency score is multiplied to
the inverse document frequency score, which gives higher scores to more
important words, while being able to give almost no weight to common
words without actual meaning, such as "the", "a", etc. In this way, TF-IDF
gives a relatively easy and reliable metric to determine the similarity of two
texts, without the need of actually understanding the contents. Some report
[12] states that more than 80% of digital library uses tf-idf in their text-based
recommendation system.
Featured with no training and easy pre-processing, TF-IDF is rather
easy to implement and quick to be calculated, but this does not reduce its
effectiveness in finding the key information. Many researchers have tried to
justify the correctness of TF-IDF score from the perspective of information
theory. According to Aizawa [1], from the definition of conditional entropy
Name Function Note

Naive t f (t, d) = c(t, d) c(t,d) represents the raw
count of word t in docu-
ment d
Boolean t f (t, d) = 1 or 0 1 when t occurs, 0 otherwise
c(t,d)
Length- t f (t, d) = number of words in d
adjusted
Log-scaled t f (t, d) = log(1 + c(t, d))
c(t,d)
Augmented t f (t, d) = 0.5+0.5· maxc(t′ ,d):t′ ∈d
This is useful to prevent a
bias toward longer docu-
ment.
Table 3.1 Common heuristics for TF score
and mutual information, we can conclude that

X
H(D|T = t) = − pd|t · log(pd|t )
d
1
= −log
|(d ∈ D : t ∈ d)|
|(d ∈ D : t ∈ d)|
= log + log|D|
|D|
= −id f (t, D) + log|D|
and,
M(T, D) = H(D) − H(D|T)

X
= pt · (H(D) − H(D|W = T))
t
X
= pt · id f (t, D)
t
X
= pt|d · pd · id f (t, D)
t,d
1 X
= t f (t, d) · id f (t, D)
|D|
t,d
From these equations, we can see that the sum of all words’ TF-IDF score
will give us the mutual information between the words and documents.
Hence, each TF-IDF score carries the "bit of information" of a corresponding

word-document pair, which in other words, means that a sentence with
higher TF-IDF score, carries more information from the original document.
3.2.2 Semantic Coverage Vector

While probabilistic models are easy to implement, and gives good results
in many scenarios. They have very obvious shortcomings in comparing the
texts that are in different languages or have many synonyms. Therefore,
many researchers have proposed other approaches to counter this problem.
One effective approach is the semantic coverage vector ("SCV") [32].
SCV is introduced in an attempt to deal with the over-translation and
under-translation issue in Neural Machine Translation (NMT) task. As we
can see from fig. 3.3, For a standard attention-based NMT model [2], it
takes an input sentence x = {x1 , x2 , ..., x j } and previously generated words
{y1 , y2 , ..., yi−1 }, then, the probability of generating next word yi is given by
a softmax function of non-linear function g(yi−1 , decoding state for time
step i ti , and a time representation si . si , usually defined as the weighted
sum of source annotations h j , where h j is the annotation of word x j from
a bi-directional RNN, and its weight is calculated by an attention model
which scores how well yi matches with h j . From this process, we can notice
that attention model helps avoid the necessity of representing the whole
sentence with one vector. But one the other hand, attention model cannot
utilize the information from past alignment, which is an important cause for
over-translation or under-translation. Because, it is easy to imagine that if
the model has already had a word from the source translated, it would be
less likely for the system to translate it again and such part will be assigned
with a lower alignment probability, but the attention model is not able to
keep track of such information.
To solve this issue, the author introduced the idea of coverage vector,
which is essentially a set C initialized to 0 for input sentence x = {x1 , x2 , ..., x j },
and the size of set C is equal to j. In this set, ci stands for to what degree word
xi has been translated in the output. Thus very naturally, an ideal translation
should create a full coverage C = {1, 1, ..., 1}.
This coverage set is maintained to keep track of which source words

have been translated (or "covered") by the translation. More specifically,
the coverage is defined by an attention model that scores how well the
generated sentence y j matches with the original sentence h j called coverage
vector. Such a coverage vector will keep track of the attention history, and
will be fed into the attention model to help adjust future attention. In this
way, it will help determine to what degree a new sentence will bring new
information, therefore, help alleviate the problem of over-translation and
under-translation. Fig. 3.2 gives an example of coverage vector help the
system to detect under-translation and over-translation.
Fig. 3.2 Example showing non-perfect translation (left) and better translation
(right) [32]
Other than traditional linguistic-based coverage model, the authors used

neural network based coverage model. As we can see from Fig. 3.3, Ci is
the coverage vector for a certain word h j at time step i, it is updated by a
nonlinear activation function and the same coverage vector at time step i − 1,
Ci−1 ; auxiliary input that encodes translation information to the moment ti−1 ;
and a weight α. Simple functions like hyperbolic tangent, or gating function
has been proved to be powerful tools to capture long-distance dependence,
so in this research, the authors have chosen GRU as the activation function.
The training of the model is made based on a Chinese-English translation

task, in which a corpus of 1.25M sentence pairs are used. The training
objective is to maximize the likelihood of the reference sentence, as most
machine translation models do. The model also includes an online scheme
that will help adjust the machine translation output based on the current
coverage vector, which we will not touch too much in this research.
Fig. 3.3 Architecture of NN-based coverage model
3.3 Factual Error Detection and Correction

Introduced in chapter 2, the common metric used to evaluate the quality
of summarization is ROUGE, which is completely based on checking of
common words, and does not include any check on the factual consistency.
This could lead to the problem that a sentence with completely different
meaning could have a higher ROUGE score. Table 3.2 listed an example
of factual error happened in the abstractive summary. While creating a
perfectly fluent sentence that results in high ROUGE score, the error on key
information has greatly reduced the credibility and quality of this summary.
So, it is necessary for us to detect the factual errors in the summary, and
change it to the correct one. Thus, the task can also be divided into two parts,
which are detection and then correction of factual errors.
3.3.1 Factual Error Detection

Factual error detection can be defined as a classification problem that requires
the system to receives sentence s, source text t, and gives an boolean output on
whether the sentence is consistent with the source on its meaning. Introduced
in chapter 2, this has been a hard problem to solve due to the lack of data sets.
Kryscinski et al. [15] proposed FactCC model, that solves this issue with data
augmentation.
In this approach, they created their own data sets by modifying the
source document in different ways that can reproduce the common errors
(and non-errors) that we have observed in actual abstractive summarization
task. The common errors include sentence negation, pronoun swap, entity
swap, number swap. Besides these errors, there are also two other common
cases where sentences are changed but the meaning is the same, namely
paraphrasing and noise injection. So, they have created a "corruption" scheme
for each of the 6 cases mentioned above to change a sentence into a different
one with same or different meaning. The corruption scheme is listed as
follows:
• Paraphrasing
Paraphrase of the source document is generated by back translation using

Google Cloud Translation API 1 . That is, the source text is first translated to
an intermediate language and then translated back to English.
• Entity and Number Corruption
The model used Spacy NER tagger [11] to find extract all the entities and
numbers from both the source document and the specific sentence. Then,
one of the name entity or number in the sentence is replaced by a random
one from the source document that is different from the current one.
• Pronoun Corruption
The model finds all gender-specific pronouns from the sentence. Then, it
replace a random pronoun from the sentence with another pronoun of the
1
https://cloud.google.com/translate/
same grammatical case (i.e. a subjective pronoun will always be replaced

by another subjective pronouns, a object pronoun will always be replaced
by another object pronoun, etc.) to ensure its grammatical and syntactic
correctness.
• Sentence Negation
The model scans the sentence to look for auxiliary verbs such as "am", "is",
etc. And once all the auxiliary verbs are found, the model will randomly
change one of them to be its negation by adding "not" or "n’t" after it, or
removing the "not" or "n’t" that is originally there.
• Noise Injection
For each token in a sentence, a decision process will be made on whether

injecting noise or not. If the model decides to inject noise, it will randomly
duplicate or remove the token.
In this manner, any sentence from a document can be used to produce a
bunch of positive and negative data for training. The authors used CNN/DM
dataset as the source. They "corrupted" the document by applying one of
the 6 schemes mentioned above. In this way, a corrupted sentence, the
corresponding document, and a consistency label constitute one piece of
triplet data for training. The label is assigned based on corruption type,
i.e. consistent for paraphrasing and noise injection, and inconsistent for
the others. Then, choosing uncased, base BERT as the base model, they
fine-tuned the model on a classification task with the triplet dataset generated
previously. From the result, we can see that it has significantly improved the
accuracy of consistency check to around 70%.
3.3.2 Factual Error Correction

FactCC model has pushed forward the performance of fact-checking a lot.
But it does not offer a way to change the sentence when an error is detected.
To solve this problem, Cao et al. [3], proposed Fact Error Correction Model
based on FactCC and BART [16].
As introduced in Chapter 2, BART is a PLM that is pre-trained on sentence
reconstruction tasks, which is exactly the same task that we are now trying
Corruption Type Original Sentence Corrupted Sentence

Sentence negation Snow was predicted Snow was not predicted
later in the weekend for later in the weekend for
Atlanta. Atlanta.
Pronoun error He came to my room. She came to my room.
Entity error Charlton coach Guy Lu- Charlton coach Bor-
zon attended. deaux attended.
Number error Government made a 5 Government made a 6
trillion budget plan. trillion budget plan.
Paraphrasing The police decided to Two weeks after the
come back again two meeting, the police de-
weeks after the meeting. cided to come back
again.
Noise injection Heavy rainfall was pre- Heavy rainfall was was
dicted for areas even fur- predicted for areas fur-
ther north of Ontario. ther north of Ontario.
Table 3.2 Corruption Sample
to solve. Noticing this key feature, the authors used BART as the language
generation model in their approach. The authors first created a training
dataset following the scheme introduced in FactCC. To avoid generating
of meaningless or incomprehensible texts (which in other words, can be
regarded as mislabelled data), instead of corrupting the source documents
to create training set, the authors corrupted only the summary sentences,
which can create corrupted sentences with higher quality and less ambiguity.
Because we are now facing a correction task, which our assumption is that
the input given to the system is always not consistent with the document, we
don’t need to another label of consistent/inconsistent at this step.
The authors corrupted 70% of the summaries, and left the other 30% as it is.
In this manner, the authors have created a dataset with the following triplet:
corrupted summaries s′ (in which 30% are the same as original summaries),
original summaries s, and the source documents d. Then, they fine-tuned
BART on sentence correction task by training it with the triplet dataset created
previously. The training target of the model is to regenerate s based on s′ and
d. When testing the model on a test set that has been generated in the same
manner, the model has given an overall accuracy of above 70% in correcting
the wrong sentences into the correct ones.
Chapter 4
Adjusting Extractive Summary

Based on Semantic Coverage
Analysis
As introduced in Chapter 2 and 3, the current models, though taking some

measures countering the problem of repetition, has still shown a big insuffi-
ciency in detecting the repetition and ignorance of key information from the
source text. So, in order to solve this problem, we will explore the possibility
of applying semantic coverage analysis on extractive summarization task in
this chapter.
4.1 Ideas of Investigation

In previous chapters, we have mentioned that the current models have shown
deficiency in detecting repetition and negligence of key information. As
of now, very few solutions have been proposed to solve this issue, and the
only approach that has been taken is Trigram Blocking Scheme, which is
simply not choosing a sentence as summary when it shares a same trigram
with sentences that has already been chosen. Such an approach is obviously
not enough for the complete solution. So, we try to introduce the idea of
semantic coverage into the summarization task, and see its performance.
Semantic coverage, introduced in previous chapters, is a rather vague
idea that includes a high-level subjectivity. We can justified the idea from
4.2 Semantic Coverage Models 31
many different perspectives and create different metrics to accustom the

corresponding justification. In this research, we will try to use some of the
metrics that seem to fit in this scenario without putting too much focus on
proving their theoretical correctness, even if it works in the setting.
4.2 Semantic Coverage Models

We investigate the performance of 2 types of coverage models, namely TF-IDF,
and Semantic Coverage Vector, in summarization tasks. The TF-IDF model is
rather simple, we can use the concept directly in this task without significant
modifications. but semantic coverage vector was originally used in neural
machine translation tasks to evaluate a word-level coverage. So we have to
change it to a sentence-level coverage so as to fit the requirement of extractive
summarization.
Let S = {w1 , w2 , ..., wi } be a sentence from document d. Then, we will
define the two types of coverage scores as follows:
• TF-IDF based Coverage
Let Define st f id f (wk , d) to be the TF-IDF score of word wk in document d.

Then, the TF-IDF score of sentence S to document d will be defined as
st f id f (S, d) = i st f id f (wi , d).
P
• Semantic Coverage Vector based Coverage
As we have seen from Chapter 3, for two sentences S1 = {x1 , x2 , ..., xi } and
S2 = {y1 , y2 , ..., y j }, we can obtain a vector-like score for each word in sentence
S1, defined as cvw (xk , S2) = (si1 , si2 , ..., si j ). Based on this, we extend the
idea of word-level coverage into a sentence level, in this approach, we
define cv′w (xk , S2) = 1j j cvw (xk , S2), and cvs (S1, S2) = 1i k cv′w (xk , S2) as a
P P
sentence-level coverage score, which is essentially an extension of word-level

coverage.
In this way, We have modified both TF-IDF and Semantic Coverage Vector
to be a sentence-level coverage measurement that would be able to give
a coverage score for each of the sentences from the source document. We
will combine it with the feature vector obtained from the BERT model for
obtaining a new extractive summary.
32 Adjusting Extractive Summary Based on Semantic Coverage Analysis
4.3 Experiment Setup

In our model, we have two relatively independent module, PLM-based
extractive summarization model, and the coverage model. We will train
them separately (details of training is introduced in the next few sections).
Then, when testing the model, extractive summarization model and coverage
module will give a score for each sentence in the source document respectively.
Then, we will use a new score that combines the two scores together as the
one that decides which sentences to choose. After the new score is calculated,
we will normalize the scores to prevent a positive loss from happening. Given
the feature of source documents, we select 3 sentences with the highest scores
as the summary. Similar to previous approaches, whenever a sentence is
selected, we will also apply trigram blocking scheme to check if the sentence
to be added contains a trigram that has already shown up, and if so, we will
skip this sentence and select the sentence with next highest score.
We validate the performance of different ways to combine the two scores
together, including multiplication, simple summation, and weighted summa-
tion, on the validation set. From results of the validation set, we have noticed
the score multiplication would give a best result. So we will use that as the
score combination strategy for testing.
4.3.1 Extractive Summarization Model

For purpose of simplicity, we used BERTSUM [19] as the baseline model
of extractive summary. In this research, we followed the same training
procedures introduced in their paper. The details of the training process
is introduced in Chapter 3. The model is trained for 50,000 steps with
gradient accumulation, and checkpoints are saved every 1000 steps. We
used the model after 50,000 steps of training for testing. The dataset we
used for training, validation, and testing is CNN/DM dataset [10]. We follow
the standard partition to separate it into training, validation, and test sets
respectively, each with 287,227, 13,368, and 11,490 documents.
4.4 Results 33
4.3.2 Coverage Models

TF-IDF model is a rather simple model that does not need training. The only
thing we need to do is to collect the overall lexical statistics from the training
set. So, we will put more focus on explaining setting and training of Semantic
Coverage Vector Model.
As mentioned in previous section, we have defined a sentence-level
coverage model based on word-level coverage. The key issue here is to create
a training set for sentences in the same language, under the summarization
setting. Unfortunately, we are not able to find any sentence-level datasets
that can be used for this purpose. A common approach here is to use the back-
translation technique, which is essentially translates the original sentence into
an intermediate language and then translates it back to the original language
[6]. In this manner, we can produce semantically-equivalent sentences
that includes minor syntactic and/or lexical changes. However, the major
translation APIs that we can find available are not free and would actually
cost A LOT (to the degree that we can talk to their sales representatives for an
customized special offer). So we have given up this idea. Instead, we make
the assumption that the human-made summarization text is a full coverage of
the all the information from the original sentence. So, we train the coverage
model with the original source document, and its summary, which functions
as a fully covered "translation" of the whole document. Apparently this is
a rather bold assumption that does not necessarily reflect the truth. But
unfortunately, we have not been able to find a better solution other than this.
The dataset used for the training of coverage model is also CNN/DM dataset,
we use the same separation introduced in the previous section.
4.4 Results
We tested the effect of two coverage metrics on the CNN/DM dataset, and
the result is listed in Table 4.1. As the numbers largely explain themselves,
we have observed small improvement on ROUGE score when using TF-IDF
as the coverage metric of summary on extractive summarization tasks. But
the SVC metric does not bring improvements.
Other than qualitatively results, Table 4.2 also gives us an intuitive

observation on how the model gives improvement on the actual summaries,
which may help us understand the effect of information coverage in the
task. In this example, the baseline model only gives information on the
cases in France, while the model with TF-IDF module added included also
information from UK, which gives us more information from the source
text. Though it is hard to give an overall evaluation on the summaries, we
have noticed many improvement like this in the summaries generated with
TF-IDF coverage analysis.
Model R-1 R-2 R-L

BERTSUM(baseline) 41.82 19.08 38.28
BERTSUM + TFIDF 41.90 19.20 38.31
BERTSUM + SCV 41.34 18.55 37.63
Table 4.1 Extractive Summarization with Information Coverage Analysis
Gold anti-semitic attacks worldwide have surged by 38 per cent ,

study reveals<q>france remains the country with the highest
number of incidents<q>last year it registered 164 anti-semitic
attacks , compared to 141 in 2013<q>earlier this year , four
hostages were killed at a paris kosher supermarket<q>britain
was home to 141 violent incidents , after registering 95 a year
earlier
BERTSUM 766 violent anti-semitic acts were carried out around
(baseline) the world last year<q>in france , the number of anti-
semitic attacks rose to 164 in 2014 compared with 141 in
2013<q>violence began when islamist militants carried out
a massacre in the office of charlie hebdo magazine
BERTSUM 766 violent anti-semitic acts were carried out around the
+ TFIDF world last year<q>in france , the number of anti-semitic at-
tacks rose to 164 in 2014 compared with 141 in 2013<q>Britain
registered 141 violent incidents against Jews in 2014, after 95
a year earlier.
Table 4.2 Sample Summary from the Model
4.5 Discussion 35
4.5 Discussion
As shown from the result, the relatively easier method, TF-IDF, actually
gives improvement to the summary quality. We believe the reason behind
this improvement is that the sum of TF-IDF scores of all possible terms and
documents will be able to recover mutual information between documents
and terms, taking into account all the specificity of their joint distribution [1].
That is, the TF-IDF score of each sentence is actually directly connected to
how much information is brought by them. Such justification suggests the
correctness of the idea that uses information coverage analysis for extractive
summarization if we can find an effective measurement for the information
of each sentences.
However, the semantic coverage vector seems not be a suitable candidate
for this purpose. As mentioned in the previous section, the training target
for the coverage vector is slightly awkward, and as it turns out unfortunately,
such a bold assumption is not able to give us a satisfying result. Semantic
Coverage Vector, as we introduced, is a method that originated from the
machine translation task, where texts to be compared are about the same size.
But this is not a property that our training process has, so due to this key
difference, the training is not giving us an acceptable and reasonable result
on the actual semantic coverage between sentences. So, to obtain a better
quality in the coverage output, we will need to re-train the SCV model with
data of better quality before we can make a final conclusion on its usability
under this scenario. As mentioned previously, back-translation would be
able to create a large dataset that works for this purpose, very unfortunately
this is not a possible option for us now due to financial restrictions, but it is
definitely something worth checking. And after the examination, we will be
able to decide if we should somehow modify, or completely withdraw from
the idea of using SCV in this approach.
Besides, another important takeaway from observing the output is the
fact that the most important limiting factor lies in the BERT model itself.
Though it has brought significant improvement to the summarization task.
The upper limit of a input stream is 512 words for BERT. But, as we have
listed out in Chapter 2, the average length of CNN and Daily Mail documents
are 760 and 653 words, respectively. Therefore, in the BERTSUM model, the
input documents were truncated to 512 tokens in order to fit in the BERT
model. Even though a majority of the key information are from the first
few sentences, from Fig 2.1 we can still see that, the cases where the key
information are lying in the last few sentences are not negligible. So, another
PLM that can take in more than 512 tokens would be preferable for further
improvement of ROUGE score.
Chapter 5
Factual Error Detection and

Correction for Abstractive
Summarization
In Chapter 2, we have pointed out a problem that has not been well solved by
the current abstractive summarization model, which is the problem of factual
errors. Because the major training scheme for abstractive summarization
is training the model based on the aim of maximizing ROUGE score, and
due to the fact that ROUGE score does not give any information on factual
consistency, when generating a sentence, there is no mechanism that can
stop a factual error from being generated. Therefore, in this chapter, we will
explore the possibilities of adding a check mechanism into the abstractive
summarization model, and change the summary whenever it is necessary.
5.1 Ideas of investigation

In section 2.4, we have seen that to solve the problem of factual inconsistency,
there are two major approaches, to set up a model that focuses on the factual
consistency, or to check consistency (and correct it whenever necessary) after
the text is generated. In our scenario, because we have already had an encoder-
decoder model that works on the generation of abstractive summary, it would
be relatively hard to make modification on the model to add a new factual
consistency check mechanism while maintaining the quality of abstractive
38 Factual Error Detection and Correction for Abstractive Summarization
summarization. So in this research, we will explore the possibility of adding

a factual consistency check module after the summary has generated. We will
try applying Fact Error Correction Model [3] introduced in Section 3.3 to test
if it can help improve the quality of abstractive summarization. In this way,
we will see the effect of Fact Error Correction Model ("FECM") on output, and
therefore conclude on its effectiveness in the abstractive summarization task.
5.2 Experiment Setup

For the summarization model, we use the abstractive model of BERTSUM.
The model gives baseline summaries that will be further examined by Fact
Error Correction Model. The training process will follow the same process
introduced in their paper, whose detail is listed in Chapter 3.
The training of Fact Error Correction Model includes two parts, the
first part is to generate a training set that will be used to train the model.
In this part, we used the training set of CNN/DM dataset as the source.
Following the processing procedures introduced in the paper, we take out
70% of the summaries from the dataset, corrupt them by making a random
choice of date/entity/number/pronoun swap to the summaries following the
approaches introduced in Chapter 3, and leave the other 30% of summary as
it is. Then, the unchanged summaries are referred to as the negative/clean
samples while the changed ones are positive samples. In this way, we have
created a dataset for the model to learn if a sentence is consistent with a
specific document.
Similarly, we use the triplet of the corrupted summaries, the original
summaries, and the source documents to train BART on the factual error
correction task. Following the setup introduced in the paper, the BART model
is fine-tuned on this data set for 10 epochs, with a learning rate of 3e-5.
After the error detection model and correction model have been trained.
We conduct the experiment by putting the generated abstractive summaries
and their respective source document into the trained model, it will first
determine if the given summary is in line with the source document. If the
answer is true, then no changes will be made; and if the answer is false,
then the BART corrector model will be used to try reconstructing a correct
sentence from the original sentence. In this process, we will count the total
5.3 Result 39
number of factual inconsistencies and if correct changes have been made to

the "inconsistent" summaries.
Other than training the model based on the training set of CNN/DM, we
also conducted another experiment to the performance of the model on the
test set of CNN/DM dataset to help qualitatively understand the performance
of this model. In this experiment, we randomly corrupted 1 place from 50%
of the human-made summaries of the test set to make it inconsistent with
the source document, and leave the other summaries as it is. Then, similar
to the settings mentioned previously, we put this new set of summaries
and documents into the model to test if the model can successfully tell the
consistency. Besides, we also let the model to correct the sentence if it judges
a summary as inconsistent to its corresponding document.
5.3 Result
The CNN/DM test set has a total of 11490 documents. In the test, 2178
summaries are labelled as "inconsistent" with their corresponding source
documents. Unfortunately, there is no convenient way to check the correctness
of the claims other than checking it by eye inspection. Checking all of the
questionable summaries is beyond the reasonable workload for only one
person. So, we have checked only the first 50 claims, and we have come
up with the following conclusions. In the 50 claims of inconsistencies, 28 of
them are actually inconsistent, whereas the remaining 22 are false positive
cases. Furthermore, in the 28 inconsistent cases, 18 of them have the factual
errors corrected, while the other 11 fails to make the correction. Here we
have listed a successful changes to the factual errors in Table 5.1 :
Source The Cleveland forward, who won two NBA titles with Miami
before moving back to Ohio, helped his team to a 114-88 win
at the American Airlines Arena. (The other texts abbreviated)
Original the cleveland forward helped his team to a 114-8888 win at
Summary the american airlines arena.
Modified the cleveland forward helped his team to a 114-88 win at the
Summary american airlines arena.
Table 5.1 Example of a Successful Correction of Factual Error
40 Factual Error Detection and Correction for Abstractive Summarization
And for the experiment conducted on the corrupted human-made sum-

maries, the result is as follows:
Accuracy Precision
Corrupted 76.8%
84.0%
Clean 91.2%
Table 5.2 Precision and Accuracy on Errors Detection
Besides, for the 76.8% of corrupted sentences correctly labelled as "incon-

sistent", the model has successfully changed 62.4% of them back to the correct
summary sentences. As for the 8.8% unchanged summary that has been
mistakenly labeled as "inconsistent", the model has changed 28.1% percent
of them into another sentence, which we will take as "changing a correct
sentence to a wrong one".
5.4 Discussion
Though not being able to check all of the corrections made by this model, we
can still make an evaluation on its overall performance. From the second we
experiment, we can see that this model can make the correct judgement at
around 80 % of times, which is similar to the authors have claimed, so it is
in general safe to say this model is able to tell factual consistency in most of
times. But at the same time, since we have no way of knowing the factual
consistency status for all of the summaries, we can only say that from the
sampling of 50 sentences labelled as "inconsistent", the model seems to face a
rather serious problem of false positive when it is put in this setting.
If we assume that the second experiment conducted on the corrupted
summaries gives us a credible statistics on the actual precision and accuracy
of the FECM model. Then, applying some basic Bayesian probabilistic
calculation, we can roughly conclude that about 10 - 20% of the abstractive
summaries produced by the abstractive model are inconsistent with the
source document. Then, In about 85% ∗ (1 − 90%) = 8.5% of the cases, the
consistent summary that has labelled to be inconsistent; while in 15% ∗ 75% =
11.3% of the cases, an inconsistent summary has been correctly labelled.
Therefore, the current accuracy on labelled data, though reaching a decent
5.4 Discussion 41
level, is unfortunately at the point where the most Type I error will occur
8.5
( (8.5+11.3) = 43%). The calculation gives us mostly consistent number with the
ones that we obtained from the experiment. Taking the fact that that in many
(around 40%) cases, the model is not able to correct the mistakes made in the
sentence, the overall improvement of the model for the factual consistency
is rather limited. Therefore, a factual error detection and correction model
with even higher accuracy is necessary to effectively reduce Type I error.
If we use the same abstractive summarization model, an correction model
with accuracy of around 90% would effectively alleviate the problem of false
positive, which means that we are in need of a 10% performance boost from
the current factual error detection system. Furthermore, as it is the case for
all probabilistic models, we have not been able to do anything with false
negative errors that occurs. Though it is relatively safe to say this case will
not happen very often, we still need to take account of it by adding extra
statistical or probabilistic measures to deal with it.
But other than statistical traps that we are in, in our experiment, we have
proved the correctness and feasibility of the idea to correct the data after the
summaries are generated. With a detection model of higher quality, we can
expect great performance improvement for abstractive summarization in
terms of factual consistency.
Chapter 6
Conclusions
6.1 Thesis Conclusion

In this thesis, we proposed a few refinements on a Pre-Trained Language
Model based summarization model, in an attempt to improve the quality of
both extractive and abstractive summarization.
Based on examination of current approaches towards the summarization
task, we have come to the conclusion that summarization model based
on Pre-Trained Language model, due to its flexibility, has great potentials
for the further improvement of summarization quality. While it gives a
state-of-the-art performance at the time, it also suffers from the issues of
information repetition and negligence, as well as factual errors. Therefore, we
have examined the possible approaches that might help solve these problems.
For the information repetition and negligence problem in the extractive
summarization task, we have borrowed the idea of semantic coverage from
machine translation, and attempted to use some different metrics. From
the test result, we have observed small performance improvement from
TF-IDF metric, which gives us confidence that this might a correct approach.
However, Semantic Coverage Vector approach does not offer satisfying result
due to its incompatibility to the summarization task.
For factual error problem in the abstractive summarization task, we have
attempted to use data augmentation to set up a training set that trains a model
which can determine factual consistency of a summary, and then, we use
another model based on BART to correct the wrong sentences into the correct
6.2 Future Work 43
ones. Tests on the system itself gives a somewhat satisfying result. Adding
the factual error detection and correction model has helped correct a few
sentences from wrong. But due to the problematic properties of abstractive
summaries generated by the model, it has also introduced rather serious
problem of Type I error, which causes the overall improvement of this idea to
be limited.
6.2 Future Work

Other than the work mentioned in this thesis, there is still much work that
could be done to further improve the quality of an PLM summarization
model.
First and foremost, current Pre-trained Language Models is the major
limiting factor for improvement. BERT can take only 512 tokens at a time,
which stops it from taking the whole source document in many times. As
we can see from the statistics of dataset, most articles have more than 512
words, such a limitation would harm the output seriously. Therefore, it is
necessary for us to figure out other PLM models that can overcome this lethal
shortcomings. BART, which is used only in a sub-task in our approach, might
be a good candidate due to the similarity between the requirement for its
pre-training scheme and that for the summarization task.
For extractive summarization task, we have tried only two metrics for
semantic coverage analysis. And very unfortunately, the training for Semantic
Coverage Vector is not able to be carried out in the common approaches due
to financial limitations, so it is not clear whether SCV is actually suitable for
summarization task.
For abstractive summarization task, we have tried to correct the factual
errors in the output after it has been generated. The mechanism is in general
working well. However, the precision of the factual error detection system
is still not high enough, which gives lots of false positive errors in the final
output. Therefore, a better factual error detection mechanism (ideally with
an accuracy of >= 90%) is desired to effectively eliminate the false positive
problem.
Abstractive summarization actually shares the problem of improper
semantic coverage as well. But in this research, we have not been able
44 Conclusions
to make an online adjustment scheme for the generation of abstractive

summaries based on current output. So, an attention model that gives online
coverage information would be a possible approach to help alleviate the
improper coverage for abstractive summarization.
As mentioned in Chapter 2, datasets of news are among the easiest ones
for summarization task. We have not been able to test the performance
of such modification on other datasets due to limitation of time. So it is
interesting to see if such change can effectively help all kinds of datasets to
improve their summarization performance.
The problem with evaluation metric is also worth pointing out. There
have already been many critics on effectiveness and comparability of ROUGE.
In this research, the factual error part actually has not had an objective quality
evaluation metric. It would be preferable to develop another evaluation
metric that effectively solves this problem.
References
[1] Akiko Aizawa. An information-theoretic perspective of tf–idf measures.

Information Processing Management, 39(1):45 – 65, 2003. ISSN 0306-4573.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural
machine translation by jointly learning to align and translate, 2016.
[3] Meng Cao, Yue Dong, Jiapeng Wu, and Jackie Chi Kit Cheung. Factual
error correction for abstractive summarization models, 2020.
[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
BERT: Pre-training of deep bidirectional transformers for language
understanding. In Proceedings of the 2019 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 4171–4186,
Minneapolis, Minnesota, June 2019. Association for Computational
Linguistics.
[5] Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. Learning-based
single-document summarization with compression and anaphoricity
constraints. In Proceedings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 1998–2008, Berlin,
Germany, August 2016. Association for Computational Linguistics.
[6] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Un-
derstanding back-translation at scale. In Proceedings of the 2018 Con-
ference on Empirical Methods in Natural Language Processing, pages
489–500, Brussels, Belgium, October-November 2018. Association
for Computational Linguistics. doi: 10.18653/v1/D18-1045. URL
https://www.aclweb.org/anthology/D18-1045.
[7] Prakhar Ganesh. Pre-trained Language Models:
Simplified. URL https://towardsdatascience.com/
pre-trained-language-models-simplified-b8ec80c62217. Accessed:
2021-01-10.
[8] Sebastian Gehrmann, Yuntian Deng, and Alexander M. Rush. Bottom-up
abstractive summarization, 2018.
[9] Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. Assessing
the factual accuracy of generated text. Proceedings of the 25th ACM
46 References
SIGKDD International Conference on Knowledge Discovery Data Mining,

Jul 2019.
[10] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse
Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching
machines to read and comprehend. In NIPS, 2015.
[11] Matthew Honnibal and Ines Montani. spaCy 2: Natural language
understanding with Bloom embeddings, convolutional neural networks
and incremental parsing. 2017.
[12] Stefan Langer Joeran Beel, Bela Gipp and Corinna Breitinger. Research-
paper recommender systems : a literature survey. International Journal
on Digital Libraries, 17, 2016.
[13] Sparck Jones Karen. A statistical interpretation of term specificity and
its application in retrieval. Journal of Documentation, 28.
[14] Anastassia Kornilova and Vlad Eidelman. BillSum: A Corpus for
Automatic Summarization of US Legislation. arXiv e-prints, page
arXiv:1910.00523, Oct 2019.
[15] Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher.
Evaluating the factual consistency of abstractive text summarization,
2019.
[16] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel-
rahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.
Bart: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension, 2019.
[17] Wei Li, Xinyan Xiao, Yajuan Lyu, and Yuanzhuo Wang. Improving neural
abstractive document summarization with explicit information selection
modeling. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 1787–1796, Brussels, Belgium,
October-November 2018. Association for Computational Linguistics.
[18] Chin-Yew Lin. ROUGE: A package for automatic evaluation of sum-
maries. In Text Summarization Branches Out, pages 74–81, Barcelona,
Spain, July 2004. Association for Computational Linguistics. URL
https://www.aclweb.org/anthology/W04-1013.
[19] Yang Liu and Mirella Lapata. Text summarization with pretrained
encoders. Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing, pages 3721–3731, 2019.
[20] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar GuÌ‡lçehre,
and Bing Xiang. Abstractive text summarization using sequence-to-
sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference
on Computational Natural Language Learning, pages 280–290. Association
for Computational Linguistics, August 2016.
References 47
[21] Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. Summarunner: A

recurrent neural network based sequence model for extractive summa-
rization of documents. In Proceedings of 31st AAAI Conference on Artificial
Intelligence, pages 3075–3081, 2017.
[22] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sentences
for extractive summarization with reinforcement learning. In Proceedings
of NAACL-HLT 2018,, pages 1747–1759, 2018.
[23] Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the
details, just the summary! topic-aware convolutional neural networks
for extreme summarization. Proceedings of the 2018 Conference on Empirical
Methods in Natural Language Processing, pages 1797–1807, 2018.
[24] Ani Nenkova and Kathleen McKeown. Automatic Summarization. Now
Publishers, 2011. ISBN 9781601984708.
[25] Sergei Nirenburg, Kavi Mahesh, and Stephen Beale. Measuring semantic
coverage. In COLING 1996 Volume 1: The 16th International Conference on
Computational Linguistics, 1996. URL https://www.aclweb.org/anthology/
C96-1016.
[26] Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced
model for abstractive summarization. 2017.
[27] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christo-
pher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized
word representations. In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies, Volume 1 (Long Papers), pages 2227–2237,
New Orleans, Louisiana, June 2018. Association for Computational
Linguistics.
[28] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and
Ilya Sutskever. Language models are unsupervised multitask learners.
2018.
[29] Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention
model for abstractive sentence summarization, 2015.
[30] Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the
point: Summarization with pointer-generator networks. In Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017.
Association for Computational Linguistics.
[31] Eva Sharma, Chen Li, and Lu Wang. BIGPATENT: A large-scale dataset
for abstractive and coherent summarization. In Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics, pages
2204–2213, Florence, Italy, July 2019. Association for Computational
Linguistics.
48 References
[32] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li.
Modeling coverage for neural machine translation. 2016.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need, 2017.
[34] Andreas Vlachos and Sebastian Riedel. Fact checking: Task definition
and dataset construction. In Proceedings of the ACL 2014 Workshop
on Language Technologies and Computational Social Science, pages 18–
22, Baltimore, MD, USA, June 2014. Association for Computational
Linguistics. doi: 10.3115/v1/W14-2508. URL https://www.aclweb.org/
anthology/W14-2508.
[35] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhut-
dinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining
for language understanding. In Advances in Neural Information Processing
Systems 32, pages 5754–5764. Curran Associates, Inc., 2019.
[36] Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. Neural
latent extractive document summarization. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages
779–784, Brussels, Belgium, October-November 2018. Association for
Computational Linguistics.

Pre-Trained Language Model-Based Automatic Text Summarization With Information Coverage and Factual Error Detection

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pre-Trained Language Model-Based Automatic Text Summarization With Information Coverage and Factual Error Detection

Uploaded by

Copyright:

Available Formats

Pre-Trained Language

Supervisor: Prof. Yoshimasa Tsuruoka

Graduate School of Information and Science

This dissertation is submitted for the degree of

Throughout my study in the University of Tokyo, I have received a great deal

Text summarization is a long-existing problem in the field of natural language

3.1.2 Abstractive Summarization . . . . . . . . . . . . . . . . 21

4 Adjusting Extractive Summary Based on Semantic Coverage Anal-

5 Factual Error Detection and Correction for Abstractive Summariza-

2.1 Position of extracted sentences [19] . . . . . . . . . . . . . . . . 8

3.1 Architecture of Original Bert (left) and BERTSUM(right) [19] . 20

2.1 Example of Extractive Summary and Abstractive Summary . . 5

3.1 Common heuristics for TF score . . . . . . . . . . . . . . . . . . 23

4.1 Extractive Summarization with Information Coverage Analysis 34

5.1 Example of a Successful Correction of Factual Error . . . . . . 39

XLNet [35], have succeeded in advancing the state-of-the-art performance of

1.2 Research Problem

1.3 Thesis Outline

In Chapter 3, we investigate some important studies related to the quality

2.1 Text Summarization

2.1.1 Extractive Summarization

Source West Ham are showing interest in Crystal Palace midfielder

2.1.2 Abstractive Summarization

2.1.3 Evaluation Metrics

In this formula, ri are sentences in the reference document, and Count(n-

(1 + β2 )RLCS (cand, re f )PLCS (cand, re f )

and parameter β controls the relative importance of recall and precision.

Datasets No. of document avg. doc length summary length

Fig. 2.1 Position of extracted sentences [19]

However, as a readily available resource, news has a natural tendency

and extractive models will amplify such tendency, resulting an imprecise

Fig. 2.2 Example of Patent datasets [31]

2.2 Encoder-Decoder Model

Fig. 2.3 Generic Encoder-Decoder Model

As we can see from the fig. 2.2, an encoder-decoder model compose

2.3 Pre-Trained Language Model

Fig. 2.4 Concept of a Generic Pre-Trained Language Model [7]

Fig. 2.5 Structure of BERT Model [4]

Fig. 2.6 Structure of BART Model [16]

2.4 Semantic Coverage Analysis

common factors we would assume to determine the similarity between two

2.4.1 Probabilistic Model-Based Approach

2.4.2 Neural Network-Based Approach

2.5 Factual Error Detection and Correction

labor. Traditionally, this task is treated as a classification problem that requires

the first to successfully implement correction functionality for abstractive

3.1 Text Summarization with Pre-Trained Language

Fig. 3.1 Architecture of Original Bert (left) and BERTSUM(right) [19]

3.1.1 Extractive Summarization

3.1.2 Abstractive Summarization

3.2 Semantic Coverage Analysis

created many models to defined such similarity. In this research, we will

Name Function Note

and mutual information, we can conclude that

M(T, D) = H(D) − H(D|T)

Hence, each TF-IDF score carries the "bit of information" of a corresponding

3.2.2 Semantic Coverage Vector

This coverage set is maintained to keep track of which source words

Other than traditional linguistic-based coverage model, the authors used

The training of the model is made based on a Chinese-English translation

Fig. 3.3 Architecture of NN-based coverage model

3.3 Factual Error Detection and Correction

3.3.1 Factual Error Detection