Professional Documents
Culture Documents
Model-Based
Automatic Text Summarization
with Information Coverage and
Factual Error Detection
情報被覆と事実エラー検出を用いた
事前学習済み言語モデルに基づく
テキスト自動要約
Haihan Yu
January 2021
Acknowledgements
List of figures ix
List of tables x
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 4
2.1 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Extractive Summarization . . . . . . . . . . . . . . . . . 4
2.1.2 Abstractive Summarization . . . . . . . . . . . . . . . . 6
2.1.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 6
2.1.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Encoder-Decoder Model . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Pre-Trained Language Model . . . . . . . . . . . . . . . . . . . 11
2.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Semantic Coverage Analysis . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Probabilistic Model-Based Approach . . . . . . . . . . 15
2.4.2 Neural Network-Based Approach . . . . . . . . . . . . 15
2.5 Factual Error Detection and Correction . . . . . . . . . . . . . . 16
3 Literature Review 19
3.1 Text Summarization with Pre-Trained Language Models . . . 19
3.1.1 Extractive Summarization . . . . . . . . . . . . . . . . . 20
viii Table of contents
6 Conclusions 42
6.1 Thesis Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
References 45
List of figures
Introduction
1.1 Background
Text summarization is a classic natural language processing (NLP) problem
which can be traced back to at least 40 years ago. The problem aims to
generate a stream of text based on the original contents while maintaining the
important ideas. The problem has drawn a lot of attention from researchers,
for its nature of integrating a lot of different sub-tasks in NLP and its great
potential in many real-world applications. Examples for its applications
include tools that can help users navigate and digest internet contents, and
question answering. [22] To properly solve this problem, the system has
to both well understand the original content (i.e. identify the important
contents), and generate logical and reasonable output (i.e. aggregate and/or
paraphrase the identified contents into a summary) [23]. Researchers have
identified the major paradigms of summarizations [24], and among them,
single-document summarization has consistently drawn attention from
researchers.
Modern approaches to single-document summarization are based on
supervised learning of large amounts of labeled data, taking advantage of the
success of neural network architectures and their ability to learn continuous
features without recourse to pre-processing tools or linguistic annotations. In
recent years, Language model pre-training has advanced the state of the art
in many NLP tasks ranging from sentiment analysis, to question answering,
etc. A variety of pre-trained language models, including ELMo [27], BERT [4],
2 Introduction
Background
and quality. So in recent years, many researchers have started trying to solve
this problem with other more advanced techniques.
SuMMaRuNNer [21] is one the first summarization schemes that uses
neural network approaches, in this approach, the author adopts an encoder
model based on Recurrent Neural Networks (RNN). Then, Narayan et al.
[22] introduced a training scheme that aims at optimizing the ROUGE metric
globally via reinforcement learning model. Other than these approaches,
some researchers have also tried to think differently of the extractive summa-
rization problem. An interesting approach is introduced by Zhang et al. [36],
in which they defined this question as latent variable inference problem. So
the training objective is to directly maximize the likelihood of human-written
summaries given selected sentences, instead of the traditional approach of
maximizing the likelihood of "gold" label.
P P
ri ∈ref n-gram∈ri Count(n-gram, cand)
ROUGE − N(cand, re f ) = P
ri ∈ref Number of N-grams(ri )
where
P
ri ∈re f |LCS(cand, ri )|
RLCS (cand, re f ) =
numbero f words(re f )
P
ri ∈re f |LCS(cand, ri )|
PLCS (cand, re f ) =
numbero f words(cand)
2.1.4 Datasets
Currently, the most widely-used datasets for text summarization are CNN/Daily
Mail Corpus [10], NYTimes Corpus [5]. CNN/DM dataset contains around
300,000 news articles and accompanied highlight points, all written by editors,
and we usually treat them as gold summary; similarly, NYTimes dataset
contains about 100,000 articles and their abstractive summaries. Other than
that, other major datasets (e.g. XSum [23]) are also set up based on news.
Table 2.2 lists some key statistics of these mainstream datasets.
Another important problem with the current data sets is that they are
100% in English. For languages with similar structures to that of English,
similar techniques might be transplanted directly, but for other languages
10 Background
with completely different syntax or grammar structures, very few (if not zero)
researches has been conducted to see whether these techniques are applicable
due to the lack of data.
other the input to the encoder, the state representation itself includes all the
information needed to make predictions; while other tasks (e.g. fact check)
requires an extra input to the decoder to help with the prediction process.
Therefore, similar to that of encoder, we can choose from a variety of possible
candidate structures for the decoder and use the one that best fits the task
requirements.
The most important advantage of encoder-decoder model is that it allows
input and outputs to be of significantly different sizes, and gives the flexibility
to choose different internal structures while following the same general idea.
However, the idea of state representation vector has also shown insufficiency
in summarizing information when the input sequence is too long. So, based
on such idea, researchers have started to look for new models that will help
solve these issues, and one major progress is the pre-trained language model.
status in many NLP tasks. Very recently, there are successful attempts to
apply pre-trained models to various language generation problems.
2.3.1 BERT
Bidirectional Encoder Representations from Transformers, or BERT [4], is
probably the most important pre-trained language model in the past 5 years.
Developed by Google in 2018, it is a multi-layer bidirectional Transformer
encoder. The base version of BERT has 12 layers, 12 attention heads, and
110 million parameters, while the LARGE version has 4 layers, 16 attention
heads, and 340 million parameters.
The BERT model is the first to use bidirectional model instead of single
directional one, which means that BERT reads the whole input of words at
once, without a specific order of left-to-right or right-to-left, which enables it
to learn the context of a word from all of the words surrounding it.
BERT is a model pre-trained on two tasks, namely masked language
modeling and next sentence prediction. In the Masked language modeling
(Masked LM) is the task, 15% of the words sent into the model is randomly
chosen and replaced with a different token. For the best performance in
training, the authors said that instead of changing all of this 15% of tokens to
be a [MASK] token, only 80% will be changed to [MASK] token, while 5% to
be a random different word and the remaining 15% to be left unchanged. The
training objective of the task is to recover these masked words based on the
context provided by the other non-masked words in the input sequence. And
the next sentence prediction is a binary classification task that requires the
system to determine if the second input sentence is actually next to the first
2.3 Pre-Trained Language Model 13
input sentence. The two tasks help the model to understand the language on
both word-level and sentence-level, therefore allowing the model to be able
to hand any downstream single sequence and sequence-pair tasks without
significant task-specific modification on the structure.
When conducting further fine-tuning for specific tasks, personalized
adjustment is available for BERT model to fit the specific requirements
of downstream tasks, which helps it gain both flexibility and generality.
With such outstanding properties, many researches based on BERT has been
implemented in a short period of time, and have advanced the state-of-the-art
performance for many NLP tasks.
2.3.2 BART
Bidirection and Auto-Regresstive Transformers, or BART [16] is a denoising
auto-encoder for pre-training seq-to-seq models. It is introduced by Facebook
in 2019, shortly after BERT. BART is a pre-trained by texts that are corrupted
by arbitrary noising function, and aims at reconstructing the original texts.
14 Background
BART can seen as a generalized BERT and many other recent pre-training
schemes, including GPT [28], etc. As we can see from Fig. 2.5, in the
BART model, the inputs to the encoder does not have to be aligned with
the decoder, therefore allows noise transformations in arbitrary formats. In
the pre-training process, BART is pre-trained on document recovery task, in
which the document is corrupted by token masking, deletion, infilling, as well
as sentence permutation, and document rotation. The target of pre-training
is to denoise such corruption and reconstruct the original documents, i.e. to
optimize the negative log likelihood of the original document.
After pre-training, similar to all other pre-trained language models, we
can further fine-tune BART model on many different downstream tasks, e.g.
sequence classification, token classification, etc. As suggested by the authors,
due to its pre-training scheme, BART is a model that is "particularly effective"
for language generation task, and works well for comprehension tasks. The
model has been able to give even better performance than BERT in many
tasks. So, after its invention, researchers have started to apply BART model
on abstractive summarization, which is itself a language generation task that
requires language comprehension.
not from the source, or sentence structures that are completely different.
Therefore, we need some more complicated, neural network-based approach
to determine the similarity between sentences and help us solve the problem.
Neural Machine Translation (NMT) task, thanks to the introduction
attention model [33], has enhanced state-of-the-art performance in recent
years. But similar to summarization task, due to the fact that attention model
tends to ignore past alignment information, NMT is also suffering from
the problem of over-translation and under-translation. So researcher has
proposed many different methods to alleviate this problem.
In MT task, a common approach for word-level coverage is to maintain a
coverage set that keeps track of which source words have been translated
(or, "covered") in the past. If we define x = {x1 , x2 , x3 , x4 } as a sample sentence
to be translated, then a coverage set would be initialized as C = {0, 0, 0, 0},
which denotes that none of the source words have been translated at this
time. Every time a translation is applied, ideally, the corresponding change
should be made to the set, and for a complete and correct translation, we
wish to make the coverage set to be C = {1, 1, 1, 1} in the end. In this way,
we can make sure that every source word is translated once exactly. And to
extract the feature that source and target texts are not a one-to-one word-level
corresponding relationship, in practice, the coverage set is usually set as a
vector, so that for each word xi in the source text, its coverage is defined by a
vector (c1 , c2 , ..., c j ), in which each number symbolizes the degree of coverage
by the corresponding word in the target text.
Source A Japan Railway maglev train hit 603 kilometers per hour
(374 miles per hour) on an experimental track in Yamanashi
Tuesday, setting a decisive new world record. (5 sentences
abbreviated) That beat the old record of 581 kilometers per
hour (361 miles per hour), which was set in 2003 during
another Japanese maglev test. (The other texts abbreviated)
Summary a japan railway maglev train hit 603 kilometers per hour (
374 miles per hour ) on an experimental track in yamanashi
tuesday<q>the new record was set in 2003 during another
japanese maglev test
Table 2.3 Example of Factual Inconsistencies in Abstractive Summary
Approaches that tries to solve this problem can be divided into two cate-
gories, one is to make a model paying special attention to factual consistency,
while the other is to check and evaluate the factual correctness after the text
has been generated. But for whichever approach, a very important limiting
factor that stops them from making progress is the lack of data. Vlachos
and Riedel [34] have collected and created a dataset of 221 labelled claims
and their corresponding sources in the political domain. Other than that,
very few datasets are available for the task of fact check. Kryscinski et al.
[15] solved this problem by applying the idea of data augmentation. In their
approach, they have tried to modify the source text in a variety of ways
that will create both positive and negative examples based on the original
sentences. Therefore, enabling models to be trained by a much bigger number
of data.
But consistency check is only halfway done for the problem, we also
need to correct the sentence whenever it is inconsistent with the source.
Being a sentence generation problem, it has been rather unrealistic before the
invention of pre-trained language model. BART, being a language model
that has been proven effective for various text denoising task, has been one
of the most promising structure for this problem. Cao et al. [3] , taking
advantage of BART model and data augmentation technique, is among
18 Background
Literature Review
In the training process, the extractive model was trained 50,000 steps with
gradient accumulation. The author saved model checkpoints and evaluate
them every 1,000 steps to find the best performance checkpoints.
3.2.1 TF-IDF
Term frequency - inverse document frequency, or TF-IDF in short, introduced
in 1972 [13], is a numerical statistic that describes the "term specificity", i.e.
to feature the importance of words to a document, based on a large corpus
D. Let t be a specific word, and d be a document, term frequency, with the
common definition listed in table 3.1, catches the feature of how many times
a word comes out in a document, thus extracting the relative importance of
each word (based on the assumption that a word shows up more often is
more important). Meanwhile, inverse document frequency, which is defined
by id f (t, D) = log(|D| ÷ (1 + |d ∈ D : t ∈ T|)) ( |d ∈ D : t ∈ T| means the number
of documents where word t appears), measures how much information a
word provides by giving lower scores to words that shows up more often
among the corpus. And then, the term frequency score is multiplied to
the inverse document frequency score, which gives higher scores to more
important words, while being able to give almost no weight to common
words without actual meaning, such as "the", "a", etc. In this way, TF-IDF
gives a relatively easy and reliable metric to determine the similarity of two
texts, without the need of actually understanding the contents. Some report
[12] states that more than 80% of digital library uses tf-idf in their text-based
recommendation system.
Featured with no training and easy pre-processing, TF-IDF is rather
easy to implement and quick to be calculated, but this does not reduce its
effectiveness in finding the key information. Many researchers have tried to
justify the correctness of TF-IDF score from the perspective of information
theory. According to Aizawa [1], from the definition of conditional entropy
3.2 Semantic Coverage Analysis 23
and,
From these equations, we can see that the sum of all words’ TF-IDF score
will give us the mutual information between the words and documents.
24 Literature Review
Fig. 3.2 Example showing non-perfect translation (left) and better translation
(right) [32]
• Paraphrasing
The model used Spacy NER tagger [11] to find extract all the entities and
numbers from both the source document and the specific sentence. Then,
one of the name entity or number in the sentence is replaced by a random
one from the source document that is different from the current one.
• Pronoun Corruption
The model finds all gender-specific pronouns from the sentence. Then, it
replace a random pronoun from the sentence with another pronoun of the
1
https://cloud.google.com/translate/
28 Literature Review
• Sentence Negation
The model scans the sentence to look for auxiliary verbs such as "am", "is",
etc. And once all the auxiliary verbs are found, the model will randomly
change one of them to be its negation by adding "not" or "n’t" after it, or
removing the "not" or "n’t" that is originally there.
• Noise Injection
to solve. Noticing this key feature, the authors used BART as the language
generation model in their approach. The authors first created a training
dataset following the scheme introduced in FactCC. To avoid generating
of meaningless or incomprehensible texts (which in other words, can be
regarded as mislabelled data), instead of corrupting the source documents
to create training set, the authors corrupted only the summary sentences,
which can create corrupted sentences with higher quality and less ambiguity.
Because we are now facing a correction task, which our assumption is that
the input given to the system is always not consistent with the document, we
don’t need to another label of consistent/inconsistent at this step.
The authors corrupted 70% of the summaries, and left the other 30% as it is.
In this manner, the authors have created a dataset with the following triplet:
corrupted summaries s′ (in which 30% are the same as original summaries),
original summaries s, and the source documents d. Then, they fine-tuned
BART on sentence correction task by training it with the triplet dataset created
previously. The training target of the model is to regenerate s based on s′ and
d. When testing the model on a test set that has been generated in the same
manner, the model has given an overall accuracy of above 70% in correcting
the wrong sentences into the correct ones.
Chapter 4
As we have seen from Chapter 3, for two sentences S1 = {x1 , x2 , ..., xi } and
S2 = {y1 , y2 , ..., y j }, we can obtain a vector-like score for each word in sentence
S1, defined as cvw (xk , S2) = (si1 , si2 , ..., si j ). Based on this, we extend the
idea of word-level coverage into a sentence level, in this approach, we
define cv′w (xk , S2) = 1j j cvw (xk , S2), and cvs (S1, S2) = 1i k cv′w (xk , S2) as a
P P
4.4 Results
We tested the effect of two coverage metrics on the CNN/DM dataset, and
the result is listed in Table 4.1. As the numbers largely explain themselves,
we have observed small improvement on ROUGE score when using TF-IDF
as the coverage metric of summary on extractive summarization tasks. But
the SVC metric does not bring improvements.
34 Adjusting Extractive Summary Based on Semantic Coverage Analysis
4.5 Discussion
As shown from the result, the relatively easier method, TF-IDF, actually
gives improvement to the summary quality. We believe the reason behind
this improvement is that the sum of TF-IDF scores of all possible terms and
documents will be able to recover mutual information between documents
and terms, taking into account all the specificity of their joint distribution [1].
That is, the TF-IDF score of each sentence is actually directly connected to
how much information is brought by them. Such justification suggests the
correctness of the idea that uses information coverage analysis for extractive
summarization if we can find an effective measurement for the information
of each sentences.
However, the semantic coverage vector seems not be a suitable candidate
for this purpose. As mentioned in the previous section, the training target
for the coverage vector is slightly awkward, and as it turns out unfortunately,
such a bold assumption is not able to give us a satisfying result. Semantic
Coverage Vector, as we introduced, is a method that originated from the
machine translation task, where texts to be compared are about the same size.
But this is not a property that our training process has, so due to this key
difference, the training is not giving us an acceptable and reasonable result
on the actual semantic coverage between sentences. So, to obtain a better
quality in the coverage output, we will need to re-train the SCV model with
data of better quality before we can make a final conclusion on its usability
under this scenario. As mentioned previously, back-translation would be
able to create a large dataset that works for this purpose, very unfortunately
this is not a possible option for us now due to financial restrictions, but it is
definitely something worth checking. And after the examination, we will be
able to decide if we should somehow modify, or completely withdraw from
the idea of using SCV in this approach.
Besides, another important takeaway from observing the output is the
fact that the most important limiting factor lies in the BERT model itself.
Though it has brought significant improvement to the summarization task.
The upper limit of a input stream is 512 words for BERT. But, as we have
listed out in Chapter 2, the average length of CNN and Daily Mail documents
are 760 and 653 words, respectively. Therefore, in the BERTSUM model, the
36 Adjusting Extractive Summary Based on Semantic Coverage Analysis
input documents were truncated to 512 tokens in order to fit in the BERT
model. Even though a majority of the key information are from the first
few sentences, from Fig 2.1 we can still see that, the cases where the key
information are lying in the last few sentences are not negligible. So, another
PLM that can take in more than 512 tokens would be preferable for further
improvement of ROUGE score.
Chapter 5
In Chapter 2, we have pointed out a problem that has not been well solved by
the current abstractive summarization model, which is the problem of factual
errors. Because the major training scheme for abstractive summarization
is training the model based on the aim of maximizing ROUGE score, and
due to the fact that ROUGE score does not give any information on factual
consistency, when generating a sentence, there is no mechanism that can
stop a factual error from being generated. Therefore, in this chapter, we will
explore the possibilities of adding a check mechanism into the abstractive
summarization model, and change the summary whenever it is necessary.
5.3 Result
The CNN/DM test set has a total of 11490 documents. In the test, 2178
summaries are labelled as "inconsistent" with their corresponding source
documents. Unfortunately, there is no convenient way to check the correctness
of the claims other than checking it by eye inspection. Checking all of the
questionable summaries is beyond the reasonable workload for only one
person. So, we have checked only the first 50 claims, and we have come
up with the following conclusions. In the 50 claims of inconsistencies, 28 of
them are actually inconsistent, whereas the remaining 22 are false positive
cases. Furthermore, in the 28 inconsistent cases, 18 of them have the factual
errors corrected, while the other 11 fails to make the correction. Here we
have listed a successful changes to the factual errors in Table 5.1 :
Source The Cleveland forward, who won two NBA titles with Miami
before moving back to Ohio, helped his team to a 114-88 win
at the American Airlines Arena. (The other texts abbreviated)
Original the cleveland forward helped his team to a 114-8888 win at
Summary the american airlines arena.
Modified the cleveland forward helped his team to a 114-88 win at the
Summary american airlines arena.
Table 5.1 Example of a Successful Correction of Factual Error
40 Factual Error Detection and Correction for Abstractive Summarization
Accuracy Precision
Corrupted 76.8%
84.0%
Clean 91.2%
Table 5.2 Precision and Accuracy on Errors Detection
5.4 Discussion
Though not being able to check all of the corrections made by this model, we
can still make an evaluation on its overall performance. From the second we
experiment, we can see that this model can make the correct judgement at
around 80 % of times, which is similar to the authors have claimed, so it is
in general safe to say this model is able to tell factual consistency in most of
times. But at the same time, since we have no way of knowing the factual
consistency status for all of the summaries, we can only say that from the
sampling of 50 sentences labelled as "inconsistent", the model seems to face a
rather serious problem of false positive when it is put in this setting.
If we assume that the second experiment conducted on the corrupted
summaries gives us a credible statistics on the actual precision and accuracy
of the FECM model. Then, applying some basic Bayesian probabilistic
calculation, we can roughly conclude that about 10 - 20% of the abstractive
summaries produced by the abstractive model are inconsistent with the
source document. Then, In about 85% ∗ (1 − 90%) = 8.5% of the cases, the
consistent summary that has labelled to be inconsistent; while in 15% ∗ 75% =
11.3% of the cases, an inconsistent summary has been correctly labelled.
Therefore, the current accuracy on labelled data, though reaching a decent
5.4 Discussion 41
level, is unfortunately at the point where the most Type I error will occur
8.5
( (8.5+11.3) = 43%). The calculation gives us mostly consistent number with the
ones that we obtained from the experiment. Taking the fact that that in many
(around 40%) cases, the model is not able to correct the mistakes made in the
sentence, the overall improvement of the model for the factual consistency
is rather limited. Therefore, a factual error detection and correction model
with even higher accuracy is necessary to effectively reduce Type I error.
If we use the same abstractive summarization model, an correction model
with accuracy of around 90% would effectively alleviate the problem of false
positive, which means that we are in need of a 10% performance boost from
the current factual error detection system. Furthermore, as it is the case for
all probabilistic models, we have not been able to do anything with false
negative errors that occurs. Though it is relatively safe to say this case will
not happen very often, we still need to take account of it by adding extra
statistical or probabilistic measures to deal with it.
But other than statistical traps that we are in, in our experiment, we have
proved the correctness and feasibility of the idea to correct the data after the
summaries are generated. With a detection model of higher quality, we can
expect great performance improvement for abstractive summarization in
terms of factual consistency.
Chapter 6
Conclusions
ones. Tests on the system itself gives a somewhat satisfying result. Adding
the factual error detection and correction model has helped correct a few
sentences from wrong. But due to the problematic properties of abstractive
summaries generated by the model, it has also introduced rather serious
problem of Type I error, which causes the overall improvement of this idea to
be limited.
[20] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gu̇lçehre,
and Bing Xiang. Abstractive text summarization using sequence-to-
sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference
on Computational Natural Language Learning, pages 280–290. Association
for Computational Linguistics, August 2016.
References 47
[32] Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li.
Modeling coverage for neural machine translation. 2016.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention
is all you need, 2017.
[34] Andreas Vlachos and Sebastian Riedel. Fact checking: Task definition
and dataset construction. In Proceedings of the ACL 2014 Workshop
on Language Technologies and Computational Social Science, pages 18–
22, Baltimore, MD, USA, June 2014. Association for Computational
Linguistics. doi: 10.3115/v1/W14-2508. URL https://www.aclweb.org/
anthology/W14-2508.
[35] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhut-
dinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining
for language understanding. In Advances in Neural Information Processing
Systems 32, pages 5754–5764. Curran Associates, Inc., 2019.
[36] Xingxing Zhang, Mirella Lapata, Furu Wei, and Ming Zhou. Neural
latent extractive document summarization. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages
779–784, Brussels, Belgium, October-November 2018. Association for
Computational Linguistics.