File 158594

Early detection of COVID-19 fake news
on social media posts using deep

learning methods
Jasper van Gool

S TUDENT NUMBER: 2045796
T HESIS SUBMITTED IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF
M ASTER OF S CIENCE IN C OGNITIVE S CIENCE & D ATA S CIENCE & S OCIETY
D EPARTMENT OF C OGNITIVE S CIENCE & A RTIFICIAL I NTELLIGENCE
S CHOOL OF H UMANITIES AND D IGITAL S CIENCES
T ILBURG U NIVERSITY
Thesis committee:
Dr. Grzegorz Chrupala
Dr. Yash Satsangi
Tilburg University
School of Humanities and Digital Sciences
Department of Cognitive Science & Artificial Intelligence
Tilburg, The Netherlands
May 2021
Preface
Dear reader,
This thesis marks the end of my Master in Data Science and Society. Thank you
for taking the time to read this report. Before you lies the dissertation "Early detection
of COVID-19 fake news on social media posts using deep learning methods", a study
that dives into the automatic classification of COVID-19 related fake news, by adopting
a deep learning approach. The research was difficult, but conducting extensive
investigation has allowed me to answer the research questions.
Firstly, I would like to thank my thesis supervisor, Grzegorz Chrupala, for your
excellent guidance during this process. The meetings have been very helpful and
contributed to me finishing my thesis. Due to COVID-19, we didn’t meet face-to-face,
but the advice you gave in the online meetings via Teams were very helpful. I also
wish to thank Paulien for your ever-present love and support. My parents deserve a
particular note for their support, your kind and reassuring words helped me through
some difficult times.
I hope you enjoy reading.
Jasper van Gool
Tilburg, May 2021

Early detection of COVID-19 fake news on
social media posts using deep learning
methods
Jasper van Gool
The ’infodemic’ surrounding COVID-19 is affecting people worldwide by causing misunder-

standing, mistrust, and rumors. Fake news diffuses faster than real news, in a time when
trustworthy information is of great importance for public health. Containing the spread of fake
news at an early stage could help in the fight against the ’infodemic’. In this work, classification
of fake news on COVID-19 related social media posts based on linguistic features is investigated
by adopting a deep learning approach. Popular methods for fake news classification are the
traditional language models LSTM and BI-LSTM, and the state-of-the art transformer-based
models. This study compares different variations of these models to determine which model
is superior in terms of predicting fake news on two datasets. The main dataset is a dataset
containing COVID-19 related social media posts containing fake and real news. The additional
dataset is a dataset containing general fake and real news articles. Generalizability of the models
is compared and evaluated by using both datasets crosswise as test sets. The pre-trained natural
language processing system RoBERTa proved to be the best performing deep learning model
on both datasets, which corresponds with conducted research on the topic. However, cross-
validation with the use of the other dataset showed that generalizability still causes problems.
Especially the model fitted on the COVID-19 data showed a significant decline in performance
when encountering the unseen data. The model fitted on the general news dataset also showed a
decline in performance when predicting whether the COVID-related data was real or fake. The
model did seem to generalize better than the model fitted on COVID-19 news.
1
Contents
1 Introduction 3
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Fake news . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Research contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Methods 9
3.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Pre-trained language models . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Experimental Setup 13
4.1 Dataset description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Data preprocessing and preparation . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Results 22
5.1 Machine Learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Deep Learning models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6 Discussion 25
6.1 Model performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Performance on unseen data from other dataset . . . . . . . . . . . . . . . 26
6.3 Limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7 Conclusion 28
2
J. van Gool A deep learning approach
1 Introduction
1.1 Context
The rise of social media causes a substantial growth in our day-to-day consumption of
news. Anyone can post anything without being accountable for fact-checking, which
doesn’t always make a social media platform a trusted news source. This trend has
had a big influence on our global society, as fake news consumption increases. Fake
news has been around for a long time and every technological innovation unleashes
new possibilities. However, recent events have proven to be a dangerous combination
of technological developments and human behavior. A study conducted by Vosoughi,
Roy, and Aral (2018) has proven that false news diffuses significantly farther, faster,
deeper, and more broadly than the truth in all categories of information. It was found
that false news was more novel than true news, which suggests that people were more
likely to share novel information. Since social media provides the possibility for anyone
to share information, the spread of fake news via the platforms increased. Because
of these developments, fake news influenced major political events, like the Brexit
referendum in the U.K. and the win of Donald Trump in the U.S. presidential election of
2016 (Gelfert 2018). The ‘infodemic’ surrounding COVID-19 is also affecting people by
causing misunderstanding, mistrust, and rumors. Trustworthy and reliable information
becomes harder to find. One example of a spreading rumor was that drinking methanol
or alcohol-based cleaning products would be a cure for the virus. This resulted in at
least 800 deaths and 5,800 hospitalizations (Coleman 2020).
Because of the global issues that arise due to the spread of fake new, classification
methods have been the focus of intense research for the scientific community for the
past years. Both traditional Natural Language Processing approaches and Deep Learn-
ing approaches have proven to be successful methods for the detection of fake news
(Ghosh and Shah 2019). Deep learning methods, such as Long Short Term Memory,
Bidirectional Long Short Term Memory (Chopra, Jain, and Merriman Sholar 2019) and
transformer structures (Jwa et al. 2019) show promising results as accuracy scores on
several classification tasks improved. However, generalizability forms an important
concern for the fake news classification models. There is a considerable difference
between performances from one dataset to another, which indicates that the choice of
dataset determines the performance of the model (Bozarth and Budak 2020).
Recently, a dataset containing real and fake news on COVID-19 is curated for
classification tasks by Patwa et al. (2020). The dataset, containing real news from verified
sources and fake news verified not to be true, consists of 10,700 social media posts
and articles and is benchmarked with four machine learning baselines - Decision Tree,
Logistic Regression, Gradient Boost, and Support Vector Machine. Deep learning meth-
ods have not yet been explored and could outperform the machine learning baselines.
Since generalizability of existing models is a point of concern, training new models
for classification on this dataset, by adopting a deep learning approach, could offer a
solution in the fight against fake news and the ‘infodemic’ surrounding COVID-19. The
goal of this work is to develop a robust fully automated classification algorithm for the
classification of fake news concerning COVID-19, and to evaluate the adaptability of the
model with the use of another dataset containing real and fake news.
3
Cognitive Science & Data Science & Society 2021
1.2 Research Questions
To define the scope of this study, one research question and 3 sub-research questions
were formulated.
Research question: How do deep learning models perform compared to traditional machine
learning baselines on the classification of fake news on COVID-19 related social media posts?
Sub-research question 1A: What is the performance of baseline machine learning methods
on the classification of fake news on COVID-19 related social media posts?
Sub-research question 1B: What is the performance of baseline machine learning methods on the
classification of fake news on general news articles?
Sub-research question 2A: What is the performance of deep learning methods on the classification
of fake news on COVID-19 related social media posts?
Sub-research question 2B: What is the performance of deep learning methods on the classification
of fake news on general news articles?
Sub-research question 3: How do the performances differ? How do performances differ when the
trained models encounter unseen data?
1.3 Relevance
This paper seeks to address the gap presented in the existing literature by focusing on an
improved classification algorithm for the detection of fake news on COVID-19 related
social media posts. In their research Patwa et al. (2020) present the performance of
traditional machine learning baselines on a dataset containing COVID-19 related social
media posts. In the conclusion, they state that future work could be targeted towards
collecting more data and exploring deep learning methods. Since model generalizability
for fake news detection forms an important concern (Bozarth and Budak 2020) and this
dataset focuses specifically on a relatively new topic, existing models could show a
decline in performance when applied to this dataset. Therefore, training a model on
this sub-category could build towards a robust evaluation framework for the detection
of COVID-19 related fake news and help in the fight against the infodemic. Furthermore,
this paper dives deeper into generalizability by comparing performances of models
trained on social media posts with models trained on full news articles. The ability
of a model specifically trained on COVID-related news to adapt to full news articles,
and the ability of a model trained on general articles to adapt to COVID-related news
are evaluated. This is done by using both datasets cross-wise as test set for the best
performing models.
The rest of this paper is organized as follows. In section 2, the background is
discussed. Defining the scope of the study and evaluating related work on fake news
classification. Section 3 describes the methods, giving an insight into the deep learn-
ing methods that are used. Section 4 shows the experimental setup. The results are
discussed in section 5. Finally, the results are discussed in section 6 together with a
conclusion in section 7, and suggestions for future work.
4
2 Background
In this section, the larger scientific context of the problem is explained. The research area
is defined and relevant work is described. Finally, the contribution of this research to the
existing literature is provided.
2.1 Fake news
The following subsection provides insight into the broader context of fake news. The
term fake news is defined in the context of this study, the characteristics are specified,
and the consequences fake news has on individuals are explained.
2.1.1 Definition
Before continuing with the classification of fake news, it is important to define the term
“fake news” in the context of this study. Multiple definitions are going around for the
term fake news, one more convincing than the other. For this research, a definition
expressed by Gelfert (2018, p.108) is used.
“Fake news is the deliberate presentation of (typically) false or misleading claims as

news, where the claims are misleading by design.”
In his definition, he expresses the phrase ‘by design’. With this expression, he means
that it is the originator’s individual intention to mislead an audience or manipulate the
public opinion. Therefore, a misleading title to a factual article or a factual mistake pre-
sented in an article doesn’t classify as fake news, since there is no deliberate presentation
of false or deceptive claims.
2.1.2 Characteristics
The human ability to detect deception is limited, as psychology and communications

studies have shown. Over 100 experiments with over 1000 participants demonstrated
that the ability is only marginally better than chance with a mean accuracy of only 54%
(Rubin 2010), indicating that the characteristics of fake news are hard to uncover with
the unaided eye.
Scientific research across different disciplines has provided insight into the charac-
teristics of fake news. An investigation into stylistic differences between real and fake
news conducted by Horne and Adalı (2017), shows that fake and real news articles
are notably different. The complexity and style of fake news are closely related to the
complexity and style of satirical news. The differences are also reflected in the title. Fake
news titles use fewer stop-words and nouns while using more proper nouns and verb
phrases. Real news articles are longer, use punctuation more frequently, contain longer
words, more quotes, and more comparisons.
Propagation-based methods for the detection of fake news have proven to be an
even more robust method for fake news classification than classification based on
writing style. This form of classification is done by investigating how the news spreads
online. It has been proven that fake news spreads differently from true news, forming
different patterns that can be exploited by machine learning/deep learning models
(Monti et al. 2019). However, this method is inefficient for the detection of fake news
at an early stage.
5
The social context of fake news also differs from fake news. In their research,
Shu et al. (2020) divide social media context into three major aspects: user profiles,
users posts, and network structures. In their exploratory research of these aspects, they
discovered that fake news pieces are more likely to be created by fake accounts than
real news. The creation time of user accounts for fake news is significantly different
from the creation time of a real user account. Besides that, almost 22% of users involved
in fake news are bots, whereas 9% were predicted as bots for real news. When looking
at responses and likes, fake news pieces tend to have fewer replies and more retweets,
real news pieces have more ratio of likes than fake news pieces. Deep learning and
machine learning methods can find meaningful hidden and unhidden patterns in the
characteristics of fake and real news.
2.1.3 Consequences
Fake news has been around for a long time, and every technological innovation un-
leashes new possibilities. However, recent events have proven to be a dangerous com-
bination of technological developments and human behavior. A study conducted by
Vosoughi, Roy, and Aral (2018) has proven that false news diffuses significantly farther,
faster, deeper, and more broadly than the truth in all categories of information. It was
found that false news was more novel than true news, which suggests that people
were more likely to share novel information. Opposed to conventional wisdom, robots
accelerated false and real news at the same rate as humans, implying that the accelerated
spread of false news is caused by humans, not robots. A recent study conducted by Ipsos
(2019) confirms the increase in fake news production and consumption. The impact is
widespread, as 86% of the 25,000 interviews in over 25 economies say they have been
exposed to fake news. Among them, nearly nine in ten initially believed the news was
real.
One can argue that fake news only causes danger for unsophisticated consumers
who are gullible and easily influenced. That those who at first believed the news was
real, can eliminate the problem by fact-checking and careful consumption, and that for
those reasons fake news only causes danger for the previously-mentioned “unsophisti-
cated consumers”. A threat nonetheless. However, in his research into fake news, Levy
describes this view as the naïve view. In his paper, he argues that fake news poses
dangers for even its sophisticated consumers. Beliefs, once acquired, resist retraction.
Research using different paradigms has shown that, even when a claim is retracted,
people may continue to quote the claim in explaining events. Besides that, repeated
negative claims about a certain person might cause the formation of implicit bias against
him/her (Kang 2012). Repetition effects are the consequence of these repeated negative
claims. Repeated messages, even if they are coming from the same source, can influence
the interpretation of information. Repeated exposure to those negative claims also feeds
partisan bias, causing unwillingness to negotiate and dismissing any claims against
their initial beliefs. Fake news influences our cognitive biases and heuristics in such
a way that it causes polarization and distrust in the media, with online social media
working on the effects “like steroids” (Gelfert 2018).
2.2 Related work
Detection of fake news on social media is a relatively new concept that due to the risks
it is causing, gained a lot of attention in the literature. Over the past few years, a lot
of extensive research is put into tackling the problems from a machine learning/deep
6
learning perspective. Existing approaches can be divided into three main categories
based on algorithmic approach: traditional machine learning, deep learning, and ad-
vanced language model-based detection.
Various machine learning approaches have been proposed for the detection of
fake news. The approaches that focus on context, rely on linguistic features. Afroz,
Brennan, and Greenstadt (2012) use machine learning techniques and linguistic features
to detect hoaxes, frauds, and deception in writing styles. Pérez-Rosas et al. (2017) also
classify fake news based on linguistic features. In their research, machine learning based
approaches are proposed for fake news detection. The following set of features were
extracted and trained on a linear SVM: ngrams, punctuation, psycholinguistic features,
readability, and syntax. Weakly supervised approaches are compared in a paper by
Helmstetter and Paulheim (2018). After applying Naïve Bayes, Decision Trees, SVM,
and Random Forest to a very noisy dataset an F1 score of up to 0.9 was reached.
Research has proven that fake news spreads significantly different from real news
(Vosoughi, Roy, and Aral 2018). Some papers argue that propagation-based approaches
have advantages compared to content-based approaches. Monti et al. (2019) claim that
language independence is one of them. One of the early studies that dives into informa-
tion credibility on social media based on propagation, is a study conducted by Castillo,
Mendoza, and Poblete (2011). In their research, they use features from users’ posting
and reposting behavior, from the text of posts and citations to external sources for the
automatic detection of the credibility of a tweet. A similar study is conducted by Yang
et al. (2012). In their research, they classify false rumors on a popular Chinese social
network based on propagation patterns. Jin et al. (2013) use epidemiological models
for the characterization of information diffusion of real and fake news on social media
platforms. However, the propagation-based approach has its limits, as it is inefficient
for the detection of fake news at an early stage.
The profiles of users can also be used for the detection of fake news. Social context
based approaches look into demographics, social network structure, and user reactions.
Few papers have investigated fake news detection with the use of social context. Shu,
Wang, and Liu (2019) investigated the correlation between user profile and fake news.
Shu, Bernard, and Liu (2019) investigate the role network properties have in the dissem-
ination of fake news.
Several studies explore the importance of deep learning models over machine
learning models for fake news detection (Choudhary and Arora 2021; Mouratidis,
Nikiforos, and Kermanidis 2021). Wang (2017) proves that a hybrid convolutional neural
network can outperform machine learning models on fake news detection based on
linguistic patterns. Rashkin et al. (2017) also showed promising results with LSTM. The
models that focused on linguistic aspects outperformed machine learning baselines.
Zhou (2017) proposes bi-directional Gated Recurrent Unit. The self-attentive mecha-
nism ranked 1st in the Clickbait Challenge. Hybrid deep learning models that combine
convolutional and recurrent neural networks for fake news classification performed
slightly better than other non-hybrid baseline methods (Nasir, Khan, and Varlamis
2021). Monti et al. (2019) investigate propagation-based approaches based on geometric
deep learning. Ruchansky, Seo, and Liu (2017) investigate response and text with a
recurrent neural network approach to capture patterns in the articles.
State-of-the-art pre-trained language models (i.e. BERT, ALBERT) are getting a great
deal of attention as they have proven to be especially effective for common natural
language tasks, including fake news classification. Jwa et al. (2019) show that BERT
improved the F-score on classification of fake news by 0.14 over older state-of-the-art
models. In their research, Qazi, Khan, and Ali (2020) show an accuracy score increase
7
of 15% compared to a hybrid CNN approach. A fine-tuned BERT model for fake news
classification achieved an accuracy of 97.02% in a paper presented by Aggarwal et al.
(2020).
2.3 Research contribution
From all the above-mentioned articles, none of them analyze fake news detection specif-
ically on COVID-19 related social media posts. Since model generalizability for fake
news detection forms an important concern (Bozarth and Budak 2020), existing models
could show a decline in performance when applied to this dataset. Furthermore, none
of the existing studies investigate the ability of a model specifically trained on COVID-
related news to adapt to full news articles. At the same time, the ability of a model
trained on general articles to adapt to COVID-related news is evaluated.
8
3 Methods
In this section, the general approach is described. The different computational algo-
rithms that are used are explained: LSTM, BI-LSTM, BERT, RoBERTa, and ALBERT.
3.1 Recurrent Neural Network
Traditional Neural Networks (Appendix A) use feed-forward signals, which over the
years have proven to be a successful solution to various machine learning problems.
Within the feedforward neural networks (FFNN), signals travel in one direction: from
input to output. FFNN’s cannot capture sequential information since their success
depends on the assumption that training and test data are independent of each other.
For time-series data, this is not the case. RNNs address these issues. They are networks
with loops in them, allowing information to persist over time. Because of this, RNNs
are very suited for forecasting problems. In practice, RNNs have some downsides. As
the gap between relevant information and the point where it is needed grows, RNNs,
are unable to connect information. Error signals “flowing backward in time” tend to
either blow up or vanish. Because of this, it can be difficult for standard RNNs to solve
problems that require long-term temporal dependencies.
3.1.1 LSTM
In 1997, Hochreiter and Schmidhuber (1997) introduce a novel recurrent network ar-
chitecture with an appropriate gradient-based learning algorithm, Long Short-Term
Memory (LSTM). An LSTM is a specific version of a Recurrent Neural Network (RNN).
To avoid the scaling effect, the unit of a neural network was redesigned. Every unit
is enriched with ‘gating units’, thus adding several intermediate steps controlling the
information flowing through to the next cell. Figure 1 shows the inner cell structure of
LSTM. The first gate, the “forget gate” (ft ), decides what information is going through
and what information is thrown away. In the next step, the “input gate” (it ) decides
what information is stored in the cell state. Finally, the “output gate” (ot ) decides what
information is sent to the next cell. The compact forms of the equations are as follows:
ft = σg (Wf xt + Uf h( t − 1) + bf ) (1)
it = σg (Wi xt + Ui h( t − 1) + bi ) (2)
ot = σg (Wo xt + Uo h( t − 1) + bo ) (3)
Where W and U are weight matrices and b bias vector parameters. σg represents the
sigmoid function.
The cell state (ct ) and output vector (ht ) are obtained from the following gates:
ct = ft · ct−1 + it · e
ct (4)
9
ht = ot · σh (ct ) (5)
Where σh represents the hyperbolic tangent function, c( t − 1) the previous cell state,
and e
ct represents the cell input activation vector.
Figure 1
LSTM memory cell with input (i.e. t), output (i.e. o), and forget gate(i.e. f ). Each gate computes
an activation at timestep t also considering the activation of the memory cell c at timestep t − 1.
The S-like curve represents the application of a differentiable function to a weighted sum. Figure
from Greff et al. (2017).
3.1.2 BI-LSTM
Traditional RNNs only use the previous context, which is a shortcoming. Bidirectional
RNNs (BRNN) can also access future inputs by processing data in both directions
(Graves and Schmidhuber 2005). Combining LSTM with BRNNs gives Bidirectional
LSTM (BI-LSTM). BI-LSTM also processes data in 2 directions; forwards and backward
(Figure 2). They process the data with two hidden layers, after which it is fed forward
to the same output layer. By doing this, the model accesses both past and future input
features for a given time. The past features are accessed via forward states and the future
features are accessed via backward states (Huang et al. 2015). This gives BI-LSTM the
ability to access long-range context in both input directions
3.2 Pre-trained language models
AI developers and researchers put a lot of time and effort into pre-training models
to solve a specific problem. Transfer learning capabilities enable authors to pre-train
a model on a large dataset. The model can then be adapted and fine-tuned so that
it is applicable to other NLP tasks. Building a language model yourself is very time-
consuming and need a lot of computational resources, so using a pre-trained model
10
Figure 2
A bidirectional LSTM. Figure from Huang et al. (2015).
saves a lot of effort. There are several pre-trained language models available categorized
based on the purpose they serve. Here, the models that are applied in this study will be
explained.
3.2.1 BERT
A team at Google introduced a new, simple, and empirically powerful pre-trained

model, called BERT. BERT strand for Bidirectional Encoder Representations from Trans-
formers (Devlin et al. 2018). The architecture is a multilayer bidirectional transformer
based on the basic implementation of the Transformer as described in Appendix B
introduced by Vaswani et al. (2017). It mainly focuses on the multi-headed self-attention
mechanism. Unlike the basic Transformer, BERT applies bidirectional training which
causes a deeper sense of language context and flow. Since BERT generates a language
model, only the encoder mechanism of the transformer is necessary.
BERT extracts word and sentence embedding vectors from text input. In the past,
one-hot encoding, and fixed-length feature embeddings were used to translate the
features into numerical representations. This causes a fixed representation of words
under, for example, Word2Vec. BERT produces dynamic word representation informed
by the words around them. BERT uses WordPiece embeddings to represent the text (Wu
et al.). WordPiece split words into pieces. Here’s an example:
• Sentence: Jet makers feud over seat width with big orders at stake
• Wordpieces: _J et _makers _fe ud _over _seat _width _with _big _orders _at _stake
For example, “Jet” is broken into “_J” and “et”, the “_” character indicating the end
of a word. The sub-word tokens retain the contextual meaning of the original word.
Besides token embeddings, segment embeddings are taken into account. The segment
embeddings are unique for each input sentence. Position embeddings represent the
position of each WordPiece. The input representation of a token is constructed by
summing the token, segment, and position embedding. The visualization can be seen in
figure 3.
In the framework, there are 2 steps: pre-training and finetuning. In the pre-training
phase, the model is trained on unlabeled data by performing several pre-training tasks.
One of the training strategies Masked LM (ML). 15% of the words in a sequence are
removed before they are fed into BERT. A [MASK] token fills up removed words and
the model tries to predict the original word under the masked word based on the
11
non-masked words in the sequence. The other training technique applied in the pre-
training phase is Next Sentence Prediction (NSP). In this process, the model receives
two sentences as input and learns to predict if sentence B is subsequent to sentence A
in the original document. The execution of this task helps the model to understand the
relationship between sentences better. A sentence embedding and a positional embed-
ding are also added. MLM and NSP are trained together to minimize the combined loss.
To fine-tune BERT all the parameters are initialized and fine-tuned. Devlin et al. (2018)
used the Transformer self-attention mechanism (Appendix B).
Figure 3
BERT input presentation. [SEP] marks the separation token that separates the sentences. [CLS]
marks the start of a sequence. Figure from Devlin et al. (2018).
3.2.2 RoBERTa
In their study, Liu et al. (2019) found that BERT was significantly undertrained. Thus
RoBERTa was introduced. RoBERTa is an optimized robust version of BERT. The model
improved performance on several NLP tasks in comparison with BERT. Like BERT, the
model uses the masking strategy as described in section 3.2.1. Within the model, BERT’s
key hyperparameters are modified, including the removal of next sentence prediction.
The learning rates are also modified and training is done in much larger mini-batches.
Because of this, the masked language modeling and downstream performance of tasks
improved (Liu et al. 2019).
3.2.3 ALBERT
ABERT, which stands for A Lite Bert was introduced by Lan et al. (2020). The model
was introduced to address the issues caused by GPU/TPU memory limitations that
come along with increasing model sizes. Two reduction techniques are applied to the
parameters to reduce the computational load. The first key difference between the two
models is that the parameters of the word embeddings are factorized with ALBERT.
BERT’s large vocabulary embedding matrix is decomposed into two smaller matrixes.
Because of this, it’s easier to grow the hidden size without increasing the parameters.
The second difference is that layer parameters are shared for every similar subsegment.
Because the parameters are shared, the number of parameters is reduced. This means
that, for example, multi-head self-attention subsegments share parameters across all
twelve layers (Lan et al. 2020).
12
4 Experimental Setup
This section provides a detailed description of the datasets that are used and the
experimental procedure. Preparation and preprocessing of the data, finetuning of the
model, evaluation metrics, and actual implementation are reported.
4.1 Dataset description
For this study, two different datasets are analyzed. The first dataset that is used in this
study, consists of social media posts containing real and fake news on COVID-19. The
second dataset is a dataset that contains general fake and real news articles.
4.1.1 COVID-19 dataset

The COVID-19 dataset was curated and manually annotated by the Association for
the Advancement of Artificial Intelligence (AAAI). Each observation is labeled, either
real or fake. Fake news is fact-checked by sources like Politifact, NewsChecker, and
Boomlive (Patwa et al. 2020). For obtaining real tweets, verified Twitter handles were
used. The dataset contains the id, the text of the tweet, and the corresponding label
(real/fake). Table 1 gives some examples of real and fake news.
Table 1
Fake and real news examples
Label Source Text
Fake Twitter All hotels, restaurants, pubs etc will be closed till 15th Oct 2020 as per tourism
minister of India.
Real Twitter Almost 200 vaccines for COVID19 are currently in clinical and pre-clinical
(WHO) testing. The history of vaccine developments tells us that some will fail and
some will succeed-@DrTedros #UNGA #UN75
The dataset contains a total of 10,700 social media posts and is predefined in a
train (60%), validation (20%), and test (20%) split. The class-wise distribution can be
seen in Table 2. The classes are equally distributed, as 52% of the samples consist of real
news and 48% consist of fake news.
Table 2
Class-wise distribution across splits
Split Real Fake Total
Training 3360 3060 6450
Validation 1120 1020 2140
Test 1120 1020 2140
Table 3 shows that the social media posts containing real news generally are longer,
containing more words and thus more characters. The average social media post con-
taining real news contains 47% more words. The total dataset contains 38,720 unique
words. Real news has a slightly bigger vocabulary than fake news. The Wordclouds
depicted in Figure 4 show a visual representation of the textual data. The font size shows
13
the importance of occurring words in the dataset. As the wordclouds show, there is an
overlap of important words. This makes sense since the dataset specifically focuses on
COVID-19 related posts. Analyzing the text on a token level, after removing stopwords,
with the use of a unigram and a bigram shows some differences between real and fake
news.
Table 3
Numeric Features COVID-19 dataset
Real Fake Total
Unique words 23,282 20,150 38,720
Average words 31.97 21.65 27.05
Average characters 217.12 143.21 181.89
Top 10 most frequent tokens:

• Real: Covid19, case, new, test, state, number, death, total, people, confirmed
• Fake: Coronavirus, covid19, people, virus, claim, trump, vaccine, say, new, case
Top 10 most frequent bigrams:
• Real: (new, case), (confirmed, case), (case, covid19), (total, number), (active, case),
(covid19, case), (new, zealand), (state, reported), (managed isolation), (update,
published)
• Fake: (novel, coronavirus), (donald, trump), (coronavirus, pandemic), (new, coro-
navirus), (bill, gate), (covid19, case), (video, show), (cure, covid19), (president,
trump), (covid19, pandemic)
Figure 4
Word clouds COVID-19 news for fake news (A), real news (B), and all news articles (C)
4.1.2 General news dataset

The second dataset is a dataset that is derived from a fake news detection task hosted
by Kaggle, the Fake News Detection Challenge KDD 2020. The task was hosted for
the Second International Truefact Workshop: Make a Credible Web for tomorrow. The
dataset contains the id, the text of the news article, and the corresponding label (1:
fake, 0: true). The dataset consists of 4986 articles, 2014 containing fake news and 2971
containing real news.
14
Since this dataset contains news articles instead of social media posts, the average
length of each article is longer. The amount of characters and words is about the same
for real and fake news articles. Real news does have a bigger vocabulary than fake news.
Table 4 shows an overview of the numeric features for the dataset. The wordcloud
depicted in figure 5 shows a visual representation of the textual data. The unigram
shows a great deal of overlap between real and fake news. The bigram already shows
some differences as the news seems to be more about certain individuals.
Table 4
Numeric features general news dataset
Real Fake Total
Unique words 136,457 101,662 189,677
Average words 661.41 662.87 661.87
Average characters 3,961 3,902 3,925
Top 10 most frequent tokens:

• Real: said, one, time, year, new, also, show, people, first like
• Fake: time, year, said, one, new, also, like, film, first, people
Top 10 most frequent bigrams:
• Real: (new, york), (united, state), (los, angeles), (social, medium), (getty, image),
(first, time), (york, city), (last, year), (prince, harry), (donald, trump)
• Fake: (new, york), (los, angeles), (brad, pitt), (united, state), (kim, kardashian),
(meghan, markle), (source, told), (angelina, jolie), (prince, harry), (social, medium)
Figure 5
Word clouds general news for fake news (A), real news (B), and all news articles (C)
4.2 Data preprocessing and preparation
The dataset that was provided by Patwa et al. (2020) was already cleaned from miss-
ing data and had no duplicates. The label variables in the dataset were “real” and
“fake” and needed to be transformed to be fed into the models. In his study, Chollet
(2016), casts lists of labels as integers. Labels are transformed to arrays of integer16-
observations. The dataset containing general fake news articles had one misclassified
observation as it was labeled "label". The observation was removed from the dataset.
15
4.2.1 LSTM – BI-LSTM

Since raw textual data cannot go straight to fitting a machine learning or deep learning
model, the data first needs to be cleaned. A function that cleans the textual data is ap-
plied to each entry. The whole text is made lowercase, text in square brackets is removed,
links are removed, the text is stripped from punctuation, and words containing numbers
are removed. With the help of The Natural Language Toolkit (NLTK) stopwords are
removed from the corpus. Removing the stop words reduces the noise of the textual
data. Discarding the non-discriminative words helps the model reach more accurate
results. By doing this, the feature space of the classifiers will be reduced Silva and
Ribeiro (2003).
After the noise of the data is reduced, the text is normalized with the help of
stemming. Stemming reduces words to their word stems. The words are run through a
series of conditionals. Based on the conditionals, the algorithm determines how to cut it
down. For example, the words “walk”, “walking”, and “walked” are all reduced to the
same stem: “walk”. For this research, NLTK’s Snowball stemmer is used. The stemmer
reduces the size of the dictionary and thus reducing the input dimensions. Stemming is
not done in combination with GloVe word embeddings, since the stemmed words may
not exist in the GloVe corpus. They will get a vector of ‘0’s. For fastText and Word2Vec
the words are cut down. Difference between text before and after cleaning is as follows:
Text before cleaning: “ The CDC currently reports 99031 deaths. In general the
discrepancies in death counts between different sources are small and explicable. The
death toll stands at roughly 100000 people today.”.
Text after cleaning: “cdc current report death general discrep death count differ sourc
small explic death toll stand rough peopl today”.
The next step in preprocessing the text was tokenizing the input sequences with
the help of Keras Tokenizer. By tokenizing the input sequences, the sentences are split
into smaller units, in this case the individual words in each sentence. Since the text
is cleaned and punctuation marks and numbers were removed, each token is a word.
Keras’ Tokenizer first creates vocabulary indexes based on word frequency, so every
word gets a unique integer value starting from 1, 0 is reserved for padding. Then,
each word in the text is replaced with its corresponding number from the word index
dictionary. Since the models can only take inputs of the same length and dimensions, the
sequences are padded to a maximum length. To define the length of the input sequences,
the number of words in the tweets is evaluated. The distribution of the number of words
for each observation for both datasets is visualized in figure 5. Since 8532 out of 8560
tweets have less than 40 words, the sequences are pre-padded to a maximum length
of 40 words for the COVID-19 dataset. Tweets that are longer are truncated, shorter
tweets are added with zeroes. 4413 out of 4986 news articles are shorter than 600 words,
therefore the sequences are pre-padded to a maximum length of 600 words.
The sequential models were trained using three different types of word embeddings
to compare performances: Word2Vec, GLOVE, and fastText. Word embeddings help the
model understand relationships between words. Representations of texts are learned in
an n-dimensional space. Words that have a similar meaning, have a similar representa-
tion. They are closely placed in a vector space.
Word2Vec was developed by Mikolov et al. (2013) at Google. The method uses a
two-layer neural network to create the word embeddings. It turns the text into a numer-
16
Figure 6
Input length distribution of COVID-19 data (A), input length dimension general news articles
(B)
ical form, that deep learning models understand. The method combines two techniques:
Continuous bag of words (CBOW) and Skip-gram model. Firstly, by treating the entire
context as one observation, CBOW smoothes over the distributional information. It
predicts a target word based on the surrounding words. After that, Skip-gram does
the inverse of CBOW and predicts the context based on the target words. The algorithm
looks at a window of words (5 in this study) and targets to learn context and meaning for
the words based on the provided text corpus. The Gensim library is used to implement
the Word2Vec word vectors. The model is trained on the text corpus of the datasets used
in this study.
Stanford’s GloVe embedding (Pennington, Socher, and Manning 2014), which
stands for “Global Vectors”, has quite a bit of synergy with Word2Vec. However, un-
like Word2Vec, GloVe focuses on global statistics instead of local statistics to train the
word vectors. GloVe is a count-based model. The model stores how often each word
is seen in a context. This information is stored in a co-occurrence matrix. The matrix is
factorized in order to yield a lower-dimensional matrix. This is done by normalizing the
counts and log-smoothing them. The model looks at ratios of co-occurrence probabilities
and predicts surrounding words by maximizing the probability of a context word by
performing a dynamic logistic regression. For this study, pre-trained word Vectors are
used. The vectors are trained on Wikipedia data with a 400,000 word vocabulary and
are 100 dimensional.
fastText is an extension of the Word2Vec model (Bojanowski et al. 2017). Word2Vec
learns the vectors for words directly, fastText transformers each word as an n-gram of
characters. So, the word artificial, with n=3 is represented as <ar, art, rti, tif, ifi, fic, ici,
ial, al> by fastText. After representing the words using n-grams, the above-mentioned
skip-gram is trained to learn the embeddings. fastText is considered to be a bag of word
window with a sliding window of size n. This helps capture the meaning of shorter
words and works well with rare words. Unlike Word2Vec and GloVe, the model can get
embeddings of words that weren’t seen during training on the dictionary by breaking
words down into n-grams. For this research, the pre-trained word vectors, trained on
17
Wikipedia using fastTtext are used. The models were trained using CBOW in dimension
300.
4.2.2 Pre-trained language models

BERT requires very little preprocessing with Huggingface pre-trained models. In order
to preprocess the data for BERT, the text is tokenized using AutoTokenizer from the
trained model, “bert-large-uncased” and "bert-large-cased" from Huggingface. The
models, that are trained on lower-cased and cased English text, have 24 layers, 1024
nodes, and 16 attention heads. It is trained on Wikipedia (2,500M words) and Book
Corpus (800M words), a dataset containing over 10,000 books. The tokenizer transforms
the input data into a specific format with the special token to indicate the start ([CLS])
and the separation of sentences ([SEP]). For example, the sentence “Encode me!” is
transformed in the following way by the tokenizer:
Input: Encode me!

Encode: [101, 4372, 16044, 2033, 999, 102]
Decode: [CLS] encode me! [SEP]
The tokenizer is applied to all input sequences to transform the data into the
specific format expected by BERT. BERT requires input ids and attention masks, which
are appended to the tokenized sentences. The input ids identify each input token to its
index number. The attention masks mask 15% of the words in the input. The masked
word will be predicted by the model in the training phase.
4.3 Experimental Procedure
The experimental procedure includes the description of the task, the algorithms, and
choosing and fine-tuning the parameters. The actual implementation and evaluation
criteria are also described.
4.3.1 General procedure

Based on the sub-questions formulated in the introduction, this research can be divided
into three parts. The first part explores the performance of baseline machine learning
methods on both COVID-19 related social media posts and general news articles. Since
Patwa et al. (2020) already explored machine learning baselines in their research, their
methods will be identically applied to the dataset containing general news articles,
to get the baselines for that dataset. The second part dives into the performance of
deep learning methods. The experimental procedure of the deep learning methods will
be described in the subsections below. Finally, to evaluate the generalizability of the
models, both datasets will crosswise be used as test set for the trained models.
4.3.2 Algorithms
Since research questions one and two involve the detection of fake and real news,
this paper investigates a binary classification problem with two output classes. This
indicates the need for a binary deep learning model. Therefore different LSTM models,
BERT, RoBERTa, and ALBERT are trained on the data.
• A simple LSTM, also called a vanilla LSTM (Wu et al. 2018) as introduced by
Graves and Schmidhuber (2005). The vanilla LSTM is an LSTM model with a
18
single hidden layer of LSTM units, making it the simplest version of an LSTM
model.
• Bidirectional LSTM with one hidden layer moving the information forwards, and
one hidden layer moving the input in the reverse direction.
• BERT has proven to be the current state-of-the-art on natural language processing
tasks. “BERTlarge”, which is used in this study, has 24 encoders stacked on top of
each other, 16 attention heads, and 1024 hidden layers (Devlin et al. 2018). Both
BERT cased and BERT uncased are tested. In BERT uncased, before WordPiece
tokenization, the text is made lowercased. In BERT cased no changes are made to
the text.
• A RoBERTa model, that uses the BERT architecture (Liu et al. 2019).
• The base ALBERT model is trained on the data. Like BERT, the model has 24
encoders, 16 attention head, and 1024 hidden layers. Unlike BERT, the model has
17 million parameters instead of 336 million (Lan et al. 2020).
To put the results into perspective, performances are compared with the machine
learning baselines as presented by Patwa et al. (2020). In their research, they compared
Decision Tree, Logistic Regression, Gradient Boost, and Support Vector Machine. The
best performance of 93.46% F1-score was obtained with Support Vector Machine.
4.3.3 Finetuning
(BI-)LSTM - Optimization of the hyperparameters is a crucial step in training an (BI)-
LSTM. To find the optimal model, the number of neurons in each layer, the dropout rate
and the number of epochs were finetuned. After testing different batch sizes, the batch
size for the models is set to 16 on optimization tests. The Adam optimization algorithm
is used to update the networks weights (Kingma and Ba 2015) and binary cross-entropy
is used as loss function. A dynamic learning rate was used during training, meaning
the learning rate is dynamically adapted. When a metric stops improving, the learning
rate of the model is reduced. Models often benefit from reducing the learning rate
(Liu et al. 2019). To optimize the performance for each of the models, the number of
neurons were optimized. For each model 16, 32, 64, 128, 256, and 512 neurons were
tested on the training and validation data. For this research, only LSTM models with
a single hidden layer were tested. The BI-LSTM was also tested with 16, 32, 64, 128,
256 and 512 neurons. To prevent the models from overfitting, dropout is added. For
each model, a dropout rate of 15%, 20% and 25% is tested. The number of epochs refers
to the number of complete passes through the training dataset. Too many epochs will
lead to overfitting. 5, 7, 9, 12 and 15 epochs were tested. The chosen hyperparameters
for all models and datasets can be seen in Table 5. The same set of hyperparameters
was applied to each word embedding technique. The process illustrating setting the
hyperparameters is described in Appendix C until F.
Pre-trained language models - To optimize the performance of the pre-trained
language models the models were tested with a dropout rate of 0.2 and without a
dropout rate. In addition, the models have been tested with and without an additional
dense layer on top of the model. Finally, to optimize the performance of the model,
different numbers of epochs were tested, namely 3, 4, and 5 epochs. The models were
eventually trained with no dropout, no dense layer, and with 5 epochs.
4.3.4 Evaluation
To measure the performance of the different models in an objective manner, a stratified
split was applied to equally distribute classes among the train and test set for the general
19
Table 5
Hyperparameters set for LSTM and BI-LSTM models
COVID-19 dataset General news dataset
LSTM BI-LSTM LSTM BI-LSTM
Dropout 0.20 0.20 Dropout 0.20 0.20
Epochs 7 9 Epochs 7 9
Neurons 256 128 Neurons 64 64
news dataset. 20% of the test set is used for validation. The COVID-19 dataset was
already predefined in train-validation-test. For comparison reasons, F1-score is used
as the evaluation metric. Patwa et al. (2020) also used it as the evaluation metric in
their research. The score combines the precision and recall of the model, therefore it is
seen as the mean of the model’s precision and recall. Recall refers to the proportion
of positives that are correctly predicted positives. Precision refers to the percentage
of positive predictions that were positives as well. The formula of the F1-score is the
following:
TP
P recision = (6)
TP + FP
TP
Recall = (7)
TP + FN
precision ∗ recall
F1 = 2 ∗ (8)
precision + recall
Where TP indicates the number of true positives classified by the model, FN the number
of false negatives classified by the model, and FP indicates the false positives classified
by the model.
4.3.5 Software
The programming language that is used is Python, version 3.7.10. The code was
executed with Google’s Collaboratory (Colab). Colab is a free Jupyter notebook en-
vironment that runs entirely in the cloud. Colab allows you to use the computing
power of Google’s servers. The LSTM models were implemented with the use of Keras
(Chollet 2016). Keras runs on Tensorflow as backend. The pre-trained BERT model is
implemented with the use of the Transformers (Wolf, 2019). Transformers is backed by
Tensorflow and PyTorch. Below an overview of other libraries and packages that were
used in this research:
• Pandas (McKinney 2010)
• Numpy (Harris et al. 2020)
• NLTK (Bird 2009)
• Matplotlib (Hunter 2007)
• Sklearn (Pedregosa FABIANPEDREGOSA et al. 2011)
20
• Seaborn (Waskom et al. 2017)

• Gensim (Řehůřek ; R. Řehůřek, Řehůřek, and Sojka 2010)
• Wordcloud (Oesper et al. 2011)
21
5 Results
This chapter will contain the results of the performances of the optimized models for
both the machine learning methods as the deep learning methods. The metrics are
provided and summarized with the performance of the baseline methods. In addition,
metrics of generalizability are provided by using both datasets crosswise as test set for
the best performing models, for further evaluative purposes.
5.1 Machine Learning models
As mentioned before, the performance of the baseline methods as presented by Patwa

et al. (2020) will be used to evaluate the performance of the deep learning models.
Decision tree, Logistic Regression, Gradient Boost, and Support Vector Machine were
all used to benchmark the dataset. In their study, SVM was the best performing model
(93.46%) An overview of the performance of all the baseline machine learning models
on the COVID-19 dataset is summarized in Appendix G. Table 6 summarizes the scores
of the machine learning models on the dataset containing regular news articles.
Table 6
Performance of machine learning baselines on regular news dataset
Model Accuracy Precision Recall F1-score
Decision Tree 67.76 67.96 67.76 67.85
Logistic regression 76.10 80.94 76.10 77.24
SVM 77.31 78.65 77.31 77.69
Gradient boost 74.50 79.12 74.50 75.64
5.2 Deep Learning models
To classify whether a post contains real or fake news traditional language models, such
as LSTM and BI-LSTM, were trained with different embedding techniques (fastText,
Word2Vec, GloVe) leaving a total of 6 natural language models. Several transformer-
based models were also tested: BERT-cased, BERT-uncased, RoBERTa, and ALBERT.
Perfomances are compared based on the F1-score column.
Table 7
Performance of pre-trained language models on COVID-19 data
BERT uncased 96.82 96.55 97.41 96.98
BERT cased 96.40 95.79 97.41 96.59
RoBERTa 97.01 95.60 98.84 97.19
ALBERT 95.56 95.00 96.91 95.79
On the COVID-19 data, the pre-trained transformer models show a significant

increase in performance over both the machine learning baselines and the traditional
language models. From table 7 can be observed that RoBERTa is the best performing
model with an F1-score of 97.19%. Looking at the traditional language models, only the
22
Table 8
Best performing traditional language models on COVID-19 data
LSTM - GloVe 93.60 94.37 92.06 93.20
BI-LSTM - GloVe 93.92 94.68 92.45 93.55
BI-LSTM trained with GloVe embeddings outperforms the best performing machine
learning baseline with an F1-score of 93.55%, as can be seen in table 8. From all em-
bedding techniques, GloVe embedding showed the best results, fastText showed the
worst results. A summary of the scores of all 10 models can be found in Appendix H.
Appendix I visualizes the improvement of the best performing deep learning model
compared to the best performing machine learning model with confusion matrices.
When looking at the dataset containing general news articles, all models outper-
formed the best performing machine learning model based on the F1-score. On this
dataset, the performance of the pre-trained transformer models did not diverge much
from the performance of the traditional language models. The best performing model
was again RoBERTa, with an F1-score of 81.96%. Performance of all pre-trained lan-
guage models can be seen in table 9. An overview of the performance of all models can
be found in Appendix J. Appendix K visualizes the improvement of the best performing
deep learning model compared to the best performing machine learning model with
confusion matrices on the general news data.
Table 9
Performance of pre-trained language models on general news data
BERT uncased 76.05 75.14 89.41 81.66
BERT cased 75.05 77.55 81.85 79.64
RoBERTa 77.56 78.67 85.55 81.96
ALBERT 74.55 75.18 85.55 80.03
From all embedding techniques, Word2Vec was the best performing model, even
outperforming most of the pre-trained transformer models. The BI-LSTM trained with
Word2Vec reached an F1-score of 81.72% making it the second-best performing model
on this dataset. Table 10 shows the scores of the best performing traditional language
models on the dataset containing general news.
Table 10
Best performing traditional language models on general news data
LSTM - Word2Vec 75.14 73.51 91.12 81.37
BI-LSTM - Word2Vec 77.15 78.06 85.73 81.72
To evaluate the generalizability of the model, both datasets were crosswise used as
test set for the best performing models. This means that the entire COVID-19 dataset is
used as test set for the RoBERTa model that is trained on the regular news dataset. At
23
the same time, the entire general news dataset is used as test set for the RoBERTa model
that is trained on the COVID-19 data. This is also done for the SVM trained on both
datasets. Table 11 shows the crosswise performances of both datasets by comparing the
F1-scores.
Table 11
Performance of the best performing models on unseen data from another dataset
COVID-19 SVM 47.63 54.06 47.63 48.14
COVID-19 RoBERTa 59.76 59.79 99.36 74.65
General news SVM 52.60 70.13 52.60 57.63
General news RoBERTa 68.84 67.03 68.17 67.60
All models show a significant decline in performance when they encounter unseen
data from another dataset. The confusion matrices in figure 7 describe the performances
of the algorithms on both dataset.
Figure 7
SVM: Confusion matrix on COVID-19 data when using regular news as test set (A), Confusion
matrix on regular news data when using COVID-19 data as test set (B)
RoBERTa: Confusion matrix on COVID-19 data when using regular news as test set (C),
Confusion matrix on regular news data when using COVID-19 data as test set (D)
24
6 Discussion
The main goal of this study is to develop an improved automated classification algo-
rithm for the classification of COVID-19 related fake news in social media posts using
deep learning methods. Prior research found that deep learning methods outperformed
machine learning methods on similar tasks of classifying fake news. This study explores
deep learning methods such as LSTM, BI-LSTM, and pre-trained transformer models
in an attempt to outperform the machine learning methods as presented by Patwa
et al. (2020). Adaptability of the model is evaluated with the help of another dataset
containing fake news articles. All above-mentioned models are fitted to both datasets.
The adaptability of the best performing model of each dataset will be tested by using
the other dataset as test set for the fitted model.
6.1 Model performances
The first sub-question explores the performance of machine learning models on the
COVID-19 dataset and the regular news dataset. Gravanis et al. (2019) present results
of comparing different models for fake news classification based on content features.
SVM is the model that is best capable of classifying fake news. The result of Patwa
et al. (2020), who compared different machine learning methods for the classification of
COVID-19 related fake news, was in line with the results as presented by Gravanis et al.
(2019). It was found that SVM was the best performing model, reaching an F1-score of
93.46%. This research has been exactly replicated by this study by applying the machine
learning models to the dataset containing general news articles. The best performing
model on the dataset also was SVM with an F1-score of 77.69%. When the performances
between both datasets are compared, predictive qualities showed a decline, indicating
that the characteristics of the dataset make the fake news classification task harder for
the general news dataset.
To improve performance on the fake news classification task, the performances of
more complex deep learning methods are evaluated in sub-question 2. The results were
in line with the expectations as pre-trained language models were the superior models
in terms of predicting fake news. The results were in line with other studies (Aggarwal
et al. 2020; Schütz et al. 2021; Liu, Liu, and Ren). RoBERTa, the robustly optimized BERT
pre-training model was the best performing pre-trained language model reaching an F1-
score of 97.19% on the COVID-19 dataset and an F1-score of 81.96% on the regular news
dataset. Two different word embedding techniques resulted in the best performance
for the datasets, GloVe for the COVID-19 dataset and Word2Vec for the general news
dataset. GloVe outperforming the Word2Vec embedding is not a surprising result as
pre-trained embeddings are encouraged if the training set is small (Gravanis et al. 2019).
Word2Vec outperforming the pre-trained embedding techniques could be the result of
the dataset’s bigger corpus. What is striking, is the big difference between precision
and recall for the deep learning models trained on the regular news dataset. Imbalance
in the output classes could potentially bias the classifier towards the majority class.
Future research could address the imbalance in the dataset by applying oversampling
techniques such as SMOTE, which has shown promising results, or undersampling
techniques (Nik 2020).
25
6.2 Performance on unseen data from other dataset
Evaluating generalizability, by using both datasets as test set for the fitted models gen-
erated some interesting results. Both models showed a significant decline in predictive
performance when making predictions on the other dataset. The best performing model
fitted on the COVID-19 dataset showed the highest F1-score when predicting fake news
on the other dataset, 74.65%. However, the other metrics and the confusion matrix indi-
cate that the result should be taken with a grain of salt. The model reached an accuracy
of 59.76%, performing only slightly better than a 50% random guessing baseline. The
confusion matrix shows some remarkable statistics, as the model predicted that the
input was fake 99,1% of the time. This indicates that the model does not generalize well
on general news articles. The best performing model fitted on the regular news dataset
reached an F1-score of 67.60% when encountering unseen data from the other dataset.
The F1-score is lower than the F1 score of the COVID model, but when all metrics are
compared, it can be concluded that the model generalizes better than the model fitted on
the COVID-19 data. The ability to generalize better on unseen data seems logical since
the model is not trained on a specific topic. Robustness of the models improved when
the crosswise scores of the deep learning and machine learning models are compared.
Both the accuracy, and the F1-scores increased significantly when predicting on data
from another dataset. However, the increase in performance of the model fitted on the
COVID-19 data should once again be taken with a grain of salt, as the amount of times
Fake is predicted only increased.
This study concludes that pre-trained language models are superior models in
terms of predicting fake news on COVID-19 related social media posts. They outper-
form both machine learning models and traditional language models. RoBERTa showed
the best performance and outperformed every model on both datasets. However, the
model is not able to adapt to general news articles, as 99,1% of the news articles were
predicted to be fake. The model trained on general news articles performed better when
predicting whether the COVID-19 related social media post was either real or fake.
6.3 Limitations and future work
The additional dataset that was used during this study is one of the limitations. To
evaluate the adaptability of the model, a dataset containing general news articles is
used. Because of this, there are two major differences in the datasets. The first difference
is the topics of the news articles. One corpus focuses on COVID-19 related news, the
other corpus doesn’t contain COVID-19 related news. The second difference is the
length of the news articles. The COVID-19 dataset contains social media posts, which
are limited in the amount of characters. The dataset containing general news contains
news articles, which are not limited in the amount of characters and thus much longer.
A more similar dataset, isolating only one major difference, would make for a more
focused research. For further research and evaluation of adaptability, another dataset
containing COVID-19 related news would be advisable. Either social media posts or
news articles could work. Unfortunately, I didn’t manage to find a dataset that met the
requirements. Future research could also be targeted towards collecting more data.
Due to time limitations, some techniques that showed promising results in other
studies were not covered in the scope of this study. Hybrid models and other pre-trained
language models (XLM, ELMO) could potentially show promising results. Exploring
the logic behind the decisions of the models with the help of interpretability techniques
26
could offer an explanation for the differences in performance across different datasets.
Getting familiar with the clues that the models are using could potentially help in
building a robust and more adaptable model.
Finally, manually searching for the optimal combination of hyperparameter set-
tings was a very time-consuming activity, which had its limitations on this research.
Although this approach is generally accepted, it is not optimal. Since this study trained
a lot of models, not every parameter was optimized. The hyperparameter optimization
technique grid search could overcome this issue and have a positive effect on tuning
the hyperparameters for the deep learning models. It could be that further fine-tuning
could improve the performances of the models, something that future research could
look into.
27
7 Conclusion
In the final section, the research questions will be restated and answered. Information
of all previous section is evaluated and put in perspective to answer the questions.
Research question: How do deep learning models perform compared to traditional machine
learning baselines on the classification of fake news on COVID-19 related social media posts?
In general, deep learning models perform better on predicting fake news on COVID-19
related social media posts than machine learning models. The best performing deep
learning model increased the F1-score of the best performing machine learning model
by 3,73%.
Sub-research question 1A: What is the performance of baseline machine learning methods
on the classification of fake news on COVID-19 related social media posts?
The best performing model, as presented by Patwa et al. (2020) is the SVM. The model
reached an F1-score of 93.46%. This model outperformed the decision tree, logistic
regression, and gradient boost.
Sub-research question 1B: What is the performance of baseline machine learning methods
on the classification of fake news on general news articles?
In line with the study conducted by Patwa et al. (2020), decision tree, logistic regression,
SVM and gradient boost were evaluated. The best performing model was SVM,
reaching an F1-score of 77.69%.
Sub-research question 2A: What is the performance of deep learning methods on the classification
of fake news on COVID-19 related social media posts?
After fine-tuning the deep learning models, the results showed that all the transformer
models were superior in terms of predicting fake news. The models significantly
outperformed the traditional language models. RoBERTa was the best performing
model, reaching an F1-score of 97.19%. From all traditional language models, only the
BI-LSTM trained with GloVe word embeddings slightly outperformed the machine
learning baseline reaching an F1-score of 93.55%.
Sub-research question 2B: What is the performance of deep learning methods on the classification
of fake news on general news articles?
All deep learning models trained on the regular news dataset reached an higher
F1-score than the best performing machine learning baseline. The best performing
model once again was RoBERTa, reaching an F1-score of 81.96%. The transformer
models didn’t outperform the traditional language models as much as they did on the
COVID-19 dataset. For example, of all models, the second best performing model was
the BI-LSTM trained with Word2Vec word embeddings.
Sub-research question 3: How do the performances differ? How do performances differ

when the trained model encounters unseen data?
The deep learning methods generally showed an increase in performance when being
compared with the machine learning methods. Saying that deep learning methods
always outperform the machine learning methods for this study would be exaggerated,
since not all models performed better than the best performing machine learning
model. However, the best performing deep learning method did show a significant
28
increase. All performances also increased when the models were applied to unseen
data, indicating that generalizability of deep learning models is better than the
generalizability of the machine learning models. However, this conclusion should
be taken with a grain of salt as the deep learning model fitted on the COVID data,
predicted the news to be fake 99,1% of the times.
29
References
2020. Fake News Detection Regarding the Hong Kong Events from Tweets. volume 585 IFIP,
pages 177–186, Springer.
Afroz, Sadia, Michael Brennan, and Rachel Greenstadt. 2012. Detecting Hoaxes, Frauds, and
Deception in Writing Style Online.
Aggarwal, Akshay, Aniruddha Chauhan, Deepika Kumar, Mamta Mittal, and Sharad Verma.
2020. EAI Endorsed Transactions on Scalable Information Systems Classification of Fake
News by Fine-tuning Deep Bidirectional Transformers based Language Model.
Bird, Klein E. Loper E., S. 2009.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word
Vectors with Subword Information. Transactions of the Association for Computational Linguistics,
5:135–146.
Bozarth, L. and C. Budak. 2020. View of Toward a Better Performance Evaluation Framework for
Fake News Classification. Technical report.
Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. 2011. Information Credibility on Twitter.
Chollet, François. 2016. The limitations of deep learning. pages 10–12.
Chopra, Sahil, Saachi Jain, and John Merriman Sholar. 2019. Towards Automatic Identification of
Fake News: Headline-Article Stance Detection with LSTM Attention Models. Technical report.
Choudhary, Anshika and Anuja Arora. 2021. Linguistic feature based learning model for fake
news detection and classification. Expert Systems with Applications, 169:114171.
Coleman, Alistair. 2020. ’Hundreds dead’ because of Covid-19 misinformation - BBC News.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training
of Deep Bidirectional Transformers for Language Understanding. NAACL HLT 2019 - 2019
Conference of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies - Proceedings of the Conference, 1:4171–4186.
Gelfert, Axel. 2018. Fake news: A definition. Informal Logic, 38(1):84–117.
Ghosh, Souvick and Chirag Shah. 2019. Toward Automatic Fake News Classification.
Gravanis, Georgios, Athena Vakali, Konstantinos Diamantaras, and Panagiotis Karadais. 2019.
Behind the cues: A benchmarking study for fake news detection. Expert Systems With
Applications, 128:201–213.
Graves, Alex and Jürgen Schmidhuber. 2005. Framewise Phoneme Classification with
Bidirectional LSTM and Other Neural Network Architectures. Technical report.
Greff, Klaus, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber.
2017. TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 LSTM: A
Search Space Odyssey.
Harris, Charles R., K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen,
David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert
Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane,
Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin
Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E.
Oliphant. 2020. Array programming with NumPy.
Helmstetter, Stefan and Heiko Paulheim. 2018. Weakly supervised learning for fake news
detection on Twitter. In Proceedings of the 2018 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining, ASONAM 2018, pages 274–277, Institute of Electrical
and Electronics Engineers Inc.
Hochreiter, Sepp and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation,
9(8):1735–1780.
Horne, Benjamin D and Sibel Adalı. 2017. This Just In: Fake News Packs a Lot in Title, Uses
Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News. Technical
Report 1.
Huang, Zhiheng, Baidu Research, Wei Xu, and Kai Yu Baidu. 2015. Bidirectional LSTM-CRF
Models for Sequence Tagging. Technical report.
Hunter, John D. 2007. Matplotlib: A 2D graphics environment. Computing in Science and
Engineering, 9(3):90–95.
Ipsos. 2019. Fake News: A Global Epidemic Vast Majority (86%) of Online Global Citizens Have
Been Exposed to it.
Jin, Fang, Edward Dougherty, Parang Saraf, Yang Cao, and Naren Ramakrishnan. 2013.
Epidemiological Modeling of News and Rumors on Twitter.
30
Jwa, Heejung, Dongsuk Oh, Kinam Park, Jang Kang, and Hueiseok Lim. 2019. exBAKE:
Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from
Transformers (BERT). Applied Sciences, 9(19):4062.
Kang, Jerry. 2012. Communications law: Bits of bias. In Implicit Racial Bias Across the Law.
Cambridge University Press, pages 132–145.
Kingma, Diederik P and Jimmy Lei Ba. 2015. Adam: A method for stochastic optimization. In
3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings.
Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu
Soricut, and Google Research. 2020. ALBERT: A LITE BERT FOR SELF-SUPERVISED
LEARNING OF LANGUAGE REPRESENTATIONS. Technical report.
Levy, Neil. The Bad News About Fake News.
Liu, Liyuan, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and
Jiawei Han. 2019. On the variance of the adaptive learning rate and beyond.
Liu, Shuaipeng, Shuo Liu, and Lei Ren. Trust or Suspect? An Empirical Ensemble Framework
for Fake News Classification. Technical report.
McKinney, Wes. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the
9th Python in Science Conference, pages 56–61.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of
Words and Phrases and their Compositionality. Technical report.
Monti, Federico, Fabrizio Frasca, Davide Eynard, Damon Mannion, Michael M Bronstein, Fabula
Ai, and Usi Lugano. 2019. Fake News Detection on Social Media using Geometric Deep
Learning. Technical report.
Mouratidis, Despoina, Maria Nefeli Nikiforos, and Katia Lida Kermanidis. 2021. Deep learning
for fake news detection in a pairwise textual input schema. Computation, 9(2):1–15.
Nasir, Jamal Abdul, Osama Subhani Khan, and Iraklis Varlamis. 2021. Fake news detection: A
hybrid CNN-RNN based deep learning approach. International Journal of Information
Management Data Insights, 1(1):100007.
Oesper, Layla, Daniele Merico, Ruth Isserlin, and Gary D. Bader. 2011. WordCloud: A Cytoscape
plugin to create a visual semantic summary of networks. Source Code for Biology and Medicine,
6.
Patwa, Parth, Shivam Sharma, Srinivas PYKL, Vineeth Guptha, Gitanjali Kumari, Md Shad
Akhtar, Asif Ekbal, Amitava Das, and Tanmoy Chakraborty. 2020. Fighting an Infodemic:
COVID-19 Fake News Dataset. arXiv.
Pedregosa FABIANPEDREGOSA, Fabian, Vincent Michel, Olivier Grisel OLIVIERGRISEL,
Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Jake Vanderplas, David Cournapeau, Fabian
Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Bertrand Thirion, Olivier Grisel, Vincent
Dubourg, Alexandre Passos, Matthieu Brucher, Matthieu Perrot andÉdouardand, andÉdouard
Duchesnay, and FRÉdouard Duchesnay EDOUARDDUCHESNAY. 2011. Scikit-learn:
Machine Learning in Python Gaël Varoquaux Bertrand Thirion Vincent Dubourg Alexandre
Passos PEDREGOSA, VAROQUAUX, GRAMFORT ET AL. Matthieu Perrot. Technical report.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. GloVe: Global vectors for
word representation. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural
Language Processing, Proceedings of the Conference, pages 1532–1543.
Pérez-Rosas, Verónica, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea. 2017.
Automatic Detection of Fake News. Technical report.
Qazi, Momina, Muhammad U.S. Khan, and Mazhar Ali. 2020. Detection of Fake News Using
Transformer Model. In 2020 3rd International Conference on Computing, Mathematics and
Engineering Technologies: Idea to Innovation for Building the Knowledge Economy, iCoMET 2020,
Institute of Electrical and Electronics Engineers Inc.
Rashkin, Hannah, Eunsol Choi, Jin Yea Jang, Svitlana Volkova, Yejin Choi, and Paul G Allen.
2017. Truth of Varying Shades: Analyzing Language in Fake News and Political
Fact-Checking. Technical report.
Rubin, Victoria L. 2010. On deception and deception detection: Content analysis of
computer-mediated stated beliefs. Proceedings of the American Society for Information Science and
Technology, 47(1):1–10.
Ruchansky, Natali, Sungyong Seo, and Yan Liu. 2017. CSI: A Hybrid Deep Model for Fake News
Detection. International Conference on Information and Knowledge Management, Proceedings, Part
F131841:797–806.
31
Schütz, Mina, Alexander Schindler, Melanie Siegel, and Kawa Nazemi. 2021. AUTOMATIC
FAKE NEWS DETECTION WITH PRE-TRAINED TRANSFORMER MODELS. Technical
report.
Shu, Kai, H Russell Bernard, and Huan Liu. 2019. Studying Fake News via Network Analysis:
Detection and Mitigation. Technical report.
Shu, Kai, Deepak Mahudeswaran, Suhang Wang, Dongwon Lee, and Huan Liu. 2020.
FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal
Information for Studying Fake News on Social Media.
Shu, Kai, Suhang Wang, and Huan Liu. 2019. Understanding User Profiles on Social Media for
Fake News Detection. Technical report.
Silva, Catarina and Bernardete Ribeiro. 2003. The Importance of Stop Word Removal on Recall
Values in Text Categorization. In Proceedings of the International Joint Conference on Neural
Networks, volume 3, pages 1661–1666.
Vaswani, Ashish, Google Brain, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need.
Technical report.
Vosoughi, Soroush, Deb Roy, and Sinan Aral. 2018. The spread of true and false news online.
Science, 359(6380):1146–1151.
Wang, William Yang. 2017. "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News
Detection. ACL 2017 - 55th Annual Meeting of the Association for Computational Linguistics,
Proceedings of the Conference (Long Papers), 2:422–426.
Waskom, Michael, Olga Botvinnik, Drew O’Kane, Paul Hobson, Saulius Lukauskas, David C
Gemperline, Tom Augspurger, Yaroslav Halchenko, John B. Cole, Jordi Warmenhoven, Julian
de Ruiter, Cameron Pye, Stephan Hoyer, Jake Vanderplas, Santi Villalba, Gero Kunter, Eric
Quintero, Pete Bachant, Marcel Martin, Kyle Meyer, Alistair Miles, Yoav Ram, Tal Yarkoni,
Mike Lee Williams, Constantine Evans, Clark Fitzgerald, Brian, Chris Fonnesbeck, Antony
Lee, and Adel Qalieh. 2017. mwaskom/seaborn: v0.8.1 (September 2017).
Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang
Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah,
Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo,
Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason
Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey
Dean. Google’s Neural Machine Translation System: Bridging the Gap between Human and
Machine Translation. Technical report.
Wu, Yuting, Mei Yuan, Shaopeng Dong, Li Lin, and Yingqi Liu. 2018. Remaining useful life
estimation of engineered systems using vanilla LSTM neural networks. Neurocomputing,
275:167–179.
Yang, Fan, Xiaohui Yu, Yang Liu, and Min Yang. 2012. Automatic detection of rumor on Sina
Weibo. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 1–7, ACM Press, New York, New York, USA.
Zhou, Yiwei. 2017. Clickbait Detection in Tweets Using Self-attentive Network The Zingel
Clickbait Detector at the Clickbait Challenge 2017. Technical report.
Řehůřek ; R. Řehůřek, R, ; R Řehůřek, and P Sojka. 2010. Fast and Faster: A Comparison of Two
Streamed Matrix Decomposition Algorithms. Technical report.
32
Appendix A: Artificial Neural Network
An artificial neural network (ANN) is a mathematical model inspired by the function-

ality of the human brain. Within the model, different layers work together to solve
tasks like image processing, forecasting and classification by recognizing patterns. They
group unlabeled data according to similarities in numerical data contained in vectors
and classify the data when they are accompanied by a labeled dataset. Over the past
few decades, it has become a very popular and powerful tool, as it is able to process
large amounts of data
33
Appendix B: Transformer
In 2017, in the paper ‘Attention Is All You Need’, a new architecture called the Trans-
former is introduced (Vaswani et al. 2017). The team presenting the paper showed that
the architecture produces promising results, as it improved results on translation and
other tasks. Transformers use attention-mechanisms, and like LSTM, it is an architecture
for transforming sequences with the help of an encoder and a decoder. The encoder and
decoder are stackable modules that contain Multi-Head Attention and Feed Forward
layers.
The Multi-Head Attention bricks can be described by the following equation:
QK t
Attention(Q, K, V ) = sof tmax( )V (B.1)
dk
Where Q is a matrix that contains vector representation of a word (Query), K are the
vector representations of all the words (Key) and V are again the vector representations
of all the words (Values). To calculate the weight of the values (QK T ), they are divided
by (dk ). The outcome is passed through a softmax function. This means that the weights
are defined by the way each word (Q) is influences by every other word in in the
sequence (K). Those weights are applied to al words in the sequence V . Figure 1
shows how multiple attention mechanisms could operate at the same time side by side.
34
Because of this, the system learns from different representations of Q, K and V , which
benefits the model.
Besides the attention layers, each layer of the encoder consists of a fully connected
feed-forward network (FFN). The FFN is identically and independently applied to each
position. Since there are no recurrent networks that can remember sequences, the posi-
tions of elements are added to the embedded representation of each word. As Vaswani
et al. (2017, p.2) explain: “The transformer relies entirely on an attention mechanism to
draw global dependencies between input and output. It allows for significantly more
parallelization and can reach a new state of the art.“
35
Appendix C: COVID data LSTM optimizing
Optimizing number of neurons
36
Optimizing dropout rate
37
Optimizing number of epochs
38
Appendix D: COVID data BI-LSTM optimizing
39
40
41
Appendix E: Kaggle data LSTM optimizing
42
43
44
Appendix F: Kaggle data BI-LSTM optimizing
45
46
47
Appendix G: Performance baseline machine learning models as presented by Patwa

et al. (2020)
48
Appendix H: Performance deep learning models on COVID data
49
Appendix I: Confusion matrixes COVID data
Figure 1
Confusion matrix RoBERTa (A), confusion matrix SVM (B)
50
Appendix J: Performance deep learning models on general news data
51
Appendix K: Confusion matrixes general news data data
Figure 1
Confusion matrix RoBERTa (A), confusion matrix SVM (B)
52

File 158594

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

File 158594

Uploaded by

Copyright:

Available Formats

Early detection of COVID-19 fake news

on social media posts using deep

Jasper van Gool

T HESIS SUBMITTED IN PARTIAL FULFILLMENT

I hope you enjoy reading.

Jasper van Gool

Tilburg, May 2021

Jasper van Gool

The ’infodemic’ surrounding COVID-19 is affecting people worldwide by causing misunder-

1.2 Research Questions

2.1 Fake news

“Fake news is the deliberate presentation of (typically) false or misleading claims as

The human ability to detect deception is limited, as psychology and communications

2.2 Related work

2.3 Research contribution

3.1 Recurrent Neural Network

3.2 Pre-trained language models

A team at Google introduced a new, simple, and empirically powerful pre-trained

4.1 Dataset description

4.1.1 COVID-19 dataset

Top 10 most frequent tokens:

4.1.2 General news dataset

Top 10 most frequent tokens:

4.2 Data preprocessing and preparation

4.2.1 LSTM – BI-LSTM

4.2.2 Pre-trained language models

Input: Encode me!

4.3 Experimental Procedure

4.3.1 General procedure

• Seaborn (Waskom et al. 2017)

5.1 Machine Learning models

As mentioned before, the performance of the baseline methods as presented by Patwa

5.2 Deep Learning models

On the COVID-19 data, the pre-trained transformer models show a significant

6.1 Model performances

6.2 Performance on unseen data from other dataset

6.3 Limitations and future work

Sub-research question 3: How do the performances differ? How do performances differ

Appendix A: Artificial Neural Network

An artificial neural network (ANN) is a mathematical model inspired by the function-

The Multi-Head Attention bricks can be described by the following equation:

Appendix C: COVID data LSTM optimizing

Optimizing number of neurons

Optimizing dropout rate

Optimizing number of epochs

Appendix D: COVID data BI-LSTM optimizing

Optimizing number of neurons

Optimizing dropout rate

Optimizing number of epochs

Appendix E: Kaggle data LSTM optimizing

Optimizing number of neurons

Optimizing dropout rate

Optimizing number of epochs

Appendix F: Kaggle data BI-LSTM optimizing

Optimizing number of neurons

Optimizing dropout rate

Optimizing number of epochs

Appendix G: Performance baseline machine learning models as presented by Patwa

Appendix H: Performance deep learning models on COVID data

Appendix I: Confusion matrixes COVID data

Appendix J: Performance deep learning models on general news data

Appendix K: Confusion matrixes general news data data

You might also like