Professional Documents
Culture Documents
Abstract
Fake news causes a huge impact on the reader’s mind, therefore it has become
a major concern. Identifying fake news or differentiating between fake and au-
thentic news is quite challenging. The trend of fake news in Pakistan has grown
a lot in the last decade. This research aims to develop the first comprehensive
fake news detection dataset for Pakistani news by using multiple fact-checked
news APIs. This research also evaluates the developed dataset by using mul-
tiple state-of-the-art artificial intelligence techniques. Five machine learning
techniques namely Naive Bayes, KNN, Logistic Regression, SVM, and Decision
Trees are used. While two deep learning techniques CNN and LSTM are used
with GloVe and BERT embeddings. The performance of all the applied models
and embeddings is compared based on precision, F1-score, accuracy, and recall.
The results show that LSTM initialized with GloVe embeddings has performed
best. The research also analyzes the misclassified samples by comparing such
samples with human judgments.
Keywords: NLP, Fake News, BERT, GloVe, LSTM, CNN
1. Introduction
In the past few decades, the internet has become accessible to almost every-
one which eventually increased the use of the internet in our daily lives. With
∗ Correspondingauthor
Email addresses: azkakish22@gmail.com (Azka Kishwar), adeel.zafar@riphah.edu.pk
(Adeel Zafar)
1.2. Contributions
This paper is organized into multiple sections. The related work is discussed
in section 2 while section 3 covers the details of developed dataset and its eval-
uation along with feature extraction, and experimental setup of the applied
models. The results are discussed in section 4 and section 5 covers findings,
conclusion and future work.
This section provides a brief overview of the need and existing work done
for fake news detection using multiple techniques. The section also covers the
currently available datasets and their limitations.
Fake news and its detection is a hot topic among researchers. Many re-
searchers have highlighted the need of automated news detection in multiple
fields. The need of research for the automated deception detection in multiple
languages such as Asian languages is mentioned [8]. Additionally, the need for
Multiple open-source datasets are used for the classification. Most of the
datasets include news from the political sector. Few of the models are created
based on the data scraped from online news websites. One of the most used
and publicly available fake news datasets named Liar, Liar Pants on Fire is
developed. This dataset is commonly known as LIAR and contains 12.8K short
news statements which are labeled into six categories [9]. Few other datasets
named as Fake-or-Real news, Twitter, BuzzFeedNews, and Weirdo are also used
by multiple researchers [2, 1, 10, 11, 12, 13]. Fake news detection is also applied
on many web scrapped datasets including multiple countries such as Pakistani
media news [7].
A lot of work is done on identifying the important features which can help
improve the model accuracy. Discourse and pragmatics that is to use the lan-
guage to accomplish communication are used [14]. Word level features such as
lexical and semantic features for a headline, profanity, or slang are also consid-
ered important [15]. Auxiliary information including social engagements of a
person on social media have been considered as important features and can help
in improving the accuracy [2]. Speaker proles such as speaker title, party affili-
ation, credit history, and location are also considered as important features [6].
Multiple features from the text are extracted and then used for classification.
These features include term frequency (TF) and Inverse Document Frequency
(IDF) [7].
Google Fact Checker. The claim search API of Google’s fact checker APIs1 has
been used to gather the fact checked news data related to Pakistani news. The
news data gathered by this API includes news data for both the real and fake
class. The API takes the search string as input to search for the fact checked
news claims related to the search term.
PolitiFact. PolitiFact2 is one the most famous fact checking website operated
by Poynter institute. The website contains multiple fact checked news data
categorized into one of the following six categories
1 https://toolbox.google.com/factcheck/apis
2 https://www.politifact.com/
• Mostly True
• Half True
• Mostly False
• False
• Pants on Fire
The fact checked news data related to Pakistan has been scrapped from the
website. The news data gathered by PolitiFact includes news data for both the
real and fake class.
TheNewsAPI. The news API3 is one of the most famous, simple and easy-to-
use REST API. It provides hundreds and thousands of authentic news published
over worldwide resources. The news data gathered by this API includes news
data for the real class only. The API takes the search string as input to search
for the news related to that search term.
Kaggle. Kaggle is one the world largest and most famous data science com-
munity which help to achieve multiple data science goals. There are multiple
challenges related to fake news detection on Kaggle. All of these challenges
provide state-of-the-art datasets to perform fake news detection. The news
records related to Pakistan are included in the developed dataset. The news
data gathered by these datasets includes news data for both the real and fake
3 https://newsapi.org/
4 https://www.factcheck.org/
• Fake News5
AFP Factcheck. AFP is one the leading global news agency. Similarly, its digital
verification service has also become leading global fact-checking organization
[26]. The fact checking section of the website9 contains the fact checked news
data. The fact checked news data related to Pakistan has been scrapped from
the website. The news data gathered by AFP Factcheck includes news data for
both the real and fake class.
Prntly. Prntly11 is one of the America’s top fake news website [28]. The news
data related to Pakistan has been scrapped from the website. The news data
gathered by Prntly website includes news data for fake class only.
5 https://www.kaggle.com/c/fake-news/data
6 https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
7 https://www.kaggle.com/c/fakenewskdd2020/data
8 https://www.kaggle.com/c/fake-detection/data
9 https://factcheck.afp.com/
10 https://notallowedto.com/
11 https://prntly.com/
10
Authentic Source of News. The news data either it is fake or real must be from
some authentic source. There are a lot of official Pakistani news websites which
provide real and authentic news but there are very few authentic resources for
gathering the fake news data.
Limited fact-checking news resources. There are very limited resources which
provides fact-checked news data. Almost all of the fact-checking websites con-
tains fact checked international news. The fact checked news data related to
Pakistan is very limited. Additionally, most of the fact checking websites does
not provide search option to search for specific news.
Limited Rest APIs for Gathering News. There are very limited resources which
provides well structured and documented REST APIs to gather news data.
Most of the resources either do not have APIs or they don’t maintain them
anymore. One such example is PolitiFact14 , the API for news gathering have
been removed and is not maintained by them anymore. Following is the list of
12 https://realnewsrightnow.com/
13 https://guides.stlcc.edu/fakenews/factchecking
14 https://www.politifact.com/
11
https:// No
fullfact.org/ • The content does not contain the news
data. It includes the evidences to either
prove it real or fake
https://www. Yes
politifact. • Less data related to Pakistan
com/
https://www. No
snopes.com/ • Very limited data related to Pakistan
https://www. No
washingtonpost. • No search option available for searching
com/news/ specific fact checked news
fact-checker/
12
• TheNewsAPI 16
Limited News Data due to Specific Search Area. Most of the fact checking web-
site are international websites and thus contains international fact checked news.
Therefore, when the fact checked news data is limited to only data related to
Pakistan, the data gathered becomes very less. One such example is PolitiFact17
which contains only 169 fact-checks related to Pakistan18 . Tough, the website
is one of the most popular fact-checking websites and contains thousands of fact
checked news.
Issues faced in Web Scarping. There are a lot of news websites which does not
provide any API to gather the data. Therefore, the news data has been scrapped
from those websites. Some websites have restricted the data scrapping as the
scrapped data does not contain any content except html of the loading page.
Therefore, the HTML content of such websites have manually extracted using
the page sources and then the data has been scrapped from the hardcoded
HTML content. One such example of these websites is NotAllowedTo19 .
Class Imbalancing. Class imbalancing means that the data belong to the classes
is not balanced. As fake news detection is a binary class problem, so the news
text belongs to either real or fake class. The data gathered is highly imbalanaced
as the real news data is much more than the fake news data. Section explains
the data statistics in detail. Classes are imbalanced because of very limited fact-
checking resources. Multiple official Pakistani news websites are available which
15 https://toolbox.google.com/factcheck/apis
16 https://newsapi.org/
17 https://www.politifact.com/
18 https://www.politifact.com/search/factcheck/?q=pakistan
19 https://notallowedto.com/
13
Discard duplicate records. The news data gathered from multiple resources con-
tains a lot of duplicate records where the news text is same. So as the first step
in data cleaning, all the duplicate records are discarded which results in only
unique news data records in the dataset.
Discard Missing Data. The most important thing in the news data for fake
news detection is the text of news. Therefore, as the next step of data cleaning,
all the news record where the text of news is missing are discarded. This results
in the dataset which contains unique and complete text news data.
Attribute Selection. The gathered data contains additional metadata like URL,
title, review date, publisher site, publisher name, claim date, claimant, content,
published at, and author information. But in a real-life scenario, this additional
information along with the news text might not always be available. Therefore,
all this additional metadata is ignored and all the experimentation is performed
on only news text only.
News Text Preprocessing. The raw news text contains raw data which might
affect the performance of classification models. Therefore, the raw news text is
preprocessed before feeding into the classification models. Following steps are
performed while preprocessing the raw news text.
20 https://towardsdatascience.com/what-is-data-cleaning-how-to-process-data-for-
analytics-and-machine-learning-modeling-c2afcf4fbf45
14
• Removing URLs
• Removing Punctuation
• Text Tokenization
• Removing Stopwords
• Stemming
Transforming Labels into Binary Class. The textual rating of the news record
is considered as the label for the classifier. As the news data is gathered from
multiple resources, therefor the textual rating attribute contains different values.
These textual ratings can be categorized into 49 categories. Fake news detection
is one of the binary class classification problem. In binary classification, there
should be only two class labels21 . So to achieve this news labels are transformed
into binary class. True, Partly True and Half True labels are transformed into
true(real) class while all other labels are transformed to false(fake) class.
Fig. 2 shows a sample of news data of both classes from the developed
dataset.
21 (https://machinelearningmastery.com/types-of-classification-in-machine-learning
15
80% of the dataset is used for training the algorithms while 20% of the
dataset is used for testing the developed algorithms. As the dataset contains
imbalanced classes, therefore the dataset is split using stratified sampling.
The developed dataset is evaluated by classifying the fake and real news
using multiple machine learning and deep learning algorithms. To achieve this,
first, the raw dataset labels are converted into binary classes. True, Partly True,
and Half True labels are transformed into true(real) classes while all other labels
are transformed into false(fake) classes.
Multiple machine learning and deep learning models are developed to evalu-
ate the dataset. This section explains the experimental setup and details of the
developed models.
16
This section explains the in-depth analysis of results of all the applied mod-
els along with the performance comparisons. The performance of all the models
is compared and is evaluated using different performance metrics including ac-
curacy, precision, recall and F1-score. The best performing models are shown
in bold.
Table 4 shows the results of all the machine learning models. Fig. 3 shows
the performance comparison of machine learning models.
The results show that KNN with lexical and sentiment features is the best
performing machine learning model by achieving the highest F1-score and ac-
curacy of 89%. The results also show that adding sentiment features along with
lexical features does not have a significant effect on the performance of the SVM
model.
Table 5 shows the results of both deep learning models along with the applied
embeddings. Fig. 4 shows the performance comparison of applied deep learning
models.
17
18
The results shows that the LSTM initialized with pre-trained GloVe word
embeddings is the best performing model by achieving 0.94 F1-score and accu-
racy. Overall, CNN has better performance in each case than LSTM.
Fig. 5 shows the comparison of F1-scores of different embeddings applied
with both CNN and LSTM. The result shows that GloVe word embedding has
outperformed with both CNN and LSTM. While BERT embeddings have shown
better performance with CNN than LSTM. This is because the text of news in
the dataset in mostly short statements, therefore applying word embedding
19
Table 6 shows the best performing models of both machine learning and deep
learning techniques. Fig. 6 shows the performance bar chart of both the best
models of machine learning and deep learning techniques. The results show that
LSTM has performed better on the developed dataset. Overall deep learning
techniques have better performance rather than machine learning techniques.
While among deep learning techniques, GloVe embedding has shown better
performance with both CNN and LSTM by achieving the F1-score of 0.93 and
0.94 respectively.
The results above show that the overall best performing model is LSTM
pre-trained with GloVe word embeddings. It has achieved 0.94 F1-score and
accuracy.
Few random misclassified samples where either real class is predicted as fake
or fake class is predicted as real are available in Appendix 5.1. Total number of
misclassified samples are 133.
20
First five rows of the table show the samples which are actually fake but
predicted as real by the model. While the last five rows of the table show the
samples, which are actually real but predicted as fake by the model. Most of
the news text which are predicted as fake but are real looks like real news. A
human might also predict this news as real because the content of the news
sounds so right. Similarly, most the news text which are predicted as real but
21
Fig. 7 shows the bar chart of response count of the human survey. The
results show that the news predicted as fake by the model are mostly predicted
22
The key findings of this research includes the development and evaluation
of first comprehensive Pakistani fake news detection dataset. The results shows
that LSTM initialized with GloVe Embeddings has shown the best performance
on the dataset by achieving almost 0.94 F1-score. It is also seen that GloVe em-
beddings has shown better performance than BERT embeddings. The compar-
ison of misclassified samples with human judgements shows that human judge-
ments are also weak as misclassified samples are wrongly predicted by human
as well.
In future, the developed dataset can be enhanced to make it balanced be-
tween both classes and then it can be evaluated as a balanced dataset. Addi-
tionally, the performance of the dataset can be compared using multiple other
23
References
[1] V. Rubin, Deception Detection and Rumor Debunking for Social Media,
2016, pp. 342–364.
[2] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake news detection on social
media: A data mining perspective, SIGKDD Explor. Newsl. 19 (1) (2017)
22–36. doi:10.1145/3137597.3137600.
URL https://doi.org/10.1145/3137597.3137600
[4] Y. Chen, N. Conroy, V. Rubin, News in an online world: The need for an
” automatic crap detector ”, Vol. 6?10, 2015.
[6] Y. Long, Q. Lu, R. Xiang, M. Li, C.-R. Huang, Fake news detection through
multi-perspective speaker profiles, in: Proceedings of the Eighth Interna-
tional Joint Conference on Natural Language Processing (Volume 2: Short
Papers), Asian Federation of Natural Language Processing, Taipei, Taiwan,
2017, pp. 252–256.
URL https://www.aclweb.org/anthology/I17-2043
[7] I. Kareem, S. Awan, Pakistani media fake news classification using ma-
chine learning classifiers, 2019, pp. 1–6. doi:10.1109/ICIC48496.2019.
8966734.
24
[9] W. Wang, ”liar, liar pants on fire”: A new benchmark dataset for fake news
detection, 2017, pp. 422–426. doi:10.18653/v1/P17-2067.
[10] N. Ruchansky, S. Seo, Y. Liu, Csi: A hybrid deep model for fake news
detection, 2017, pp. 797–806. doi:10.1145/3132847.3132877.
[11] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, Eann:
Event adversarial neural networks for multi-modal fake news detection,
Association for Computing Machinery, New York, NY, USA, 2018, pp.
849–857. doi:10.1145/3219819.3219903.
URL https://doi.org/10.1145/3219819.3219903
25
[20] C. Song, K. Shu, B. Wu, Temporally evolving graph neural network for
fake news detection, Information Processing Management 58 (6) (2021)
102712. doi:https://doi.org/10.1016/j.ipm.2021.102712.
URL https://www.sciencedirect.com/science/article/pii/
S0306457321001965
26
5.1. Appendices
27
28