Professional Documents
Culture Documents
Today: Proceedings
Manuscript Draft
Abstract: Social media has abolished the conventional news sources such
as newspaper and TV news channels and became the major source of news.
According to Pew Research Centre, which is an Internet research company,
62 percent of adults in the united states gather their news from social
media [1]. Social media has become an open platform to spread any type of
fake news or propaganda's all over the world with a single click. Social
media platforms like Facebook and Google are using preventive measures to
tackle this fake news by using reporting and flagging tools. However,
these measures are user-based and only users can report the news as fake
or genuine. Machine learning and artificial intelligence may help to
create certain powerful algorithms that can detect and remove fake news
broadcasting itself. This paper explores a machine learning algorithm
named Bernoulli's Naive Bayes Classifier, which is the extended version
of Multinomial Naive Bayes with predictors as Boolean variables i.e. 0
and 1 to detect fake news. Previous studies applied Gaussian Naive Bayes
[2]. Our proposed methodology classifies the input data into two classes
0 or 1. '0' stands for Fake and '1' stands for Genuine news article.
Further, it is observed that the results are enhanced as compared to
Gaussian Naive Bayes. From the experiments, we observe that the
classification results improve by the use of Bernoulli's Naive Bayes
Classifier as compared to Gaussian Naive Bayes. The comparison is done in
terms of accuracy, precision, recall, and F1 measure. The accuracy is
improved by 10%, precision by 15%, and F1 measure by 6%.
Manuscript
Click here to view linked References
Abstract
Social media has abolished the conventional news sources such as newspaper and TV news
channels and became the major source of news. According to Pew Research Centre, which is
an Internet research company, 62 percent of adults in the united states gather their news from
social media [1]. Social media has become an open platform to spread any type of fake news
or propaganda’s all over the world with a single click. Social media platforms like Facebook
and Google are using preventive measures to tackle this fake news by using reporting and
flagging tools. However, these measures are user-based and only users can report the news as
fake or genuine. Machine learning and artificial intelligence may help to create certain
powerful algorithms that can detect and remove fake news broadcasting itself. This paper
explores a machine learning algorithm named Bernoulli’s Naive Bayes Classifier, which is
the extended version of Multinomial Naive Bayes with predictors as Boolean variables i.e. 0
and 1 to detect fake news. Previous studies applied Gaussian Naive Bayes [2]. Our proposed
methodology classifies the input data into two classes 0 or 1. ’0’ stands for Fake and ’1’
stands for Genuine news article. Further, it is observed that the results are enhanced as
compared to Gaussian Naive Bayes. From the experiments, we observe that the classification
results improve by the use of Bernoulli’s Naive Bayes Classifier as compared to Gaussian
Naive Bayes. The comparison is done in terms of accuracy, precision, recall, and F1 measure.
The accuracy is improved by 10%, precision by 15%, and F1 measure by 6%.
Keywords: Fake news, Bernoulli’s Naive Bayes, Gaussian Naïve Bayes, Machine learning
Introduction
The concept of fake news comes into the spotlight when there were rumours coming out
about spread of fake news during US Presidential election held in 2016. Those news which
are spread with the intention to mislead someone are known as fake news [3]. In order to
minimize the spread of fake news conventional news sources like newspaper and TV news
channels uses some strict codes of practice. There are some other news sources as well like
social media which give an individual to spread any kind of news either genuine or fake
easily all over the world. These days most of the people uses these fake news in order to
make more revenue. When a fake news goes viral, many people clicks on the news and the
huge advertising revenue is generated. The opinions of an individual are moreover converted
to the text messages and are filtered out to get certain keywords. These keywords are then
recognized as constructive or destructive [4].
A brief history of Fake News
On 25 August 1835, a series of articles were published in the New York Sun newspaper
regarding the discovery of life on the moon. The article was published with the heading “The
Great Moon Hoax” [5]. On Thirteen December, 2006, Belgian public television station
broadcast a news with title “The Flander's parliament has declared its independence from the
kingdom of Belgium”. In the on-going program, a new title was introduced which states the
the news was a hoax. [6]. In the 2016 US Presidential election, most of the American public
believed that fake news spread over facebook had an important role in the results of
election[7]. Other conspiracy theories include the 1975 conspiracy about the killing of Martin
Luther King, 2010 conspiracy about Barack Obama that he was born in another country etc.
Impact of Fake News
Fake News can have adverse effect on an individual to a complete nation. some of the major
impacts of fake news include:
1. Effects on health
2. Financial impact
3. Fear
4. Racist ideas
5. Bullying and Violence against innocent people
6. Democratic impacts
Detection of Fake News
The bad impact of fake news leads the researchers to resolve the issue of spreading of fake
news. Many researchers are trying to create some powerful algorithms that can resolve the
issue of detection and spread of fake news. Most of these detecting techniques use some
features of a news article like source from where the news is generated, content of the news,
author’s background and public reviews about the news article. These techniques make use
of supervised learning, where the algorithm is first trained using a training dataset and then is
tested to find accuracy. Thus, it is necessary to protect the websites from the pharming attack
which can be achieved using the SVM detection technique with an accuracy of 97 percent
[8]. These previously implemented techniques are discussed in the Related work section.
Naive Bayes Classifier
The basis of Naive Bayes classifier is Bayes theorem. The classifier is a collection of the
classification algorithm. Naive Bayes classifier works on the principle that all the classified
features are independent of eachother. Mathematically, Bayes theorem is stated as
(1)
Here A and B are two events which are independent. P(A|B) is the probability of one event
when other event has already been occurred. P(A) and P(B) are the probabilities of two
independent events A and B. P(B|A) is the probability of event B with respect to event A.
Sci-kit learn is the library is a machine learning library for python programming language.
There are different classification, regression, clustering algorithms such as SVM (support
vector machine), Random forest, k-means clustering, etc that are available in this library.
Under this library, there are three types of Naive Bayes models. These are
• Gaussian: It ensures that the features are following normal.
•Multinomial: It is used when we have discrete count.
•Bernoulli: it is useful when we have binary feature vectors.
Related Work
This paper [9] explores that using a Naive Bayes Classifier, how the detection of fake news is
possible smoothly. Presentable as a software system and testified on Facebook posts to
validate classification accuracy which turns out to be 74 percent. However, the major
suggested relies on using artificial intelligence methods for the detection of fake news.
Limitation: This approach gives less accuracy when the dataset has less similar words. The
artificial intelligence methods are not discussed in it. It is futuristic rather realistic
Dey et al. [10] propose a decision-maker to identify news deception. The dataset used in this
paper contains 200 tweets. In the first step tweets undergo “text normalization”, in the next
step features are extracted to classify news. Then performing comprehensive linguistic
analysis on tweets, extracts bag-of-words helps us to find a noticeable pattern, and finally the
k-nearest neighbor algorithm is applied for classifying polarized news from credible. For a
particular sample size, two out of three were labelled fake and one was labelled real. The
system gave the accuracy of 66.66%. Limitation: This model can only be applicable to a
predefined dataset.
The paper[11] introduces a technique (NLP Natural language processing) which can identify
fake news and fake accounts on Twitter using a NER (Named Entity Recognition)
component.This method splits the text of the article into small composing parts like entities,
topics, social tags, overall tweet sentiment, and hashtag sentiment.
A complex structure for fake news detection is proposed in [12]. The technique implements
the machine learning model for incident classification. The model consists of five NLP
features and three knowledge verification features in the form of questions that are related to
the scope, the spread, and the consistency of the source. Limitation: The notion of similarity
is the key element for automated knowledge verification. In particular, it depends on
establishing semantic similarity.
Zhang et al. [13] propose a systematic approach for fake news detection. A two-layered
approach is used in this technique for fake news detection, which leads to detection of fake
topics and fake events. The efficiency of the proposed method is determined through the
implementation and validation of a novel Fake News Detection (FEND) system. The
proposed technique achieves 92.49% of classification accuracy and a recall value of 94.16%
based on a specific threshold value of 0.6. Limitations: This study is only limited to
distinguish fact versus opinion articles and there is no scope for other types of news
categories.
Dataset
The dataset is taken from Kaggle [14]. The dataset is further divided into two sub-datasets i.e.
train and test datasets. The training dataset is pre labelled as fake and genuine, since the
algorithm used is based on supervised learning. There are 25116 labelled news articles in
training dataset and the test dataset has 5880 unlabelled news articles. The training dataset
has five attributes as follows:
•ID: There is a unique id for each article in the dataset.
•Title: The heading of the news article.
•Author: One who posted the article
•Text: The content present in the news article.
•Label: The label marks the news article as fake or genuine.
Implementation
The proposed technique is implemented using Scikit a machine learning library in python.
The Naive Bayes Classifier classifies the news article to be fake or real based on the words
present in it. The basis of this model is conditional probability and the is calculated using
Here, P(f|w) is the probability of a news article being fake when it contains specified words
as mentioned in the dataset. P(w|f) is the probability that words are found in fake articles. P(f)
is the overall probability of fake news articles. Similarly, P(w|r) is the probability of word
present in genuine articles. And P(r) is the overall probability of genuine news articles. The
conditional probability for a news article to be genuine containing some specified words can
be calculated using eq. 2. However, if any of the variable’s value is manupulated, it may
directly or indirectly affect the nature of the news.
Proposed Algorithm
The proposed algorithm is implemented in two phases; the training phase and the testing
phase.
a. Training Phase
The model is trained in the following steps:
• Loading the training dataset.
• Splitting the dataset into two parts xtrain and ytrain, where xtrain holds attributes id, title,
author, text, and ytrain hold label attribute of the training dataset.
• Applying Bernoulli’s Naive Bayes classifier that will detect some particular words in the
text attribute of xtrain and checks the corresponding label in ytain.
b. Testing Phase
• Loading the test dataset.
• Since the test dataset is unlabelled, ytest holds no attribute and xtest will have all attributes
i.e. id, title, author, and text.
• Again, applying Bernoulli’s Naive Bayes classifier. The classifier will compare some
particular words in the text attribute of xtest and xtrain. If a match is found the corresponding
ytest will be assigned a label similar ytrain.
Results
When Gaussian Naive Bayes model is applied on the same dataset, the accuracy achieved
was approximately 72% [2]. However, the results of Bernoulli’s Naive Bayes model are far
better. The accuracy was increased to approximately 83%. The accuracy is calculated using
the formula
(3)
where tp, tn, fp, fn defines true positive, true negative, false positive, and false negative
values of the Confusion Matrix. Other model parameters like precision, recall, and F1 score
can be calculated as:
(4)
These measures have been calculated from the confusion matrix shown below.
[2] https://github.com/rockash/Fake-newsDetection
[5] “‘The Great Moon Hoax’ is publish in the ‘New York Sun’”, www.history.com.
[8] Manhas S., Taterh S. and Singh D., “Detection of Pharming Attack on Websites Using Svm
Classifier”, International Journal of Scientific & Technology Research, vol. 8, no. 11, 2019.
[9] Granik M. and Mesyura V., “Fake news detection using naive Bayes classifier”, 2017 IEEE
First Ukraine Conference on Electrical and Computer Engineering (UKRCON), pp. 900–
903, 2017, IEEE.
[10] Dey A., Rafi R. Z., Parash S. H., Arko S. K. and Chakrabarty A., “Fake news pattern
recognition using linguistic analysis”,2018 Joint 7th International Conference on
Informatics, Electronics & Vision (ICIEV) and 2018 2nd International Conference on
Imaging, Vision & Pattern Recognition (icIVPR), pp. 305– 309, 2018, IEEE.
[11] Atodiresei C., selea T.A. and Iftene A., “Identifying fake news and fake users on Twitter”,
Procedia Computer Science, vol. 126, pp. 451–461, 2018, Elsevier.
[12] Ibrishimova M. D. and Li K. F., “A machine learning approach to fake news detection using
knowledge verification and natural language processing”, International Conference on
Intelligent Networking and Collaborative Systems, pp. 223–234, 2019, Springer.
[13] Zhang C., Gupta A., Kauten C., Deokar A. V. and Qin X., “Detecting fake news for reducing
misinformation risks using analytics approaches”, European Journal of Operational
Research, vol. 279, pp. 1036–1052, 2019, Elsevier.
[15] Kaur P., Boparai R. S. and Singh D., “Hybrid Text Classification Method for Fake News
Detection”, International Journal of Engineering and Advanced Technology, vol. 8, no. 5,
2019.
*Figure
Credit Statement
The authors declare that they have no conflict of interest and transfer the right of the paper to the
Journal for publication.
Declaration of interests
☐The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.
☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:
Mandeep Singh, Mohammed Wasim Bhatt, Harpreet Singh Bedi, Umang Mishra