You are on page 1of 11

Elsevier Editorial System(tm) for Materials

Today: Proceedings
Manuscript Draft

Manuscript Number: MATPR-D-20-09284

Title: Performance of Bernoulli's Naive Bayes Classifier in the detection


of Fake News

Article Type: ICMSTE2K21

Keywords: Fake news, Bernoulli's Naive Bayes, Gaussian Naïve Bayes,


Machine learning

Corresponding Author: Mr. Mohammed Wasim Bhatt, M.Tech

Corresponding Author's Institution:

First Author: Mandeep Singh, Ph.D

Order of Authors: Mandeep Singh, Ph.D; Mohammed Wasim Bhatt, M.Tech;


Harpreet Singh Bedi, M.Tech; Umang Mishra, B.E

Abstract: Social media has abolished the conventional news sources such
as newspaper and TV news channels and became the major source of news.
According to Pew Research Centre, which is an Internet research company,
62 percent of adults in the united states gather their news from social
media [1]. Social media has become an open platform to spread any type of
fake news or propaganda's all over the world with a single click. Social
media platforms like Facebook and Google are using preventive measures to
tackle this fake news by using reporting and flagging tools. However,
these measures are user-based and only users can report the news as fake
or genuine. Machine learning and artificial intelligence may help to
create certain powerful algorithms that can detect and remove fake news
broadcasting itself. This paper explores a machine learning algorithm
named Bernoulli's Naive Bayes Classifier, which is the extended version
of Multinomial Naive Bayes with predictors as Boolean variables i.e. 0
and 1 to detect fake news. Previous studies applied Gaussian Naive Bayes
[2]. Our proposed methodology classifies the input data into two classes
0 or 1. '0' stands for Fake and '1' stands for Genuine news article.
Further, it is observed that the results are enhanced as compared to
Gaussian Naive Bayes. From the experiments, we observe that the
classification results improve by the use of Bernoulli's Naive Bayes
Classifier as compared to Gaussian Naive Bayes. The comparison is done in
terms of accuracy, precision, recall, and F1 measure. The accuracy is
improved by 10%, precision by 15%, and F1 measure by 6%.
Manuscript
Click here to view linked References

Performance of Bernoulli’s Naive Bayes Classifier in the detection


of Fake News
Mandeep Singh1, Mohammed Wasim Bhatt*2, Harpreet Singh Bedi3, Umang Mishra4
1,4
Department of Computer Science Engineering, Chandigarh University, India.
2
Department of Computer Science Engineering, Central University of Punjab, India.
3
School of Electronics and Electrical Engineering, Lovely Professional University, India.

Abstract
Social media has abolished the conventional news sources such as newspaper and TV news
channels and became the major source of news. According to Pew Research Centre, which is
an Internet research company, 62 percent of adults in the united states gather their news from
social media [1]. Social media has become an open platform to spread any type of fake news
or propaganda’s all over the world with a single click. Social media platforms like Facebook
and Google are using preventive measures to tackle this fake news by using reporting and
flagging tools. However, these measures are user-based and only users can report the news as
fake or genuine. Machine learning and artificial intelligence may help to create certain
powerful algorithms that can detect and remove fake news broadcasting itself. This paper
explores a machine learning algorithm named Bernoulli’s Naive Bayes Classifier, which is
the extended version of Multinomial Naive Bayes with predictors as Boolean variables i.e. 0
and 1 to detect fake news. Previous studies applied Gaussian Naive Bayes [2]. Our proposed
methodology classifies the input data into two classes 0 or 1. ’0’ stands for Fake and ’1’
stands for Genuine news article. Further, it is observed that the results are enhanced as
compared to Gaussian Naive Bayes. From the experiments, we observe that the classification
results improve by the use of Bernoulli’s Naive Bayes Classifier as compared to Gaussian
Naive Bayes. The comparison is done in terms of accuracy, precision, recall, and F1 measure.
The accuracy is improved by 10%, precision by 15%, and F1 measure by 6%.
Keywords: Fake news, Bernoulli’s Naive Bayes, Gaussian Naïve Bayes, Machine learning
Introduction
The concept of fake news comes into the spotlight when there were rumours coming out
about spread of fake news during US Presidential election held in 2016. Those news which
are spread with the intention to mislead someone are known as fake news [3]. In order to
minimize the spread of fake news conventional news sources like newspaper and TV news
channels uses some strict codes of practice. There are some other news sources as well like
social media which give an individual to spread any kind of news either genuine or fake
easily all over the world. These days most of the people uses these fake news in order to
make more revenue. When a fake news goes viral, many people clicks on the news and the
huge advertising revenue is generated. The opinions of an individual are moreover converted
to the text messages and are filtered out to get certain keywords. These keywords are then
recognized as constructive or destructive [4].
A brief history of Fake News
On 25 August 1835, a series of articles were published in the New York Sun newspaper
regarding the discovery of life on the moon. The article was published with the heading “The
Great Moon Hoax” [5]. On Thirteen December, 2006, Belgian public television station
broadcast a news with title “The Flander's parliament has declared its independence from the
kingdom of Belgium”. In the on-going program, a new title was introduced which states the
the news was a hoax. [6]. In the 2016 US Presidential election, most of the American public
believed that fake news spread over facebook had an important role in the results of
election[7]. Other conspiracy theories include the 1975 conspiracy about the killing of Martin
Luther King, 2010 conspiracy about Barack Obama that he was born in another country etc.
Impact of Fake News
Fake News can have adverse effect on an individual to a complete nation. some of the major
impacts of fake news include:
1. Effects on health
2. Financial impact
3. Fear
4. Racist ideas
5. Bullying and Violence against innocent people
6. Democratic impacts
Detection of Fake News
The bad impact of fake news leads the researchers to resolve the issue of spreading of fake
news. Many researchers are trying to create some powerful algorithms that can resolve the
issue of detection and spread of fake news. Most of these detecting techniques use some
features of a news article like source from where the news is generated, content of the news,
author’s background and public reviews about the news article. These techniques make use
of supervised learning, where the algorithm is first trained using a training dataset and then is
tested to find accuracy. Thus, it is necessary to protect the websites from the pharming attack
which can be achieved using the SVM detection technique with an accuracy of 97 percent
[8]. These previously implemented techniques are discussed in the Related work section.
Naive Bayes Classifier
The basis of Naive Bayes classifier is Bayes theorem. The classifier is a collection of the
classification algorithm. Naive Bayes classifier works on the principle that all the classified
features are independent of eachother. Mathematically, Bayes theorem is stated as

(1)
Here A and B are two events which are independent. P(A|B) is the probability of one event
when other event has already been occurred. P(A) and P(B) are the probabilities of two
independent events A and B. P(B|A) is the probability of event B with respect to event A.
Sci-kit learn is the library is a machine learning library for python programming language.
There are different classification, regression, clustering algorithms such as SVM (support
vector machine), Random forest, k-means clustering, etc that are available in this library.
Under this library, there are three types of Naive Bayes models. These are
• Gaussian: It ensures that the features are following normal.
•Multinomial: It is used when we have discrete count.
•Bernoulli: it is useful when we have binary feature vectors.
Related Work
This paper [9] explores that using a Naive Bayes Classifier, how the detection of fake news is
possible smoothly. Presentable as a software system and testified on Facebook posts to
validate classification accuracy which turns out to be 74 percent. However, the major
suggested relies on using artificial intelligence methods for the detection of fake news.
Limitation: This approach gives less accuracy when the dataset has less similar words. The
artificial intelligence methods are not discussed in it. It is futuristic rather realistic
Dey et al. [10] propose a decision-maker to identify news deception. The dataset used in this
paper contains 200 tweets. In the first step tweets undergo “text normalization”, in the next
step features are extracted to classify news. Then performing comprehensive linguistic
analysis on tweets, extracts bag-of-words helps us to find a noticeable pattern, and finally the
k-nearest neighbor algorithm is applied for classifying polarized news from credible. For a
particular sample size, two out of three were labelled fake and one was labelled real. The
system gave the accuracy of 66.66%. Limitation: This model can only be applicable to a
predefined dataset.
The paper[11] introduces a technique (NLP Natural language processing) which can identify
fake news and fake accounts on Twitter using a NER (Named Entity Recognition)
component.This method splits the text of the article into small composing parts like entities,
topics, social tags, overall tweet sentiment, and hashtag sentiment.
A complex structure for fake news detection is proposed in [12]. The technique implements
the machine learning model for incident classification. The model consists of five NLP
features and three knowledge verification features in the form of questions that are related to
the scope, the spread, and the consistency of the source. Limitation: The notion of similarity
is the key element for automated knowledge verification. In particular, it depends on
establishing semantic similarity.
Zhang et al. [13] propose a systematic approach for fake news detection. A two-layered
approach is used in this technique for fake news detection, which leads to detection of fake
topics and fake events. The efficiency of the proposed method is determined through the
implementation and validation of a novel Fake News Detection (FEND) system. The
proposed technique achieves 92.49% of classification accuracy and a recall value of 94.16%
based on a specific threshold value of 0.6. Limitations: This study is only limited to
distinguish fact versus opinion articles and there is no scope for other types of news
categories.
Dataset
The dataset is taken from Kaggle [14]. The dataset is further divided into two sub-datasets i.e.
train and test datasets. The training dataset is pre labelled as fake and genuine, since the
algorithm used is based on supervised learning. There are 25116 labelled news articles in
training dataset and the test dataset has 5880 unlabelled news articles. The training dataset
has five attributes as follows:
•ID: There is a unique id for each article in the dataset.
•Title: The heading of the news article.
•Author: One who posted the article
•Text: The content present in the news article.
•Label: The label marks the news article as fake or genuine.
Implementation
The proposed technique is implemented using Scikit a machine learning library in python.
The Naive Bayes Classifier classifies the news article to be fake or real based on the words
present in it. The basis of this model is conditional probability and the is calculated using

Here, P(f|w) is the probability of a news article being fake when it contains specified words
as mentioned in the dataset. P(w|f) is the probability that words are found in fake articles. P(f)
is the overall probability of fake news articles. Similarly, P(w|r) is the probability of word
present in genuine articles. And P(r) is the overall probability of genuine news articles. The
conditional probability for a news article to be genuine containing some specified words can
be calculated using eq. 2. However, if any of the variable’s value is manupulated, it may
directly or indirectly affect the nature of the news.

Proposed Algorithm
The proposed algorithm is implemented in two phases; the training phase and the testing
phase.
a. Training Phase
The model is trained in the following steps:
• Loading the training dataset.
• Splitting the dataset into two parts xtrain and ytrain, where xtrain holds attributes id, title,
author, text, and ytrain hold label attribute of the training dataset.
• Applying Bernoulli’s Naive Bayes classifier that will detect some particular words in the
text attribute of xtrain and checks the corresponding label in ytain.
b. Testing Phase
• Loading the test dataset.
• Since the test dataset is unlabelled, ytest holds no attribute and xtest will have all attributes
i.e. id, title, author, and text.
• Again, applying Bernoulli’s Naive Bayes classifier. The classifier will compare some
particular words in the text attribute of xtest and xtrain. If a match is found the corresponding
ytest will be assigned a label similar ytrain.
Results
When Gaussian Naive Bayes model is applied on the same dataset, the accuracy achieved
was approximately 72% [2]. However, the results of Bernoulli’s Naive Bayes model are far
better. The accuracy was increased to approximately 83%. The accuracy is calculated using
the formula

(3)
where tp, tn, fp, fn defines true positive, true negative, false positive, and false negative
values of the Confusion Matrix. Other model parameters like precision, recall, and F1 score
can be calculated as:

(4)

These measures have been calculated from the confusion matrix shown below.

Figure 1: Confusion Matrix for Bernoulli’s Naive Bayes Model


Table 1 shows the comparison of these measures for Gaussian and Bernoulli’s Naive Bayes
classifier models.
Measures Gaussian Bernoulli

Accuracy 72.33 82.42


Precision 67.95 83.40
Recall 86.04 81.58
F1 score 75.93 82.48
Table 1: Performance comparison of Gaussian and Bernoulli’s Naive Bayes Models.
Discussion
The parameters like accuracy, precision, recall, and f1 score that we have calculated in the
results section define the efficiency of the model. All these values are calculated from the
confusion matrix shown in figure 1. The accuracy defines the number of news articles that are
correctly predicted by the classifier. The precision defines the correctly predicted genuine
news articles from all genuine news articles. All the techniques that are discussed in the
related work section are based on supervised learning i.e. all those techniques use predefined
datasets and some big algorithms along with some tool kits like Natural language tool kit,
named entry recognition that is difficult to handle. However, the technique used in this paper
do not use extra tool kits and is implemented using a simple algorithm named Bernoulli’s
Naive Bayes. The spreading of fake news has become a major problem as it can be spread
very easily through the social media platform and is affecting a large community at an
instant. Moreover, it is more dangerous for the common man as one fake message on a
common man’s social media handle sometimes affects him mentally, socially, or even
financially. To stop the spread of these fake news, it needs to be detected properly.
Conclusion and Future scope
The best thing about this model is its simplicity as it only compares some words of training
and testing datasets and declares news to be fake or genuine. All types of sources like social
media, publications, newspapers, etc. to be considered. Although this technique is very
simple, there are some limitations that leads to reduce the accuracy of the model. we have
found some articles that have no similar words or most similar words but are labeled
incorrectly by the classifier. Some models use a hybrid technique which is divided into three
phases including pre-processing, feature extraction, and classification to increase accuracy
but the outcomes are not satisfactory [15]. In those cases, the accuracy of the model is highly
affected. These limitations may be overcome if other parameters like, public views, author
and source of the news are also taken into consideration.
References
[1] Elas S. and Katerina E. M., “News Use Across Social Media Platforms 2018”,
http://www.journalism.org, September 10,2018.

[2] https://github.com/rockash/Fake-newsDetection

[3] “Explained:What is False Information(Fake News)?”,https://www.webwise.ie/teachers/ what-


is-fake-news/.
[4] Kaur P., Boparai R. S. and Singh D., “A Review on Detecting Fake News through Text
Classification”, International Journal of Electronics Engineering, vol. 11, no. 1, pp. 393-406,
2019.

[5] “‘The Great Moon Hoax’ is publish in the ‘New York Sun’”, www.history.com.

[6] “Flemish Secession Hoax”, http://hoaxes.org/archive/permalink/flemishsecession-hoax.


[7] Danielle Kurtzleben,“Did Fake News On Facebook Help Elect Trump? Here’s What We
Know” http://www.npr.org/2018/04/11/ 601323233/6-facts-we-know-about-fakenews-in-the-
2016-election

[8] Manhas S., Taterh S. and Singh D., “Detection of Pharming Attack on Websites Using Svm
Classifier”, International Journal of Scientific & Technology Research, vol. 8, no. 11, 2019.
[9] Granik M. and Mesyura V., “Fake news detection using naive Bayes classifier”, 2017 IEEE
First Ukraine Conference on Electrical and Computer Engineering (UKRCON), pp. 900–
903, 2017, IEEE.

[10] Dey A., Rafi R. Z., Parash S. H., Arko S. K. and Chakrabarty A., “Fake news pattern
recognition using linguistic analysis”,2018 Joint 7th International Conference on
Informatics, Electronics & Vision (ICIEV) and 2018 2nd International Conference on
Imaging, Vision & Pattern Recognition (icIVPR), pp. 305– 309, 2018, IEEE.

[11] Atodiresei C., selea T.A. and Iftene A., “Identifying fake news and fake users on Twitter”,
Procedia Computer Science, vol. 126, pp. 451–461, 2018, Elsevier.

[12] Ibrishimova M. D. and Li K. F., “A machine learning approach to fake news detection using
knowledge verification and natural language processing”, International Conference on
Intelligent Networking and Collaborative Systems, pp. 223–234, 2019, Springer.

[13] Zhang C., Gupta A., Kauten C., Deokar A. V. and Qin X., “Detecting fake news for reducing
misinformation risks using analytics approaches”, European Journal of Operational
Research, vol. 279, pp. 1036–1052, 2019, Elsevier.

[14] https://www.kaggle.com/c/fake-news /data, February, 2018.

[15] Kaur P., Boparai R. S. and Singh D., “Hybrid Text Classification Method for Fake News
Detection”, International Journal of Engineering and Advanced Technology, vol. 8, no. 5,
2019.
*Figure

Figure 1: Confusion Matrix for Bernoulli’s Naive Bayes Model


*Credit Author Statement

Credit Statement

The authors declare that they have no conflict of interest and transfer the right of the paper to the
Journal for publication.

Mohammed Wasim Bhatt


*Declaration of Interest Statement

Declaration of interests

☐The authors declare that they have no known competing financial interests or personal relationships
that could have appeared to influence the work reported in this paper.

☐The authors declare the following financial interests/personal relationships which may be considered
as potential competing interests:

Mandeep Singh, Mohammed Wasim Bhatt, Harpreet Singh Bedi, Umang Mishra

You might also like