Professional Documents
Culture Documents
Abstract—Internet acts as the best medium for ‘digital wildfires’, that is, unreliable information going
proliferation and diffusion of fake news. Information viral online (aka fake news) would be one of the biggest
quality on the internet is a very important issue, but threats faced by society. Given its negative impacts (e.g.
web-scale data hinders the expert’s ability to correct It brings needless agitations and social unrest in the
much of the inaccurate content or fake content present society) [14,15], detection of fake news has become an
over these platforms. Thus, a new system of safeguard is increasingly important issue [11].
needed. Traditional Fake news detection systems are This paper presents a novel fake news detection model
based on content-based features (i.e. analyzing the which uses both content-based and social features for
content of the news) of the news whereas most recent fake news detection. The proposed model has
models focus on the social features of news (i.e. how the outperformed existing approaches in the literature and
news is diffused in the network). This paper aims to build obtained higher accuracies than traditional content-based
a novel machine learning model based on Natural methods on a publically available standard dataset that
Language Processing (NLP) techniques for the detection was recently published [2].
of ‘fake news’ by using both content-based features and
social features of news. The proposed model has shown
remarkable results and has achieved an average accuracy II. RELATED WORKS
of 90.62% with F1 Score of 90.33% on a standard
dataset. Extensive research has been done in order to develop
an accurate and reliable automatic fake news detector.
Index Terms—Fake News Detection, Machine Learning Traditionally content-based approach has proven
Classifier, Natural Language Processing, Probabilistic effective for the news lacking social information. In [5]
Classifiers. authors showed that a simple model based on term
frequency-inverse document frequency (tf-idf) offers a
baseline accuracy of 88.5%. In [6] authors have used
I. INTRODUCTION tf-idf with six different machine learning classifiers on a
2000 news dataset, obtaining a 92% accuracy. In [7]
Fake news refers to false information masqueraded as authors used syntactic and semantic features of news
authentic news. Fake news is mainly of the following articles for classifying between genuine and fake news
types: satirical news, hoax, completely fabricated news articles using a trigram language model, obtaining an
and government propaganda (political). It is typically accuracy of 91.5%. But the main problems associated
distributed to attract viewers and to generate advertising with content-based methods for real-world fake news
revenue. The reasons behind fake news include media detection is that these news articles are intentionally
manipulation and propaganda, political and social written in a style that makes it impervious to word
influence, provocation and social unrest and financial analysis.
profit [9]. However, people and groups with potentially Research has been conducted for developing fake news
malicious agendas have been known to initiate fake news detector using social features of news. In [8] authors
in order to influence events and policies around the world. showed that Facebook posts can be classified with high
[10] showed the significant impact of fake news in the accuracy as hoaxes or non-hoaxes on the basis of the
context of the 2016 US presidential elections. [16] users who “like” or “dislike” them and achieved
conducted a survey on how fake news is spread on social accuracies exceeding 99% even with very small training
media websites like “Facebook” by analyzing the dataset, authors have used logistic regression and
activities of respondents’ social media accounts. In 2013, harmonic boolean label crowd-sourcing methods.
The World Economic Forum warned that the so-called Although the method proposed by [8] offers very high
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
2 Natural Language Processing based Hybrid Model for Detecting Fake News Using
Content-Based Features and Social Features
accuracy, its application is limited just to the use-cases flowchart captures the major part of the developed
that garner enough social media attention (i.e. the number algorithm and each segment of the flowchart is described
of likes and dislikes). In [11] authors have proposed a in later sections.
Multi-source Multi-class Fake news Detection (MMFD)
A. Data Preprocessing
framework and also introduced an automated and
interpretable way to integrate information from multiple Data preprocessing consists of various techniques
sources, but the accuracy of the model on real-world data which can be used to convert the data present in raw
is only 38.81%. Similar research has been carried out by format (data gathered from various resources which are
authors in [17], they have built an automated system not feasible for the analysis e.g. news grabbers extract
called “FakeNewsTracker” for understanding and news from the web page of news editors and store the
detection of fake news. This system collects news various features of that news, this data need further
contents and social context automatically. In [4] authors processing before feeding it to classification algorithms)
have used content-based methods only when the social to clean format (specific formatted data as required by a
based methods perform poorly. They have built the classification algorithm). Preprocessing refers to the
model based entirely on only one type of feature at a time various transformations that are applied to the data before
and tested this model on real-world data and have feeding it to the classification algorithm. Data
obtained an accuracy of 81.7%. Our model is preprocessing increases the efficiency of machine
substantially different from the model proposed in [4], learning algorithms as some of the algorithms require the
instead of relying on a single type of feature of news data to be in a specified format. Data Preprocessing is
articles (i.e. either content-based or social features) for needed only on the content-based part of the news
prediction of fake news, our model uses both (heading, body, etc.). Social features of news don’t
content-based and social features of news articles require any preprocessing. Each of these content-based
simultaneously for detection of hoax articles. parts of news present in the dataset is modeled using the
Bag of Words Model which is a simplified representation
used in Natural Language Processing (NLP). In this
III. METHODOLOGY model, a text (such as a document or a sentence) is
represented as a bag (multiset) of its words, by
The flowchart of the proposed methodology of the
disregarding the stop words and the ordering of words in
entire process for determining whether the news article is
the text, but the multiplicity of each word is stored.
fake or genuine using the content-based and social
features of the news article is presented in Fig. 1. This
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
Natural Language Processing based Hybrid Model for Detecting Fake News Using 3
Content-Based Features and Social Features
A.1. Normalization day, then these dates should be matched with each other
during content-based matching.
Normalization phase includes various preprocessing
e.g. Let’s assume that 31.10.18 was mentioned in news
steps that are needed to be performed on each word, these
A and 10.31.2018 was mentioned in news B, if these date
steps are described in detail in the subsequent section.
formats are not normalized then during content-based
1. Changing Uppercase to Lowercase matching these dates will mismatch, but if they are
normalized to a standard form that will be used
This phase of data preprocessing focuses on the throughout the dataset i.e. convert 10.31.2018
case-folding operation. Case-Folding is a technique to
(mm.dd.yyyy) to 31.10.18 (dd.mm.yy) then both these
reduce all the letters of the word to lowercase which dates will match during content-based classification.
allows case-insensitive comparisons, all the textual Apart from the above mention transformation, there
content of the news (present in the corpus, and the news
are many other forms of normalization that can be
that is to be labeled as fake or real) that is in Uppercase performed on the content portion of the news such as
form is converted to Lowercase form. This operation is Unit Normalization, etc. This paper only focuses on the
necessary because all the text present in the news corpus
above mention transformations for normalization as other
should have a common representation format. preprocessing transformations are not providing
If the two words that possess same meaning and are significant results.
used in the same context, but they differ in the format of
representation, then the matching algorithm would not be A.2. Stop Word Removal
able to match both the words and will treat them as a
Stop words are the most common words in a language
separate entity. e.g. “Automobile” and “automobile” will
(e.g. a, an, the, was, in, etc.). These words don’t possess
be treated as two different words because they differ in
any discriminating power and need to be discarded before
their format of representation, which would be incorrect.
constructing the bag of words model, because stop words
Thus by reducing “Automobile” to lower case
take up more space and increase the processing time,
“automobile” the efficiency of the matching algorithm
such words do not possess much relevance. The word
can be increased.
having higher relevance possess high local frequency
The main disadvantage associated with Case-Folding
(number of time word occurs in the given document that
operation is that many proper nouns are derived from
is to be classified) and low document frequency (number
common nouns and can only be distinguished using cases.
of documents in the corpus containing that word). This
e.g. “General Motors” and “general motors” the first
paper focuses on removing the stop word by calculating
word refer to the company and it should be treated
its relevancy using the term frequency-inverse document
differently from the second word, but after applying
frequency (tf-idf).
case-folding operation both words will be in the same
Tf-idf is a numerical statistical method that is used to
format and will be treated as same. Thus case-folding
describe how important a particular word is to a
operation leads to loss of information about the proper
document in a collection. For a term t present in
noun.
document d, the tf-idf score of that term in that document
2. Removing Special Characters is given by tf-𝑖𝑑𝑓𝑡,𝑑 as shown in equation (1).
Special characters as such don’t possess any 𝑁
significance during content-based matching and should 𝑡𝑓 − 𝑖𝑑𝑓𝑡,𝑑 = (1 + log 𝑡𝑓𝑡,𝑑 ) ∗ log( ) (1)
𝑑𝑓𝑡
be omitted before applying the classification algorithm.
Although by removing special characters for e.g. “$” Where, 𝑡𝑓𝑡,𝑑 = Term Frequency of term t in document
some context-based information can be lost but there is d.
no loss of content-based information. e.g. let’s assume a
news A contains a statement “cost of a book is $100” now N = Number of the document in the corpus.
here if the special character i.e. “$” is removed due to 𝑑𝑓𝑡 = Document Frequency (i.e. Number of
preprocessing. The information about the currency will Documents in the corpus containing the term t).
be lost and the statement after preprocessing “cost of an
item is 100” now may be referring to dollar or rupee or The inverse of document frequency of the stop word is
any other currency. very low as it occurs in mostly all documents of the
3. Date Format Normalization corpus, once these stop words are identified they are
discarded before modeling content-based features using
Since dates that are mentioned in the news may serve Bag of Words Model.
as the most important factor that can further improve the
efficiency of the classification model. By using this A.3. Lemmatization
information we can check whether the news is a For grammatical reasons, various news editors may use
derivation of the original news or fabrication of it. E.g. different forms of a word (such as run, ran, running). In
Date can occur in various formats in the news such as addition to this, there is a family of derivationally related
dd/mm/yy, mm/dd/yyyy, dd/mm/yyyy. All these forms words (Terms in different syntactic categories that have
should be converted to a standard form such that two the same root and are semantically related) e.g.
dates that are mentioned in news and refer to the same
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
4 Natural Language Processing based Hybrid Model for Detecting Fake News Using
Content-Based Features and Social Features
photograph, photography, photographic. Lemmatization learning model M which will calculate the class scores of
is a linguistic transformation, it is the process to group all d with respect to both Fake and Real class.
the inflectional forms of a word so that all of them can be The score of d for Fake class is calculated using
analyzed using a single item and sometimes equations (4), (10), (12), (14), (16), (18), and (20) and is
derivationally related forms of a word to a common base given by equation (2)
form by removing inflectional endings only and returning
the base or dictionary form of a word, which is known as 𝑆𝑐𝑜𝑟𝑒(𝑑, 𝐹𝑎𝑘𝑒) = ∑7𝑖=1 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑓𝑖 , 𝐹𝑎𝑘𝑒) (2)
the lemma. This technique uses the vocabulary and the
morphological analysis of the word to group all inflected Where, FeatureScore(𝑓𝑖 , Fake) is the score of the
forms. If two sentences only differ in the use of the feature 𝑓𝑖 in d with respect to the Fake class.
inflectional forms of the word then these sentences The score of d for the Real class is calculated using
should match with each other but if lemmatization is not equations (7), (11), (13), (15), (17), (19), and (21) and is
performed then the classification model will mismatch given by equation (3)
such type of sentences.
E.g. John runs, John is running, John ran. Although all 𝑆𝑐𝑜𝑟𝑒(𝑑, 𝑅𝑒𝑎𝑙) = ∑7𝑖=1 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑓𝑖 , 𝑅𝑒𝑎𝑙) (3)
these sentences are describing the action john does and so
all the sentences should match syntactically but due to the Where FeatureScore( 𝑓𝑖 , Real) is the score of the
presence of more than one inflectional form of the word feature 𝑓𝑖 in d with respect to the Real class.
matching algorithm fails to group them as one. So in Based on the class scores of document d as calculated
order to improve the matching of words that are using equation (2), and (3). M will predict the Class label
syntactically same, there is a need to convert all the for d by using Algorithm 1:-
inflectional form of the word to a common base form that
can represent all of them. After lemmatization all these Algorithm 1: AssignClassLabel( )
sentences maps to “John run”, because of all the Input: Score(d, Real), Score(d, Fake)
inflectional form of the word i.e., run, ran, running will Output: ClassLabel
now map to a common base run. The main disadvantage
1. ClassLabel ← NULL
of lemmatization is the loss of timing information as
2. if Score(d, Real) >= Score(d, Fake) then
different inflectional forms of word convey different
timing information such as ran refers to past, running 3. ClassLabel ← Real
refers to the present. But contextual information is not 4. else
much relevant as compared to the content-based 5. ClassLabel ← Fake
information for this problem domain. 6. end if
7. return ClassLabel
B. Proposed Model
Given a news article d and a set of previously
published news articles in the corpus C which consists of
news that is either labeled as Real or Fake. The goal of IV. MATHEMATICAL BACKGROUND BEHIND THE
Fake News Detection is to predict whether the d is fake PROPOSED MODEL
or not using a supervised machine learning classifier M This section describes the analysis that has been
which is trained over dataset consisting of news articles carried out to understand the demarcation of the fake
from C. news articles from the genuine news articles, Later in this
section, various content-based and social-based features
1, 𝑅𝑒𝑎𝑙 are discussed which best discriminates the fake news
M (d, C) = { articles from genuine news articles. This section also
0, 𝐹𝑎𝑘𝑒
contains the mathematical background of how the
proposed model uses these features for identifying the
The goal of this paper is to build a machine learning
fake news.
classification model M which will classify the given news
documents into a fake and real class. Hence, this problem A. Content-Based Features
can be mapped into a binary classification problem and
Content-based features of fake news consist of a
can be solved using a Modified Probabilistic Classifier,
headline and body. After preprocessing these features are
which incorporates both content-based features and social
modeled using the Bag of Words Model. These features
features of the news article for predicting whether the
mostly emphasize on the content portion and don’t
news is fake or not. M uses 7 features f1, f2,..., f7 (i.e.
incorporate any contextual meaning.
Headline, Body, Description, Facebook PageID,
Facebook AppID, Source, Authors) of the news articles A.1. Headline
for demarcation and these features are discussed in detail
Headlines are independent of the body of the news.
in Section (IV).
Headlines are the simple representation of the complex
Let us Assume that there is a news article d which has
text and they are used to capture the summary of the
to be classified as either fake or real and a machine
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
Natural Language Processing based Hybrid Model for Detecting Fake News Using 5
Content-Based Features and Social Features
𝐶𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝑑)
underlying text. They act as a medium to communicate 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝑅𝑒𝑎𝑙) = ∗ 𝐺𝑊 (𝑤𝑖 , 𝐶𝑟 ) (8)
|𝑑|
the gist of the news. Headlines play a key role in gaining
the attention of potential readers. Headlines of the fake
Where, 𝐶𝑟 is the set of news articles in corpus C that
news are generally eye-catching, sensational and
are labeled as Real, Count(𝑤𝑖 , d) is the number of
exaggerated. Some headlines try to attract the attention of
occurrences of the word 𝑤𝑖 in d, Further, GW(𝑤𝑖 , 𝐶𝑟 ) is
the reader by saying imaginary things e.g. “This is what
your fingerprint says about your destiny”. Some the global weight of the word 𝑤𝑖 with respect to 𝐶𝑟
given by the equation (9).
headlines try to scare or horrify the reader e.g. “Patient
was declared dead but a few minutes later rose from bed 1+𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝐶𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝐶𝑟 )
in XYZ hospital”. Some headlines try to create awareness 𝐺𝑊 (𝑤𝑖 , 𝐶𝑟 ) = |𝐶𝑟 |
(9)
about health risks e.g. “You drink it every day without
knowing that it can cause cancer”. Some headlines Where, DocumentCount(wi, 𝐶𝑟 ) is the number of
provoke the reader to share the news article e.g. “95%
documents in 𝐶𝑟 containing the word 𝑤𝑖 and |𝐶𝑟 | is the
discount on shoes - share the article to avail your
cardinality of the set 𝐶𝑟 .
discount”.
In most of the cases, the only aim of the headline (that A.2. Body
belongs to fake article) is to grab the attention of the
Study of thousands of news articles reveals a stylistic
reader and this types of headlines are loosely connected
difference between genuine and a fake article. Genuine
to the underlying portion of the news article. Thus by
news articles contain more language conveying
examining the headlines of the news articles, its
differentiation whereas fake news articles are expressed
authenticity can be verified.
with more certainty. Words that are more likely to be used
Let us assume that headline h consists of n words w1, in genuine articles are the words that convey
w2, ..., wn. Then the score of feature headline for Fake differentiation or express insight or that quantify for e.g.
class is given by equation (4). think, know, consider, not, without, but, instead, against,
𝑛
whereas the words that are more likely to be used in fake
news articles are the words that convey certainty or that
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(ℎ, 𝐹𝑎𝑘𝑒) = ∑ 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝐹𝑎𝑘𝑒) express positive emotion or that focus on future for e.g.
𝑖=1 always, never, proven, pretty, good, cause, know, ought,
(4)
gonna, soon.
Thus by examining the content of the news article, its
Where WordScore(𝑤𝑖 , Fake) is the score of the word
authenticity can be proved. Let us assume that the Body
𝑤𝑖 with respect to Fake class. This can be calculated using portion b of the news articles consists of n words. Then
equation (5). the score of feature body for Fake class is given by
𝐶𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝑑)
equation (10).
𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝐹𝑎𝑘𝑒) = ∗ 𝐺𝑊 (𝑤𝑖 , 𝐶𝑓 )
|𝑑|
(5) 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑏, 𝐹𝑎𝑘𝑒) = ∑𝑛𝑖=1 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝐹𝑎𝑘𝑒) (10)
1+𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝐶𝑜𝑢𝑛𝑡(𝑤𝑖 ,𝐶𝑓 )
𝐺𝑊 (𝑤𝑖 , 𝐶𝑓 ) = (6) Where WordScore(𝑤𝑖 , Real) can be calculated by using
|𝐶𝑓 |
the equation (8).
A.3. Description
Where, DocumentCount(wi,Cf) is the number of
documents in 𝐶𝑓 containing the word 𝑤𝑖 and |𝐶𝑓 | is the The description is the gist of the content of the news
cardinality of the set 𝐶 𝑓 . article. It helps to identify whether the news article is
biased towards a particular point of view. Analysis of
Similarly, the score of the feature headline for the Real
thousands of Fake news articles that were published
class is given by equation (7).
reveals that these articles are biased towards a particular
topic and in most of the cases, these articles won’t reveal
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(ℎ, 𝑅𝑒𝑎𝑙) = ∑𝑛𝑖=1 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝑅𝑒𝑎𝑙) (7)
the full story. Fake news articles try to play with the
emotions of the potential readers and make the reader
Where WordScore(𝑤𝑖 , Real) is the score of the word
angry, happy or scared. Thus by analyzing the description
𝑤𝑖 with respect to the Real class. This can be calculated
or the gist of the news articles its authenticity can be
using equation (8).
checked.
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
6 Natural Language Processing based Hybrid Model for Detecting Fake News Using
Content-Based Features and Social Features
Let us assume that the description or gist g of the news a page identification number 𝐹𝑝 and |𝐶𝑟 | is the cardinality
articles consists of n words. Then the score of feature of the set 𝐶𝑟 .
description for Fake class is given by equation (12).
B.2. Facebook AppID
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑔, 𝐹𝑎𝑘𝑒) = ∑𝑛𝑖=1 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝐹𝑎𝑘𝑒) (12) When a person uses Facebook Login on a website or a
mobile App, an ID is created for the specific Facebook
Where WordScore( 𝑤𝑖 , Fake) can be calculated by App, which is called App-Scoped ID. Using this feature
using the equation (5). we can get to know that from which AppID the news
Similarly, the score of a feature description for the article was posted on Facebook. It was observed that a
Real class is given by equation (13). single Facebook account is used multiple times to post or
share the Fake news. Let us assume that a news article
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑔, 𝑅𝑒𝑎𝑙) = ∑𝑛𝑖=1 𝑊𝑜𝑟𝑑𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 , 𝑅𝑒𝑎𝑙) (13) was posted on Facebook using App having an
identification number 𝐹𝐴 . Then the score of feature
Where WordScore(𝑤𝑖 , Real) can be calculated by using Facebook AppID (𝐹𝐴 ) with respect to the Fake class is
the equation (8). given by the equation (16).
B. Social Features of news 1+𝐶𝑜𝑢𝑛𝑡(𝐹𝐴 ,𝐶𝑓 )
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝐹𝐴 , 𝐹𝑎𝑘𝑒) = (16)
|𝐶𝑓 |
Social features of the news articles tell how the news
article is diffused in the network, who has authored this
article and who has published it. Social features play a Where, 𝐶𝑓 is the set of news articles in C that are
significant role in proving the authenticity of the article. labeled as Fake and Count(𝐹𝐴 , 𝐶𝑓 ) is the number of news
Because the internet acts as a medium for the diffusion of articles in 𝐶𝑓 that are posted on Facebook using App that
the news. Thus by examining the social features of the has the identification number 𝐹𝐴 and | 𝐶𝑓 | is the
news articles, its authenticity can be established. cardinality of the set 𝐶𝑓 .
B.1. Facebook PageID Similarly, the score of the feature 𝐹𝐴 with respect to
the Real class is given by the equation (17).
This feature tells about the Identification number of the
Facebook Page on which the news article was shared or
1+𝐶𝑜𝑢𝑛𝑡(𝐹𝐴 ,𝐶𝑟 )
posted. Each page that is created on Facebook by the user 𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝐹𝐴 , 𝑅𝑒𝑎𝑙) = (17)
|𝐶𝑟 |
has a unique identification number and using this
identification number one can easily access the Where, 𝐶𝑟 is the set of news articles in C that are
information regarding the admins, the members of that labeled as Real and Count(𝐹𝐴 ,𝐶𝑟 ) is the number of news
page. It is observed that the Facebook page on which fake articles in 𝐶𝑟 that are posted on Facebook using App that
or hoax article was posted earlier, then there exists a high has the identification number 𝐹𝐴 and | 𝐶𝑟 | is the
chance that in future the authors will use the same page to cardinality of the set 𝐶𝑟 .
post another fake or hoax article. Let us assume that a
news article was posted on the Facebook page having a B.3. Source
page identification number 𝐹𝑝 . Then the score of feature This feature indicates the publisher of the news article.
Facebook PageID (𝐹𝑝 ) with respect to the Fake class is Generally, those publishers which publish genuine news
given by the equation (14). articles tend to validate the authenticity of the news
article before publishing it. Whereas those publishers
1+𝐶𝑜𝑢𝑛𝑡(𝐹𝑝 ,𝐶𝑓 ) who don’t check the authenticity of the news before
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝐹𝑝 , 𝐹𝑎𝑘𝑒) = (14)
|𝐶𝑓 | publishing may sometimes publish the fake news. Also,
those websites which regularly publishes fake news can
Where 𝐶𝑓 is the set of news articles in C that are be easily identified by examining the URL e.g.
labeled as Fake and Count(𝐹𝑝 , 𝐶𝑓 ) is the number of “szabadonebredok.info”, “mindenegybenblog.hu” etc.
news articles in 𝐶𝑓 that are posted on a Facebook page Thus by examining the URL of the published news article,
having a page identification number 𝐹𝑝 and |𝐶𝑓 | is the its authenticity can be established. Let us assume that the
news article is published by a Source 𝑆𝑘 . Then the score
cardinality of the set 𝐶𝑓 .
of the feature 𝑆𝑘 with respect to the Fake class is given
Similarly, the score of feature Facebook PageID (𝐹𝑝 ) by equation (18).
with respect to the Real class is given by the equation
(15). 1+𝐶𝑜𝑢𝑛𝑡(𝑆𝑘 ,𝐶𝑓 )
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑆𝑘 , 𝐹𝑎𝑘𝑒) = (18)
|𝐶𝑓 |
1+𝐶𝑜𝑢𝑛𝑡(𝐹𝑝 ,𝐶𝑟 )
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝐹𝑝 , 𝑅𝑒𝑎𝑙) = (15)
|𝐶𝑟 |
Where 𝐶𝑓 is the set of news articles in C that are
Where, 𝐶𝑟 is the set of news articles in C that are labeled as Fake and Count(𝑆𝑘 ,𝐶𝑓 ) is the number of news
labeled as Real and Count(𝐹𝑝 , 𝐶𝑟 ) is the number of news articles in 𝐶𝑓 that are published by 𝑆𝑘 and |𝐶𝑓 | is the
articles in 𝐶𝑟 that are posted on a Facebook page having cardinality of the set 𝐶𝑓 .
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
Natural Language Processing based Hybrid Model for Detecting Fake News Using 7
Content-Based Features and Social Features
Similarly, the score of the feature 𝑆𝑘 with respect to Where, 𝐶𝑟 is the set of news articles in C that are
the Real class is given by equation (19). labeled as Real and Count(𝑎𝑖 , 𝐶𝑟 ) is the number of news
articles in 𝐶𝑟 that are authored by 𝑎𝑖 and |𝐶𝑟 | is the
1+𝐶𝑜𝑢𝑛𝑡(𝑆𝑘 ,𝐶𝑟 )
𝐹𝑒𝑎𝑡𝑢𝑟𝑒𝑆𝑐𝑜𝑟𝑒(𝑆𝑘 , 𝑅𝑒𝑎𝑙) = (19) cardinality of the set 𝐶𝑟 .
|𝐶𝑟 |
1+𝐶𝑜𝑢𝑛𝑡(𝑎𝑖 ,𝐶𝑟 )
𝑃𝑟 (𝑎𝑖 , 𝑅𝑒𝑎𝑙) = (23)
|𝐶𝑟 |
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
8 Natural Language Processing based Hybrid Model for Detecting Fake News Using
Content-Based Features and Social Features
3. Accuracy: Accuracy is the most intuitive news articles were used for testing purpose out of which
performance measure and it is simply a ratio of 44 news articles were labeled as Fake and remaining 43
correctly predicted observation to the total news articles as Real. The outcome obtained in the
observations. Accuracy is given by equation (26). second Iteration is shown in Table 2.
𝑇𝑃+𝑇𝑁 Table 2. Confusion Matrix for Second Iteration
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (26)
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
Fake Real
Fake Real
The accuracy obtained in the first iteration is 90.58%,
with precision 94.87%, recall 86.04% and F1 Score of Fake 36 (TP) 5 (FN)
90.23% as calculated using equations (24), (25), (26), and
Real 3 (FP) 36 (TN)
(27).
2. Second Iteration The accuracy obtained in the fourth iteration is 90.00%,
In the second iteration, the proposed model was trained with precision 92.30%, recall 87.80% and F1 Score of
on 335 news articles chosen randomly from the dataset 89.99% as calculated using equations (24), (25), (26), and
out of which 167 news articles were labeled as Fake and (27).
the remaining 168 news articles as Real. Remaining 87
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
Natural Language Processing based Hybrid Model for Detecting Fake News Using 9
Content-Based Features and Social Features
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
10 Natural Language Processing based Hybrid Model for Detecting Fake News Using
Content-Based Features and Social Features
[14] M. Balmas (2012), “When Fake News Becomes Real: Prajal Jain was born in Jobat, India in
Combined Exposure to Multiple News Sources and 1997. He is currently pursuing a
Political Attitudes of Inefficacy, Alienation, and Cynicism”, bachelor’s degree in Computer Science
Communication Research, vol. 41(3), pp. 430-454. and Engineering from Maulana Azad
[15] S. Jang, and K. K. Joon (2018). “Third person effects of National Institute of Technology,
fake news: Fake news regulation and media literacy Bhopal, India. His research interests
interventions”, Computers in Human Behavior, vol. 80, pp. include Data Structures, Algorithms
295-302. and Machine Learning.
[16] A. Guess, J. Nagler, and J. Tucker (2019), “Less than you
think: Prevalence and predictors of fake news
dissemination on Facebook”, Science Advances, vol. 5(1),
pp. Dr. Meenu Chawla completed her BE
[17] K. Shu, D. Mahudeswaran, and H. Liu (2018), (Computer Technology) from Maulana
“FakeNewsTracker: a tool for fake news collection, Azad College of Technology, India in
detection, and visualization”, Computational and 1990. She did her M. Tech (Computer
Mathematical Organization Theory, vol. 25(1), pp. 60-71. Science and Engineering) at the Indian
Institute of Technology, Kanpur, India,
in 1995 and received her Ph.D. in the
area of Ad hoc Networks (Computer
Authors’ Profiles Science) from Maulana Azad National
Institute of Technology, India, in 2012. She has more than 25
Shubham Bauskar was born in years of teaching and research experience. Currently, she is a
Burhanpur, India in 1998. He is Professor in the Department of Computer Science and
currently pursuing a bachelor’s degree in Engineering at Maulana Azad National Institute of Technology,
Computer Science and Engineering from India. She has published more than 50 research papers in the
Maulana Azad National Institute of reputed journals and conferences. She is a Member of IEEE,
Technology, Bhopal, India. His research CSI, and ISTE. Her research and teaching interests include Data
interests include Machine Learning, Structure and Algorithms, Wireless communication and Mobile
Natural Language Processing, and Computing, Mobile Ad Hoc and Sensor Networks, Cognitive
Pattern Recognition. Radio Networks and Big Data.
How to cite this paper: Shubham Bauskar, Vijay Badole, Prajal Jain, Meenu Chawla, " Natural Language Processing
based Hybrid Model for Detecting Fake News Using Content-Based Features and Social Features", International
Journal of Information Engineering and Electronic Business(IJIEEB), Vol.11, No.4, pp. 1-10, 2019. DOI:
10.5815/ijieeb.2019.04.01
Copyright © 2019 MECS I.J. Information Engineering and Electronic Business, 2019, 4, 1-10
© 2019. Notwithstanding the ProQuest Terms and Conditions, you may use
this content in accordance with the associated terms available at
http://www.mecs-press.org/ijcnis/terms.html