Professional Documents
Culture Documents
4, 2022 307
Riktesh Srivastava
City University College of Ajman,
Sheikh Ammar Road – Al Tallah 2, Ajman, UAE
Email: r.srivastava@cuca.ae
Khushboo Agnihotri
Amity Business School,
Amity University,
Amity Rd., Sector 125, Noida,
Uttar Pradesh 201301, India
Email: agnihotrikhushboo@gmail.com
Email: kagnihotri@lko.amity.ac.edu
Abstract: Nowadays, the interaction on social media for the latest news is an
excellent source of information. Most of the time we read online news that may
primarily appear authentic, but we cannot assure it because it does not happen
every time. According to Gartner’s published report, by 2022, most mature
economies will get fake information than the correct information, mainly
through social media. Fake news is one of the prevalent threats in our digitally
linked world. This paper proposes a model for recognising fake news through
the dataset from the Kaggle. There was 3,000 news collected from various
social media sources in the dataset, of which 2,725 news is a training dataset
and 275 for the test dataset. The fake and real news is classified and compared
using five machine learning classification algorithms and analysed accordingly.
The five classification algorithms are support vector machine (SVM), naïve
Bayes, logistic regression, random forest, and neural networks.
1 Introduction
Web 3.0 has prompted the rise of user-generated content through social media that
empower users to examine any form of news and believe them. (Srivastava et. al., 2020)
The main objective of social media was to socialise with friends and colleagues and use it
The impact on society of false news spreading 309
for different purposes like education and business. Sadly, it becomes the platform for
performing unlawful activities and spreading fake news (Islam et al., 2020). They may be
fake news about the customer recently purchased online or expressing political views;
spreading them through social media has become a trend (Lazer et al., 2018). Social
media helps people communicate and spread the news (Duffy et al., 2020).
Unfortunately, it was observed that 62% of news spreads on social media are fake
(Statista, 2020). The fake news spread through social media is often presented
sensationally; thus, it is rapidly picked and circulated (Bergström and Belfrage, 2018;
Harper, 2010; Li et al., 2017). The spread of fake news through social media also caused
damage to different domains of society and are mentioned by different authors in their
study. These domains include financial markets (Kogan et al., 2020), online retailing
(Martens and Maalej, 2019), and healthcare (Lara-Navarra et al., 2020; Smaldone et al.,
2020). Manzoor et al. (2019 and Shu et al. (2017) states that fake news affects
individual’s people’s lives with a negative impact.
The fake news created through social media aims to misguide readers (Fernandez and
Alani, 2018; Zhang et al., 2018) via a false account (Kumar and Shah, 2018; Shu et al.,
2019). The fake news spread through social media is usually well-written, long, and well-
referenced (Collins et al., 2020; Pennycook and Rand, 2021). The researchers applied
various techniques to distinguish between fake and real news or real and fraudulent
accounts over the past years. However, it was challenging for the conventional methods
to analyse and predict all types of fake news. (Saxena et al., 2017).
This paper recommends a methodology to create a predictive model that will detect if
the news spread on social media is fake or real based on its words, phrases, sources, and
titles. The suggested predictive models apply supervised machine learning algorithms on
an annotated dataset. Then, feature selection picks the best-fit features to obtain the
highest precision. The predictive models get trained on the unseen data, and one with the
highest precision is selected. The selected model uses the test data for further analysis.
2 Literature review
Research to detect fake news spread through social media using various machine learning
algorithms is available. However, current research focuses primarily on using social
features and keywords using a specific classification algorithm.
Thota et al. (2018) came with a deep learning algorithm using binary classification to
detect fake news from social media with 94.21% accuracy.
Liu and Wu (2018) demonstrated the detection of fake news on social media through
a convolution algorithm using time series data. The author researched reports on Twitter
and Sina Weibo with 85% and 92% correct classification.
Aldwairi and Alwahedi (2018) used logistic classification algorithm for detection of
fake news on social media, claiming 99.2% accuracy.
Cardoso Durier da Silva et al. (2019) study neural networks to detect the spread of
fake news on social media. The authors claimed that the spread of fake news is because
of the deficiency of consensual evidence.
Ahmad et al. (2020) proposed ensemble machine learning approach to detect fake
news on social media using four performance metrics: accuracy, precision, recall, and
F-1 score.
310 R. Srivastava et al.
3 Research methodology
The research follows the steps from the TDSP framework, team data science process
(Martinez et al., 2021), and includes the following phases
• download and collect the dataset, which includes both the train and test observations
(Section 3.2)
• perform the data cleaning for the test and train dataset (Section 3.2)
• [data modelling] propose the predictive model to classify the news as fake or
authentic (Section 4)
• [performance analysis] use performance metrics to identify the machine learning
models with the best results (Section 5).
4 Data modelling
The modelling process consists of choosing models based on different predictive models
used in the research. The study’s five predictive models are logistic regression, SVM,
decision tree, naïve Bayes, and neural networks. The accuracy of the predictive models
upsurges with the amount of data available during training. The dataset is divided into
two parts with 90:10 ratios, one used for training and testing (see Figure 2).
Figure 2 Fake news detection using predictive models (see online version for colours)
Raw Data
Training
Data
Document
Data Cleaning
Embedding
Test Data
Naïve Byes
Predictive Model
Trained Model
Logistics Regression Models Evaluation
Random Forest
CA, Precision, Recall,
F-1 score
Neural Networks
Optimal model selection
Prediction
Once the relevant attributes get selected after the data cleaning, the next step involves
document embedding – the document embedding groups similar text to word embeddings
and aggregated by calculating the mean.
The fake news posted on social media is identified and tested using the proposed five
different predictive models. All five predictive models use the binary classifications of
the news as fake and real. The section describes the mathematical formulation of each of
the selected predictive models and the mechanism of its classification. The selection of
the model is based on the receiver operating characteristic (ROC) curve accordingly.
( X ) ∈ { x1 , x2 } .
(2)
Y = w0 + w1 x1
log ( hθ ( X ) ) , y = 1
Cost ( hθ ( X ), y ) = (3)
− log ( −1 − hθ ( X ) ) , y = 0
The text data of the job description is converted into vectors. These encoding text in the
form of numbers will help decide whether the vector representation of news belongs to a
fake or real. A naïve Bayes classifier is trained to automatically categorise news into fake
or real using the probabilities defined in the Bayes theorem. From the bayes theorem,
replacing A and B from equation (5) to X and Y as feature matrix and response vector
respectively, the equation (6) will become:
X y P( X )
P = P ∗ (6)
y X P( y )
Thus, the probability of predicting a target with class k has given feature matrix X, given
a particular class (fake or real) of y times the probability of belonging to a specific class.
The impact on society of false news spreading 313
which is as:
m
wi xi = wT x (8)
i =1
where wT is wn.
The Logistic Regression uses probabilistic logistic function based on equation (8) and
is as:
1
P ( wT x ) = T
(9)
1 + e− w x
Based on equation (9), if the weighted sum for a data point is nearing 1, then we can
predict the data point to have a class 1, 0 otherwise, which is as:
1
P ( yˆ = 1 x : w ) = T
, for class 1 (10)
1 + e− w x
For correct classification, stochastic gradient descent (SGD) iterates the cost function as
1
( yˆ − y )
2
C= (11)
2
where initial y = 0, and ŷ is based on equation (10).
where pi is the probability of an object being classified to a particular class, and c refers
to the classes (fake or real).
The analysis and adoption of the model are based on the ROC curve (see Figure 3).
ROC curve is considered the most accurate and straightforward method of classifying the
classes as fake or real. By analogy, the ROC curve states higher the area under curve
(AUC), the better is the predictive model is at differentiating between fake or real news
on social media.
Figure 3 ROC curve of five predictive models (see online version for colours)
The AUC values for five different predictive models are given in Table 1.
Based on Table 1, the AUC value for the logistic regression predictive model gives
the highest value and is thus suitable to predict the news as fake or real. For detailed
analysis, in the following section, we will use four accuracy standards for all the five
predictive models and then select the suitable ones.
Table 1 AUC values for five predictive models
5 Results evaluation
The data modelling process involves selecting machine learning techniques for predictive
modelling. The logistic regression model is used to recognise news as fake or real. The
The impact on society of false news spreading 315
accuracy standards from the confusion matrix are classification accuracy (CA), precision,
recall, and F-1 score (see Table 2).
Table 2 Accuracy standard to evaluate fake news on social media
This phase assesses the predictive models’ abilities using four accuracy criteria
(classification accuracy, precision, recall, and F-1 score) from the confusion matrix,
summarised in Table 3. The outcomes are for 90% of the dataset for training the
predictive models.
As mentioned in Table 3, two predictive models, namely, logistic regression and
random forest, recognise almost 80% accuracy for all four accuracy criteria.
Nevertheless, we selected the logistic regression as its AUC is 0.888, higher than the
random forest AUC value (0.868).
Table 3 Evaluation metrics
Table 4 testifies the confusion matrix of the logistic regression predictive model, which
suitably classified the news as fake or real with almost 70% accuracy for 10% of the test
data.
Table 4 Logistic regression predictive model outcomes for test data
• The true-negative rate (TNR) gave 67.8% results, indicating that the news on social
media was classified correctly by the model
• The true-positive rate (TPR) is also 69.8%, claiming that the proposed model
classified 189 news correctly out of the 275 from the test data.
This paper used a different predictive model to categorise the news as fake or real for the
downloaded Kaggle dataset. The results were investigated using four performance
metrics for assessing the suggested models: classification accuracy (CA), precision,
recall, and F-1 score. The experiment discovered that the logistic regression provides
tolerable outcomes (CA = 0.687, precision = 0.688, recall = 0.687 and F-1 score = 0.687).
The potential future work for this study will be a further development using other
predictive models as k-NN, Adaboost, or Tree predictive models. Data available have
restrictions regarding the facts of defaulters and timeline, which specifies the
comportment of default news obtainable from social media.
References
Ahmad, I., Yousaf, M., Yousaf, S. and Ahmad, M. O. (2020) ‘Fake news detection using machine
learning ensemble methods’, Complexity, Hindawi, 17 October, https://DOI.org/10.1155/
2020/8885861.
Aldwairi, M. and Alwahedi, A. (2018) ‘Detecting fake news in social media networks’, Procedia
Computer Science, Vol. 141, pp.215–222.
Bergström, A. and Belfrage, M.J. (2018) ‘News in social media’, Digital Journalism, Vol. 6, No. 5,
pp.583–598, https://DOI.org/10.1080/21670811.2018.1423625.
Cardoso Durier da Silva, F., Vieira, R. and Garcia, A.C. (2019) ‘Can machines learn to detect fake
news? A survey focused on social media’, Proceedings of the 52nd Hawaii International
Conference on System Sciences, 8 January, https://DOI.org/10.24251/HICSS.2019.332.
Collins, B., Hoang, D.T., Nguyen, N.T. and Hwang, D. (2020) ‘Trends in combating fake news on
social media – a survey’, Journal of Information and Telecommunication, pp.1–20,
https://doi.org/10.1080/24751839.2020.1847379.
Duffy, A., Tandoc, E. and Ling, R. (2020) ‘Too good to be true, too good not to share: the social
utility of fake news’, Information, Communication & Society, Vol. 23, No. 13, pp.1965–1979,
https://DOI.org/10.1080/1369118X.2019.1623904.
Fernandez, M. and Alani, H. (2018) ‘Online misinformation: challenges and future directions’,
Companion Proceedings of The Web Conference, pp.595–602, https://DOI.org/10.1145/
3184558.3188730.
Harper, R.A. (2010) ‘The social media revolution: exploring the impact on journalism
and news media organizations’, Inquiries Journal, Vol. 2, No. 3 [online]
http://www.inquiriesjournal.com/articles/202/the-social-media-revolution-exploring-the-
impact-on-journalism-and-news-media-organizations (accessed 20 October 2021).
Islam, M.R., Liu, S., Wang, X. and Xu, G. (2020) ‘Deep learning for misinformation detection on
online social networks: a survey and new perspectives’, Social Network Analysis and Mining,
Vol. 10, No. 1, pp.82, https://doi.org/10.1007/s13278-020-00696-x.
Kaliyar, R., Goswami, A. and Narang, P. (2021) ‘FakeBERT: fake news detection in social media
with a BERT-based deep learning approach’, Multimedia Tools and Applications,
https://DOI.org/10.1007/s11042-020-10183-2.
The impact on society of false news spreading 317
Kogan, S., Moskowitz, T.J. and Niessner, M. (2020) ‘Fake news in financial markets’, (SSRN
Scholarly Paper ID 3237763)’, Social Science Research Network, https://DOI.org/10.2139/
ssrn.3237763.
Kumar, S. and Shah, N. (2018) ‘False information on web and social media: a survey’, 23 April,
Vol. 1, No. 1, pp.1–35.
Lara-Navarra, P., Falciani, H., Sánchez-Pérez, E.A. and Ferrer-Sapena, A. (2020) ‘Information
management in healthcare and environment: towards an automatic system for fake news
detection’, International Journal of Environmental Research and Public Health, Vol. 17,
No. 3, p.1066, https://DOI.org/10.3390/ijerph17031066.
Lazer, D.M.J., Baum, M.A., Benkler, Y., Berinsky, A.J., Greenhill, K.M., Menczer, F.,
Metzger, M.J., Nyhan, B., Pennycook, G., Rothschild, D., Schudson, M., Sloman, S.A.,
Sunstein, C.R., Thorson, E.A., Watts, D.J. and Zittrain, J.L. (2018) ‘The science of fake
news’, Science, Vol. 359, No. 6380, pp.1094–1096, https://DOI.org/10.1126/science.aao2998.
Li, B., Stokowski, S., Dittmore, S.W. and Scott, O.K.M. (2017) ‘For better or for worse: the impact
of social media on Chinese sports journalists’, Communication & Sport, Vol. 5, No. 3,
pp.311–330, https://DOI.org/10.1177/2167479515617279.
Liu, Y. and Wu, Y-F. (2018) ‘Early Detection of Fake News on Social Media Through Propagation
Path Classification with Recurrent and Convolutional Networks.
Manzoor, S.I., Singla, J. and Nikita (2019) ‘Fake news detection using machine learning
approaches a systematic review’, 3rd International Conference on Trends in Electronics and
Informatics (ICOEI), pp.230–234, https://DOI.org/10.1109/ICOEI.2019.8862770.
Martens, D. and Maalej, W. (2019) ‘Towards understanding and detecting fake reviews in app
stores’, Empirical Software Engineering, Vol. 24, No. 6, pp.3316–3355, https://DOI.org/
10.1007/s10664-019-09706-9.
Martinez, I., Viles, E. and Olaizola, I.G. (2021) ‘Data science methodologies: current challenges
and future approaches’, Big Data Research, Vol. 24, p.100183, https://DOI.org/10.1016/
j.bdr.2020.100183.
Oliveira, N.R. de, Medeiros, D.S.V. and Mattos, D.M.F. (2020) ‘A sensitive stylistic approach to
identify fake news on social networking’, IEEE Signal Processing Letters, Vol. 27,
pp.1250–1254, https://doi.org/10.1109/LSP.2020.3008087.
Pennycook, G. and Rand, D.G. (2021) ‘The psychology of fake news. Trends in Cognitive
Sciences, Vol. 25, No. 5, pp.388–402, https://DOI.org/10.1016/j.tics.2021.02.007.
Saxena, A. and Srivastava, S. K. (2017) ‘Online to offline platform: a case study of Firstcry.com’,
International Journal of Economic Perspectives, Vol. 11, No. 3, pp.424–430.
Shu, K., Sliva, A., Wang, S., Tang, J. and Liu, H. (2017) ‘Fake news detection on social media: a
data mining perspective’, ACM SIGKDD Explorations Newsletter, Vol. 19, No. 1, pp.22–36.
https://doi.org/10.1145/3137597.3137600.
Shu, K., Zhou, X., Wang, S., Zafarani, R. and Liu, H. (2019) ‘The role of user-profiles for fake
news detection’, Proceedings of the IEEE/ACM International Conference on Advances in
Social Networks Analysis and Mining, pp.436–439, https://DOI.org/10.1145/3341161.
3342927.
Smaldone, F., Ippolito, A. and Ruberto, M. (2020) ‘The shadows know me: exploring the dark side
of social media in the healthcare field’, European Management Journal, Vol. 38, No. 1,
pp.19–32, https://doi.org/10.1016/j.emj.2019.12.001.
Srivastava, S.K. and Agnihotri, K. (2020) ‘Relational study between significance level of frontline
executives and their happiness level in an organisational setup: a critical analysis’, Int. J. Work
Organisation and Emotion, Vol. 11, No. 1, pp.62–76.
318 R. Srivastava et al.
Statista (2020) ‘Media sources are believed to contain fake news worldwide in 2019’, Statista
[online] https://www.statista.com/statistics/1112026/fake-news-prevalence-attitudes-
worldwide/ (accessed 24 October 2021).
Thota, A., Tilak, P., Ahluwalia, S. and Lohia, N. (2018) ‘Fake news detection: a deep learning
Approach’, SMU Data Science Review, Vol. 1, No. 3 [online] https://scholar.smu.edu/
datasciencereview/vol1/iss3/10 (accessed 25 October 2021).
Zhang, H., Kuhnle, A., Smith, J. D. and Thai, M. T. (2018) ‘Fight under uncertainty: restraining
misinformation and pushing out the truth’, pp.266–273, https://DOI.org/10.1109/
ASONAM.2018.8508402.