You are on page 1of 5

14th IEEE International Conference on Computational Intelligence and Communication Networks

Fake News Detection Using Machine Learning


Models
2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) | 978-1-6654-8771-9/22/$31.00 ©2022 IEEE | DOI: 10.1109/CICN56167.2022.10008340

Malak Aljabri Dorieh M. Alomari Menna Aboulnour


1Department of Computer Science, College of SAUDI ARAMCO Cybersecurity Chair, SAUDI ARAMCO Cybersecurity Chair,
Computer and Information Systems, Umm Al- Department of Computer Engineering, College Department of Computer Science,
Qura University, Makkah 21955, Saudi Arabia of Computer Science and Information College of Computer Science and Information
2
Department of Computer Science, College of Technology, Imam Abdulrahman Bin Faisal Technology, Imam Abdulrahman Bin Faisal
Computer Science and Information Technology, University, P.O. Box 1982 University, P.O. Box 1982
Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia Dammam 31441, Saudi Arabia
Dammam 31441, Saudi Arabia 2180007089@iau.edu.sa 2180007190@iau.edu.sa
mssjabri@uqu.edu.sa

Abstract— Nowadays, with the widespread use of technology, increases concerns related to digital literacy and the legitimacy
fake news and rumors are spreading too. People and society are of information [1].
greatly impacted by fake news, which also can be used as phishing
attempts and a way of stealing their information. In many areas of Phishing is one of the serious cyber threats to the internet
our lives, Artificial Intelligence (AI) and Machine Learning (ML) environment and people's daily life. In this attack, the attacker
have demonstrated their effectiveness. Furthermore, Natural mimics a trusted entity with the aim of stealing sensitive
Language Processing (NLP) has shown promising results in text information [2] [3]. Therefore, fake news detection is an
classification applications. In this study, we proposed an important need in our society.
experimental study for detecting fake news using ML models. The
proposed model analyzes the main text of the news using NLP Moreover, The ability to distinguish between trustworthy
techniques and then classifies the news into fake or real news. We and fake news is difficult because fake news is diverse with
used a new dataset that combined multiple fake news datasets. regard to subjects, styles, and media platforms [4]. Nevertheless,
Moreover, we studied the impact of features extraction methods researchers have argued that deceptive news is given away by
on the performance of the developed models. Eight experiments linguistic cues. If we are able to recognize these indicators, we
were performed using Random Forest (RF) and Support Vector can develop an intelligent detector that can outperform our
Machines (SVM) models, each with a different features extraction manual inspection [4]. Effective tools are therefore crucial to
technique. The SVM model resulted in the best performance with distinguish between reliable news and fake ones in order to
an accuracy level of 98%. This result proves the model ability to facilitate the extraction of unreliable news articles.
be deployed and used in real-world with high reliability, to detect
fake news. Machine learning (ML) techniques have proven to be an
effective tool for the automated detection of anomalies in
Keywords—Fake news, Features extraction, Machine Learning, different sectors [5] [6]. With enough training using relevant and
NLP useful data these models have shown their efficiency over time
[7][8]. Most of the time ML algorithms are used for prediction
I. INTRODUCTION purposes or to detect something that is difficult to be manually
The internet provides vast magnitudes of news articles which identified. In this paper, we aim to detect fake news using
makes it very convenient to reach any piece of information Support Vector Machine (SVM) and Random Forest (RF) ML
needed at the click of a mouse. While people have become algorithms.
increasingly aware of the invalidity of some of the news found The main contributions of this paper are as follows:
on the web, there is still a large amount of news that is being
falsely trusted and shared. Moreover, fake news is becoming • Utilize a new dataset related to fake news.
more and more believable aiming to retain the curiosity of the
• Analyze news articles and build a set of models
audiences to sell information. Even younger generations who
using different features extraction techniques and
were better at identifying misleading websites find themselves
different ML classifiers to detect the fake news.
confused as those websites continue to develop. It was reported
in a study conducted by Common Sense Media that 44% of • Perform a comparative analysis to evaluate the
teenagers frequently struggle to detect whether a news article is performance of the set of models.
fake or not. They went on to indicate that 31% of children
between the ages of 10 and 18 have at least shared one news The remaining part of this paper is organized as follows:
story online that they later realized was inaccurate. This Section 2 reviews the previous works of literature. Section 3
explains the methodology followed in this study. Section 4

978-1-6654-8771-9/22/$31.00 ©2022 IEEE 473


DOI: 10.1109/cicn.2022.81
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
contains the experimental setup. The results are discussed in III. METHODOLOGY
section 5, and section 6 concludes this paper. This study's main goal is to employ ML approaches to create
II. LITERATURE REVIEW classifiers that can accurately identify fake news. SVM and RF
are two of the models employed. Models were trained using the
Stahl [9] aimed to analyze previous and current approaches WELFake dataset [15] with the intention of classifying the
for fake news detection in textual formats and proposes a three- articles as real or fake. Additionally, we evaluated these
part technique using Naïve Bayes (NB), Support Vector classifiers' performance using a number of evaluation metrics,
Machines (SVM), and Semantic Analysis as a method to detect such as classification accuracy, precision, recall, and F1-score.
fake news on social media. This study aimed to guide Eight experiments were run using several feature extraction
researchers to determine which combination of methods should strategies, including hash vectorization and TF-IDF with various
be used to reliably detect fake news on social media. n-gram ranges, utilizing the 80–20 holdout split to generate the
P ́erez-Rosas et al. [10] focused on the automatic models. Fig. 1 shows the procedural steps of the research
identification of fake content in online news. They introduced methodology followed in this paper.
two datasets for the task of fake news detection, covering seven
different news domains. Once the datasets were prepared, they
conducted various experiments to build accurate fake news
detectors using several classification models they developed.
The models were trained using 100 fake news articles and 100
real news articles. Furthermore, the model they constructed
relies on a combination of syntactic, lexical, and semantic
information, along with features representing text readability
properties. The best performing model reached an accuracy
comparable to that of human ability at detection with an
accuracy of 74%, precision of 74%, recall of 75%, and F1-score
of 74% using the “FakeNews” dataset [11].
Jain et al. [12] aggregated ML and NLP to distinguish
between legitimate and fake news. Specifically, they combined
NB, SVM, and NLP to construct the model to enhance the
reliability of their model. The introduced model reached an
accuracy of 93.6%.
A system to detect fake news using classification techniques
aimed at classifying large data was proposed by Hiramath and
Deshpande [13]. The classifiers used include LR, NB, SVM, RF, Fig. 1. Methodology Framework
and Deep Neural Network (DNN). The DNN algorithm
outperforms the rest of the algorithms in terms of accuracy and A. Dataset Description
time as it achieved an accuracy of 91% followed by NB which In this study, we used the WELFake dataset which was
achieved an accuracy of 89%. designed to include more linguistic features for fake news
Aldwairi and Alwahedi [14] aimed to discover an approach detection. the dataset is roughly balanced consisting of 72,134
to detect and filter sites containing fake news. In this study, they news articles of which 35,028 are legitimate news articles and
used simple and carefully selected features to reliably determine 37,106 are fake news articles. Moreover, the dataset aims to
fake posts. Preliminary experiments were conducted to evaluate generate unbiased classification output as it is made of a
the classifiers’ performance in identifying possible sources of combination of four popular news datasets, namely, Kaggle,
fake news. The models used include Bayes Net, Logistic, RF, Reuters, McIntire, and BuzzFeed Political, which is meant to
and NB. The best performing classifier was the Logistic avoid over-fitting and provide more text data to ensure better
classifier which obtained an accuracy of 99.4%, precision of model training. Each sample is described using four columns,
99.4%, recall of 99.3%, and F1-score of 99.3%. however, in namely, the serial number, article title, article content, and the
terms of ROC the Logistic classifier which obtained a value of label defining whether it's real or fake.
99.5%, was outperformed by both the Bayes Net and NB which B. Data Preprocessing
obtained a ROC of 100%.
In the data preprocessing step, the text was cleaned and
Most of the reviewed articles did not focus on the features prepared to be used for Natural Language Processing (NLP).
extraction part. Detecting fake news requires identifying the This step includes text normalization, stop words removal,
words that are highly related to fake news, and this can be tokenization, and stemming.
achieved using features extraction techniques. In this study, we
studied different feature extraction techniques and utilized a 1) Text Normalization
large dataset that combines many fake news datasets. Text normalization means cleaning the text and converting
it into a uniform format [16]. First, the special characters as
('@', '$', '*') and URLs were removed. Second, digits were also

Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering.474
Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
removed. Lastly, all the letters were converted into the lower Where TF (t, d) is the term frequency in the document and
case. IDF (t) is the (n) total number of documents over the number of
documents that contain the term t.
2) Stop-words Removal
Stop words are the words that are highly repeated in the text This technique shows the most related words to each
but do not affect its meaning[16]. For example, (‘a’, ‘the’, ‘is’, document by assigning them high values, while the words that
‘are’). All these stop words were removed. are repeated in all the documents will have low values even if
they were frequently used.
3) Tokenization
Tokenization means splitting the sentence into chunks of 3) N-Grams
tokens [16]. For example, the sentence (‘no work today’) was N-grams is the process of extracting a set of words from a
tokenized into (‘no’, ‘work’, ‘today’). sentence based on specified window size. The n-grams take the
minimum number and the maximum number for its range [17].
4) Stemming For example, the sentence (‘The weather is very cold today’),
Stemming is the process of removing large portions of the with N-grams= (2, 2) means only 2-grams will give the
token such as suffixes and prefixes to return the base verb [16]. following: {‘weather very’, ‘very hot’, ‘hot today’}.
For example, the word (‘studying’) was stemmed into (‘study’).
While with N-grams= (1, 3) means 1,2, and 3-grams will
C. Features Extraction give the following:
Before training the ML models, we need to extract the
{‘weather’, ‘very’, ‘hot’, ‘today’, ‘weather very’, ‘very
important words that give the best performance. This step is
hot’, ‘hot today’, ‘weather very hot’, ‘very hot today’}.
also known as features extraction because words are the
features to be used to train the model. Three features extraction The terms ‘The’ and ‘is’ are ignored by this technique.
techniques were used in this study, which are Hash vectorizer,
Term Frequency–Inverse Document Frequency (TF-IDF) This method helps in predicting the relation between words
Vectorizer, and n-grams. After cleaning the data, each features based on probability, as some words are more likely to be used
extraction technique was performed on the data. after some sets of words.

1) Hash Vectorization D. Proposed Models


Hash vectorizer converts words into vectors based on the 1) Support Vector Machines (SVM)
occurrence frequency [17]. Each word is represented by a Is a supervised learning ML technique that is used for solving
unique hash, this way reduces the memory needed to store classification and regression problems. The algorithm works by
words. Fig. 2 shows an example of hash vectorizer work. As marking the training instances as belonging to one of two
shown in the figure, each hash number represents a word of the categories and later classifying new instances based on what it
sentence, then the number of occurrences is calculated for each learned. Furthermore, the classifier works to map the instances
word and the number is then assigned to the hash value. with a goal to maximize the space between the categories
resulting in a hyperplane that separates the two categories. [18].
2) Random Forest (RF)
Is a supervised learning ML technique that is useful in both
classification and regression contexts. The algorithm consists
of a collection of decision trees used to make predictions that
are weighed and combined to come up with a more reliable
prediction and improve performance [19].
E. Evaluation Metrics
1) Accuracy
Fig. 2. Hash Vectorizer The ratio of correct predictions is calculated using Equation
(1) by dividing the total number of correct predictions by the
2) Term Frequency–Inverse Document Frequency total number of predictions made [20].
(TF-IDF) Vectorization
𝑇𝑃 + 𝑇𝑁
TF-IDF converts words into vectors based on the frequency 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (1)
and how the word is relevant to the document [17]. This is done 𝑇𝑃+𝐹𝑁+𝐹𝑃+𝑇𝑁′
by finding the term frequency (TF) then multiplying it by the
inverse document frequency (IDF), indicated by Equations 5 2) Precision
and 6. A measure of the number of positive, “fake”, instances out of
all the instances that were predicted as positive which is
calculated using Equation (2) [20].
𝑇𝐹 − 𝐼𝐷𝐹(𝑡, 𝑑) = 𝑇𝐹(𝑡, 𝑑) ∗ 𝐼𝐷𝐹(𝑡) (5)
𝑛 𝑇𝑃
𝐼𝐷𝐹(𝑡) = log ( )+1 (6) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 (𝑃) = (2)
𝐷𝐹(𝑡) 𝑇𝑃 + 𝐹𝑃

475
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
Where TP is the number of truly predicted samples of the TABLE 2. EXPERIMENTAL RESULTS
positive class, over the total number of samples of the positive N-
F1-
class. Model Vectorizer grams
range
Accuracy Precision Recall
score

(1,3) 0.98 0.98 0.98 0.98


3) Recall Hash
Vectorizer
Predicts the positive class predictions over the number of (2,2) 0.97 0.97 0.97 0.97
SVM
positive instances in the dataset itself regardless of whether they (1,3) 0.97 0.97 0.97 0.97
were correctly predicted as positive or not. This indicates how TF-IDF
(2,2) 0.97 0.97 0.97 0.97
successful the model is at predicting the true positives which is
calculated using Equation (3) [20]. Hash
(1,3) 0.94 0.94 0.94 0.94
Vectorizer
𝑇𝑃 (2,2) 0.94 0.94 0.94 0.94
𝑅𝑒𝑐𝑎𝑙𝑙 (𝑅) = (3) RF
𝑇𝑃 + 𝐹𝑁 (1,3) 0.93 0.93 0.93 0.93
TF-IDF
(2,2) 0.93 0.93 0.93 0.93
Where TP is the number of truly predicted samples of the
positive class, over the total number of samples that were As demonstrated by Table 2, the SVM model with Hash
predicted as a positive class by the model. vectorizer and (1, 3) n-grams range outperformed other models
with an accuracy level of 98%. The lowest accuracy achieved by
4) F1-score SVM was 97% using TF-IDF and Hash vectorizer with (2, 2) n-
A measure that combines the precision and recall values in a grams. The SVM's kernel solves the non-linearity of the data,
which could be the reason behind its high performance in these
single number which is calculated using Equation (4) [20].
experiments. The kernel function of SVM transforms the input
𝑃×𝑅 data into the required dimension to simplify the data for training
𝐹 − 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 (𝐹) = 2 × (4) the model. Although RF is an ensemble model, it achieved an
𝑃+𝑅 accuracy level of 94% and 93% for both Hash and TF-IDF
vectorizers respectively. As indicated in the table above,
IV. EXPERIMENTAL SETUP changing the n-grams ranges did not affect the performance of
To build the proposed models, Python 3.9 programming the SVM model with TF-IDF. Moreover, the performance of the
language was used. The experiments had been performed online RF model did not get affected by changing the n-grams ranges
using Google Colaboratory Notebook. Eight different using both vectorization techniques. This could be due to the
experiments were performed to compare the results. First, a nature of RF, as it performs more features extraction during its
comparison between the two different types of vectorizations training process by choosing a different set of features for each
were performed, which are Hash Vectorization and TF-IDF. of the trained trees. These results show the impact of features
Each technique was used with two n-grams ranges which are (1, extraction on the model's performance. Furthermore, the hash
3) and (2, 2). The resulting vectors were used to train the SVM vectorizer required less time during the training phase as it
and RF classifiers and get eight different models. requires low memory space and processing time while
producing the highest accuracy. Fig. 3 shows the confusion
matrix of the SVM model with Hash vectorization and n-grams
A. Parameter Tuning
range (1, 3).
Grid Search Cross Validation (CV) techniques for the
parameter tuning process [21]. Grid Search CV works by trying
all the possible sets of parameters until it finds the best set. Tab.
1.shows the optimal parameter values for SVM and RF models.

TABLE 1. PARAMETER VALUES USED IN THE EXPERIMENTS

Model Parameter Value

C 10

SVM Kernel rbf

Gamma scale

criterion gini
RF
max_features sqrt
Fig. 3. Confusion Matrix

V. RESULTS AND DISCUSSION Fig.3 indicated that the model shows high performance in
After training eight different models, a comparison between distinguishing between the two classes as the number of
their performances was performed as indicated by Tab. 2. misclassified samples in each class is considerably low

476
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.
comparing to the size of the testing set. This means the results Sci. Mach. Learn. Appl. CDMA 2022, pp. 175–180, 2022, doi:
of the model are generalizable and can be used for new data. 10.1109/CDMA54072.2022.00034.
[3] M. Aljabri et al., “An Assessment of Lexical, Network, and Content-
A. Comparison with Benchmark Study Based Features for Detecting Malicious URLs Using Machine Learning
and Deep Learning Models,” Comput. Intell. Neurosci., vol. 2022, pp. 1–
In this study, the WELFake dataset was used to perform 14, Aug. 2022, doi: 10.1155/2022/3241216.
eight experiments using different ML models and features [4] ShuKai, SlivaAmy, WangSuhang, TangJiliang, and LiuHuan, “Fake
extraction techniques. The dataset was preprocessed and News Detection on Social Media: A Data Mining Perspective,” ACM
cleaned. Then two vectorization techniques were used. The SIGKDD Explor. Newsl., vol. 19, no. 1, pp. 22–36, Sep. 2017, doi:
Hash vectorizer and TF-IDF vectorizer, with two n-grams 10.1145/3137597.3137600.
ranges which are (1, 3) and (2, 2). The proposed model obtained [5] M. Aljabri, A. A. Alahmadi, R. M. A. Mohammad, M. Aboulnour, D. M.
an accuracy level of 98% using the SVM model with Hash Alomari, and S. H. Almotiri, “Classification of Firewall Log Data Using
Multiclass Machine Learning Models,” Electron. 2022, Vol. 11, Page
vectorizer and (1, 3) n-grams range. Another study [15] used 1851, vol. 11, no. 12, p. 1851, Jun. 2022, doi:
the same dataset and proposed different models. The study used 10.3390/ELECTRONICS11121851.
the SVM model and achieved an accuracy level of 96.73%. [6] M. Aljabri et al., “Intelligent Techniques for Detecting Network Attacks:
Tab. 3 shows a detailed comparison. Review and Research Directions,” Sensors 2021, Vol. 21, Page 7070, vol.
21, no. 21, p. 7070, Oct. 2021, doi: 10.3390/S21217070.
TABLE 3: COMPARISON AGAINST THE BENCHMARK STUDY [7] R. Mustafa A. Mohammad, M. Aljabri, M. Aboulnour, S. Mirza, and A.
Alshobaiki, “Classifying the Mortality of People with Underlying Health
Study Model Dataset Accuracy F1-score Conditions Affected by COVID-19 Using Machine Learning
Techniques,” Appl. Comput. Intell. Soft Comput., vol. 2022, 2022, doi:
WELFake
This study SVM 98% 98% 10.1155/2022/3783058.
dataset
WELFake [8] M. Aljabri et al., “Sentiment Analysis of Arabic Tweets Regarding
[15] SVM 96.73% 96.56% Distance Learning in Saudi Arabia during the COVID-19 Pandemic,”
dataset
Sensors 2021, Vol. 21, Page 5431, vol. 21, no. 16, p. 5431, Aug. 2021,
VI. CONCLUSION doi: 10.3390/S21165431.
[9] K. Stahl, “Fake news detection in social media.”
There is a demand for a system to detect fake news to protect [10] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, and R. Mihalcea, “Automatic
society from scams and all that this type of news may lead to Detection of Fake News,” Aug. 2017.
from public disturbance. By collecting the needed data and using [11] “FakeNewsNet Dataset | Papers With Code.” .
ML algorithms, models can be built to solve these problems and [12] A. Jain, A. Shakya, H. Khatter, and A. K. Gupta, “A smart System for
identify anomalies. Fake News Detection Using Machine Learning,” IEEE Int. Conf. Issues
Challenges Intell. Comput. Tech. ICICT 2019, Sep. 2019, doi:
Still, this issue has not been completely resolved. Even 10.1109/ICICT46931.2019.8977659.
though there have been several attempts to address the issue, [13] C. K. Hiramath and G. C. Deshpande, “Fake News Detection Using Deep
existing techniques need modification and improvements. In this Learning Techniques,” 1st IEEE Int. Conf. Adv. Inf. Technol. ICAIT
work, we sought to address this issue by developing ML models 2019 - Proc., pp. 411–415, Jul. 2019, doi:
that can detect different types of fake news. We applied three 10.1109/ICAIT47043.2019.8987258.
different features extraction techniques to improve our models' [14] M. Aldwairi and A. Alwahedi, “Detecting Fake News in Social Media
performance and managed to develop an SVM model that Networks,” Procedia Comput. Sci., vol. 141, pp. 215–222, Jan. 2018, doi:
10.1016/J.PROCS.2018.10.171.
achieved an accuracy, recall, precision, and F1-score of 98%.
[15] P. K. Verma, P. Agrawal, and R. Prodan, “WELFake dataset for fake news
For future work, we aim to implement deep learning detection in text data,” Feb. 2021, doi: 10.5281/ZENODO.4561253.
techniques to handle fake news. In addition, we would attempt [16] “Text Preprocessing NLP | Text Preprocessing in NLP with Python
to generate new datasets related to fake news with new features codes.” https://www.analyticsvidhya.com/blog/2021/06/text-
preprocessing-in-nlp-with-python-codes/ (accessed Oct. 27, 2022).
and study the effect of the different features on the model’s
[17] “Vectorization Techniques in NLP [Guide] - neptune.ai.”
detection's performance. https://neptune.ai/blog/vectorization-techniques-in-nlp-guide (accessed
Oct. 27, 2022).
ACKNOWLEDGMENT
[18] T. Evgeniou and M. Pontil, “Support vector machines: Theory and
applications,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes
We would like to thank SAUDI ARAMCO Cybersecurity Chair Artif. Intell. Lect. Notes Bioinformatics), vol. 2049 LNAI, pp. 249–257,
at Imam Abdulrahman bin Faisal University for funding this 2001, doi: 10.1007/3-540-44673-7_12.
project. [19] L. Breiman, “Random Forests,” Mach. Learn. 2001 451, vol. 45, no. 1,
pp. 5–32, Oct. 2001, doi: 10.1023/A:1010933404324.
REFERENCES [20] “Evaluation Metrics Definition | DeepAI.” https://deepai.org/machine-
[1] Á. Figueira and L. Oliveira, “The current state of fake news: Challenges learning-glossary-and-terms/evaluation-metrics (accessed Sep. 13, 2022).
and opportunities,” in Procedia Computer Science, 2017, vol. 121, doi: [21] “GridSearchCV for Beginners. It is somewhat common knowledge in
10.1016/j.procs.2017.11.106. the… | by Scott Okamura | Towards Data Science.”
[2] M. Aljabri and S. Mirza, “Phishing Attacks Detection using Machine https://towardsdatascience.com/gridsearchcv-for-beginners-
Learning and Deep Learning Models,” Proc. - 2022 7th Int. Conf. Data db48a90114ee (accessed Feb. 22, 2022).

477
Authorized licensed use limited to: Mukesh Patel School of Technology & Engineering. Downloaded on August 26,2023 at 17:27:45 UTC from IEEE Xplore. Restrictions apply.

You might also like