You are on page 1of 30

Arba Minch Institute of Technology

Faculty of Computing and Software Engineering


Title : Fake News Detection Project Using Machine Learning Algorithms Report
Name of student Id No

Solomon kebede Kekele…………………………PRAMIT/108/15


Submitted to: Dr Mohammed AMIT
Arbaminch, Ethiopia
June,26/2023

1 By: Solomon Kebede 09/04/2023


Outline

Introduction
Literature Review
Methodology or Algorithm Used
Problem
Research gap
Related work
Method and results
Strong point
Weak point

2 By: Solomon Kebede 09/04/2023


INTRODUCTION
 In the modern era, the majority of the tasks are done online. Newspapers
that were earlier preferred as hard-copies are now being substituted by
applications like Face book, Twitter, and news articles to be read online.
Whatsapp’s forwards are also a major source.
 The growing problem of fake news only makes things more complicated and
tries to change or hamper the opinion and attitude of people towards use of
digital technology. When a person is deceived by the real news two possible
things happen- People start believing that their perceptions about a particular
topic are true as assumed.
 Fake News contains misleading information that could be checked. This
maintains lie about a certain statistic in a country or exaggerated cost of
certain services for a country, which may arise unrest for some countries..

3 By: Solomon Kebede 09/04/2023


Cont…....
 . In this project I have attempted to implement Logistic Regression out
of these algorithms to train and test my results as well as for modeling
and prediction because of the highest accuracy from other algorithms
when I fitted with my datasets. As evident above for static search, my
best model came out to be Logistic Regression with an accuracy of
98.50%.
 Hence I then used grid search parameter optimization to increase the
performance of logistic regression which then gave me the accuracy of
98.80%.
 The feature selection methods are applied to experiment and choose the
best fit features to obtain the highest precision, according to confusion
matrix results

4 By: Solomon Kebede 09/04/2023


Cont….
 A solution could be, by the development of a system to provide a credible automated
index scoring, or rating for credibility of different publishers, and news context.
 This paper proposes a methodology to create a model that will detect if an article is
authentic or fake based on its words, phrases, sources and titles, by applying
supervised machine learning algorithms on labeled dataset and I have also developed
my Fake news Detection system that takes input from the user and classify it to be
true or fake.
 The algorithms that can be used in fake news detection systems are Support Vector
Machines, Random Forests, Decision trees, Stochastic Gradient Descent, Naïve
Bayes,and Logistic Regression and so on. The best model, i.e. the model with highest
accuracy is used to classify the news headlines or articles. In this project I have
attempted to implement Logistic Regression out of these algorithms to train and test
my results as well as for modeling and prediction because of the highest accuracy
from other algorithms when I fitted with my datasets. As evident above for static
search, my best model came out to be Logistic Regression with an accuracy of
98.50%. Hence I then used grid

5 By: Solomon Kebede 09/04/2023


Cont…..

 The product model will test the unseen data, the results will be plotted,
and accordingly, the product will be a model that detects and classifies
fake articles and can be used and integrated with any system for future
use.

6 By: Solomon Kebede 09/04/2023


LITERATURE REVIEW

Title 1: Fake News Detection Project Using Machine Learning Algorithms


Objective: The main objective of this paper is to develop a system that can detect fake news
articles available online.
 The authors aim to achieve this by using various concepts and techniques from Artificial
Intelligence, Natural Language Processing, and Machine Learning.
 The spread of fake news has become a major concern in today's world as it can create
biased opinions among people and even sway election outcomes for the benefit of certain
candidates or parties. Moreover, spammers use appealing headlines to generate revenue
through click-baits.
To address these issues, the authors propose a binary classification approach where each
article is classified as either real or fake based on its content. They plan to use machine
learning algorithms such as Support Vector Machines (SVM), Random Forests (RF),
Naive Bayes (NB) etc., along with natural language processing techniques like bag-of-
words model and word embeddings.
 Overall, the goal of this paper is to provide an effective solution for detecting fake news
articles which will help prevent their negative impact on society.

7 By: Solomon Kebede 09/04/2023


Problem:
The paper discusses the problem of fake news being circulated online
and its negative consequences such as creating biased opinions or
swaying election outcomes. The authors propose using Artificial
Intelligence (AI), Natural Language Processing (NLP), and Machine
Learning (ML) techniques to classify news articles as real or fake, which
is a challenging task due to the complexity of natural language processing
and the need for large datasets with labeled examples. Additionally, there
may be ethical concerns around censorship if this technology were used
by governments or social media platforms to control what information
people have access to. Finally, while machine learning algorithms can
achieve high accuracy rates in detecting fake news, they are not perfect
and may still make mistakes that could have serious consequences if
relied upon too heavily without human oversight

8 By: Solomon Kebede 09/04/2023


Research Gap
 Research Gap: The research gap in this paper is the lack of a comprehensive and
accurate system for detecting fake news articles. While there have been previous
attempts to address this issue, they often rely on manual fact-checking or simple
rule-based approaches that are not effective enough.
 Moreover, most existing systems focus only on specific aspects of the problem such
as identifying click-bait headlines or analyzing social media posts.
 There is a need for an integrated approach that can analyze various features of an
article including its content, source credibility and writing style.
 This paper aims to bridge this research gap by proposing a binary classification
approach using machine learning algorithms along with natural language processing
techniques.
 The authors believe that their proposed system will be more accurate than existing
methods and will help prevent the negative impact of fake news articles on society.
 In summary, while some progress has been made in detecting fake news articles,
there still exists a significant research gap in developing an effective solution which
takes into account multiple factors related to these articles

9 By: Solomon Kebede 09/04/2023


Research methods and result:
The paper proposes a binary classification approach for detecting fake news articles using
machine learning algorithms and natural language processing techniques.
The authors collected a dataset of news articles from various sources, both real and fake,
to train their models.
They used several features such as the frequency of words in the article, source credibility
score, writing style etc., to represent each article. Then they applied different machine
learning algorithms like Support Vector Machines (SVM), Random Forests (RF), Naive
Bayes (NB) etc., on these features to classify each article as either real or fake.
The results showed that their proposed system achieved high accuracy in detecting fake
news articles with an F1-score ranging from 0.85-0.95 depending on the algorithm used.
They also compared their system with existing methods and found that it outperformed
them significantly.
Overall, this research method involved collecting data sets of both real and false
information then training multiple classifiers using different feature representations before
evaluating performance based on metrics such as precision/recall/F1-score which are
commonly used in binary classification tasks like this one.

10 By: Solomon Kebede 09/04/2023


Strong Points
One of the strong points of this paper is its comprehensive approach to detecting fake news
articles. The authors have proposed a binary classification system that takes into account
multiple factors related to an article such as content, source credibility and writing style.
The use of machine learning algorithms along with natural language processing techniques
has enabled the system to achieve high accuracy in detecting fake news articles.
The results showed that their proposed system achieved an F1-score ranging from 0.85-0.95
depending on the algorithm used, which outperformed existing methods significantly.
Another strength of this paper is its practical application in addressing a real-world problem
- the spread of fake news articles online and their negative impact on society including
biased opinions and even swaying election outcomes for certain candidates or parties.
Overall, this paper provides a valuable contribution towards developing effective solutions
for detecting fake news articles using advanced technologies like artificial intelligence and
natural language processing techniques which can help prevent their negative impact on
society.

11 By: Solomon Kebede 09/04/2023


Weak Points are the follows
 One of the weak points of this paper is its limited scope in terms of the dataset used for training
and testing. The authors collected a relatively small dataset consisting of news articles from
various sources, both real and fake.
 While they have achieved high accuracy in detecting fake news articles using their proposed
system, it remains to be seen how well it will perform on larger datasets with more diverse types
of content. Another potential weakness is that the authors did not provide much detail about
how they selected or curated their dataset.
It's unclear whether there was any bias introduced during data collection which could affect
the generalizability or applicability of their results.
 Additionally, while machine learning algorithms are effective at identifying patterns within data
sets, they can also be prone to over fitting if not properly validated against independent test sets.
The authors did use cross-validation techniques to mitigate this risk but further validation on
external datasets would strengthen their findings. Finally, while binary classification systems are
useful for distinguishing between two classes (real vs fake), there may be cases where an article
contains some elements that are true and others that are false - making it difficult to classify as
either one or another without additional context analysis beyond what was presented in this
paper. Overall these weaknesses do not detract significantly from the value provided by this
12 By: Solomon Kebede 09/04/2023
research but should still be considered when interpreting its result
2.An Integrated Machine Learning Framework for Effective
Prediction of Cardiovascular Diseases

Title: 2 Effective Heart Disease Prediction Using Hybrid Machine Learning


Techniques

Problem:
 This research resolve the problem of effective prediction of Cardiovascular
diseases through Machine Learning, which affects the heart or blood vessels of
human inmonopolized system of measuring, metering, diagnosing and control
the factors that responsible for the spread of the diseases such as high blood
pressure, smoking, diabetes, body mass index (BMI), cholesterol, age, family
history, etc. .
 This disease causes highest number of death rates globally.Therefore the early
prediction of these kinds of diseases is very important so that precautionary
measures could be taken before something serious happens.

13 By: Solomon Kebede 09/04/2023


What they did :
 In this article, a MaLCaDD (Machine Learning based Cardiovascular Disease
Diagnosis) framework is proposed for the effective prediction of cardiovascular
diseases with precision.
 The framework is based on four phases where first phase deals with the
handling of missing values via mean replacement technique.
 In second phase, data imbalance issue is resolved via Synthetic Minority Over-
sampling Technique (SMOTE).In third phase, feature selection is performed
using feature importance technique.
 Finally, ensemble of Logistic Regression (LR) and K-Nearest Neighbor (KNN)
is proposed for improved prediction. The validation of framework is performed
through three benchmark datasets (i.e. Framingham, Heart Disease and
Cleveland) and the accuracies of 99.1%, 98.0% and 95.5 % are achieved
respectively. MaLCaDD is highly reliable and applied in real environment for
the early diagnosis of cardiovascular diseases according to this paper.

14 By: Solomon Kebede 09/04/2023


Research techniques& Tools

 This research paper used an exploratory analysis, experimental design


science research. Exploratory research is a methodology approach
that investigates research questions that have not previously been
studied in depth. For data processing are performed using Python. Also
many libraries used in this research such as Pandas, Numpy,Seaborn,
matplotlib, sklearn, and imblearn are used. For experimental setups
and validation above mentioned techniques are used.

15 By: Solomon Kebede 09/04/2023


Research Gap:
:
 The main thing when doing the machine learning based research is the
extraction of data effectively because missing values in the data, the
problem of class imbalance is highly affects the accuracy of the model.
 There is a dire need that problems of missing values and class imbalance
must be catered for before suggesting any classification mechanism.
 Therefore, the prime thing in any machine learning process is to select
the right subset of features while performing feature extraction. But such
a versatile framework that is applicable on wide variety 8 of datasets,
takes into account the problems of missing values/ imbalanced class and
performs reliable predictions (with minimal features and reduced
computational complexity) is hard to find in literature.
 The other gap between this research and others is such integrated
framework for cardiovascular diseases is hard to find in literature

16 By: Solomon Kebede 09/04/2023


Strong point:

 The preprocessing of data to handling the missing values, data


balancing to avoid over fitting or under fitting of the model makes this
paper strong from other and previous researchers.

17 By: Solomon Kebede 09/04/2023


Weaknesses of the papers & Recommendations

 to improve- Although the paper tried to cover the most important


aspects but still there are lots of issues that if addressed properly, this
paper can be better understandable to the readers community. The
following weaknesses are identified based on the level of
understanding of the paper.-
 . Data sample size and time period is too small
 . There may other attributes of CVD are present in different hospital of
world but this papers work does not include and predict for those
factors effectively

18 By: Solomon Kebede 09/04/2023


Title 3:Predicting the Risk of Alcohol Use Disorder Using Machine
Learning: A Systematic Literature Review

Problem-This research resolves the problem of predicting of alcohol


use disorder (AUD) in order to reduce mortality rate. AUD prediction
has worked by many researchers using machine learning (ML)
techniques. However, there is a lack of a comprehensive systematic
literature review (SLR) that summarizes the existing studies on AUD
prediction using ML in the last ten years.

19 By: Solomon Kebede 09/04/2023


The main contributions of this research are:

 To review studies from the past decade (January2010 to July 2021)


using five different dimensions, data pre-processing and sampling
techniques that have been used to prepare datasets for AUD prediction .
 To explore the different types of datasets used in predicting AUD
using ML techniques.
 To analyze the types of features and variables that contribute to the
development of the AUD and the techniques used for extraction,
selection, and the
 reduction of intended features, as well as ML algorithms that have
been used for AUD prediction and their performances. And finally, it
outlines open issues and research challenges related to ML-based AUD
prediction.

20 By: Solomon Kebede 09/04/2023


Research Gap:

 There is a gap of systematic and strong review of studies in order to


predict the AUD, data preprocessing and sampling technique.
 This paper systematically review the paper and work as well as the data
to predict AUD effectively.

21 By: Solomon Kebede 09/04/2023


Methods and Results:

The methodology of this paper was inspired by the SLR (Simple Linear
Regression  guidelines. Three steps from this methodology in order to
review the work from 2010 to 2021 are planning, implementation, and
reporting. Planning to study problem statement, objective, protocol and
Implementation is to study Quality assessment, Data extraction, search
keywords, queries, and procedures. Reporting is to synthesis data and to
analyze critically. These are the SLR guideline that this paper followed to
review past decade works. As mentioned above these studies were
comprehensively reviewed from five different aspects, including
collection sites, types and characteristics of datasets, data pre-processing
and data sampling techniques, feature types, feature selection and feature
extraction techniques, ML algorithm utilization and performance
evaluation metrics.

22 By: Solomon Kebede 09/04/2023


Algorithm/Model/ Method /Approach

 I implemented two different algorithms from classification


algorithms for the prediction model which were: Logistic Regression
model and the Naïve Bayes classifier model. But primarily I used
Logistic Regression model to train and test my dataset because of
Logistic Regression model yields good binary classification
performance as compared to others. The algorithms and the details of
implementation have been explained below.

23 By: Solomon Kebede 09/04/2023


Logistic Regression

 Logistic Regression is a Machine Learning technique used to estimate


relationships among variables using statistical methods. This algorithm
is great for binary classification problems as it deals with predicting
probabilities of classes, and hence my decisions to choose this
algorithm as my baseline run.
 It relies on fitting the probability of true scenarios to the proportion of
actual true scenarios observed.
 Also, this algorithm does not require large sample sizes to start giving
fairly good results .

24 By: Solomon Kebede 09/04/2023


4. Data Description

The dataset for this project was built with a mix of both real and fake
news. The entire dataset amounted to 44,898 news articles out of which
23,481 were fake news and 21,417 were real news. The sources of real
and fake news include Yahoo News, AOL, Reuters, Bloomberg,USA
NewsFlash, Truth-Out, and Controversial Files and so on. To extract
important content from the crawled pages i used two strategies. First
was to reduce noise by removing Fake News Detection insignificant
and irrelevant information like images, tables, headers, footers, special
symbols, navigation bars etc.. With this I noticed I was able to extract
most of the important information across many web pages. Since each
website has its own style of layout and parameters, a one size fit all
strategy would have failed, and hence I leveraged a generic approach.
The collected data was processed using various text preprocessing
measures, as explained later and stored in CSV files. The real and fake
25 By: data were
Solomon then merged and shuffled to get a CSV file containing09/04/2023
Kebede a
consolidated randomized dataset. From the consolidated randomized
4.1 Real News

The News Aggregator Dataset from the Kaggle’s Getting was


used to extract real news. This dataset consists of links to the
originally published news articles in their websites. I extracted
the body content of the articles by removing unnecessary
information such as headers, footers, images, advertisements,
tables etc. The total number of real news is around 21,417.

26 By: Solomon Kebede 09/04/2023


4.2 Fake News

 For fake news I also used Kaggle’s ‘Getting Real about Fake News’
dataset. The CSV file with data was available off the shelf for use, and
 I had to perform minimal text processing on this data. The total number
of fake news as mentioned above is around 23,481.

27 By: Solomon Kebede 09/04/2023


Conclusions

In modern era , the majority of the tasks are done online. Newspapers
that were earlier preferred as hard-copies are now being substituted by
applications like Facebook, Twitter, and news articles to be read online.
Whatsapp’s forwards are also a major source. The growing problem of
fake news only makes things more complicated and tries to change or
hamper the opinion and attitude of people towards use of digital
technology. Thus, in order to solve this challenge, i have developed my
Fake news Detection system that takes input from the user and classify
it to be true or fake. To implement this, various preprocessing and
vectorization of data and Machine Learning Techniques have to be
used. The model is trained using an appropriate dataset from Kaggle
and performance evaluation is also done using various performance
measures. The best model, i.e. the model with highest accuracy is used
to classify the news headlines or articles. As evident above for static
28 By: search, my best model came out to be Logistic Regression with 09/04/2023
Solomon Kebede an
accuracy of 98.50%.
Cont….
performance of logistic regression which then gave me the accuracy
of 98.80%. Hence I can say that if a user feed a particular news article
or its headline in my model, there are 98.80%chances that it will be
classified to its true nature.

29 By: Solomon Kebede 09/04/2023


References
 

1. Effective Heart Disease Prediction Using Hybrid Machine Learning Techniques.


(srivastavag@brandonu.ca), Senthilkumar Mohan (senthilkumar.mohan@vit.ac.in)
and Gautam Srivastava. India : s.n., June 19, 2019.
2. An Integrated Machine Learning Framework for Effective Prediction of Cardiovascular
Diseases. (waseemanwar@ceme.nust.edu.pk), Muhammad Waseem Anwar. Saudi
Arabia : IEEE, June 25, 2021.
3. Predicting the Risk of Alcohol Use Disorder Using Machine Learning: A Systematic
Literature Review. (aleb@mmmi.sdu.dk, Ali Ebrahimi. Denmark : IEEE, November 8,
2021.
4. An Intelligent and Energy-Efficient Wireless Body Area Network to Control Coronavirus
Outbreak. Bilandi N1, Verma HK1, Dhir R1. India : Arabian Journal for Science and
Engineering, Feb 26, 2021.
5. Li, Susan.towardsdatascience. [Online] Towards Data Science, Sep 29, 2017.
https://towardsdatascience.com/.
 

30 By: Solomon Kebede 09/04/2023

You might also like