Fake News Detection: Bachelor of Technology Information Technology

FAKE NEWS DETECTION
A report submitted in partial fulfilment of the requirements for the award of the degree of
BACHELOR OF TECHNOLOGY
in
INFORMATION TECHNOLOGY
By
GURLEEN SINGH (IIITU17318)
SCHOOL OF COMPUTING
INDIAN INSTITUTE OF INFORMATION TECHNOLOGY UNA

HIMACHAL PRADESH
MAY 2021
BONAFIDE CERTIFICATE
This is to certify that the project titled FAKE NEWS DETECTION is a bonafide record of
the work done by
in partial fulfilment of the requirements for the award of the degree of Bachelor of
Technology in INFORMATION TECHNOLOGY of the INDIAN INSTITUTE OF
INFORMATION TECHNOLOGY UNA, HIMACHAL PRADESH, during the year 2020 –
2021.
under the guidance of

DR.NIDHI KUSHWAHA
Project viva-voce held on: 27 MAY 2021
Examiner (DR. NIDHI KUSHWAHA)
i
School of Computing, IIITU [IT-4807] : 17318
ORIGNALITY / NO PLAGARISM DECLARATION
I certify that this project report is our original report and no part of it is copied from any
published reports, papers, books, articles, etc. I certify that all the contents in this report are
based on our personal findings and research and i have cited all the relevant sources which
have been required in the preparation of this project report, whether they be books, articles,
reports, lecture notes, and any other kind of document. I also certify that this report has not
previously been submitted partially or as whole for the award of degree in any other
university in India and/or abroad.
I hereby declare that, we are fully aware of what constitutes plagiarism and understand that
if it is found at a later stage to contain any instance of plagiarism, our degrees may be
cancelled.
ii
ABSTRACT
Recent political events have lead to an increase in the popularity and spread of fake news.
As demonstrated by the widespread effects of the large onset of fake news, humans are
inconsistent if not outright poor detectors of fake news. With this, efforts have been made
to automate the process of fake news detection. The most popular of such attempts include
“blacklists” of sources and authors that are unreliable. While these tools are useful, in order
to create a more complete end to end solution, we need to account for more difficult cases
where reliable sources and authors release fake news. As such, the goal of this project is to
create a tool for detecting the fake and real news through the use of machine learning
techniques. The results of this project demonstrate the ability for machine learning to be
useful in this task.This project of detecting fake news deals with fake and real news
detection using python and machine learning techniques.Using sklearn,we build a TF-IDF
Vectorizer on our dataset.Then, we initialize a Passive Aggressive Classifier and fit the
model. In the end, the accuracy score and confusion matrix tell us how well our model
fares. In addition to PAC two more classifiers are initialized but PAC is having the highest
accuracy score among others.
Keywords:- TF, IDF, PAC.
iii
ACKNOWLEDGEMENT
I would like to thank the following people for their support and guidance without whom the
completion of this project in fruition would not be possible.
I would like to express our sincere gratitude and heartfelt thanks to Dr.Nidhi Kushwaha for
their unflinching support and guidance, valuable suggestions and expert advice. Their
words of wisdom and expertise in subject matter were of immense help throughout the
duration of this project.
I also take the opportunity to thank our Director and all the faculty of School of Computing,
IIIT Una for helping us by providing necessary knowledge base and resources.
I would also like to thank our parents and friends for their constant support.
GURLEEN SINGH(IIITU17318)
iv
TABLE OF CONTENTS
Title Page No.
ABSTRACT iii
ACKNOWLEGEMENT iv
TABLE OF CONTENTS v
LIST OF ACRONYMS vi
LIST OF FIGURES vii
1 Introduction 1
1.1 Fake and Real News 1
1.1.1 What is Fake News? 3
2 Detecting Fake News 4

2.1 Datasets 4
2.2 TfidfVectorizer 5
2.2.1 Classifier 5
2.3 Detection Steps 6
References 8
Appendices 9
v
LIST OF ACRONYMS
TF Term Frequency
IDF Inverse Document Frequency
PAC Passive Aggressive Classifier
vi
LIST OF FIGURES
1. 5 Records from Dataset 1

2. Dataframe Labels 2
3. Accuracy and confusion matrix by PAC 3
vii
Chapter 1
Introduction
1.1 Fake and Real News
The rise of fake news during the 2016 U.S. Presidential Election highlighted not
only the dangers of the effects of fake news but also the challenges presented when
attempting to separate fake news from real news. Fake news may be a relatively
new term but it is not necessarily a new phenomenon. Fake news has technically
been around at least since the appearance and popularity of one-sided, partisan
newspapers in the 19th century. However, advances in technology and the spread of
news through different types of media have increased the spread of fake news today.
As such, the effects of fake news have increased exponentially in the recent past and
something must be done to prevent this from continuing in the future. I have
identified the three most prevalent motivations for writing fake news and chosen
only one as the target for this project as a means to narrow the search in a
meaningful way. The first motivation for writing fake news, which dates back to the
19th century one-sided party newspapers, is to influence public opinion. The
second, which requires more recent advances in technology, is the use of fake
headlines as clickbait to raise money. The third motivation for writing fake news,
which is equally prominent yet arguably less dangerous, is satirical writing.While
all three subsets of fake news, namely, clickbait, influential, and satire, share the
common thread of being fictitious, their widespread effects are vastly different.
Therefore, our goal is to move beyond these achievements and use machine learning
to classify, at least as well as humans, more difficult discrepancies between real and
fake news.
1
There are two methods by which machines could attempt to solve the fake news
problem better than humans. The first is that machines are better at detecting and
keeping track of statistics than humans, for example it is easier for a machine to
detect that the majority of verbs used are “suggests” and “implies” versus, “states”
and “proves.” Additionally, machines may be more efficient in surveying a
knowledge base to find all relevant articles and answering based on those many
different sources. Either of these methods could prove useful in detecting fake
news, but we decided to focus on how a machine can solve the fake news problem
using supervised learning that extracts features of the language and content only
within the source in question, without utilizing any fact checker or knowledge base.
For many fake news detection techniques, a “fake” article published by a
trustworthy author through a trustworthy source would not be caught. This approach
would combat those “false negative” classifications of fake news. In essence, the
task would be equivalent to what a human faces when reading a hard copy of a
newspaper article, without internet access or outside knowledge of the subject
(versus reading something online where he can simply look up relevant sources).
The machine, like the human in the coffee shop, will have only access to the words
in the article and must use strategies that do not rely on blacklists of authors and
sources.
2
1.1.1 WHAT IS FAKE NEWS?
A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes
and is generally spread through social media and other online media. This is often done to
further or impose certain ideas and is often achieved with political agendas. Such news
items may contain false and/or exaggerated claims, and may end up being viralized by
algorithms, and users may end up in a filter bubble.
3
Chapter 2
Detecting Fake News
2.1 Datasets
The lack of manually labeled fake news datasets is certainly a bottleneck for advancing
computationally intensive, text-based models that cover a wide array of topics. The dataset
for the fake news challenge does not suit our purpose due to the fact that it contains the
ground truth regarding the relationships between texts but not whether or not those texts are
actually true or false statements. The dataset I have used for this python project- news.csv.
This dataset has a shape of 7796×4. The first column identifies the news, the second and
third are the title and text, and the fourth column has labels denoting whether the news is
REAL or FAKE.
4
2.2 TfidfVectorizer
TF (Term Frequency): The number of times a word appears in a document is its Term

Frequency. A higher value means a term appears more often than others, and so, the
document is a good match when the term is part of the search terms.
IDF (Inverse Document Frequency): Words that occur many times a document, but also
occur many times in many others, may be irrelevant. IDF is a measure of how significant a
term is in the entire corpus.
The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF

features.
2.2.1 Classifier
Passive aggressive classifier is used in this project as it is having higher accuracy score as
compared to other classifiers. Passive Aggressive algorithms are online learning
algorithms. Such an algorithm remains passive for a correct classification outcome, and
turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other
algorithms, it does not converge. Its purpose is to make updates that correct the loss,
causing very little change in the norm of the weight vector.
5
2.3 Detection Steps
1. Make necessary imports.

2. Read the data into a DataFrame, and get the shape of the data and the first 5
records.
Fig.1
3. Get the labels from the DataFrame.
Fig.2
4. Split the dataset into training and testing sets.
5. Initialize a TfidfVectorizer with stop words from the English language and a
maximum document frequency of 0.7 (terms with a higher document frequency will
be discarded). Stop words are the most common words in a language that are to be
6
filtered out before processing the natural language data. And a TfidfVectorizer turns
a collection of raw documents into a matrix of TF-IDF features. Now, fit and
transform the vectorizer on the train set, and transform the vectorizer on the test set.
6. Initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and

y_train.
7. Predict on the test set from the TfidfVectorizer and calculate the accuracy with
accuracy_score() from sklearn.metrics.
8. Initialize other classifiers and compare the accuracy score.
9. Print out a confusion matrix to gain insight into the number of false and true
negatives and positives.
Fig.3
We have 588 true positives, 589 true negatives, 40 false positives, and 50 false
negatives by using passive aggressive classifier.
7
References
[1] M. Risdal. (2016, Nov) Getting real about fake news. [Online]. Available:
https://www.kaggle.com/ mrisdal/fake-news
[2] J. Soll, T. Rosenstiel, A. D. Miller, R. Sokolsky, and J. Shafer. (2016, Dec) The long
and brutal history of fake news. [Online]. Available:
https://www.politico.com/magazine/story/2016/12/ fake-news-history-long-violent-214535
[3] C. Wardle. (2017, May) Fake news. it’s complicated. [Online]. Available:
https://firstdraftnews.com/ fake-news-complicated
[4] T. Ahmad, H. Akhtar, A. Chopra, and M. Waris Akhtar, “Satire detection from web
documents using machine learning methods,” pp. 102–105, 09 2014.
8
Appendices
9
Appendix A
Code Attachments
A.1 Code To Detect Fake News
pip install numpy pandas sklearn

import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
df=pd.read_csv('F:\\news.csv')
df.shape
df.head()
labels=df.label
labels.head()
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2,
random_state=7)
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
#DataFlair - Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
10
from sklearn.neighbors import KNeighborsClassifier
kn= KNeighborsClassifier()
kn.fit(tfidf_train,y_train)
y_pred=kn.predict(tfidf_test)
from sklearn.tree import DecisionTreeClassifier
dt=DecisionTreeClassifier()
dt.fit(tfidf_train,y_train)
y_pred=dt.predict(tfidf_test)
print("Passive agressive classifier is best with highest accuracy of 92.98%")
print("accuracy of other classifiers is less compared to passive aggressive")
11

Fake News Detection: Bachelor of Technology Information Technology

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fake News Detection: Bachelor of Technology Information Technology

Uploaded by

Copyright:

Available Formats

FAKE NEWS DETECTION

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY UNA

GURLEEN SINGH (IIITU17318)

under the guidance of

Project viva-voce held on: 27 MAY 2021

Examiner (DR. NIDHI KUSHWAHA)

GURLEEN SINGH (IIITU17318)

Keywords:- TF, IDF, PAC.

Title Page No.

2 Detecting Fake News 4

1. 5 Records from Dataset 1

1.1 Fake and Real News

TF (Term Frequency): The number of times a word appears in a document is its Term

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF

1. Make necessary imports.

3. Get the labels from the DataFrame.

6. Initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and

8. Initialize other classifiers and compare the accuracy score.

pip install numpy pandas sklearn

You might also like