Professional Documents
Culture Documents
Final CPE Report
Final CPE Report
ON
“FAKE NEWS DETECTION
USING LSTM”
SUBMITTED TO
MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION,
MUMBAI
SUBMITTED BY
GUIDED BY
Mr.S.M.Bankar
CERTIFICATE
This is to Certify that the project report entitled “FAKE NEWS DETECTION USING
LSTM” Was Successfully completed by Student of Sixth Semester DIPLOMA IN
COMPUTER ENGINEERING .
in partial fulfillment of the requirements for the award of the Diploma in Computer
Engineering and submitted to the Department of Computer Engineering of
Government Polytechnic Ambad work carried out during a period for the academic
year 2023-24 as per curriculum.
Secondly I would also like to thank my parents and friends who helped me a lot infinalizing
this project within the limited time frame.
2. Sarthak Deshmukh 11
3. Vishal Ugale 37
4. Piyush Chaudhari 50
Date:
Place: Ambad.
ABSTRACT
Designed to deceive readers and manipulate public opinion, fake news can be created for a variety
of reasons ranging from political propaganda to generating revenue through clickbait. Another
significant challenge in combating fake news is the difficult balance between curbing
misinformation and preserving free speech, though some argue for stricter regulations to control the
spread of fake news. Thus, the purpose of this study is to identify fake news using Long-Short Term
Memory (LSTM). LSTM models are often used to analyze the linguistic features of news articles or
social media posts. The dataset we used comes from a dataset of fake news on Kaggle's website.
The proposed method can identify fake news with average precision, recall, accuracy, and f-measure
values of 0.99, 0.99, 0.99, and 0.99. The results showed that LSTM provides superior performance
compared to the Support Vector Classifier, Logistic Regression, and Multinomial Naive Bayes
methods.
Keywords: Fake News Classification, LSTM, Deep Learning
Index
Sr. no. TITLE Pag
e no
1. Introduction 6.
2. Editor Brower & Server used
7.
3. Research Method
1. Data Collection 8-20.
2. Data Preprocessing
3. Model Training
4. Evaluation
5. Analysis and Result
4. Areas of Improvement
21-22
5. Hardware Requirement Specification
23.
6. RAM and Disk used at time of Model Training
24.
7. Conclusion
25.
8. References 26.
5
1. INTRODUCTION
The proliferation of fake news in recent years has become a significant challenge,
undermining the trustworthiness of information and disrupting societal discourse. Fake
news, deliberately crafted to deceive and manipulate public opinion, poses substantial
threats to democratic processes, social cohesion, and individual decision-making.
Motivated by various factors such as political agendas or financial incentives from
clickbait, creators of fake news exploit the accessibility and reach of social media
platforms, often prioritizing virality over accuracy.
The consequences of fake news extend beyond mere misinformation, eroding trust in
established sources of information and exacerbating societal polarization. Social media
platforms, designed for rapid information dissemination, have inadvertently facilitated the
spread of misinformation, overshadowing legitimate news sources and complicating the
task of distinguishing truth from falsehood. Moreover, addressing the challenge of fake
news is complicated by the need to balance efforts to combat misinformation with the
protection of free speech rights.
This study advances existing research by proposing a novel approach to fake news
detection using LSTM classification models. By analyzing linguistic features extracted
from news articles and social media posts, the proposed LSTM-based system endeavors
to differentiate between trustworthy information and fake news with high precision and
recall. Through rigorous experimentation and evaluation, this research seeks to contribute
to the evolution of fake news detection techniques, ultimately fostering a more informed
and resilient society.
6
2. EDITOR BROWSER & SERVER USED
Editor Used :-
Google Colab
Architecture Used :-
System RAM :- 12.7 GB
GPU RAM :- 15.0 GB
7
3. RESEARCH METHOD
1.Data Collection
2.Data Preprocessing
3.Model Training
4.Evaluation
5..Analysis and Result
8
1. Data Collection
We obtained a dataset of fake news from Kaggle. The data set used for analysis and
exploration in this research project consisted of a total of 44,898 news articles, with a clear
dichotomy between real and fake news. From a vast pool of news articles, 21,417 articles
were categorized as original news. On the other hand, the dataset also includes many
23,481 articles classified as fake news, which were deliberately created and disseminated
to deceive the audience and spread misinformation. The clear separation between genuine
and fake news in this dataset provides a unique opportunity for researchers to analyze
significant patterns and characteristics to develop accurate and robust approaches to
identifying fake news.
Total :- 44,898
Real :- 21,417
Fake :- 23,481
9
Subject Count Plot of Real :-
10
2. Data Preprocessing
NLP :- NLP Stands for Natural Language Processing.
Natural Language Processing (NLP) plays a crucial role in fake news detection models by
enabling computers to analyze and understand textual content for identifying
misinformation. Here's how NLP techniques have been applied in our project:
1. Text Preprocessing:
2. Feature Extraction:
Word Cloud: A word cloud is a visualization technique used to
represent text data, where the size of each word indicates its
frequency or importance within the text. In a word cloud, words
are typically arranged randomly, and the size of each word is
proportional to its frequency in the text.
Word clouds are often used to provide a visual summary of the
most frequently occurring words in a document or a corpus of
text. They are particularly useful for identifying key themes,
topics, or trends within the text data at a glance.
12
3. Word To Vector Model:
4. Tokenization:
1. Word Tokenization:
- Word tokenization involves splitting the text into individual words based
on whitespace or punctuation boundaries.
- For example, the sentence "Natural Language Processing is fascinating!"
would be tokenized into the following tokens: ["Natural", "Language",
"Processing", "is", "fascinating", "!"].
2. Sentence Tokenization:
- Sentence tokenization involves splitting the text into individual sentences.
- For example, the paragraph "NLP is fascinating. It involves analyzing text
data." would be tokenized into the following sentences: ["NLP is
fascinating.", "It involves analyzing text data."].
3. Subword Tokenization:
- Subword tokenization involves breaking down words into smaller
13
linguistic units, such as prefixes, suffixes, or root words.
- This approach is particularly useful for handling out-of-vocabulary words
and morphologically rich languages.
- For example, the word "unhappiness" might be tokenized into ["un",
"happi", "ness"] using subword tokenization techniques like Byte-Pair
Encoding (BPE) or WordPiece.
Tokenization serves as the first step in many NLP tasks, including text
classification, sentiment analysis, machine translation, and named entity
recognition. Once the text has been tokenized, further processing steps such
as stopword removal, stemming, lemmatization, and feature extraction can
be applied to the tokens to extract meaningful information and patterns from
the text data.
14
3. Model Training
Training a model using Long Short-Term Memory (LSTM) networks involves several
steps. LSTMs are a type of recurrent neural network (RNN) architecture that are well-
suited for sequence prediction tasks, such as time series forecasting, natural language
processing, and speech recognition. Here's a high-level overview of the process:
1. Data Preprocessing:
- Prepare your dataset for training. This typically involves tokenizing the text (if
working with text data), splitting it into sequences, and encoding it into numerical
format that the LSTM can process. For example, you might convert words into word
embeddings or use one-hot encoding for categorical variables.
2. Model Architecture:
- Define the architecture of your LSTM model. This includes specifying the number
of LSTM layers, the number of units (or neurons) in each layer, and any additional
layers such as dropout or dense layers. You'll also need to specify the input shape,
which depends on the format of your input data.
4. Training:
- Train the LSTM model on your training data using the `fit()` method. Specify the
training data, validation data (if applicable), batch size, and number of epochs. During
training, the model learns to adjust its parameters (weights and biases) to minimize the
loss function.
5. Validation:
- Monitor the model's performance on the validation data during training to detect
overfitting and adjust hyperparameters accordingly. You can visualize metrics such as
loss and accuracy over epochs using plots or callbacks.
6. Evaluation:
- Evaluate the trained LSTM model on a separate test dataset to assess its
performance on unseen data. Compute relevant metrics such as accuracy, precision,
recall, F1-score, or mean squared error depending on the task.
7. Fine-Tuning (Optional):
15
- Fine-tune the LSTM model by experimenting with different architectures,
hyperparameters, and preprocessing techniques to improve performance.
Throughout this process, it's essential to monitor the model's performance, iterate on
the architecture and hyperparameters, and incorporate best practices for training deep
learning models, such as regularization and early stopping, to prevent overfitting.
16
Values per Epoch :-
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
===============================================================
==
embedding (Embedding) (None, 1000, 100) 22826500
===============================================================
==
Total params: 22943877 (87.52 MB)
Trainable params: 117377 (458.50 KB)
Non-trainable params: 22826500 (87.08 MB)
Epoch 1/6
737/737 [==============================] - 35s 42ms/step - loss: 0.1423 - acc:
0.9496 - val_loss: 0.0704 - val_acc: 0.9764
Epoch 2/6
737/737 [==============================] - 30s 41ms/step - loss: 0.0723 - acc:
0.9771 - val_loss: 0.0618 - val_acc: 0.9799
Epoch 3/6
737/737 [==============================] - 30s 41ms/step - loss: 0.0448 - acc:
0.9848 - val_loss: 0.0446 - val_acc: 0.9858
Epoch 4/6
737/737 [==============================] - 29s 40ms/step - loss: 0.0408 - acc:
0.9864 - val_loss: 0.0445 - val_acc: 0.9858
Epoch 5/6
737/737 [==============================] - 30s 41ms/step - loss: 0.0304 - acc:
0.9901 - val_loss: 0.0434 - val_acc: 0.9855
Epoch 6/6
737/737 [==============================] - 30s 41ms/step - loss: 0.0277 - acc:
0.9904 - val_loss: 0.0405 - val_acc: 0.9856
<keras.src.callbacks.History at 0x7f9b70636260>
17
4. Evaluation
After training your Long Short-Term Memory (LSTM) model for a specific task, such
as text classification or sequence prediction, it's crucial to evaluate its performance to
assess how well it generalizes to unseen data. Here are some common evaluation
metrics and techniques for assessing the performance of an LSTM model:
1. Accuracy:
- Accuracy measures the proportion of correctly predicted labels out of all samples
in the test dataset. It's a common metric for classification tasks. However, accuracy
alone might not provide a complete picture, especially if the classes are imbalanced.
3. Confusion Matrix:
- A confusion matrix provides a detailed breakdown of the model's predictions by
comparing them to the actual labels. It shows the number of true positives, true
negatives, false positives, and false negatives. From the confusion matrix, you can
derive other metrics like precision, recall, and accuracy.
18
No. of epochs run :- 6 epochs
Classification Report :-
Model Summery :-
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
===============================================================
==
embedding (Embedding) (None, 1000, 100) 22826500
===============================================================
==
Total params: 22943877 (87.52 MB)
Trainable params: 117377 (458.50 KB)
Non-trainable params: 22826500 (87.08 MB)
_________________________________________________________________
19
5. Analysis and Result
Result :-
20
4. AREAS OF IMPROVEMENT
5. Domain-specific Adaptation:
- Investigate techniques for domain adaptation or transfer learning to adapt the model
to specific domains or languages with limited labeled data.
- Pre-train the model on a large corpus of general text data and fine-tune it on the
task-specific dataset to leverage transfer learning.
6. Adversarial Robustness:
- Develop strategies for adversarial defense to mitigate the impact of adversarial
attacks on the model's predictions, such as adversarial training, input perturbation, or
robust optimization.
By focusing on these areas for improvement, we can enhance the effectiveness and
reliability of our fake news detection model, ultimately contributing to the fight against
misinformation.
22
5. HARDWARE REQUIREMENT SPECIFICATION
Optional Hardware External hard drives, printers, scanners, etc. based on project needs
23
RAM and Disk used at time of Model Training :-
24
6. CONCLUSION
In conclusion, the use of LSTM (Long Short-Term Memory) networks in the detection
of fake news represents a promising approach with significant potential for mitigating
the spread of misinformation in the digital era. Through the application of advanced
natural language processing techniques, such as LSTM, we can effectively analyze
textual data to discern patterns and characteristics indicative of fake news.
Furthermore, the development of this fake news detection system underscores the
importance of interdisciplinary collaboration between machine learning experts,
linguists, and domain specialists. By combining expertise from diverse fields, we have
been able to design a model that not only harnesses the power of deep learning but also
incorporates nuanced linguistic features essential for effective fake news detection.
Moving forward, the success of this project opens up avenues for further research and
refinement. Future efforts may focus on enhancing the model's accuracy and
scalability, exploring additional features or data sources, and addressing emerging
challenges in the ever-evolving landscape of online misinformation.
Ultimately, the deployment of LSTM-based fake news detection systems holds great
promise for promoting information integrity, fostering media literacy, and safeguarding
the public discourse against the pernicious influence of false information. Through
continued innovation and collaboration, we can strive towards a more informed and
resilient society in the digital age.
25
7.REFERENCES
Chat-GPT
Youtube
Github
Research Paper 1
Google Colab
Kaggle
26