You are on page 1of 26

Detecting Spam Messages using

Naive Bayes Classifier


A J Component Report

Submitted by Team Members

Name Registration Number


Prateek Aggarwal 20BCE0387

Ishanvi Sharma 20BCE0394

Nikkhil Chopra 20BCE0927

Prajjwal Chamaria 20BCE0948

in partial fulfilment for the award of the degree of


B.Tech
in
Computer Science and Engineering
Under the guidance of
Faculty: Prof. Delhi Babu R.
School of Computer Science and Engineering
April 2022

1
School of Computer Science and Engineering

DECLARATION

We hereby declare that the J Component report entitled “DETECTING SPAM MESSAGES
USING NAÏVE BAYES CLASSIFIER” submitted by us to Vellore Institute of Technology,
Vellore-14 in partial fulfilment of the requirement for the award of the degree of B.Tech in
Computer Science and Engineering is a record of bonafide undertaken by us under the
supervision of Dr. R. Delhi Babu. We further declare that the work reported in this report has
not been submitted and will not be submitted, either in part or in full, for the award of any other
degree or diploma in this institute or any other institute or university.

Prateek Aggarwal (20BCE0387)

Ishanvi Sharma (20BCE0394)

Nikkhil Chopra (20BCE0927)

Prajjwal Chamaria (20BCE0948)

2
ACKNOWLEDGEMENT

We are grateful to our professor, Delhi Babu R., whose insightful leadership and knowledge
benefited us to complete this project successfully. Thank you so much for your continuous
support and presence whenever needed.

3
CONTENTS

1. Executive Summary

2. Project Idea and Scope

3. Problem Statement

4. Literature Review

5. Architecture Diagram

6. Technical Specification

7. Materials Used (Dataset)

8. Design and Implementation

9. Results and Inferences

10. Conclusion

11. References

4
1. EXECUTIVE SUMMARY

This project explores the classification of a bulk of SMS or email messages into two categories
-- Spam and Non-Spam, or Ham. A very effective way to do this job would be to employ Bayes
theorem and use a Naive Bayes Classifier.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically
use bag-of-words features to identify spam E-mail; an approach commonly used in text
classification. It works on the principle of conditional probability, as given by the Bayes
theorem.

A data set of 5573 messages is used, split into testing and training data sets, and pre-processed
to prepare the data for classification. The messages of the training set are analysed and words
are noted. Naive Bayes algorithm is applied to classify the messages into spam and ham. A
very high accuracy of 97% (on average) is observed through the output.

5
2. PROJECT IDEA AND SCOPE
Email spam, also referred to as junk email or simply spam, is unsolicited messages sent in bulk
by email/messaging. Email spam has steadily grown since the early 1990s, and by 2019 was
estimated to account for around 90% of total Email traffic.

There are various reasons why spam emails need to be prevented. Spam mails fill up our
inboxes which makes it difficult to find genuine and useful emails, and is also time consuming
and strenuous to delete. In addition to this, it can also be used to spread computer viruses and
malware, can slow down internet speed and can contain inappropriate content or images. It
can include malicious links that can infect your computer with malware. Spam email can be
difficult to stop, as it can be sent from botnets. Botnets are a network of previously infected
computers. As a result, the original spammer can be difficult to trace and stop.

Spam email messages are sent randomly to multiple addresses by all sorts of groups, including
lazy advertisers and criminals who wish to lead the recipient to phishing websites. These sites
attempt to steal valuable personal, financial and electronic information.
While the most widely recognised form of spam is Email spam, the term is applied to similar
abuses in other media as well, which include instant message spam, usenet newsgroup spam,
web search engine spam, spam in blogs, wiki spam, online classified ad spam, mobile phone
message spam, internet forum spam, junk fax transmission, social spam, spam mobile apps,
television advertising and file sharing spam.

The main goal of our project is to highlight how we could design a spam filtering system from
scratch using various techniques. Bayesian Spam Detection/ Filtering is used to detect spam in
an email. A Bayesian network is a representation of probabilistic relationship.

6
3. PROBLEM STATEMENT
A tight competition between filtering methods and spammers is going on per day, as spammers
began to use tricky methods to overcome the spam filters like using random sender addresses
or appending random characters at the beginning or end of mail’s subject line. There is a lack
of machine learning focus on model development that can predict the activity. Spam is a waste
of time to the user since they have to sort the unwanted junk mail and it consumes storage space
and communication bandwidth. Rules in existing models must be constantly updated and
maintained, which makes it a burden to some users and it is hard to manually compare the
accuracy of classified data.

In order to proceed with this objective, we needed to understand what the role of various
methodologies in Spam Email Detection is.
Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically
use bag-of-words features to identify spam email; an approach commonly used in text
classification.

The methodology also focuses on White Lists. A white list is a list of addresses from which
users tend to receive emails. Users can also add email addresses, domain inputs or domains of
functions. An advantage of white list is that it allows users or administrators to put email
addresses of favourite people into the list in order to make sure that valid emails received from
addresses in the white list are not labelled spam when receiving emails from different senders.

7
4. LITERATURE REVIEW

S. Title of Resource Year and Methodology Comments Website


No. author Link

1. Spam Detection Vinodhini. M, Uses NLP concepts Detects spam in F112003


Framework using Prithivi. D, with random forest general (not in 8620.pdf
ML Algorithm Balaji. S algorithm using 8 emails, but in (ijrte.org)
March 2020 different features, tweets and gives
each using an NLP account info).
concept.

2. Evaluation of ML Mahmoud Calculates This paper is a Preparati


techniques for Jazzar, accuracy, precision comparison on_Instru
Email Spam Rasheed F. and recall for each between the ction
Classification Yousef, of the methods techniques and (mecs-
Derar Eleyan used. The following does not explain press.org)
February methods are used the entire
2021 for classification: working of the
J48 decision tree, system.
Support Vector
Machine, Artificial
Neural Network,
Naive Bayes
Classifier.

3. ML for email Emmanuel The paper describes Several machine https://w


spam filtering: Gbenga Dada, multiple ways to learning ww.scien
review, Joseph approach the approaches are cedirect.c
approaches, and Stephen problem after reviewed. The om/scien
open research Bassi, Haruna tokenizing the evolution of ce/article/
problems Chiromab, training data. A spam messages pii/S2405
Shafi'i number of methods over the years to 8440183
Muhammad have been evade filters was 53404
Abdulhamid, highlighted such as examined. The
Adebayo Naive Bayes architecture of
Olusola classifier, Support the system to
Adetunmbi, Vector Machine and solve the
Opeyemi K Nearest problem was
Emmanuel Neighbour. also looked into.
Ajibuwae A very useful
paper for
hopeless
students, such as
ourselves, to
approach the
given problem.

8
4. Detecting Spam Simran Implementation of Comparison of https://iee
Email With Gibson, Biju machine learning the results with explore.ie
Machine Issac, Li models was done other machine ee.org/sta
Learning Zhang, Seibu using Naïve Bayes, learning and bio- mp/stamp
Optimized With Mary Jacob Support Vector inspired models .jsp?arnu
Bio-Inspired (October Machine, Random to show the mber=92
Metaheuristic 2020) Forest, Decision best suitable 22163
Algorithms Tree and model is
Multi-Layer discussed in the
Perceptron on seven research paper.
different email
datasets, along with
feature extraction The Genetic
and pre-processing. Algorithm
worked better
The bio-inspired overall for both
algorithms like text-based
Particle Swarm datasets and
Optimization and numerical-based
Genetic Algorithm datasets than
were implemented PSO.The PSO
to optimize the worked well for
performance of Multinomial
classifiers. Naïve Bayes and
Multinomial Naïve Stochastic
Bayes with Genetic Gradient
Algorithm Descent,
performed the whereas GA
best overall. worked well for
Random Forest
and Decision
Tree. Naïve
Bayes algorithm
was proved to
have been the
best algorithm
for spam
detection.

5. Spam Detection Luo Use of Machine The proposed Spam


Approach for GuangJun , learning classifiers method(Decisio Detection
Secure Mobile Shah Nazir, such as Logistic n Tree) for spam Approach
Message Habib Ullah regression (LR), K- detection is for
Communication Khan, and nearest neighbor using machine Secure
Using Machine Amin Ul (K-NN), and learning Mobile
Learning Haq(2020) decision tree (DT) predictive Message
Algorithms are used for models. The Commun
classification of experimental ication

9
ham and spam results obtained Using
messages in mobile show that the Machine
device proposed Learning
communication method has a Algorith
high capability ms
to detect spam.
The proposed
method achieved
99% accuracy
which is high as
compared with
the other
existing
methods.

6. Email Spam Manoj Sethi, The proposed work The research


Detection Sumesha showcases proposes Email
using Chandra, Differentiating that the Spam
Machine Vinayak features of the outcome Detection
Learning and Choudhary, content of other that is obtained using
Neural Yash(4 April documents. should Machine
Networks 2021) The Research says be compared Learning
that spam email with and
detection either additional Neural
focuses on natural spam Networks
language processing datasets
methodologi from
es on single various
machine learning sources.
algorithms or one Also,more
natural language classification
processing and feature
technique on algorithms
multiple machine should be
learning algorithms. analyzed with
email
In this Project, spam datasets
modeling pipeline is
developed to
review the machine
learning
methodologies

7. Machine Priti Sharma, This paper proposes A total of three [PDF]


Learning based U. Bhardwaj a machine learning experiments are Machine
Spam E-Mail (2018) based hybrid performed and Learning
Detection bagging approach the results based
by implementing obtained are Spam E-
the two machine compared in Mail
learning algorithms: Detection

10
Naïve Bayes and terms of
J48 (decision tree) precision, recall,
for the spam email accuracy, f-
detection.The measure, true
Third experiment is negative rate,
the proposed SMD false positive
system rate and false
implemented using negative rate.
hybrid bagged The overall
approach. In this accuracy of
process, dataset is 87.5% achieved
divided into by the hybrid
different sets and bagged approach
given as input to based SMD
each algorithm. system

11
5. ARCHITECTURE DIAGRAM

12
6. TECHNICAL SPECIFICATION
In this project, we aim to build a spam e-mail message filter using Python and the multinomial
Naïve Bayes algorithm. Our goal is to code a spam filter from scratch that classifies messages
with an accuracy greater than 80%.

To build our spam filter, we'll use a dataset of over 5,000 SMS messages put together from
various sites with the major dataset being from Kaggle.

We have tried to focus on the Python implementation throughout the project and understand
the Bayes classification through the same

● Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.

● The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.

When a new message comes in, our multinomial Naive Bayes algorithm will make the
classification based on the results it gets to these two equations below, where "w1" is the first
word, and w1,w2, ..., wn is the entire message:

If P(Spam | w1,w2, ..., wn) is greater than P(Ham | w1,w2, ..., wn), then the message is spam.

13
To calculate P(wi|Spam) and P(wi|Ham), we need to use separate equations:

For the scope of the project, we have assumed the value of alpha to be 1.

14
7. MATERIALS REQUIRED (DATA SET)
We have started by opening the SMS SpamCollection file with the read csv() function from
the pandas package.

Here, the process of the splitting of the dataset has also been shown to ensure that spam and
ham messages are spread properly throughout the dataset.

import pandas as pd

import re

data = pd.read_csv('SMSSpamCollection', sep='\t', header=None,


names=['Label', 'SMS'])

index = round(len(data) * 0.8)

train_data = data[:index].reset_index(drop=True)

test_data = data[index:].reset_index(drop=True)

We have taken the data from one of the datasets available on Kaggle.

(https://www.kaggle.com/datasets)

15
For the dataset that we have taken, it is observed that about 87% of the messages are ham (non-
spam), and the remaining 13% are spam. This sample is representative, depending on context
depending on system to system and user to user.

For the implementation of the code on this dataset,we have used the first 80% of the data for
training and the remaining 20% for testing.

16
8. DESIGN AND IMPLEMENTATION

I. Processing
a) Data Cleaning

To make use of the formulae for Bayes’ Theorem, we convert the dataset into a more usable
format. We do this by making all the characters in the messages lowercase, and removing all
punctuation.

train_data['SMS'] = train_data['SMS'].str.replace('\W', ' ')


train_data['SMS'] = train_data['SMS'].str.lower()
train_data['SMS'] = train_data['SMS'].str.split()

Since we will be calculating the probability for every message, it is essential that we convert
the dataset into a more convenient format so we can make use of the formulas. We do this in
the following manner:-

For every unique word in the dataset, there is a column related to it. If there are ‘n’ unique
words, then there are ‘n’ columns. Now, for every message, there is a label stating whether it’s
ham or spam, and then there are ‘n’ columns. The cells of these columns contain the number
of times a word occurs in the corresponding message.

The Data Cleaning Mechanism included:


● Letter Case and Punctuation
● Creating the Vocabulary (if needed)
● The Final Training Set (using the vocabulary created to make the data transformation
we want)

17
b) Calculating Constants according to the formula
Now that we're done with cleaning the training set, we can begin coding the spam filter. The
multinomial Naive Bayes algorithm will need to answer these two probability questions to be
able to classify new messages:

Also, to calculate P(wi|Spam) and P(wi|Ham) inside the formulas above, we'll need to use these
equations:

We'll also use Laplace smoothing and set .

spam = processed_train_data[processed_train_data['Label'] ==
'spam']
ham = processed_train_data[processed_train_data['Label'] ==
'ham']
#Formula Calculation
P_Spam = len(spam) / len(processed_train_data)
P_Ham = len(ham) / len(processed_train_data)

N_Spam = spam['SMS'].apply(len).sum()
N_Ham = ham['SMS'].apply(len).sum()
N_Words = len(words)

18
c) Calculating Parameters according to the formula
After we have the constant terms calculated above, we can move on with calculating the
parameters P(wi|Spam) and P(wi|Ham).

P(wi|Spam) and P(wi|Ham) will vary depending on the individual words. For instance,
P("secret"|Spam) will have a certain probability value, while P("cousin"|Spam) or
P("lovely"|Spam) will most likely have other values.

Thus, each parameter shall be a conditional probability value associated with each word in the
vocabulary.

The parameters are calculated using the two equations mentioned above.

P_Word_given_Spam = {word : 0 for word in words}


P_Word_given_Ham = {word : 0 for word in words}

for word in words:


N_Word_Spam = spam[word].sum()
P_Word_given_Spam[word] = (N_Word_Spam + 1) / (N_Spam +
N_Words)

N_Word_Ham = ham[word].sum()
P_Word_given_Ham[word] = (N_Word_Ham + 1) / (N_Ham + N_Words)

19
II. Classifying Text Messages

Once we have calculated the constants and the parameters, we move onto using them to
calculate the probabilities for classifying them.

Our messages are in the format (w1, w2, w3…..,wn).

We calculate P(Spam|w1, w2, ..., wn) and P(Ham|w1, w2, ..., wn) using the formulas mentioned
earlier. The maximum of these two is the result of the classification. On the off chance that the
probabilities are equal, human intervention is required to classify the message.

In the test messages, there is a chance that one or more words in the message are not known to
our word dictionary. Therefore the model ignores those words and they will not contribute
towards the calculation of probabilities.

The function ‘predict’ processes the text to convert it to a simpler format and then the
probability for its classification is calculated.

def predict(text):

text = re.sub('\W', ' ', text)


text = text.lower().split()

P_Spam_given_Text = P_Spam
P_Ham_given_Text = P_Ham

for word in text:


if word in P_Word_given_Spam:
P_Spam_given_Text *= P_Word_given_Spam[word]

if word in P_Word_given_Ham:
P_Ham_given_Text *= P_Word_given_Ham[word]

print('P(Ham):', P_Ham_given_Text)
print('P(Spam):', P_Spam_given_Text)

if P_Spam_given_Text > P_Ham_given_Text:


print("Spam")
else:
print("Ham")

20
The ‘validate’ function is the same as ‘predict’ function, except it is used to classify the test
data and for every message in the test data, its predicted value is stored and exported to a new
csv file called ‘test_results’. This csv file contains all the test messages, their labels, and the
predicted results. The accuracy of the model is calculated using the number of corrected
predictions that are made and the total number of test messages.

def validate(text):

text = re.sub('\W', ' ', text)


text = text.lower().split()

P_Spam_given_Text = P_Spam
P_Ham_given_Text = P_Ham

for word in text:


if word in P_Word_given_Spam:
P_Spam_given_Text *= P_Word_given_Spam[word]

if word in P_Word_given_Ham:
P_Ham_given_Text *= P_Word_given_Ham[word]

if P_Spam_given_Text > P_Ham_given_Text:


return 'spam'
else:
return 'ham'

test_data['prediction'] = test_data['SMS'].apply(validate)

21
9. RESULTS AND INFERENCES
After the classification of the messages into spam and ham, we measure the spam filter’s
accuracy which basically forms the result of our project.
Here, we check the accuracy of the filter does on our test set, which has 1,114 messages (20%
of the entire dataset).

We have written a function that returns classification labels instead of printing them (as
shown previously)

We have compared the predicted values with the actual values to measure how good our
spam filter is with classifying new messages. To make the measurement, we have used
accuracy as a standard.

Accuracy is measured by dividing the total number of correctly classified messages by the
total number of classified messages.
correct = 0
incorrect = 0
for index, tuple in test_data.iterrows():
if tuple['Label'] == tuple['prediction']:
correct = correct + 1
else:
incorrect = incorrect + 1

print('Test Tuples:', test_data.shape)


print('Correct Results:', correct)
print('Incorrect Results:', incorrect)
print('No of spam msgs:', len(test_data[test_data['Label'] ==
'spam']))
print('Accuracy:', correct / test_data.shape[0])

#test_data.to_csv('test_results.csv')

The csv file containing the final results is shown following the final output from the dataset
taken and a few custom inputs also.

22
Test_results.csv

Final Output of the Dataset

Custom inputs

predict("hey harry would like to play a game to win free


prizes?")

23
predict('Mam we will not be able to submit our project on
time')
predict("Win easy money by playing this amazing game.")

Output for Custom Inputs

24
10. CONCLUSION
Spam message is one of the biggest issues in the current times. Spam messages not only
influence the organisations financially but also exasperate the individual user. This project has
tried to propose a bayes learning based approach by implementing the Naïve Bayes algorithm
for the spam message detection.

In this project, we have tried to code a spam filter for SMS messages using the multinomial
Naive Bayes algorithm. The filter had an accuracy of about 98.7% on the test set we used,
which is a promising result. We were able to accomplish the initial goals that we had set.

25
11. REFERENCES

● https://www.ijrte.org/wp-content/uploads/papers/v8i6/F1120038620.pdf
● http://www.mecs-press.org/ijeme/ijeme-v11-n4/IJEME-V11-N4-4.pdf
● https://www.sciencedirect.com/science/article/pii/S2405844018353404

● https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9222163

● https://www.hindawi.com/journals/scn/2020/8873639/

● https://www.irjet.net/archives/V8/i4/IRJET-V8I472.pdf

● https://www.semanticscholar.org/paper/Machine-Learning-based-Spam-E-Mail-
Detection-Sharma-Bhardwaj/bc60393ef918832baf4ea808502933c38dc8e346

26

You might also like