You are on page 1of 5

2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA)

Pimpri Chinchwad College of Engineering (PCCOE), Pune, India. Aug 18-19, 2023

Classifying Email as High and Low Risk: An


Effective Approach to Spam Email Classification
2023 7th International Conference On Computing, Communication, Control And Automation (ICCUBEA) | 979-8-3503-0426-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICCUBEA58933.2023.10392186

Prof. Priya Surana Diksha Waghchoure Riya Shah


Dept. of Computer Engineering Dept. of Computer Engineering Dept. of Computer Engineering
PCCOE Pune, India PCCOE Pune, India PCCOE Pune, India
priya.surana@pccoepune.org diksha.waghchoure19@pccoepune.org riya.shah19@pccoepune.org

Siddhesh Vharambale Abhishek Rath


Dept. of Computer Engineering Dept. of Computer Engineering
PCCOE Pune, India PCCOE Pune, India
siddhesh.vharamble19@pccoeune.org abhishek.rath19@pccoeune.org

Abstract—Email is one of the most widely used and popular spam filtering in place. Spam filtering ensures that corporate
forms of communication due to its accessibility on a worldwide emails operate without a hitch and are only utilized for what
scale, the relative speed at which messages can be transferred, they were meant to be used for, in addition to keeping trash
and the low sending costs. Today, a large portion of the population
depends on the messages or emails sent by spammers, and out of email inboxes. Spam filtering is essentially an anti-
it gives them a great opportunity to send spam messages to malware tool because many email-based assaults try to deceive
people about their interests. The rise in email-based threats users into clicking on an unsafe attachment or entering their
is directly attributed to the flaws in e-mail protocols and the credentials, among many other ways.
rise in electronic commerce and financial activities. Spam emails In spam email classification, the incoming textual data
are sneaking into users’ mailboxes without their consent. They
utilize more network resources and require more time to check is preprocessed using the count vectorizer method, and the
and delete spam emails. Spam overflows inboxes with absurd ensemble learning method is applied to the data for binary
emails. greatly reduces the speed of our internet. stealing vital classification. Financial emails are considered to be more risky
information, such as contact information, from the user In the as compared to other emails because fraudulent people can
digital age, spam email has grown to be a serious issue, and it deceive people by sending alluring emails to make them click
is crucial to identify and filter spam emails in order to preserve
the integrity of communication channels. on suspicious links, which may result in the user’s financial
Traditional techniques of classifying email simply consist of loss. Therefore, we trained our model using SVM, which
identifying email as spam or not spam, We introduced the further classifies emails into two subcategories – high risk
concept of further classifying spam email as high risk or low risk. and low risk. Phishing and financial-related emails will fall
Ensemble learning is used for binary classification as spam or under the high-risk category, and other emails will fall under
ham, it includes three models, namely Multinomial Naı̈ve Bayes,
SVM, and Decision Tree, which gave an accuracy of 98.95%. For the low-risk category. Due to this new feature, the user will
further classification of the spam mail into two subcategories as be more attentive while checking their emails.
high or low risk, SVM produced an outcome of 84.37% accuracy. We have solved the following problems in this paper.
1. Most of the previous research papers have evaluated their
Index Terms—svm, ensemble, multinomial naive bayes, emails, models on a single dataset or a small set of datasets. The
spam, high risk, low risk
ensemble of algorithms is trained on multiple datasets. It
improved the generalization ability of the model by capturing
I. INTRODUCTION
different aspects of email content or spamming behavior.
There are many different communication options available 2. Most of the previous research papers have used a limited
online. A more professional way to communicate with people set of features, such as the frequency of certain words, the
is through email. You become a target once spam hits your length of the email, and the sender’s email address. While
email inbox. Most of the time, when it comes to computer these features can be useful, they may not be sufficient to
security, people are the weakest link. Attackers will make re- capture the complexity of spam emails. Various methods are
peated attempts to trick them, employing a variety of strategies used to extract more complex features from email content.
to persuade them to click on objects they shouldn’t. If people 3. Effective capturing of temporal dynamics of spamming
unintentionally click the erroneous link in the spam email, techniques by combining the strengths of Multinomial Naı̈ve
internal information can become public. Bayes, SVM and decision tree by capturing the changes in
As email is commonly exploited as a way of scamming spamming behaviour over time by detecting anomalies in
customers and their personal information, spam filtering has email content or identifying emerging patterns of spamming
acquired importance and relevance. Each association must put activity

979-8-3503-0426-8/23/$31.00 ©2023 IEEE 1


Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
II. LITERATURE SURVEY neural networks to categorize phishing emails. As phishing
emails lack any identifying qualities, it has been difficult to
Khalid Iqbal et al [1] performed a Point-Biserial correlation recognize and classify them, leading to a dearth of research on
on each feature pertaining to the class label of the University the subject. Two deep neural networks were used in this study
of California Irvine (UCI) spam base email dataset in order for the categorization of phishing emails. Convolutional neural
to choose the best features. The dataset is used to test several networks (CNN) and Long Short Term Memory (LSTM),
distance-based, tree-based, gradient-based, and gradient-based a form of RNN, were compared and used. RNNs are the
algorithms, including Radial Basis Function (RBF), Artifi- most popular neural networks used for text categorization.
cial Neural Network (ANN), Logistic Regression (LR), and CNNs have also proven to be capable of classifying text. In
Support Vector Machine (SVM) experiments with distance- addition to adjusting hyperparameters using various activation
based, gradient-based, and tree-based methods for extracting functions and optimizers, the performance of CNN and LSTM
important properties. is compared on the basis of accuracy and the ROC score.
Alanazi Rayan et al [2] combined the two machine learning LSTM succeeded in achieving greater accuracy.
techniques of random forest and decision trees to propose Naseeb Grewal, et al[6] proposed a method based on the
a novel hybrid bagging strategy based on machine learning words, numbers, and characters in the emails, the authors of
for the detection of email spam. A number of sets from the this research have created a system that demonstrates how
database are supplied into this. During the preprocessing step, to identify spam emails using machine learning. They inves-
tokenization, stemming, and stop word elimination are all tigated a number of well-known machine learning models,
carried out. Additionally, in this work, the necessary features including Naı̈ve Bayes, Neural Networks, K-NN, Tree, and
are selected from the preprocessed data using correlation Logistic Regression, to categorize emails.
feature selection (CFS). The accuracy of the model is evaluated The technology employed by Kanish Shah, et al[7] and
in terms of its recall, precision, and confusion matrix. The his team for organizing and deriving insights from text data
results showed that the aforementioned hybrid bagged model- is called text classification. Following that, the classes are
based SMD technology has a 98% accuracy rate. divided depending on the text’s substance. They compared
Ashraf S. Mashaleha, et al[3] proposed that the choice results based on accuracy, recall, f1-score, support, and con-
of dataset properties is optimized using the Harris Hawks fusion matrix using Logistic regression, Random Forest, and
Optimizer (HHO) approach. In order to identify spam emails, KNN. The project completed by receiving the Logistic regres-
the k-Nearest Neighbors (k-NN) approach is applied. This is sion as the best-suited model.
the first time HHO has been used to identify email spam, to the Thashina Sultana and team et al[8] categorized the mail to
author’s knowledge. In this study, several of the comparison determine if it was or wasn’t spam. It is difficult to manually
approaches have also been modified and employed. Bench- identify spam every time since spammers may send spam
mark datasets for email spam that are included in the Spam- messages often. The recommended method not only contains
base dataset are frequently used to evaluate freshly improved spam detection but also identifies the IP address of the machine
algorithms. The outcomes showed that HHO, which can reach used to deliver the spam message in addition to the spam
94.3% accuracy in the email spam domain, is effective. In keyword. In this manner, the system would instantly identify
the suggested HHO-KNN technique, which aims to find the the spam message as spam based on the IP address the next
most optimal solution for the characteristics to be later used time it is transmitted from the same machine. The suggested
to forecast spam or ham emails, the HHO is combined with model determines whether a particular message is spam or not
the KNN classifier. The solution to the FS issue is shown as using the Bayes theorem, the Naive Bayes classifier, and the
a vector of ones and zeros in the proposed method. The HHO sender’s IP address
contains the exploration and exploitation phases. Based on the words, numbers, and characters in the emails,
Darshana Chaudhari, et al[4] Present a Term Frequency Naseeb Grewal and team et al[9] have created a system that
Inverse Document Frequency (TFIDF) technique utilizing the demonstrates how to identify spam emails using machine
Support Vector Machine algorithm as the goal of the study. learning. They investigated a number of well-known machine
The outcomes are contrasted using the confusion matrix, learning models, including Naive Bayes, Neural Network, K-
accuracy, and precision. This method achieves an accuracy NN, Tree, and Logistic Regression, to categorize emails
of 99.9% on training data and 98.2% on testing data using Diksha S and her team et el[10] took into account the
the Term Frequency Inverse Document Frequency (TFIDF)- fact that malware might be attached to spam messages as an
based Support Vector Machine (SVM) system. The scope executable file, a link to a malicious website, or even a link
for fully and effectively automating spam detection systems that just goes nowhere. Nevertheless, they are either slow or
is what is being advocated. The rivalry intensifies as online ineffective at solving the spam filtering problem. The bulk of
businesses become more and more well-liked. Spam reviews machine learning approaches now in use are either based on
are becoming harder to track as spammers grow smarter every Support Vector Machines or Naive Bayes. SVM-based spam
day. Finding spamming tactics is necessary before creating filters provide significant benefits in terms of high precision
counter-algorithms. and recall rates, whereas Naive Bayes-based spam filters offer
Regina Eckhardt, et al. [5] proposed a method of using deep faster classification speeds and less training set needs. We

2
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
provide a hybrid spam filtering technique that incorporates email’s textual data, while spam indicates if the email is spam
all of their advantages and is more accurate than NB and or not. There is a value of 1 for spam and a value of 0 for
SVM used separately non-spam email. The model was originally trained using this
dataset, which has 5728 rows.
III. METHODOLOGY USED The second dataset is a phishing dataset, which has 159 rows
Supervised learning techniques were used here to analyze and three columns. The subject of the email, the body of
the real-time dataset and forecast performance. The different the email, and the type of email are further categorized as
algorithms have different biases and generalizations, so they fraud, phishing, commercial spam, and false positives. In an
often make errors in various areas of the instance space. innovative conversion, we have classified email categories as
The combination of multiple algorithms with their respective high-risk and low-risk, where all commercial and phishing
accuracy will be advantageous. The incoming email from the emails are low-risk and all fraud and phishing emails are high-
user is first pre-processed and after that, the text feature ex- risk.
traction is done using a count vectorizer. Here, two models are
implemented, first, to classify incoming email as spam or ham, B. Classification Algorithms
ensemble learning is used. It is a machine-learning technique To classify incoming emails as spam or ham, an ensemble
that combines multiple models to improve the overall perfor- learning method combining support vector machines, Multi-
mance and accuracy of the prediction. An ensemble model nomial Naı̈ve Bayes and decision trees is used. The further
was created by combining the output of multiple classification classification of spam emails is implemented as high risk or
models, such as Multinomial Naive Bayes, Decision Trees, low risk using a support vector machine. The algorithms are
and SVM, through a soft voting scheme. In soft voting, the as follows-
final prediction is made by taking the average of the predicted 1) Support Vector Machine: Support vector machines
probabilities of all the individual models. (SVM) are a popular machine learning algorithm that can be
However, after applying the ensemble model to risk analysis, used for email classification tasks. In order to classify emails
we discovered a misclassification mistake. When one model as spam or non-spam (ham), we train an SVM model on
in the ensemble performs better than the others and its output a labeled dataset of emails, where each email is labeled as
predominates the final forecast, it can lower the model’s either spam or ham.
overall accuracy. In our model, Multinomial Naı̈ve Bayes was Once the SVM model has been trained, it can be used to
dominating all other models. To overcome this issue, the use predict the class of new, unseen emails. The SVM model
of another classification algorithm that is more robust and can will output a score for each email, indicating how likely it
handle the complexity of the dataset is feasible, so we used is to be spam. It can then use a threshold value to classify
SVM. The second model used for the binary classification of the email as spam or ham. For example, if the SVM score is
spam email as high risk or low risk is SVM. SVMs divide the above 0.5, classify the email as spam, and if it is below 0.5,
data into two classes by finding a hyperplane. Based on their classify it as ham.
content and information, we classified emails as high-risk or To further classify spam emails as high-risk or low-risk,
low-risk using this approach. SVMs are another method that additional features or metadata of the emails can be used. For
may be utilized to increase classification accuracy. example, features such as the sender’s domain, the subject
line, and the email content can be extracted, as can metadata
such as the timestamp and the recipient’s email address. The
SVM model can then output a score for each email, indicating
how likely it is to be high-risk. Based on the threshold value,
the email can be classified as high risk or low risk.

SVM for Email Classification


Let X be the input feature matrix, where each row represents
an email and each column represents a feature, and let y be
the corresponding labels (0 for ham, 1 for spam). We can
formulate the SVM optimization problem as follows:

minimise 0.5 ∗ ||w||2 + C ∗ sum(max(0, 1 − yi (wT xi + b)))


s.t. yi (wT xi + b) >= 1 − epsiloni where epsilon i are slack
Fig. 1. System Architecture variables.
In this formulation, w and b are the weight vector and
bias terms of the SVM model, C is the regularization
A. Dataset Description parameter, and epsilon i are the slack variables that allow
Two datasets are used. The first dataset is an email dataset some misclassifications.
that has two columns: text and spam. The text comprises the

3
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
Further Classifying Spam Email as High Risk and Low Risk by recursively partitioning the data into smaller and smaller
Let X’ be the additional feature matrix for the spam emails, subsets based on the values of their features until each subset
where each row represents a spam email and each column rep- is homogeneous with respect to the target variable. In a
resents an additional feature, and let y’ be the corresponding decision tree, features are selected by best-splitting data.
labels (0 for low-risk, 1 for high-risk). We can formulate the The measure of the quality of the split is called impurity
SVM optimization problem as follows: or entropy, which measures the degree of disorder in a set
minimize of examples. A good split will reduce the impurity of the
0.5 ∗ ||w ||2 + C  ∗ sum(max(0, 1 − yi (wT xi + b ))) subsets, which can be measured using various metrics such
s.t. yi (wT xi + b ) >= 1 − epsiloni , where epsilon’ i are as information gain or the Gini index. The entropy of a set S
slack variables. is given by:
H(S) = − p(c)log2(p(c))
In this formulation, w’ and b’ are the weight vector and The information gain (IG) of a feature F is the difference
bias term of the SVM model for the additional features, C’ between the entropy of the parent node S and the weighted
is the regularization parameter, and epsilon’ i are the slack average of the entropies
 of the child nodes S1, S2, ..., SK:
variables that allow some misclassifications. IG(F ) = H(S) − (|Sj |/|S|)H(Sj )

2) Multinomial Naive Bayes: Multinomial Naive Bayes


(MNB) is a variant of the Naı̈ve Bayes algorithm used for IV. RESULTS
text classification problems. This algorithm assumes that each
feature/word is independent of the others given the class
label. During training, it estimates the probability of each
word given each class label by counting the number of times
each word appears in each class label. The Bayes’ theorem
is then used to calculate the probability of each class label
being given a new email. This assumption of independence is
the ”naive” part of the algorithm.
Multinomial Naive Bayes assumes that the frequency counts
of words in an email follow a multinomial distribution and
uses Bayes’ theorem to calculate the probability of each class
(spam or ham) given the email features (word frequency
counts). The probability is calculated as follows:

P (y|X) = P (X|y) ∗ P (y)/P (X)


where y is the class (spam or ham), X is the feature vector
(word frequency counts), P (X—y) is the likelihood of the Fig. 2. Training and Testing Accuracy of Classification into Spam and Ham
feature vector given the class, P (y) is the prior probability of
the class, and P (X) is the marginal probability of the feature The above graph compares the accuracy of SVM, Decision
vector. Tree, and Multinomial Naive Bayes ensembles using soft
We can use a labeled dataset of emails to estimate the voting techniques on a dataset. The ensemble had the highest
likelihood and prior probabilities, and then use them to testing accuracy of 0.989, followed by Multinomial Naive
classify new, unseen emails. The classification rule is: Bayes with an accuracy of 0.987, SVM with 0.981, and
ŷ = argmaxy P (y|X) Decision Tree with 0.95. Multinomial Naive Bayes did not
where ŷ is the predicted class (spam or ham). fit well with the training data, while SVM, Decision Tree, and
the ensemble model did.
3) Decision Tree: A decision tree is a tree-like model of In the above graph, SVM is used to categorize spam emails
decisions and their possible consequences. Each internal node as high risk or low risk, despite Multinomial Naive Bayes
represents a test on a feature, and each branch represents the accuracy being superior because of SVM’s ability to handle
outcome. of the test, and each leaf node represents a class complicated decision boundaries and high-dimensional feature
label (spam or ham). The labeled dataset of emails can be spaces. Because it performs well with high-dimensional and
used to learn a decision tree that best separates spam and ham sparse data, MNB is a well-liked technique for text classi-
emails based on their features (e.g., word frequency counts). fication problems. SVM, however, is a superior option for
The decision tree can then be used to classify new, unseen data with complicated connections between characteristics. We
emails by traversing the tree according to the values of their proposed a method for further classifying spam emails into
features and assigning the leaf node label as the predicted high and low-risk categories. For this task, we utilized SVM,
class. which achieved an accuracy of 75%, a precision of 1, and a
Decision tree is used for classification problems, this works recall of 0.55.

4
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
[8] Thashina Sultana, K A Sapnaz, Fathima Sana, Mrs. Jamedar Na-
jath,”Email based Spam Detection”, International Journal of Engineering
Research & Technology (IJERT),Vol. 9 Issue 06,June-2020
[9] Naseeb Grewal, Rahul Nijhawan ,Ankush Mittal ,”Email Spam
Detection Using Machine Learning and Feature Optimization
Method”,Springer,First Online: 12 September 2022.
[10] Diksha S, Jawale AG, Mahajan RK, Shinkar VV et al (2018) ,”Hybrid
spam detection using Machine Learning”, International Journal of Ad-
vance Research, Ideas and Innovations in Technology,Volume 4, Issue 2

Fig. 3. Training and Testing Accuracy for classification into High and Low
Risk

V. FUTURE WORK
Further in the future, we can work with a large, live dataset
collected directly from the users. The accuracy of the models
might be increased by providing them with a huge dataset.
Also, we can deploy a system where spammers can be detected
and blocked from sending further emails.

VI. CONCLUSION
In this paper, we have explored the task of email clas-
sification into spam and ham categories, as well as further
classifying spam emails into high and low-risk categories. In
the model, results show that our ensemble approach achieved
an overall accuracy of 98.95%, with Multinomial Naive Bayes
achieving the highest accuracy of 98.77%, followed by SVM
with 98.16%, and Decision Tree with 95%. These results
demonstrate the effectiveness of our approach in accurately
classifying emails into spam and ham categories.

R EFERENCES
[1] Khalid Iqbal, Muhammad Shehrayar Khan, “Email classification analy-
sis using machine learning techniques”, Applied Computing and Infor-
matics, 2022.
[2] Alanazi Rayan, “Analysis of e-Mail Spam Detection Using a Novel
Machine Learning-Based Hybrid Bagging Technique,” Computational
Intelligence and Neuroscience, August 2022.
[3] Ashraf S. Mashaleha, Noor Farizah Binti Ibrahima, Mohammed Azmi,
Hossam M. J. Mustafae, and Qussai M. Yaseenc, “ Detecting Spam
Email with Machine Learning Optimized with the Harris Hawks Opti-
mizer (HHO) Algorithm”, Procedia Computer Science, 2022
[4] Darshana Chaudhari, Deveshri Kolambe, Rajashri Patil, and Sachin Pu-
ranik, ‘EMAIL SPAM DETECTION USING MACHINE LEARNING
AND PYTHON”, International Journal of Research Publication and
Reviews, April 2022.
[5] Regina Eckhardt, Sikha Bagui, ”Convolutional Neural Networks and
Long Short Term Memory for Phishing Email Classification”, Interna-
tional Journal of Computer Science and Information Security (IJCSIS),
Vol. 19, No. 5, May 2021.
[6] Naseeb Grewal, Rahul Nijhawan, and Ankush Mittal,” Email Spam
Detection Using Machine Learning and Feature Optimization Meth-
ods,”Springer, First Online: 12, 2022.
[7] Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah, ”A Compar-
ative Analysis of Logistic Regression, Random Forest and KNN Models
for the Text Classification”, Springer,March 2020.

5
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.

You might also like