Professional Documents
Culture Documents
Classifying Email As High and Low Risk An Effective Approach To Spam Email Classification
Classifying Email As High and Low Risk An Effective Approach To Spam Email Classification
Pimpri Chinchwad College of Engineering (PCCOE), Pune, India. Aug 18-19, 2023
Abstract—Email is one of the most widely used and popular spam filtering in place. Spam filtering ensures that corporate
forms of communication due to its accessibility on a worldwide emails operate without a hitch and are only utilized for what
scale, the relative speed at which messages can be transferred, they were meant to be used for, in addition to keeping trash
and the low sending costs. Today, a large portion of the population
depends on the messages or emails sent by spammers, and out of email inboxes. Spam filtering is essentially an anti-
it gives them a great opportunity to send spam messages to malware tool because many email-based assaults try to deceive
people about their interests. The rise in email-based threats users into clicking on an unsafe attachment or entering their
is directly attributed to the flaws in e-mail protocols and the credentials, among many other ways.
rise in electronic commerce and financial activities. Spam emails In spam email classification, the incoming textual data
are sneaking into users’ mailboxes without their consent. They
utilize more network resources and require more time to check is preprocessed using the count vectorizer method, and the
and delete spam emails. Spam overflows inboxes with absurd ensemble learning method is applied to the data for binary
emails. greatly reduces the speed of our internet. stealing vital classification. Financial emails are considered to be more risky
information, such as contact information, from the user In the as compared to other emails because fraudulent people can
digital age, spam email has grown to be a serious issue, and it deceive people by sending alluring emails to make them click
is crucial to identify and filter spam emails in order to preserve
the integrity of communication channels. on suspicious links, which may result in the user’s financial
Traditional techniques of classifying email simply consist of loss. Therefore, we trained our model using SVM, which
identifying email as spam or not spam, We introduced the further classifies emails into two subcategories – high risk
concept of further classifying spam email as high risk or low risk. and low risk. Phishing and financial-related emails will fall
Ensemble learning is used for binary classification as spam or under the high-risk category, and other emails will fall under
ham, it includes three models, namely Multinomial Naı̈ve Bayes,
SVM, and Decision Tree, which gave an accuracy of 98.95%. For the low-risk category. Due to this new feature, the user will
further classification of the spam mail into two subcategories as be more attentive while checking their emails.
high or low risk, SVM produced an outcome of 84.37% accuracy. We have solved the following problems in this paper.
1. Most of the previous research papers have evaluated their
Index Terms—svm, ensemble, multinomial naive bayes, emails, models on a single dataset or a small set of datasets. The
spam, high risk, low risk
ensemble of algorithms is trained on multiple datasets. It
improved the generalization ability of the model by capturing
I. INTRODUCTION
different aspects of email content or spamming behavior.
There are many different communication options available 2. Most of the previous research papers have used a limited
online. A more professional way to communicate with people set of features, such as the frequency of certain words, the
is through email. You become a target once spam hits your length of the email, and the sender’s email address. While
email inbox. Most of the time, when it comes to computer these features can be useful, they may not be sufficient to
security, people are the weakest link. Attackers will make re- capture the complexity of spam emails. Various methods are
peated attempts to trick them, employing a variety of strategies used to extract more complex features from email content.
to persuade them to click on objects they shouldn’t. If people 3. Effective capturing of temporal dynamics of spamming
unintentionally click the erroneous link in the spam email, techniques by combining the strengths of Multinomial Naı̈ve
internal information can become public. Bayes, SVM and decision tree by capturing the changes in
As email is commonly exploited as a way of scamming spamming behaviour over time by detecting anomalies in
customers and their personal information, spam filtering has email content or identifying emerging patterns of spamming
acquired importance and relevance. Each association must put activity
2
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
provide a hybrid spam filtering technique that incorporates email’s textual data, while spam indicates if the email is spam
all of their advantages and is more accurate than NB and or not. There is a value of 1 for spam and a value of 0 for
SVM used separately non-spam email. The model was originally trained using this
dataset, which has 5728 rows.
III. METHODOLOGY USED The second dataset is a phishing dataset, which has 159 rows
Supervised learning techniques were used here to analyze and three columns. The subject of the email, the body of
the real-time dataset and forecast performance. The different the email, and the type of email are further categorized as
algorithms have different biases and generalizations, so they fraud, phishing, commercial spam, and false positives. In an
often make errors in various areas of the instance space. innovative conversion, we have classified email categories as
The combination of multiple algorithms with their respective high-risk and low-risk, where all commercial and phishing
accuracy will be advantageous. The incoming email from the emails are low-risk and all fraud and phishing emails are high-
user is first pre-processed and after that, the text feature ex- risk.
traction is done using a count vectorizer. Here, two models are
implemented, first, to classify incoming email as spam or ham, B. Classification Algorithms
ensemble learning is used. It is a machine-learning technique To classify incoming emails as spam or ham, an ensemble
that combines multiple models to improve the overall perfor- learning method combining support vector machines, Multi-
mance and accuracy of the prediction. An ensemble model nomial Naı̈ve Bayes and decision trees is used. The further
was created by combining the output of multiple classification classification of spam emails is implemented as high risk or
models, such as Multinomial Naive Bayes, Decision Trees, low risk using a support vector machine. The algorithms are
and SVM, through a soft voting scheme. In soft voting, the as follows-
final prediction is made by taking the average of the predicted 1) Support Vector Machine: Support vector machines
probabilities of all the individual models. (SVM) are a popular machine learning algorithm that can be
However, after applying the ensemble model to risk analysis, used for email classification tasks. In order to classify emails
we discovered a misclassification mistake. When one model as spam or non-spam (ham), we train an SVM model on
in the ensemble performs better than the others and its output a labeled dataset of emails, where each email is labeled as
predominates the final forecast, it can lower the model’s either spam or ham.
overall accuracy. In our model, Multinomial Naı̈ve Bayes was Once the SVM model has been trained, it can be used to
dominating all other models. To overcome this issue, the use predict the class of new, unseen emails. The SVM model
of another classification algorithm that is more robust and can will output a score for each email, indicating how likely it
handle the complexity of the dataset is feasible, so we used is to be spam. It can then use a threshold value to classify
SVM. The second model used for the binary classification of the email as spam or ham. For example, if the SVM score is
spam email as high risk or low risk is SVM. SVMs divide the above 0.5, classify the email as spam, and if it is below 0.5,
data into two classes by finding a hyperplane. Based on their classify it as ham.
content and information, we classified emails as high-risk or To further classify spam emails as high-risk or low-risk,
low-risk using this approach. SVMs are another method that additional features or metadata of the emails can be used. For
may be utilized to increase classification accuracy. example, features such as the sender’s domain, the subject
line, and the email content can be extracted, as can metadata
such as the timestamp and the recipient’s email address. The
SVM model can then output a score for each email, indicating
how likely it is to be high-risk. Based on the threshold value,
the email can be classified as high risk or low risk.
3
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
Further Classifying Spam Email as High Risk and Low Risk by recursively partitioning the data into smaller and smaller
Let X’ be the additional feature matrix for the spam emails, subsets based on the values of their features until each subset
where each row represents a spam email and each column rep- is homogeneous with respect to the target variable. In a
resents an additional feature, and let y’ be the corresponding decision tree, features are selected by best-splitting data.
labels (0 for low-risk, 1 for high-risk). We can formulate the The measure of the quality of the split is called impurity
SVM optimization problem as follows: or entropy, which measures the degree of disorder in a set
minimize of examples. A good split will reduce the impurity of the
0.5 ∗ ||w ||2 + C ∗ sum(max(0, 1 − yi (wT xi + b ))) subsets, which can be measured using various metrics such
s.t. yi (wT xi + b ) >= 1 − epsiloni , where epsilon’ i are as information gain or the Gini index. The entropy of a set S
slack variables. is given by:
H(S) = − p(c)log2(p(c))
In this formulation, w’ and b’ are the weight vector and The information gain (IG) of a feature F is the difference
bias term of the SVM model for the additional features, C’ between the entropy of the parent node S and the weighted
is the regularization parameter, and epsilon’ i are the slack average of the entropies
of the child nodes S1, S2, ..., SK:
variables that allow some misclassifications. IG(F ) = H(S) − (|Sj |/|S|)H(Sj )
4
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.
[8] Thashina Sultana, K A Sapnaz, Fathima Sana, Mrs. Jamedar Na-
jath,”Email based Spam Detection”, International Journal of Engineering
Research & Technology (IJERT),Vol. 9 Issue 06,June-2020
[9] Naseeb Grewal, Rahul Nijhawan ,Ankush Mittal ,”Email Spam
Detection Using Machine Learning and Feature Optimization
Method”,Springer,First Online: 12 September 2022.
[10] Diksha S, Jawale AG, Mahajan RK, Shinkar VV et al (2018) ,”Hybrid
spam detection using Machine Learning”, International Journal of Ad-
vance Research, Ideas and Innovations in Technology,Volume 4, Issue 2
Fig. 3. Training and Testing Accuracy for classification into High and Low
Risk
V. FUTURE WORK
Further in the future, we can work with a large, live dataset
collected directly from the users. The accuracy of the models
might be increased by providing them with a huge dataset.
Also, we can deploy a system where spammers can be detected
and blocked from sending further emails.
VI. CONCLUSION
In this paper, we have explored the task of email clas-
sification into spam and ham categories, as well as further
classifying spam emails into high and low-risk categories. In
the model, results show that our ensemble approach achieved
an overall accuracy of 98.95%, with Multinomial Naive Bayes
achieving the highest accuracy of 98.77%, followed by SVM
with 98.16%, and Decision Tree with 95%. These results
demonstrate the effectiveness of our approach in accurately
classifying emails into spam and ham categories.
R EFERENCES
[1] Khalid Iqbal, Muhammad Shehrayar Khan, “Email classification analy-
sis using machine learning techniques”, Applied Computing and Infor-
matics, 2022.
[2] Alanazi Rayan, “Analysis of e-Mail Spam Detection Using a Novel
Machine Learning-Based Hybrid Bagging Technique,” Computational
Intelligence and Neuroscience, August 2022.
[3] Ashraf S. Mashaleha, Noor Farizah Binti Ibrahima, Mohammed Azmi,
Hossam M. J. Mustafae, and Qussai M. Yaseenc, “ Detecting Spam
Email with Machine Learning Optimized with the Harris Hawks Opti-
mizer (HHO) Algorithm”, Procedia Computer Science, 2022
[4] Darshana Chaudhari, Deveshri Kolambe, Rajashri Patil, and Sachin Pu-
ranik, ‘EMAIL SPAM DETECTION USING MACHINE LEARNING
AND PYTHON”, International Journal of Research Publication and
Reviews, April 2022.
[5] Regina Eckhardt, Sikha Bagui, ”Convolutional Neural Networks and
Long Short Term Memory for Phishing Email Classification”, Interna-
tional Journal of Computer Science and Information Security (IJCSIS),
Vol. 19, No. 5, May 2021.
[6] Naseeb Grewal, Rahul Nijhawan, and Ankush Mittal,” Email Spam
Detection Using Machine Learning and Feature Optimization Meth-
ods,”Springer, First Online: 12, 2022.
[7] Kanish Shah, Henil Patel, Devanshi Sanghvi, Manan Shah, ”A Compar-
ative Analysis of Logistic Regression, Random Forest and KNN Models
for the Text Classification”, Springer,March 2020.
5
Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:30:43 UTC from IEEE Xplore. Restrictions apply.