You are on page 1of 17

Majority Voting Technique to classify emails as

Spam or Ham

Dataset: Kaggle Respository


Sohan Narasimhan - D18129634
TU059 - MSc in Computer Science (Data Analytics) - Technological University Dublin

1 Background, Context and Scope 2 Problem Description


Email filtering
according to is the processing
specified criteriaof emails
(“Emailto organize
Filtering”,it Spam emails are from
a major trouble to
thatanmost email users
2020). A spam filter is a program used to detect are facing right a person organiza- tion.
unsolicited emails and prevent them from getting into Though users would not have subscribed to a service,
user’s inbox (Margaret, 2006). E-mails are used widely without their prior consent they would be receiving
for communication. But spammers use them to create emails from that system. Often, one kind of email is
undesirable problems for email users. The emails that ham for one person, and an identical email may be
contain undesirable information is called Spam. It is spam for others. Thus, filtering user-based emails is
said that spam emails constitute 75-80% of the global necessary, but it is a difficult task. Spam emails are
email traffic (Borde et al., 2017). Several anti-spam considered a concern because they waste a large amount
techniques have been developed. But they too have of network resources, affect most users’ daily routine,
limitations. This study primarily focuses on classifying consume a lot of user time in filtering spam emails,
email as Spam/Ham and increase the accuracy and disclosing private information, intru- sion with
precision of classifier considering only textual content malware, and phishing attacks. Therefore, addressing
of emails in English language without any attachments. this problem is critical because many com- panies will
Concepts like Ensemble learning is used which is the need to spend high amount time, effort, and money in
process of combining multiple classifiers to solve a filtering emails. Although human re- source can be used
problem and primarily used to improve the model to process these emails, chances of human error may be
performance. Bag-of-words repre- sentation explains high as some spam emails would be disguised to look
the existence of words within text by developing a word like a legitimate email. Handling these emails not only
vocabulary and measuring the presence of known helps to keep spam out of email inboxes, but also helps
words. Majority voting algorithm is used to create an with business emails’ quality so that they run smoothly
ensemble. This classifier pro- duces final prediction and are used only for their intended purpose.
result using majority of votes corresponding to 2.1 Approaches to solve the problem
classifiers. This majority voting will lead to
improvement in model performance.
Spam emails
users. is a bighave
Many works hasslebeen
faced by most
long of the
past on e-mail
ground to
lessen these unsolicited mails. Dada et al. (2019) has
surveyed and presented a detailed review to explore the
gaps in previous researches to postu- late email spam
detection and filtering in a logical, theoretically
grounded manner, in order to facilitate the introduction
of the spam filtering technique that could be operational
in an efficient way. Though some filters are good
enough in serving this purpose,

Figure 1: Email Classification (K, 2019)


it everyWakchaure
context, so often effects in false
et al. (2017) negatives.
proposed In main
that do- this work is done are
re- searches on itrestricted
(Harisinghaney
to one ettype
al., 2014).
of textAlso,
font many
style
user preference training of the filter reduces the false (Harisinghaney et al., 2014). There is a need of filter which
negatives in the user inbox. Many works have employed can filter emails with text of any font style. Many works
Na¨ıve Bayes classifier for this purpose and have proved explain filtering emails that are only in English language (Teli
that it outperforms other classifiers (Androutsopoulos et et al., 2014; Vira et al., 2012; Hossain et al., 2019). Since
al., 2000; Rathod & Pattewar, 2015; Vira et al., 2012; emails can be in any language other than English,
Abraham et al., 2019). Ed- strom (2016) has employed multilingual email filtering is necessary (F. Yasin Adwan &
Deep learning algorithm like ANN for filtering emails. Abuhasan, 2016; Kaya & Ertu˘grul, 2016; Ahsan et al.,
Some state-of-the-art techniques like Na¨ıve Bayes, 2016; Rajalingam et al., 2016). Documents attached could be
SVM, KNN, Random forest, decision trees are in any format such as .pdf, .docx, .txt, but the researches that
implemented in this area. Also, unsupervised and deep are done so far can explain filtering emails based on any one
learning algorithms are implemented to filter emails. of the file formats only and
Despite all these works, there still exists numerous gaps
which should be addressed.
2.2 Gaps in Research

Some works
textual contentare confined
only. to filtering
But emails emailsattach-
also contain with
ments like images, videos, audio files, documents, and
URL which could also be spam (Dada et al., 2019;
Androutsopoulos et al., 2000; Rathod & Pat- tewar,
2015; Vira et al., 2012; Abraham et al., 2019; Teli et al.,
2014; A. Cohen et al., 2018; A. Yasin & Abuhasan,
2016; Bhuiyan et al., 2018; Feng et al., 2016; Li & Li,
2015; S et al., 2014; F. Yasin Adwan & Abuhasan,
2016; Wang et al., 2015; Y. Cohen et al., 2018; Hossain
et al., 2019). There is a need to explains what viruses
are? how to filter emails con- taining these viruses?
(Teli et al., 2014). Spammers also embed text in images
to surpass the spam fil- ters. Though considerable work
is done to prevent image spamming, most of the work is
restricted to a specific image format like
JPEG/PNG/GIF (A. Co- hen et al., 2018). Although
images can be processed to detect spam, it is difficult to
detect spam in a captcha image. Hence very limited
Sohan Narasimhan - D18129634 1
not allfilters
spam (A. Cohen et al., of2018).
are capable Although
successfully many
filtering 3 Research Question
emails, it still leads to many false positives.
Though some work done by researchers explains
how to reduce the false positives rate, there is no “To what extent can aof majority
filter which can filter emails without producing
false positive (Dada et al., 2019; Edstrom, 2016). prove the performance classifier voting classifier
compared to a im-
Since an email of one type could be spam for one ing emails as Spamin or Ham? ” single Na¨ıve Bayes,
person and an identical email could be ham for SVM and AdaBoost classify-
others, there is a need to develop filter that
considers users domain preferences to fil- ter
emails as spam/ham. Also, many researchers
consider only email body and subject for spam fil-
tering, but since header information also contains 4 Hypothesis
information pertaining to spam, header
information also needs to be considered (Li & Li, 4.1 Null Hypothesis (H0):
2015; Niza- mani et al., 2014; Radhakrishnan &
V, 2017; Trivedi & Panigrahi, 2018). Some
filtering techniques may perform incredibly but
might consume a lot of sys- tem resources and If a majority voting SVM
classifier is used instead of a
computation time (Peng et al., 2018; Dada et al., single Na¨ıve Bayes, and AdaBoost to classify
2019; Form et al., 2015; Trivedi, 2016; Dedeturk emails as Spam or Ham, then the classifier accuracy
& Akay, 2020). Feature selection which is the and Precision is not significantly increased.
most important step for spam filtering is not 4.2 Alternate Hypothesis (H1):
performed or underperformed in some researches
(Wijaya & Bisri, 2016; Dada et al., 2019; Buber et
al., 2017). Although, some filter can adapt to text
modifications, emails encrypted in HTML/XML If a majority voting SVM
classifier is used instead of a
to bypass the spam filter should be addressed and single Na¨ıve Bayes, and AdaBoost to classify
fil- tered (Peng et al., 2018; Form et al., 2015). emails as Spam or Ham, then the classifier accuracy
Many spam filters are incapable to incrementally and Precision is significantly increased.
learn in real-time (Dada et al., 2019).
4.3 Research Objectives:

• To
mailobserve
system.the spam message’s influence on e-
• To Develop
gitimate andan e-mail
spam classifier
mails for classifying
from incoming le-
e-mails.

• To improve the performance of the classifier.


• To
withdemonstrate
the current the proposed classifier’s efficacy
techniques.
4.4 Methodology:

This research
experiment is secondary
is performed research
on the existingtype because
email the
data and
the data is not newly collected. Since, both quantitative
and qualitative methods are used for data collection and
analysis, it is a mixed research. Initially the email data
which is in qualitative type is converted to quantitative
type by finding the fre- quency of occurrence of words
in email and then perform the experiment. Hence, it is a
sequential exploratory design. It is a form of empirical
research because first, a research question is defined,
then hypothesis is framed, then experiment is conducted
and finally observations are recorded and evaluated and
based on this evidence, hypothesis is accepted or
rejected. An inductive approach is used by initially
obtaining email data, then experimenting on this data
andDesign
5 then observations are recorded.
and Implementation

This work
English considers
language textual
without any content
form of of email dataThe
attachments. in
proposed system works on information present in both
header and body of email by analysing the contents of
emails. All the work is carried out using a tool named
WEKA. The method used for imple- mentation is
described in the figure 2.
5.1 Dataset:

The
sassindataset
email used
data.for thisdataset
This work isis ataken
famous
fromSpamAs-
Kaggle
repository. The dataset contains emails in textual format
which is qualitative type of nominal data. it contains
header section and body. It contains emails of different
domains having a 2551 Ham emails and 501 Spam
emails.
From: Some of theemail
Sender’s contents of email data are
address.

Figure 2: Experiment Design


To: Receiver’s email address.

Date: The date the email was sent to the recipi- ent.
Received:
servers and the dateInformation
the email isabout intermediate
processed.
Subject:The
sender. subject of the message specified by the
Message
type: Id:
The type Unique
of the id of the message. Content
Message body: It iscontent presentthat
the message in an
theemail.
sender
sends to the recipient.
This email
converted dataformat
to .arff whichasisWEKA
in .txt supports
file extension is
data file
with .arff extension only.
5.2 Pre-processing:
Sohan Narasimhan - D18129634 2
The emailsPre-processing
processed. are considered as an input
is performed anda pre-
using Pre-
process tab in the WEKA tool. Some of the steps of
pre-processing are.
Conversion
emails to lowertocase:
are converted All case
lower the characters
to maintain in
Precision = TP/(TP + FP)
uniformity in the typecase. where,
Stopword removal:
conjunctions, All the articles,
and most common words arepreposi-
removed. the Predicted value(TP)
True
tions, Positive : Actual
is also value is positive, and
positive.
Duplicate words removal:
content of text are removed. Redundant words in the Predicted value is negative. value is positive, and
False
the Negative (FN) : Actual
Tokenization: In this the content of text is di- videdTrue Negative
Predicted (TN)negative.
value is also : Actual value is negative, and
into strings of words. This division is achieved using
white spaces or blank spaces and punctuations in a
sentence.
Word Vectorization: The attributes represent Predicted value is positive.Actual value is negative, and
False Positive (FP) :
Assessment
curacy is performed
and precision of the inmodels
two steps. First, the
is obtained ac-
when
the relativeinfrequencies of various important words Na¨ıve Bayes, SVM, and AdaBoost are implemented
and
characters emails. Convert these to Boolean values
individually. Second, when a classifier for a majority
i.e., 1 if the word or character is present in the email, 0 vote is implemented. Thirdly, by comparing and
otherwise. As a result of this all the numeric frequency analysing the findings of both. The precision and
attributes will be converted to Booleans. Now each accuracy of majority voting classification is assumed to
email will be represented as a dimensional vector be higher than other individual models. So, if this
representing whether a word exists in the email or not. assumption is proved then the alternate hypothesis is
This is called as bag of words representation. accepted. Otherwise, we will have the evidence not to
5.3 Feature Selection: reject
Ournull hypothesis.
results can be associated to the research ques-
tion as we can
model performance demonstrate that wevoting
by using majority have classi-
enhanced
fier
Feature selectionSelection
is performed using to achieve greater accuracy and precision. The findings
Based Feature method to Information Gain
obtain the list of provide us with an overview of how a ma- jority voting
most significant attributes. classifier enhances model efficiency in filtering spam
5.4 Train-Test Split: emails against Na¨ıve Bayes, SVM and AdaBoost
individually. Statistical analysis indicates the magnitude
of difference in the results between ma- jority voting
classifier and models individually. With the assist of
The data is split to training and testing. We consider this distinction we can say whether or not our
70% of data for training and 30% of data for testing. proposed method has improved model performance.
5.5 Models: References

First, models like Na¨ıve Bayes, Abraham,


Elsa, & A., P, Kanjamala,
G., Akhila.R.,(2019).
Elice, Email
Thomas,security
M.,
implemented individually andSVMthe and AdaBoost are
performance is classification of imbalanced data using naive bayes
evaluated on the test data. Further, using the Major- ity classifier. In International journal technologies (Vol.
voting ensemble algorithm the three models are 8, p. 16-20). Retrieved from of wireless
ensembled and evaluated on the test data. Further, communications and networking
comparison of performance is done between majority
voting classifier and the three individual models. https://doi.org/10.30534/ijwcnt/2019/04832019
6 Evaluation of designed solution
Ahsan, N. Md,
sain, I., I., M,
& Nahian,
Shah, M.,T.,Faisal.
Kafi, A., Abdullah,
(2016). Hos-
An ensem-
Performance metrics like Accuracy
used to evaluate the model performance. and Precision is ble approach to detect review spam using hybrid
machine learning technique. In 2016 19th inter-
Accuracy:
number Accuracy
of correct of classification
predictions made bymodel is the
the model. technology (iccit) (p. 384-388). doi: 10.1109/ic-
Accuracy = (TP + TN)/(TP + TN + FP + FN) national conference on computer and information
citechn.2016.7860229
Precision: The number of correctly predicted
out of the total number of predicted values. values
Androutsopoulos, I., Koutsias,
Konstantinos, Paliouras, G., &J., Spyropoulos,
Chandrinos, D.,V., artificial bee colony algorithm. In Applied soft com-
Constantine. (2000). An evaluation of naive puting (Vol. 91). doi: 10.1016/j.asoc.2020.106229
bayesian anti-spam filtering. In Proceedings of the Edstrom, A. (2016). Detecting spam with
workshop on machine learning in the new artificial neural networks. In Introduc-
systems (pp. 2672–2680). Retrieved from tion
to artificial neural networks and fuzzy
machine learning (p. 9-17). Retrieved
information age. 11th european conference on from https://pdfs.semanticscholar.org/deb3/
https://arxiv.org/pdf/cs/0006013.pdf bf26dd1faaf056a245f12ea9404afb16542a.pdf

Bhuiyan,
Biswas, H.,
S., &Ashiquzzaman,
Ara, J. (2018).A., Juthi, of
A survey I., existing
Tamanna,e-
mail spam filtering methods con- sidering machine Email filtering. (2020). Retrieved from
learning techniques. In nology (Vol. 18, p. 20-29). https://en.wikipedia.org/wiki/Emailf iltering
Retrieved from Global journal of computer science
and tech- https://computerresearch.org/index.php/
computer/article/view/1705
Feng, W., Sun, J., Zhang, L., Cao, C., & Yang,
Borde, S., M.,
Agrawal, M., (2017).
Utkarsh,Super-
Bilay, S., Vi-machine
raj, & Q. (2016). A support vector machine based
Dogra, Nilesh. vised
learning techniques for spam email detection. In bayes algorithm for spam filtering. In ingnaive
and
International journal for (ijsart) (Vol. 3, p. 760-764). communications conference (ipccc). doi: 2016 ieee
Retrieved from science and advance research in 35th international performance comput-
technology https://www.academia.edu/32553348/ 10.1109/pccc.2016.7820655
Form, M.,K.,
Tiong, Lew,
Wei.Chiew, L., Phishing
(2015). Kang, Sze, N.,detection
email San, &
Buber, E., selections
Feature Demir, , & forSahingoz,
the machineK., Ozgur.
learning (2017).
based technique by using hybrid features. In 2015 9th
detection of phishing web- sites. In 2017 international conference on it in asia (cita). doi:
international artificial intelli- gence and data 10.1109/cita.2015.73498185
processing symposium (idap). doi: Harisinghaney, A., Dixit, A., Gupta, S., & Arora,
10.1109/idap.2017.8090317
Cohen, A., Nissim, N., & Elovici, Y. for
(2018). A. (2014). using Text knn,
and image based and
spam
re- email
of general descriptive features en-Novel set
hanced classification na¨ıve bayes
dbscan algorithm. In 2014 international mation
verse
detection of malicious emails using ma- chine technology (icroit) (p. 153-155). doi: conference on
learning methods. In Expert systems with reliability optimization and infor-
applications (Vol. 110, p. 143-169). doi: 10.1109/icroit.2014.6798302
10.1016/j.eswa.2018.05.031
Cohen, Y., Hendler,webmail
D., & Rubin, A. (2018).based
Detection Hossain, S., Md,K.Zubair,
mad, Patwary, M., Rahman,
H., Muhammad, O., Moham-
& Rajib, G. S.,
of malicious attachments on Md. (2019). A modified na¨ıve bayesian-based spam
propagation patterns. In Knowledge- based systems filter using support vector machine. In 2019 1st
(Vol. 141, p. 67-79). doi: engineering and robotics technology (icasert). doi:
10.1016/j.knosys.2017.11.011 international conference on advances in science,
10.1109/icasert.2019.8934629
Dada, G., Emmanuel, Bassi, S., Joseph, Chiroma, H.,
K, K., Naveen. (2019). Naive bayes:
Sohan Narasimhan - D18129634 3
Spam detection.. Retrieved from https://medium.com/@naveeen.kumar.k/ naive-
bayes-spam-detection-7d087cc96d9d
Abdulhamid, M., Shafi’i, Adetunmbi, O., Adebayo, Kaya, Y., & Ertu˘grul, F., O¨ mer. (2016). A
&
ingAjibuwa,
for email E., Opeyemi.
filtering:(2019). ). approaches
machine learn- novel approach
spam Review,
open research problems. In Heliyon (Vol. 5). doi:
and shifted binary for spam email
patterns. detectionand
In Security based on
com-
10.1016/j.heliyon.2019.e01802 munication networks (Vol. 9, p. 1216-1225). doi:
10.1002/sec.1412
Dedeturk,
ing usingK., Bilge, &regression
a logistic Akay, B. (2020). Spam filter-
model trained by an Li,aL., & Li, C. based
(2015). Research andInimprove- ment of
spam filter on naive bayes.
human-machine
7th international systems and cybernetics.
conference on doi: 2015 Trivedi, K., Shrawan. (2016). A study of machine
intelligent
10.1109/ihmsc.2015.208 learning classifiers for spam detection. In 2016 4th
business intelligence (iscbi) (p. 176-180). doi:
Margaret, R. (2006). What is spam international symposium on computational and
filter.. Retrieved from 10.1109/iscbi.2016.7743279
https://searchsecurity.techtarget.com/ Trivedi, K., Shrawan, &APanigrahi, K., Prabin. (2018).
definition/spam-filter Spam classification: comparative anal- ysis of
different boosted decision tree approaches. In Journal
of systems and information technology (Vol. 20, p.
298-320). doi: 10.1108/jsit-11-2017-
Nizamani, S., Memon, N., Glasdam, M., &
Dong. (2014). Detection of fraudulent emails byNguyen, D.,
0105
employing advanced feature abundance. In Egyptian Vira, D., classification
Raja, P., & Gada,
informatics journal (Vol. 15, p. 169-174). doi:
10.1016/j.eij.2014.07.002 email usingS. (2012).
bayesianAn theorem.
approach Into
Global journal of computer science and technology
(Vol. 12, p. 45-50). Retrieved from
Peng, W., Huang,
Enhancing L., Jia,bayes
the naive J., &spam
Ingram,
filterE. through
(2018). https://computerresearch.org/index.php/
intelligent text modification detection. In 2018 17th computer/article/view/599
ieee international conference on trust, security
and privacy in computing and communications/ 12th
ieee international conference on big data sci- Wakchaure,
Ganesh, &L.,Shinde,
Sushma,B.,
Pawar,
Bipin.D.,(2017).
Shailaja, Ghuge, D.,
Overview of
ence and engineering (trustcom/bigdatase). doi: anti-spam filtering techniques. In and technology
10.1109/trustcom/bigdatase.2018.00122 (irjet) (p. 429-434). Retrieved International research
Radhakrishnan, A., & V, V. (2017). Email journal of engineering from
classification using machine
rithms. In International journal of learning
engineeringalgo- https://www.academia.edu/33585421/
Overview f nti − spam iltering echniques
o A f T
and technology (Vol. 9,
10.21817/ijet/2017/v9i1/170902310 p. 335-340). doi:
Wang, Y., Liu, Y., Feng, L., & Zhu, X. (2015).
Rajalingam, M., Raman, V., & Sumari,classification
P. (2016). Novel for
search feature
emailselection methodInbased
classification. on harmony
Knowledge- based
Implementation of vocabulary-based
for spam filtering. In 2016 international con- data systems (Vol. 73, p. 311-323). doi:
engineering (icctide’16). doi: 10.1109/icc- ference on 10.1016/j.knosys.2014.10.013
computing technologies and intelligent Wijaya, A., & Bisri, A. (2016). Hybrid
tide.2016.7725344 and logistic regression classifier for decision tree
email spam
Rathod, detection. In 2016 8th international conference on
ContentB., Sunil,
based spam&detection
Pattewar,in M.,
emailTareek. (2015).
using bayesian (icitee). doi: 10.1109/iciteed.2016.7863267
information technology and electrical engineering
classifier. In 2015 international con- (iccsp). doi:
10.1109/iccsp.2015.7322709 ference on
communications and signal processing Yasin, A., & Abuhasan,
classification model forA.phishing
(2016).email
An de-
intelli- gent
tection.
S, detection
S., Thomas, R., structural
& C, S., Emilin. (2014). Spam email In International journal of network se- curity its
using features. In Inter- national applications (Vol. 8, p. 55-72). doi:
journal of computer applications (Vol. 89, p. 38-41). 10.5121/ijnsa.2016.8405
Yasin, F., Adwan,
reduction & Abuhasan,
by using e-mail A. authentica-
history and (2016). Spam
tion
Teli, P., Savita,
Deepak. (2014).Biradar, k., email
Effective Santosh, & Patil, Y.,
classification for (sreha). In International journal of computer network
spam and non- spam. In International journal and information security (Vol. 8, p. 17-22). doi:
engineering (Vol. 4, p. 273-278). Retrieved from of 10.5815/ijcnis.2016.07.03
advanced research in computer and software
https://pdfs.semanticscholar.org/d496/
77d3f405ea581189967c16edb83a22bff833.pdf

Sohan Narasimhan - D18129634 4


7 Activities

Thirty-three
is reviewed forpapers in the research.
secondary field of spam email filtering
The Email dataset is obtained
Kaggle and environment for experimenting from repository of
is set by
mid-September.
Since the data is in text format, it will be pre-
processed and translated
WEKA software, and this to
willsuitable format by
be completed forthe
useend
in
of Feature
September.
Selection will be completed by mid-
October.
Na¨ıve
sifier willBayes classifier, on
be incorporated SVM and AdaBoost
the email dataset andclas-
the
model evaluation
Majority votingwill be completed
classifier by mid-
is developed November.
using Na¨ıve
Bayes, SVM
email data andand
the AdaBoost and implemented
model evaluation on the
will be completed
byComparison
end of November.
is made between majority voting clas-
sifier and
assessed individual
and concluded models, and modelofperformance is
The report writing willbybefirst quarter
completed byDecem-
the end ber.

of December.
8 Gantt Chart
9 Appendix

You might also like