Professional Documents
Culture Documents
Abstract—With the increasing usage of electronic emails, the conducted in this field were mentioned in section 3 along
ratio of spam is increasing day by day. Thus, spam emails with their methodologies, main steps, and final results. Section
have become a major threat that lowers the usage of electronic 4 explains the proposed method in details starting with the
emails as a way for communication. There are several machine
learning techniques that provide email spam filtering methods, software and the tested dataset. It then thoroughly explains
such as Naive Bayes (NB), K-Nearest Neighbor (KNN), Support the spam filtering process. Namely, R and orange software
Vector Machine (SVM), Artificial Neural Network (ANN) and were the tools used for the pre-processing steps and building
Decision tree (DT). This paper considers different machine the classifiers respectively. Section 5 depicted the experimental
learning techniques to filter spam emails, specifically Adaboost results. Finally, the paper is concluded in section 6.
and Stochastic Gradient Descent (SGD). R tool was used for the
pre-processing stage. Adaboost and SGD were implemented in
Orange software for building the classifiers. Using Orange tool, II. S PAM F ILTERING T ECHNIQUES
the experimental results showed that the algorithms Adaboost
and stochastic gradient descent (SGD) provided true positive A. Non-Machine Learning Spam Filtering Methods
value of 100 % and 98.1% respectively and false positive rates
of 0.0% and 1.9% respectively. The good accuracy of these The non-machine learning spam filtering methods can be
algorithms and the favorable results put them among the best classified into four categories which are briefly explained
choices of spam filtering methods.
below:
Index Terms—Spam Filtering, Machine Learning, R tool, • List Based
Orange tool, Adaboost, Stochastic gradient descent (SGD) List based is a technique used for filtering spam emails. It
attempts to stop spam emails by categorizing the senders
I. I NTRODUCTION as spammers or trusted users, and allowing or blocking
Emails, which are used in nearly all of the fields of com- their emails accordingly [3].
munication, education, and manufacturing, can be categorized • Content Based
into ham (legitimate emails) and spam. Spam emails have Content-based filters deal with emails by evaluating
grown into critical threat that lowers the usage of electronic words and phrases to determine whether an email is spam
emails as a way of communication. Spam consumes network or legitimate [3].
bandwidth and server storage spaces thus slowing down email • Challenge/Response System
servers, and providing media for harmful and/or insulting In this technique, the system block undesirable emails by
materials [1]. There are numbers of critical troubles linked forcing the sender to perform a task before their message
with the increasing volumes of spam; stuffing users mailboxes, can be delivered. If the task, which is the challenge, did
wasting network resources namely storage space and e-mail not complete after a certain time period, the message is
bandwidth, consuming users time for removing all spam rejected [3].
letters and in addition damaging computers and laptops due • Collaborative Filters
to viruses [2]. Security of email system is very necessary This technique collects input from millions of email users
in our daily lives. If the spam finds a way to conquer that, around the globe. Users of these systems can give a
it will cause wastage of resources and also pollute email notice on any incoming emails as legitimate or spam and
environment. Moreover, with the unwanted emails the server’s these notations are reported to a central database. After
storage memory will be compromised. a certain number of users marking a particular email as
Section 2 discusses various techniques that were used in junk, the filter automatically bocks it from reaching the
the spam filtering process. The previous studies that were rest of the community’s inboxes [3].
TABLE II
E XPERIMENTAL R ESULTS