You are on page 1of 6

Using Adaboost and Stochastic gradient descent

(SGD) Algorithms with R and Orange Software for


Filtering E-mail Spam
Huwaida T. Elshoush Esraa A. Dinar
Faculty of Mathematical Sciences Faculty of Computing and Information Systems
University of Khartoum Sudan International University
Khartoum, Sudan Khartoum, Sudan
htelshoush@uofk.edu esra.dinar7@gmail.com

Abstract—With the increasing usage of electronic emails, the conducted in this field were mentioned in section 3 along
ratio of spam is increasing day by day. Thus, spam emails with their methodologies, main steps, and final results. Section
have become a major threat that lowers the usage of electronic 4 explains the proposed method in details starting with the
emails as a way for communication. There are several machine
learning techniques that provide email spam filtering methods, software and the tested dataset. It then thoroughly explains
such as Naive Bayes (NB), K-Nearest Neighbor (KNN), Support the spam filtering process. Namely, R and orange software
Vector Machine (SVM), Artificial Neural Network (ANN) and were the tools used for the pre-processing steps and building
Decision tree (DT). This paper considers different machine the classifiers respectively. Section 5 depicted the experimental
learning techniques to filter spam emails, specifically Adaboost results. Finally, the paper is concluded in section 6.
and Stochastic Gradient Descent (SGD). R tool was used for the
pre-processing stage. Adaboost and SGD were implemented in
Orange software for building the classifiers. Using Orange tool, II. S PAM F ILTERING T ECHNIQUES
the experimental results showed that the algorithms Adaboost
and stochastic gradient descent (SGD) provided true positive A. Non-Machine Learning Spam Filtering Methods
value of 100 % and 98.1% respectively and false positive rates
of 0.0% and 1.9% respectively. The good accuracy of these The non-machine learning spam filtering methods can be
algorithms and the favorable results put them among the best classified into four categories which are briefly explained
choices of spam filtering methods.
below:
Index Terms—Spam Filtering, Machine Learning, R tool, • List Based
Orange tool, Adaboost, Stochastic gradient descent (SGD) List based is a technique used for filtering spam emails. It
attempts to stop spam emails by categorizing the senders
I. I NTRODUCTION as spammers or trusted users, and allowing or blocking
Emails, which are used in nearly all of the fields of com- their emails accordingly [3].
munication, education, and manufacturing, can be categorized • Content Based
into ham (legitimate emails) and spam. Spam emails have Content-based filters deal with emails by evaluating
grown into critical threat that lowers the usage of electronic words and phrases to determine whether an email is spam
emails as a way of communication. Spam consumes network or legitimate [3].
bandwidth and server storage spaces thus slowing down email • Challenge/Response System
servers, and providing media for harmful and/or insulting In this technique, the system block undesirable emails by
materials [1]. There are numbers of critical troubles linked forcing the sender to perform a task before their message
with the increasing volumes of spam; stuffing users mailboxes, can be delivered. If the task, which is the challenge, did
wasting network resources namely storage space and e-mail not complete after a certain time period, the message is
bandwidth, consuming users time for removing all spam rejected [3].
letters and in addition damaging computers and laptops due • Collaborative Filters
to viruses [2]. Security of email system is very necessary This technique collects input from millions of email users
in our daily lives. If the spam finds a way to conquer that, around the globe. Users of these systems can give a
it will cause wastage of resources and also pollute email notice on any incoming emails as legitimate or spam and
environment. Moreover, with the unwanted emails the server’s these notations are reported to a central database. After
storage memory will be compromised. a certain number of users marking a particular email as
Section 2 discusses various techniques that were used in junk, the filter automatically bocks it from reaching the
the spam filtering process. The previous studies that were rest of the community’s inboxes [3].

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 41


B. Machine Learning Techniques for Spam Filtering
• Artificial Neural Network
Machine learning (ML) is a branch of knowledge that is
Artificial Neural network (ANN) was initiated by McCul-
involved with the design and implementation of algorithms
loch and Pitts in 1943 [4]. Since the launch it has been
that allow computers to adjust their behavior according to data
progressively used in words classification [4]. It essen-
[4]. ML automatically learns to identify composite patterns and
tially imitate the functionality of human mind in which
makes intelligent judgment based on data. The development
neurons (nerves cells) contact with each other via sending
of the data mining applications such as classification and
messages among them [4]. It represents the statistical
clustering led to the need for the ML algorithms to be applied
model of these biological neurons [4]. Neural network has
for huge scale data. The aim of ML is to resolve problems
huge mapping ability or pattern association thus revealing
in intelligent ways by enhancing the performance of computer
generalization, robustness, high fault tolerance, and high
programs [2].
speed equivalent information processing [4].
a) Commonly Used ML Techniques for Spam Filtering:
First, the most commonly used ML algorithms in the field
of spam filtering are explained here: b) Proposed ML Techniques for Spam Filtering:
The two proposed ML algorithms for spam filtering, Ad-
• K- Nearest Neighbor (KNN) aboost and SGD, are introduced herein:
This algorithm does not make any consideration on the
• Adaboost
distribution of the initial data, also there is no training
Boosting is a popular grouping approach that creates a
stage [5]. It is well-known for its lazy learning stage [5].
powerful classifier from a number of weak classifiers. It
KNN keeps all the training data with it and this nature
was the earliest successful boosting algorithm that meets
of KNN is termed as ‘lack of generalization’ [5]. This
the needs of binary classification [9]. Adaboost can be
training data is used further in the testing stage. The entire
used to increase the performance of any machine learning
training datasets are taken under consideration to lead to
algorithm, and it is best used with weak learners [9].
the final decision [5]. This algorithm works under the
These are models that accomplish accuracy just beyond
idea ‘characteristics vector’; which works by setting up
random probability on a classification obstacle [9].
the contents of all messages [5].
• Stochastic Gradient Descent (SGD)
• Decision Tree
SGD is an understandable and highly effective approach
Decision tree is one of the data mining approaches that is
to discriminative learning of linear classifiers under con-
based upon the tree data design. The common statistical
vex loss functions such as (linear) SVM and logistic
technique normally can only analyze the handling of the
regression [10]. It has been around in the ML society for
exterior of the data, whereas decision tree algorithms can
a long time, and has been given a great amount of notice
discover the possible association rules among the signifi-
just recently in the context of large-scale learning [10].
cant attributes from the existing data [6]. Furthermore, the
SGD has been famously applied to large-scale and sparse
forecasting of classification of the unexplored data can
machine learning obstacle frequently experienced in text
be further received through comparing their associated
classification and natural language processing [10].
attribute values to these association rules [6].
III. R ELATED W ORK
• Naı̈ve Bayesian
Naive Bayes classifier is an understandable probabilistic Over the last decade, many researchers have come up with
classifier escorted by powerful separation presumption new techniques to enhance the spam filtering techniques. A
[7]. It can be trained very efficiently within a supervised brief analysis depicting the analysis and proposed approaches
learning setting [7]. The fundamental impression about are defined in this section.
it is to discover whether an e-mail is spam or not by In 2015 Li et al [11] proposed an empirical study with
looking at which terms are located in the message and three different environments which are Research Institute,
which terms are missing from it [7]. Academic University and Commercial Company from their
users’ perspective point of view. Five basic supervised machine
• Support Vector Machine (SVM) learning classifiers were conducted which are Naive Bayes,
Support Vector Machine (SVM) is a popular category J48, IBK, Radial Basis Function Network (RBF-Network)
of machine learning classifiers. It takes its creativity and Library for Support Vector Machines (Lib-SVM). The
from statistical learning hypothesis along with the classification outcome indicates that decision tree along with
constructional minimization standard. Due to its strength support vector machines can accomplish better results than the
in dealing with elevated dimensional data through the other classifiers that were in this study [11].
utilization of kernel functions, it is one of the most Notice that SVM accomplished better results in this study
widely used classifier in the involved area of study [8]. than ours because of the environment that has been tested on
and that it works under the concept of clustering. Actually, in

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 42


this study, it is noticed that it is already divided into different field of spam filtering to analyze text which is incorporated
groups, so it was easy to separate each cluster from the other. within spam mails. Naı̈ve Bayesian classification algorithm
Aski et al [12] in 2016 proposed machine-learning algo- is used for building the model and R tool is used for the
rithms to filter spam from valid emails with low error rates preprocessing step. This method identified the most used plus
and high productivity using a multilayer perceptron model and the unused subjects of the spam emails. This approach could
several methods including C4.5 decision tree classifier [12]. be used with different algorithms to get to the best results.
The outcome results were obtained from Waikato Environment It could also be used with hybrid algorithms to achieve the
for Knowledge Analysis software tool (WEKA) which is a best results. Furthermore, Orange software could be used to
highly dominant open source and handy tool with a powerful find the result of each algorithm in a short time. After that,
user interface to run machine learning algorithms techniques it could be developed into real life environment system and
and pre-processing steps [12]. The results of the proposed organizations.
model indicate higher efficiency than Naı̈ve Bayes classifier This paper considers different ML techniques to filter spam
algorithms and J48 with a low rate of false positives [12]. emails, specifically Adaboost and SGD, which were not used
In 2017, Alurkar et al [13] proposed an approach that uses in recent work. Hereafter, the proposed method is explained.
ML techniques to classify spam emails from ham, which uses
the parameters of the email body such as - To field, From field, IV. T HE P ROPOSED M ETHOD
Message-ID, Cc/Bcc field, etc [13]. It also considers the email A. Instrumentation
body with commonly used keywords and punctuations [13]. 1) R software: R is a language and an environment for sta-
There are four stages for this technique which are preparation tistical computing and graphics. It is similar to the S language
of data, data analysis, assessment and deployment. Enron and and environment which was developed at Bell Laboratories.
UCI are the datasets that were used and the corpus contains R can be considered as a different implementation of S.
a total of about 0.5M messages [13]. The interface that is R provides a wide variety of statistical (linear and nonlin-
used for expressing ML algorithms is called TensorFlow which ear modeling, classical statistical tests, time-series analysis,
can be implemented on all kinds of devices ranging from classification, clustering, ... etc) and graphical techniques, and
phones and tablets to distributed systems [13]. This grants the is highly extensible. The S language is often the vehicle of
proposed system greater flexibility. Even though, this interface choice for research in statistical methodology, and R provides
has not been used a lot in the following years. an Open Source route to participation in that activity [23].
In 2018, Bassiouni et al [14] and Subasi et al [15] reached 2) Orange software: Orange is an open-source software
to results that proved that the decision tree algorithms were package released under General Programming Language
successful in detecting spam emails [14]. Bassiouni et al [14] (GPL). Versions up to 3.0 include core components in C++
focused on the decision tree algorithms like Classification and with wrappers in Python are available on Github. From ver-
Regression Tree (CART), C4.5, REP Tree, NB Tree, AD Tree, sion 3.0 onwards, Orange uses common Python open-source
LAD Tree, Random Forest (RF) and Rotation Forest (RoF), libraries for scientific computing, such as numpy, scipy and
where the dataset was downloaded from UCI ML data repos- scikit-learn, while its graphical user interface operates within
itory. The size of the dataset was 4601 messages out of 1813 the cross-platform framework. Orange3 has a separate Github.
which are categorized as spam emails.The experimental results Orange is a component-based visual programming software
shows that RF and RoF achieved the highest results among package for data visualization, machine learning, data mining
the other decision tree classifiers [14]. On the other hand, and data analysis [21] [22]. Orange tool is easier to use and
Subasi et al [15] focused on the random forest and random has a friendly user interface compared to WEKA tool.
tree algorithms along with another eight other algorithms. The
dataset was downloaded from spam base UCI website. Their B. Dataset
final results show that random forest technique accomplished The data set used in this research is an email data set
the highest accuracy result among the other algorithms. that was downloaded from https://github.com/mic0331/MITx-
Jawale et al [16] in 2018 proposed a hybrid approach 15.071x-The-Analytics-Edge—2016/tree/master/dataset. The
that joins support vector machine (SVM) and Naive Bayes analytics edge course was last updated on 22nd May 2019
algorithms (NB) together. The mechanism for this method is and the file format is .csv file. It is a free online courses site
done by processing the training dataset by the NB algorithm in that is corresponding to MIT classes. There were 5728 emails
which it calculates the probability for each word and message in the data set containing 1368 spam and 4360 ham emails.
in the dataset and compares it with a threshold which classifies
the data. After that, the data is processed by the SVM to C. The Proposed Spam Filtering Process
improve the accuracy by calculating the feature vector in The proposed spam filtering process consists of preprocess-
which it draws the hyper plan along with these vectors and ing, feature selection, and finally building the model for each
classifies the data. This approach shows that this combination classifier. Figure 1 illustrates the data flow of the proposed
has more accuracy than the performance of each implemented model. It depends on the idea of cleaning the data set file
separately in the model. which is legitimate or spam email as shown in figure 2 and 3
Also in 2018, R. Mallik et al [17] used text analytics in the respectively. Pre-processing operations were performed using

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 43


figure 4, then the targeted class was selected, which was the
spam column. The selected features were tested by each algo-
rithm individually to perform the learning part on the emails.
By default in Orange tool, the learning part is performed in
40% of the total number emails and the other 60% is kept
for the testing phase, but the user can customize this. After
the learning is done, it is tested and hence the final score is
calculated.
Fig. 1. The Proposed Model TABLE I
S ELECTED ROWS AND C OLUMNS OF THE S PARSE F ILE
file type york yet yesterday year worldwid
R software, which performs the lower case, stop word removal
and stemming. The result is a sparse file as shown in table 1, 1 0 0 0 0 0
which shows only selected rows and columns of the sparse 1 0 0 0 3 0
file. Table 1 is a file in the format of Comma Separated 1 0 0 0 0 0
Value (CSV). This table presents each email after carrying out 1 0 1 4 0 0
feature selection, in which each word will be treated as a single 1 0 0 1 8 0
feature and its frequency of occurrence in the spam or ham 1 0 1 6 0 0
email is shown. As an example, 11 emails have been taken out 0 0 4 0 1 0
of 5728 randomly. After that, the sparse file is imported by the 0 0 0 0 2 0
Orange software to do the test on the classification algorithms 0 0 0 1 3 0
as illustrated in figure 4. The final result is a confusion matrix 0 0 0 0 0 0
table showing the performance of each classifier. In this paper, 0 0 1 0 1 0
two algorithms, which are Adaboost and stochastic gradient
descent, were tested and compared with commonly used ML
techniques to see their accuracy performance.
The three steps of the proposed model are herein described:
a) Corpus Preprocessing: Eliminating the unnecessary
information in the email enhances the classification perfor-
mance. Thus, Corpus prepossessing transforms the content of
the email into a uniform shape that is more understandable to
the ML algorithms [18] [19]. Hereby the steps are explained:
• Lexical Analysis (Tokenization): Headers, attachments,
and HTML tags are eliminated and thus leaving the email
body and subject line to be considered as tokens.
Fig. 2. An Example of a Legitimate Email
• Stop-word Removal: Non-informative words or symbols,
e.g. ‘a’, ‘an’, ‘the’, ‘#’, ‘is’, .. etc. are being discarded
to make the selection of candidate terms more efficient.
• Stemming: Here, the words are converted into their
morphological base forms by eliminating plurals, tenses,
gerund forms, prefixes and suffixes.
• Representation: The email message is converted into
specific format to be understood by the ML algorithms.
b) Features Selection: This process is done through Bag
of Words (BOW) model (or vector-space model) in which
Fig. 3. An Example of a Spam Email
words occurring in the e-mail are treated as features. Given a
set of terms T = t1 , t2 , t3 ..tn , the bag of words model represents V. E XPERIMENTAL R ESULTS
a document d as an n-dimensional feature vector x = x1 , x2 , A. Confusion Matrix
x3 ..xn where xi is a function of the occurrence of ti in d. It Confusion matrix (fault matrix) is a special table layout that
is possible to use all the features for classification. Table 1 enables visualization of the performance of an algorithm [12].
explains the process by showing each word and its frequency • TP (True Positive): Spam correctly detected as spam.
in each spam or ham email. The value 1 in the file type column • FP (False Positive): Ham email predicted as spam by
indicates a spam email, and 0 indicates ham email [18] [19]. mistake.
c) Building the Model for Classifiers: In this process, • TN (True Negative): Ham email correctly predicted as
the sparse file is imported by the Orange tool as shown in ham email.

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 44


Fig. 4. Building the Classifiers Models Using Orange

TABLE II
E XPERIMENTAL R ESULTS

Technique Classifier accuracy TP TN FP FN


Naı̈ve Bayes 96.3 95.3 4.7 0.5 99.5
KNN 100 100 0.0 0.0 100
Artificial Neural 98.8 98.8 1.2 1.2 98.8
Network
SVM 62.3 55.6 44.4 16.4 83.6
Random Forest 99.6 99.9 0.1 1.5 98.5
Adaboost 100 100 0.0 0.0 100
SGD 98.1 98.1 1.9 1.9 98.1

Fig. 5. Classifiers Test Result for Accuracy


• FN (False Negative): Spam email predicted as ham mail
by mistake.
The classifiers test results of the proposed algorithms com-
pared to the commonly used ML techniques are presented in
figure 5 and table 2. They assess the accuracy for each classi-
fier in detecting spam emails from the given data set. Adaboost
and SGD, which are the proposed algorithms in this research,
showed excellent results along with Naı̈ve Bayes, KNN, ANN
and Random Forest, while SVM showed the lowest result
compared to the other techniques. SVM accomplishes better
results when there is a clustering environment.
Figure 6 shows the results for each classifier according to its
accuracy in true predicted spam emails as spam (TP) and false
predicted ham as spam (FP). Figure 7 shows that Adaboost Fig. 6. Classifiers Test Result for True Positives (TP) and False Positives FP)
gave 100% TP and 0.0% for FP whereas SGD showed 98.1%
TP and 1.9% for FP, which were excellent results along with
Naı̈ve Bayes, KNN, ANN and Random Forest. On the other setting. Figure 8 depicts the performance of the algorithms of
hand, SVM demonstrated the lowest result. Adaboost and SGD, along with the previously used techniques,
where the results are gathered near to the highest TP value
B. Receiver operating characteristic (ROC) Analysis which is 1, while SVM value is near the FP rate.
ROC curve is a graphical plot that illustrates the diagnostic .
VI. C ONCLUSION AND F UTURE WORK
ability of the binary classifier systems, which in this case if the
result is spam or ham [20]. The ROC plots the true positive rate This paper focuses on using the machine learning algorithms
(TPR) against the false positive rate (FPR) at various threshold Adaboost and Stochastic Gradient Descent (SGD) for filtering

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 45


and Defense Applications (CISDA), 2014 Seventh IEEE Symposium, pp.
1-5. 2014.
[2] N.O.F.Elssied and O.Ibrahim, “K-means clustering scheme for enhanced
spam detection,” Research Journal of Applied Sciences, Engineering and
Technology, vol. 7, no. 10, pp. 1940-1952, 2014.
[3] A.Bhowmick and S.M.Hazarika, “Machine learning for e-mail spam fil-
tering: Review, techniques and trends,” arXiv preprint arXiv:1606.01042,
2016.
[4] T.Subramaniam, H.A.Jalab, and A.Y.Taqa, “Overview of textual anti-
spam filtering techniques,” International Journal of Physical Sciences, vol.
5, no. 12, pp. 1869-1882, 2010.
[5] A.G.Kakade, P.K.Kharat, A.K.Gupta, and T.Batra, “Spam filtering tech-
niques and mapreduce with svm: A study,” in Computer Aided System
Fig. 7. Performance of the proposed algorithms Engineering (APCASE), 2014 Asia-Pacific Conference on. IEEE, pp. 59-
64. 2014.
[6] J.-J.Sheu, Y.-K.Chen, K.-T.Chu, J.-H.Tang, and W.-P.Yang, “An intelligent
three phase spam filtering method based on decision tree data mining,”
Security and Communication Networks, vol. 9, no. 17, pp. 4013-4026,
2016.
[7] D.K.Renuka, T.Hamsapriya, M.R.Chakkaravarthi, and P.L.Surya, “Spam
classification based on supervised learning using machine learning tech-
niques,” in Process Automation, Control and Computing (PACC), 2011
International Conference on, 2011, pp. 1-7.
[8] S.K.Trivedi and S.Dey, “Interaction between feature subset selection
techniques and machine learning classifiers for detecting unsolicited
emails,” ACM SIGAPP Applied Computing Review, vol. 14, no. 1, pp.
53-61, 2014.
[9] R.E.Schapire, “The boosting approach to machine learning: An overview,”
in Nonlinear estimation and classification. Springer, 2003, pp. 149-171.
[10] L.Bottou, “Large-scale machine learning with stochastic gradient de-
scent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177-186.
[11] W.Li and W.Meng, “An empirical study on email classification using
supervised machine learning in real environments,” in IEEE International
Conference on Communications (ICC), 2015, pp. 7438-7443.
[12] A.S.Aski and N.K.Sourati, “Proposed efficient algorithm to filter spam
using machine learning techniques,” Pacific Science Review A: Natural
Science and Engineering, vol. 18, no. 2, pp. 145-149. 2016.
[13] A.A.Alurkar, S.B.Ranade, S.V.Joshi, S.S.Ranade, P.A.Sonewar,
P.N.Mahalle, et al., “A proposed data science approach for email spam
classification using machine learning techniques,” in Internet of Things
Business Models, Users, and Networks, pp. 1-5. 2017.
[14] M.Bassiouni, M.Ali, E.El-Dahshan, “Ham and Spam E-Mails Classifi-
cation Using Machine Learning Techniques,” Journal of Applied Security
Research, 13, pp. 315-331. 2018.
[15] A.Subasi, S.Alzahrani, A.Aljuhani, M.Aljedani, “Comparison of De-
cision Tree Algorithms for Spam E-mail Filtering,” 2018 1st Interna-
tional Conference on Computer Applications and Information Security
Fig. 8. ROC Curve for spam emails
(ICCAIS). IEEE, pp. 1-5. 2018
[16] D.S.Jawale, A.G.Mahajan, K.R.Shinkar, V.V.Katdare, “Hybrid spam
detection using machine learning,” International Journal of Advance
spam emails. R software was used for preprocessing and Research, Ideas and Innovations in Technology, 4, pp. 2828-2832. 2018.
Orange software for classification and building the model. The [17] R.Mallik, and A.K.Sahoo, “A Novel Approach to Spam Filtering Using
Semantic Based Naive Bayesian Classifier in Text Analytics,” In Emerg-
spam filtering process involves three phases: preprocessing, ing Technologies in Data Mining and Information Security (pp. 301-309).
feature selection, and building the model for each classifier. Springer, Singapore. 2019.
The three phases were implemented using a dataset containing [18] I.Katakis, G.Tsoumakas, I.Vlahavas, “Email mining: emerging tech-
niques for email management,” Web Data Management Practices: Emerg-
5728 emails, where 1368 are spam and 4360 are ham emails. ing Techniques and Technologies, pp. 219-240. 2006.
After performing preprocessing on the dataset, the size of the [19] L.Shi, Q.Wang, X.Ma, M.Weng, H.Qiao, “Spam email classification
data has been decreased significantly and data items have been using decision tree ensemble,” Journal of Computational Information
Systems, 8, pp. 949-956. 2012.
grouped into meaningful abstractions. After the model was [20] S.B.Cantor and M.W.Kattan, “Determining the Area under the ROC
built for each classifier, the results showed that Adaboost and Curve for a Binary Diagnostic Test’,” Medical Decision Making, 20(4),
SGD achieved excellent results specifically true positive value pp. 468–470. 2000.
[21] Bioinformatics, “Orange-Data Mining Fruitful and Fun”,
of 100% and 98.1% respectively and false positive rates of Orange.biolab.si, 2019 [online]. Available: https://orange.biolab.si/.
0.0% and 1.9% respectively. Hence, the proposed algorithms [Last accessed 30-Jul-2019].
Adaboost and SGD proved efficacious in filtering the spam [22] Orange Data Mining [online]. Available:
http://predictiveanalyticstoday.com/orange-data-mining. [Last accessed
email using Orange tool and R software. 31-Jul-2019]
[23] The R Project for Statistical Computing. [online]. Available:
R EFERENCES https://www.r-project.org/. [Last accessed: 30-Jul-2019]
[1] Q.D.Dinh, Q.A.Tran, and F.Jiang, “Automated generation of ham rules
for Vietnamese spam filtering,” in Computational Intelligence for Security

978-1-7281-2952-5/19/$31.00 ©2019 IEEE 46

You might also like