You are on page 1of 4

25th International Conference on Information Technology (IT)

Žabljak, 16 – 20 February 2021

Decision Tree Model for Email Classification


Ivana Čavor

Abstract— In addition to the undeniable benefits, the selection from email body is very important. Features or
development of the Internet has led to many undesirable security attributes play a vital role in the process of classification [6].
effects. Spam emails are one of the most challenging issues faced In this paper, semantic properties of email content are used
by the Internet users. Spam refers to all emails of unsolicited for feature reduction and selection. In order to reduce the
content that arrive in a user's email box. Spam can often lead to computations demand and to obtain accurate results, email
network congestion and blocking or even damage to the system data is pre-processed [7], [8]. The main aim is to preserve the
for receiving and sending electronic messages. Thus, appropriate
most important features. After feature selection, the ID3
classification of spam email from legitimate email has become
very important. This paper aims to present a new approach for algorithm is used to generate a decision tree that categorizes
feature selection and Iterative Dichotomiser 3 (ID3) algorithm emails as spam or ham [9], [10]. The proposed approach is
designed to generate the decision tree for email classification. evaluated using accuracy and precision. The performance of
The experimental results indicate that the proposed model proposed system is measured against the size of dataset and
achieves very high accuracy. feature size.
This paper is organized as follows. The second section
I. INTRODUCTION explains proposed approach for spam detection in detail. The
The Internet as a “network of networks” has expanded the third section is the result analysis. The fourth section gives the
possibilities of communication and placement of content. conclusion.
Email system is one of the most effective and commonly used
sources of communication. Unfortunately, the continuous II. SPAM DETECTION SYSTEM
growth of email users has led to a massive increase of spam This section represents the workflow of the Spam Detection
emails [1]. Spam emails are usually sent in bulk and do not (SD) system for the classification of emails into ham and spam
target individual recipients. Whether it is commercial in emails. The text based email dataset considered is initially pre-
nature or not, spam emails can cause serious problems in processed for efficient feature extraction. The SD system
electronic communication. Spam emails produce large consists of four modules: email dataset preparation, pre-
amount of unwanted data and thus affect the network’s processing of data, feature selection and classification. SD
capacity and usage [2]. Due to the large number of spam process is presented in Fig.1 and the proposed procedure is
emails to users of email services it is difficult to distinguish briefly explained in sections below.
useful from unsolicited emails. Thus, managing and filtering
emails is an important challenge. The filtering purpose is to
detect and isolate spam emails.

There are two main approaches for spam detection. The


first approach is based on email header analysis and the
second one is based on email body analysis. Email header can
be considered as the identification of an email because it
includes fields like From, To, Subject, CC (Carbon Copy),
BCC (Blind Carbon Copy) which almost reveals the nature of
the email. The recent studies have shown that the information
provided by email header is quite important [3], [4]. Content
based filtering relies on the assumption that the body content
of spam email is different than the legitimate or ham mail. In
recent years, a number of Machine Learning (ML) and data
mining techniques have been proposed in order to classify
email messages based on its content. Classification methods,
such as Naïve Bayes, Support Vector Machine, Decision
Tree, Random Forest and Neural Networks are commonly
used to develop efficient email classifier [5]. For the most
classification problems the process of feature extraction and Figure 1. Spam detection process

Ivana Čavor (email: ivana.ca@ucg.ac.me)


Faculty of Maritime Studies Kotor, University of Montenegro,
Put I Bokeljske brigade 44, 85330 Kotor, Montenegro.
A. Email dataset classification. Features play an important role in any of the
An email dataset is prepared for the SD system. The dataset classification system. SD system works with assumption that
consists of total number of 4000 emails consisting of both spam mail differs from the ham mail in terms of its content.
ham and spam emails for the classification purpose. The The main idea is to find out words that are frequently
dataset contains 3465 ham emails and 535 spam emails. This occurring in the dataset or to find out the words which hold
dataset is used for feature selection. relatively higher importance in understanding the class of an
email. In case of identifying an email as spam, the task is to
B. Pre-processing of dataset understand that are there any specific words or sequence of
The email dataset considered is raw in nature. So it needs words that determine whether an email is a spam or not. For
to be pre-processed before further consideration. Initially the this purpose, the Term Frequency (TF) method is used. TF
normalization of the text data is done. It is well known that can be defined as a numerical statistic which is intended to
spam mails usually contain phone numbers, emails, website reflect how crucial a word is to a document present in a
URLs, money amounts, and a lot of whitespace and corpus. The TF value is directly proportional to the number of
punctuation. Instead of removing the following terms, for times a word appears in a document. Words like ‘free’, ‘txt ’,
each training example, the terms are replaced with a specific ‘congratulations’ are good indicators of spam and have large
string as follows: TF weights. Fig. 2 illustrates a word cloud of common words
in spam email. The larger a word appears, the more often it
1. Replace email addresses with ‘emailaddr’ has been found in spam email.
2. Replace URLs with ‘httpaddr’
3. Replace money symbols with ‘moneysymb’
4. Replace phone numbers with ‘phonenumbr’
5. Replace numbers with ‘numbr’

Punctuations are also removed from the text and all


whitespaces (spaces, line breaks, tabs) are replaced with
single space. Entire dataset is also lowercased. The sentences
are split into words known as tokens. The string of text
representing an email is tokenized in order to identify the
candidate words to be adopted as relevant spam or ham terms.
Figure 2. Visual representation of important words for spam email
From the tokenized words, stop words are removed. Stop
words are unwanted words having no linguistic meaning.
Stop-word removal involves removing frequently used non- The TF method is used to represent text data for ML
informative words, e.g. ‘a’, ‘an’, ‘the’, and ‘is’, etc. The fourth algorithm. Data representation is needed because it is hard to
step in the pre-processing module is the stemming [11]. The do computation with the textual data. Accordingly, the
process of stemming is a process of converting words to their frequency of all words in the pre-processed spam dataset is
morphological base forms, mainly eliminating plurals, tenses, calculated and twenty most frequent spam words are selected
gerund forms, prefixes and suffixes. For stemming, Porter’s as features. Next, the occurrence of each feature in an email is
algorithm has been used. Stop word removal and stemming mapped in feature matrix shown in Table 1. In order to
are important steps in the pre-processing phase as they help to enhance the ML algorithm accuracy, one more feature is
reduce the search space for efficient feature extraction and added. That feature represents the total number of important
selection. spam words in a specific email. The experimental results
indicate that the corresponded feature has the biggest effect on
the appropriate classification decision.
C. Feature extraction and selection
In this process the emails are analyzed to find out the
features (words) which would be most useful in the

TABLE I
FEATURE MATRIX: EACH ROW REPRESENTS AN EMAIL WITH THE FEATURES PRESENTED IN COLUMNS

FEATURES
EMAIL Numbr Call Txt Free Claim Httpaddr Moneysymb Total_spam_words DECISION/CLASS
Email_1 0 1 0 0 0 0 0 1 Ham
Email_2 2 0 0 1 1 1 0 4 Spam
Email_3 1 0 0 3 0 0 0 2 Spam
Email_4 1 0 0 0 0 0 0 0 Ham
D. Decision tree Information gain is calculated to split the attributes
A decision tree is a structure that represents a procedure further in the tree. The attribute with the highest
for classifying objects based on their attributes. A decision information gain is always preferred first. Entropy and
tree is a tree where each node represents a feature, each information gain is related by the following equation:
branch represents a decision and each leaf represents an
outcome (class or decision). Decision trees can be used to 𝒈𝒂𝒊𝒏(𝑺, 𝑨𝒊 ) = 𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) − 𝑬𝒏𝒕𝒓𝒐𝒑𝒚𝑨𝒊 (𝑺) (2)
predict the class of an unknown query instance by building
a model based on existing data for which the decision is where EntropyAi(S) is the expected entropy if attribute Ai is
known. To train a decision tree model we need a dataset used to partition the data.
consisting of a number of training examples characterized The algorithm was implemented according to the
by a number of descriptive features and the class. The following steps:
features can have either nominal or continuous values. 1. Create a root node
A decision tree consists of root node, internal nodes and 2. Calculate the entropy of the whole (sub) dataset
leaf nodes. Internal nodes represent the conditions applied 3. Calculate the information gain for each single
on attributes or features whereas leaf nodes represent the feature and select the feature with the largest
class. Each node typically has two or more nodes extending information gain.
from it. When classifying an unknown instance, the 4. Assign the (root) node the label of the feature with
unknown instance is routed down the tree according to the maximum information gain. Grow for each
values of the attributes in the successive nodes. The main feature value an outgoing branch and add
advantage for using decision tree is that it is easy to follow unlabeled nodes at the end.
and understand. Fig. 4 presents example of a typical 5. Split the dataset along the values of maximum
decision tree. Words “free” and “money” are typical spam information gain feature and remove this feature
words and they are used as features. If the word “free” from dataset.
appears more than two times in an email than the email is 6. For each sub-dataset, repeat steps 3 to 5 until a
classified as spam. Otherwise, we are asking does the email stopping criteria is satisfied.
contain the word “money”. If the word “money” appears
more than three times than the email is certainly spam, Since the chosen features have continual values, to
otherwise it is ham. perform a binary split, it is needed to convert continuous
values to nominal ones. That is done using threshold value.
The threshold value is a value that offers maximum
information gain for that attribute. For example, the
information gain maximizes when threshold is equal to two
for total_spam_words feature. In fact, for most features, it
is shown that it is not important how many times a certain
spam word occurred in an email but whether it appeared at
all. This conclusion has enabled data dimensionality
reduction, since there are some features that do not have an
effect on the decision. The feature that has no influence on
the class labels can be discarded. The feature reduction has
made the data less sparse and more statistically significant
for ID3 algorithm.
Figure 3. An example of decision tree
III. EXPERIMENTAL RESULTS
The ID3 algorithm is based on the Decision tree The efficiency of proposed SD system is encountered by
algorithm. ID3 algorithm builds the decision tree based on evaluating the performance parameters. Parameters like
entropy and the information gain. Entropy measures the true negative rate, false negative rate and false positive rate,
impurity of an arbitrary collection of samples while the precision and accuracy are calculated in order to evaluate
information gain calculates the reduction in entropy by the performance of the SD system.
partitioning the sample according to a certain attribute. If Given a set of labeled data and such a predictive model,
the target attribute (class) takes on n different values, then every data point lies in one of four categories:
the entropy S relative to this n-wise classification is defined TP (True Positive): the number of instances correctly
as shown in (1): classified to that class.
TN (True Negative): the number of instances correctly
𝑬𝒏𝒕𝒓𝒐𝒑𝒚(𝑺) = ∑𝒏𝒊=𝟏 −𝒑𝒊 ∙ 𝐥𝐨𝐠 𝟐 𝒑𝒊 (1) rejected from that class.
FP (False Positive): the number of instances incorrectly
rejected from that class.
where p is the proportion/probability of S belonging to
FN (False Negative): the number of instances incorrectly
class Cn.
classified to that class.
These values are often presented in a confusion matrix. A For a classifier, accuracy is defined as the number of
confusion matrix is a summary of prediction results on a items categorized correctly divided by the total number of
classification problem. Table 2 represents confusion matrix items. It’s what fraction of the time the classifier has made
for email spam classification. correct decision. Precision is defined as the ratio of true
positives to predicted positives. It shows how many actual
TABLE II spams there are among the predicted ones.
CONFUSION MATRIX
The performance of proposed SD system is measured
Predicted HAM Predicted SPAM
Actual HAM True Negative False Positive against the size of dataset and the features size.
Actual SPAM False Negative True Positive The dataset of different sizes are used for measuring the
performance. For example, in case of 500 emails being
Accordingly, accuracy and precision can be defined as used for the training process, accuracy was 97.22% using
follows: decision tree classifier. Decision tree classifier supports
over 97.32% of classification accuracy for more than 1000
𝑻𝑷+𝑻𝑵 emails. Reducing the number of features also effects the
𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = 𝑻𝑷+𝑻𝑵+𝑭𝑷+𝑭𝑵 (3)
accuracy. As expected, the accuracy was increased
according to the feature size increased. The accuracy using
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃+𝐹𝑃 (4) 3 features is 97.12% and when using 7 features is 97.4%
for the same dataset.

TABLE III
CLASSIFICATION RESULTS BASED OD DATASET SIZE AND FEATURE SIZE

Dataset size Feature size Accuracy[%] Precision[%]


500 7 97.22 91.99
500 3 96.6 86.16
1500 7 97.32 92.28
1500 3 96.56 85.62
3000 7 97.2 91.52
3000 3 96.3 83.96

of Advanced Computer Science and Applications


(IJACSA), Vol. 11, No. 3, 2020
IV. CONCLUSION [5] E. G. Dada, S. B. Joseph, H. Chiroma, S. Abdulhamid,
In this paper, decision tree-based classification is employed A. Adetunmbi, E. Opeyemi and Ajibuwa, “Machine
for spam email detection. A novel approach for feature learning for email spam filtering: review, approaches
selection and reduction is also presented. It is shown that and open research problems”. in Heliyon, June 2019
the system achieves high accuracy with a few features and [6] E. M. Bahgat, S. Rady, W. Gad and I. F. Moawad,
with relatively small training dataset. In the near future, it “Efficient email classification approach based on
is planned to incorporate other classifiers and to compare semantic methods”, In: Ain Shams Eng. J., vol. 9, no.
their performances with the proposed approach. 4, pp. 3259-3269, December 2018.
[7] F. Ruskanda, “Study on the Effect of Preprocessing
Methods for Spam Email Detection”, in: Indonesian
REFERENCES
Journal on Computing (Indo-JC). 4. 109, March 2019.
[1] https://www.statista.com/statistics/420391/spam- [8] A. Sharma, Manisha, D. Manisha and D.R. Jain, “Data
email-traffic-share, Date of last access June 2nd, 2020. Pre-Processing in Spam Detection”, in: International
[2] B. Agrawal, N. Kumar and M. Molle, “Controlling Journal of Science Technology & Engineering
Spam E-mail at the Router”, in Proceedings of the (IJSTE), vol.1, Issue 11, May 2015
IEEE international conference on communications, [9] L. Shi, Q. Wang, X. Ma, M. Weng and H. Qiao, “Spam
vol. 3, pp 1588–1592, 2005 Email Classification Using Decision Tree
[3] A. S. Rajput, J. S. Sohal, V. Athavale, “Email Header Ensemble”,in Journal of Computational Information
Feature Extraction using Adaptive and Collaborative Systems 8, March 2012
approach for Email Classification”, in International [10] S. Balamurugan and R. Rajaram, “Suspicious E-mail
Journal of Innovative Technology and Exploring Detection via Decision Tree: A Data Mining
Engineering (IJITEE), ISSN: 2278-3075, vol.8, Issue Approach”, January 2007.
7S, May 2019 [11] C. D. Manning, P. Raghavan and H. Schütze,
[4] P. Kulkarni, J.R. Saini and H. Acharya, “Effect of Introduction to Information Retrieval,in Cambridge
Header-based Features on Accuracy of Classifiers for University Press, 2008.
Spam Email Classification”, in: International Journal

You might also like