You are on page 1of 6

Spam Filtering Email Classification (SFECM) using

Gain and Graph Mining Algorithm


M. K. Chae 1, Abeer Alsadoon1, P.W.C. Prasad1, Sasikumaran Sreedharan2
1
Charles Sturt University Study Centre, Sydney, Australia
2
Department of Computer Engineering, College of Computer Science, King Khalid University, KSA

Abstract— This paper proposes a hybrid solution of spam model [2] has been developed to better adapt at classifying
email classifier using context based email classification model as emails into homogenous groups.
main algorithm complimented by information gain calculation to
increase spam classification accuracy. Proposed solution consists This paper aims to combine spam filter which uses
of three stages email pre-processing, feature extraction and email information gain calculation and context based email
classification. Research has found that LingerIG spam filter is classification model with the aim of improving the spam email
highly effective at separating spam emails from cluster of classification accuracy to become 100%. The proposed
homogenous work emails. Also experiment result proved the solution uses spam filter to firstly filter all the spam emails
accuracy of spam filtering is 100% as recorded by the team of
from inbox. Then the context-based email classification
developers at University of Sydney. The study has shown that
implementing the spam filter in the context –based email
model can classify emails into several folders.
classification model is feasible. Experiment of the study has This paper is organized as follows: Section I and II
confirmed that spam filtering aspect of context-based presents the introduction and literature review. Section III is
classification model can be improved.
proposed solution and section IV is results and discussion.
Keywords— email classification; graph mining algorithm; Conclusion can be found in section V.
spam; email classifier
II. LITRETURE SURVEY
I. INTRODUCTION
Study of literatures regarding automated email
Email is a cost-effective method of communication classification has found there are at least four different types
commonly found in all areas of industries. Education industry of approaches to automated email classification: Traditional
is not an exception. Workforce in education industry spends approach, Ontology-based approach, Graph-mining approach,
fair amount of time in front of computer chasing up on emails. Neural-Network approach. Among many solutions proposed
This is more so with jobs that deal with high volume of emails by other researchers, Linger and context based email
each day such as administrator in education industry. classification model were notable discoveries.
Managing incoming email is a critical matter to many because
emails can herald important meetings, work messages, lunch, A. Traditional Approaches to email classification
industry related information, upcoming events which many Text classification algorithms have been adopted to email
cannot afford to miss. classification systems [3][4][5]. These includes Naïve Bayes
Also, email is a means to transfer important documents in algorithm [4] and Support Vector Machine [3] which tokenize
education agency. Often the documents contain international the email for calculation determining similarity of emails to
student’s private information and scanned copy of application either spam or other useful type of email.
to apply for admission into education institution such as Experiment conducted by Alsmadi and Alhami [3] have
Universities, TAFEs and private colleges. At present we still found that removing stop words in emails improve accuracy of
find important work related emails in spam folder. Therefore email classification. Jason D. M Rennie [4] performed email
there is still a need to improve accuracy of email classifiers classification using a Naïve Bayes algorithm in an email
using new and existing algorithms. classification system named ifile. An email classification
One possible solution to improving spam classification method named Three-Phase Tournament method devised by
algorithm is using a spam filter named LingerIG implemented Sayed et al [5] has shown very unstable accuracy ranging from
in 2003 in an email classification system named Linger [1]. 2% to 95%.
The basic principle of how this spam filter works bases on B. Ontology-based Approaches to email classification
calculating information gain. However the problem with this
solution is its accuracy in classifying non-spam emails into The template is used to format your paper and style the
folders. Out of many email learner used by Linger, at best, text. All margins, column widths, line spaces, and text fonts
Widrow-Hoff gives unstable accuracy which moves between are prescribed; please do not alter them. You may note
82.40% ~ 48.50% [1] when classifying emails into folders. peculiarities. For example, the head margin in this template
Current solution such as context based email classification measures proportionately more than is customary. This

978-1-5090-5814-3/17/$31.00 ©2017 IEEE

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.
measurement and others are deliberate, using specifications before the actual email classification. Activity at each stages
that anticipate your paper as one part of the entire proceedings, of email classification is shown below in Figure 1.
and not as an independent document. Please do not revise any
A. SFECM: Components and Implementation
of the current designations.
This system consists of three stages : Email Preprocessing,
C. Graph-mining approaches to email classification.
Feature Extraction and Email Classification. The proposed
Graph-mining approaches to email classification take system runs POS Tagger on email in email preprocessing
advantage of semantic features and structure in emails by stage to turn email texts into email features. At feature
converting emails into graphs and matching template graphs extraction stage, proposed system filters Spam from a set of
with graphs made from each emails [8][9][10]. Typical graph inputted emails. Then from filtered emails, sign-off words,
mining algorithm converts emails into graphs. Substructures greeting words, keywords are extracted to form email graph.
of graphs are then extracted from graphs. Parameters prune At this stage, template graphs update using new email graphs.
substructures. Representative substructures remain. Template graphs are then ranked in email classification stage
Substructures are ranked just so that in case an email graph to be assigned to represent relevant folder. Then email graphs
matches more than two representative substructures, emails go are matched to representative template graphs and placed to
into a folder which the matched representative with higher folder of the representative template graph that graph matches
rank. most. Detailed diagram of this proposed work is presented in
Figure 2.
eMailSift is a graph mining algorithm devised by Aery and
Chakravarthy [8]. Aery and Chakravarthy have reported the
email classification accuracy increased from 80% to 95% as
the number of inputted emails increased from 60 to 370 [8].
On the contrary, a later work by Chakravarthy et al [9] named
m-InfoSift showed that email classification accuracy
decreased as number of folders increased. Accuracy of the
email classification decreased from 100% to 91% as number
of folders increased from 2 to 4 [9].
D. Current Best Selected Solution.
Graph-mining algorithm named Context-based Email
Classification System was proposed by Wasi et al [10]. It
consists of graph mining algorithm and Event Identification
System. As shown in the Table 3, Accuracy of email
classification was 80% when 300 emails were used for
training. Accuracy of email classification rose to 85% when
750 emails were used [10]. Accuracy reached 88% when 1500
emails used. Accuracy became 93% as number of training
emails counts 3000. As this result shows, it took 10 times
more emails to raise accuracy from 80% to 93% . A major
flaw in this system bases on the huge number of emails it
requires to reach accuracy of 100%. Fig. 1. Flowchart of Proposed Solution.

Also, the context-based email classification model does


not have spam filter even though the model addresses
clustering of homogenous emails into groups. Proposed work
therefore needs to address this insufficiency to improve the
model. An email classification system named Linger was
developed by Jason Clark, Irena Koprinska and Josiah Poon at
University of Sydney. Linger uses neural network [1]. Linger
uses a spam filter named LingerIG (Information Gain). As
shown in the Table 3, result of their experiment showed that
when LingerIG was used, Linger showed 100% accuracy at
spam email classification.
III. PROPOSED SOLUTION
Name of the proposed solution is Spam Filtered Email
Classification Model (SFECM). Proposed solution is based on
the Context-Based Email Classification Model to filter spam Fig. 2. Detailed Diagram of Proposed Solution.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.
B. Spam Filter: Algorithm Following diagrams are visualization of the algorithm.
The proposed solution’s algorithm at spam filtering event
is presented below:
Algorithm(1) : Proposed Spam Filter
INPUT: Test email samples (E) = {E1, E2, … , EN}

OUTPUT: Classified emails (E’)= {E1’,E2’,…,EN’}

BEGIN
Step 1: USE a set of test emails as sample emails.

Step 2: INPUT a sample email into proposed solution’s spam filter.

Step 3: Start a loop


For each email feature in EN if

Email contains email feature that matches feature in


spam feature list, increment count.
Keep counted number of matches as
numMatchSpam; number of email features in the
email that match features from spam filter’s list of
spam email features.

END FOR
Step 4: Start another loop
For each email feature in EN if
Email contains email feature that matches feature in
non-spam feature list, increment count.

Make integer variable in source code


Fig. 3. Step 1 Training Spam Filter using set of test emails
numMatchWork; number of email features in the
email that match features from spam filter’s list of
work email features.

Keep counted number of matches as


numMatchWork.

END FOR

Step 5: Calculate Entropy/Impurity


Calculate the spam email features are contained in email by
dividing number of spam email features by number of email
features in the email.
Call this impuritySpam(impSpam).

Calculate impSpam= numMatchSpam / number of Features


In Email;
Find the work email features are contained in email by using
below formula.

Call this impurityWork (impWork)


impWork = numMatchWork/ number of Features In Email;
Step 6: Move email to either spam or keep email in inbox.

If impSpam > Average Information Gain of all Spam emails


Move the email to spam folder
directory.

If impSpam > impWork

Move the email to spam folder directory.


If impWork > Average Information Gain of all non-spam
emails

Keep the email in inbox.


If impWork > impSpam

Keep the email in inbox. Fig. 4. Step 1 Make SPAM Feature List and WORK Feature List.
Step 7: End of Algorithm

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.
of all Spam emails and Average Information Gain of all work
(non-spam) emails. Email features are used to populate list of
arraylist. Email features are also used to train the spam filter.
There is one possible experiment to compare the accuracy of
IG spam filter and accuracy of solution without using IG spam
filter.

Fig. 5. Step 2 & 3 & 5 Match Extracted Words to SPAM Feature List.

Fig. 7. Step 6 Move Email to SPAM Folder.

As it can be seen in the Fig 8, Fig 9, Table 1 and Table 2,


the experiment has used ten groups of emails, each group
consisting of 10 email files. Each group has five spam emails
and five work emails. Tables show number of correctly
classified spam email and work emails for both solutions.
Experiment results have been compared between the
results from current solution and proposed solution. Also the
accuracy of solution at classifying spam email and work email
has been measured by the number of emails being correctly
put into spam folder and work email folder. In order to
compare the difference that data size has on processing time,
size of email groups are listed on the column second from the
left.
As it can be seen in Fig 8 and Table 2, a comparison
between the proposed solution and existing solution shows
that the proposed solution provides absolutely 100% accuracy
when classifying spam emails from sets of emails. This result
proves that adopting spam filter into existing solution
Fig. 6. Step 4 & 5 Match Extracted Words to WORK Feature List.
reinforces its capability to accurately classify spam emails
from work emails.
IV. RESULT AND DISCUSSION Fig 8 also shows that in terms of accuracy the result of the
This section tests the implementation of the proposed experiment on classifying work (non-spam) emails from email
solution. The system has been implemented by Java project. sets are similar. They all provide 100% accuracy when
This project read each eml files into string objects so that classifying work (non-spam) emails from a mixed set of
email features can be organized into arrays. emails.

The proposed system depends on extracted email features Fig 8 shows the improvement in the accuracy of work
from eml file. Each predefined activities utilize extracted emails by adopting IG spam filter. Fig 8 shows that processing
email features such as to calculate Average Information Gain time of existing solution and proposed solution does not show

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.
significant difference. Also the relationship between the data TABLE II. RESULTS OF SAMPLE EMAIL SPAM CLASSIFICATION USING
INFORMATION GAIN CALCULATION – PROPOSED SOLUTION
size and processing time does not seem to depend on each
other. Implementation Result of First Step in Proposed Solution :
Email Classification (IG Spam Classifier)
TABLE I. RESULTS OF SAMPLE EMAIL SPAM CLASSIFICATION USING
INFORMATION GAIN CALCULATION – CURRENT BEST SOLUTION
Tested Corpus Proposed Solution
Implementation Result of First Step in Current Best Solution : (eml files)
Email Classification (IG Spam Classifier)
Corpus Data Accuracy (%) Time
Name size (Sec.)
Tested Corpus Current Best Solution (MB) Spam Work
(eml files)

classified spam emails


Number of correctly

Number of spam emails

Accuracy

Classified work emails


Number of correctly
Number of work emails

Accuracy
Corpus Data Accuracy (%) Time
Name size (Sec.)
(MB) Spam Work
classified spam emails
Number of correctly

Number of spam emails

Accuracy
Classified work emails
Number of correctly
Number of work emails

Accuracy

Corpus 6.49 5 5 100 5 5 100 189.46


1

Corpus 1.57 5 5 100 5 5 100 389.22


Corpus1 6.49 4 5 80 4 5 80 195.33 2
s
Corpus2 1.57 4 5 80 3 5 60 382.56 Corpus 8.78 5 5 100 5 5 100 675.34
s 3
Corpus3 8.78 4 5 80 3 5 60 680.94
s Corpus 5.33 5 5 100 5 5 100 714.17
4
Corpus4 5.33 3 5 60 4 5 80 593.95
s
Corpus 3.45 5 5 100 5 5 100 353.93
Corpus5 3.45 3 5 60 4 5 80 345.30
5
s
Corpus 6 13.1 4 5 80 3 5 60 154.65
Corpus 13.1 5 5 100 5 5 100 197.34
s
6
Corpus 7 18.8 3 5 60 3 5 60 166.30
Corpus 18.8 5 5 100 5 5 100 198.22
s
7
Corpus 8 0.8 3 5 60 4 5 80 263.40
Corpus 0.8 5 5 100 5 5 100 159.64
s
8
Corpus 9 3.28 4 5 80 3 5 60 433.79
Corpus 3.28 5 5 100 5 5 100 121.29
s
9
Corpus 2.49 3 5 60 4 5 80 335.12
Corpus 2.49 5 5 100 5 5 100 118.60
10 s
10

Fig. 9. Result of Processing Time.

Fig. 8. Result of Accuracy – Work Emails.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.
TABLE III. COMPARATIVE STUDY – LINGER AND CONTEXT-BASED EMAIL REFERENCES
CLASSIFICATION MODEL
[1] J. Clark, I. Koprinska and J. Poon, "Linger - A Smart Personal Assistant
S.N Method Author Accuracy Processing for E-Mail Classification", in International Conference on Artificial
Name Time
Neural Networks, 2003, pp. 274–277.
[2] S. Wasi, S. Jami and Z. Shaikh, "Context-based email classification

Number of correctly

Can Manual Step be


Is this method time
model", Expert Systems, vol. 33, no. 2, pp. 129-144, 2015.

classified emails

Percentage

consuming

automated
Accuracy
[3] I. Alsmadi and I. Alhami, "Clustering and classification of email
contents", Journal of King Saud University - Computer and Information
Sciences, vol. 27, no. 1, pp. 46-57, 2015.
[4] J. Rennie, "ifile : An Application of Machine Learning to E-Mail
Filtering", in Proceedings of the KDD (Knowledge Discovery in
Databases) Workshop on Text Mining, 2000.
Total 2893
[5] S. Sayed, "Three-Phase Tournament-Based Method for Better Email
emails
Classification", International Journal of Artificial Intelligence &
contained in
Applications, vol. 3, no. 6, pp. 49-56, 2012.
corpus
“LingSpam” [6] M. Fuad, D. Deb and M. Hossain, "A trainable fuzzy spam detection
Clark, J., were 100 - Y system", in 7th International Conference on Computer and Information
Koprinska, classified all % Technology, 2004.
1. Linger I., & Poon, correctly. 481
J. 2003 emails were [7] S. Youn and D. McLeod, "Spam Email Classification using an Adaptive
spam and Ontology", JSW, vol. 2, no. 3, 2007.
2412 were [8] M. Aery and S. Chakravarthy, “eMailSift: Email Classification Based on
legit emails. Structure and Content,” Data Mining, Fifth IEEE Int. Conf., pp. 18–25,
Context- Shaukat 2005.
based Wasi, Syed [9] S. Chakravarthy, A. Venkatachalam, and A. Telang, “A graph-based
email Imran Jami 240 emails of
300 emails approach for multi-folder email classification,” Proc. - IEEE Int. Conf.
2. classifi- and Zubair 80% Y Y Data Mining, ICDM, pp. 78–87, 2010.
cation Ahmed were
model Shaikh, correctly [10] T. Ayodele, S. Zhou, and R. Khusainov, “Email Classification Using
2016 classified. Back Propagation Technique,” Int. J., vol. 1, no. 1, pp. 3–9, 2010.
[11] D. Patil and Y. Dongre, “A Clustering Technique for Email Content
Mining,” Int. J. Comput. Sci. Inf. Technol., vol. 7, no. 3, pp. 73–79,
V. CONCLUSION 2015.

This paper has identified that 100% accuracy in spam [12] K. Taghva, J. Borsack, J. Coombs, A. Condit, S. Lumos, and T. Nartker,
“Ontology-based classification of email,” Proc. ITCC 2003. Int. Conf.
classification of email system is still an unmet need. Project Inf. Technol. Coding Comput., pp. 194–198, 2003.
has drawn upon the work of the existing email classification
systems known as ‘context-based email classification system’
and ‘Linger’ to address the unmet need. Main steps of the
context-based email classification system begins with
preprocessing email using POS Tagger then it extracts several
email features to transform emails into graphs and then
graphs are matched to representative graph so that emails are
classified to the folder which the representative graph with
highest match represent . Linger implements information gain
classifier for filtering spam and use neural network to classify
emails into homogenous clusters. The proposed system
adopts spam filter from Linger to reinforce the accuracy
needed to separate spam emails without any mistake. Proposed
solution provides 100% accuracy at filtering spam from a set
of mixed emails. As far as the experiment shows, processing
time between using spam filter and not using spam filter differ
insignificantly. It is important to however to stress the need to
reduce the processing time of the spam classification because
processing time of 0.1 second is an unmet need in this solution.

Authorized licensed use limited to: PES University Bengaluru. Downloaded on March 20,2024 at 17:29:31 UTC from IEEE Xplore. Restrictions apply.

You might also like