MS Thesis (Presentation

)
Comparative Study on Feature Space Reduction techniques for Spam Detection
• Researchers: Nouman Azam, Dr. Amir Hanif Dar Samiullah marwat

The Problem
• Email is the most widely used medium for communication world wide
– Its Cheap, Reliability, Fast and easily accessible.

• it is prone to spam emails. Why.?
– Due to its wide usage and cheapness – With a single click you can communicate with any one any where around the globe.

• It hardly cost spammers to send out 1 million emails than to send 10 emails

Statistics of spam
• Is spam really a problem.?
– Some statistic will clarify.

• At the end of 2002,
– as much as 40% of all email traffic consisted of spam.
• (http://zdnet.com.com/2100-1106-955842.html)

• In 2003
– the percentage was estimated to be about 50% of all emails

(http://zdnet.com.com/2100-1105_2-1019528.html)

• In 2006
– BBC news reported 96% of all emails to be spam.
• (http://news.bbc.co.uk/2/hi/technology/5219554.stm)

Statistics of Spam
Daily Spam emails sent Daily Spam received per person Annual Spam received per person Spam cost to all non-corporate Internet users Spam cost to all U.S. Corporations in 2002 Email address changes due to Spam Annual Spam in 1,000 employee company Users who reply to Spam email
http://spam-filter-review.toptenreviews.com/spam-statistics.html

12.4billion 6 2,200 $255 million $8.9 billion 16% 2.1 million 28%

Statistics of Spam

http://www.junk-o-meter.com/stats/index.php

Problems from Spam
• Wastage of network resources
– bandwidth

• Wastage of time
– wasting peoples time working in organizations resulting in reduce productivity.

• Demages to PC’s
– Computer viruses can cause serious damages to PC’s.

• Ethical issues
– Spam emails advertising pornographic sites can cause problems for children's.

Definition of Spam
• Unsolicited (unwanted) email for a recipient. OR • Any email that the user do not wanted to have in his inbox.

Existing Approaches
• Rule based
– hand made rules for detection of Spam made by experts.( Needs domain experts and constant updating of rules).

• Customer Revolt
– forcing companies not to publicize personal email ids given to them. (Hard to implement)

• Domain filters
– Allowing mails from specific domains only. (hard job of keeping track of domains that are valid for a user. )

• Blacklisting
– Blacklist filters use databases of known abusers, and also filter unknown addresses. (constant updating of the data bases would be required).

http://www.templetons.com/brad/spam/spamsol.html

Existing Approaches
• Whitelist Filters
– Mailer programs learn all contacts of a user and let mail from those contacts through directly. ( Every one should first be needed to communicate his email id to the user and only then he can send email)

Hidding address
– hidding ones original address from the spammers by allowing all emails to be received at temporary email id which is then forwarded to the original email if found valid by the user. (hard job of mainting couple of email ids).

• •

Checks on number of recipients by the email agent programs. Government actions
– Laws implemented by government against spammers (Hard to implement laws).

Lastly • Automated Recognition of Spam
– Uses machine learning algorithms by first learning from the past data available. (Seems to be the best at Current).
http://www.templetons.com/brad/spam/spamsol.html

Why automated Spam Detection is Best
• Minimum user input taken
– The filter will filter Spam automatically with minimum user input.

• Adaptation to new kinds of spam
– The filter can adopt itself with the newly unknown kinds of spam. i.e. it will learn and update it self automatically.

Nature of the Problem
• Instance of document classification
– it can be considered as a simple instance of document classification problem where we have two classes and our objective is to separate spam from legitimate emails.

• The features in our domain will be words. • Representation of emails
– Any email can be represented in terms of features (taken to be words in this case) with discrete values based on some statistics of the presence or absence of words

Main Steps

Preprocessing of Data
• Removal of words that have length lesser than 3
– All those words whose length were found to be lesser in length than 3 were removed as they were found to be mostly non informative.

• Removal of stop words
– Stop words are those which provide structure of the language and do not provide the content. – Not informative towards the class of the document. – Examples are pronouns and conjectives

• Performing Stemming with Porter Stemming algorithms (Porter 1980).
– Stemming reduces the words having the same stems to single words thus reducing the vocabulary.

Preprocessing
• Some stop words • Examples of Stemmed words

Ling spam corpus after the pre processing

Representation of Data
example #1 example #2 example #3 example #4 … … Feature #1 feature #2 feature #3 feature #4 ……

Representation of Data
• Term Frequency (TF)

wij  tf

Wij = weight of a ij term i in email j, tfij = frequency of a term I in email j

The Corpus (Data Set)
• The corpus that we used in our experimentation was Ling Spam Corpus
(Androutsopoulos. et al. 00)

• Total number of legitimate emails in the corpus were 2412 • Total number of spam emails were 481. • Spam percentage was about 16%

Why Feature Reduction?
• Text classification tasks are driven by high dimensionality
– Total number of unique features i.e. words in the entire corpus was found out to be over 40 thousands.

• Hard job
– Computation in over 40 thousands dimensions would be very difficult – Secondly the storage requirements will be very huge. – It should be noted that storing an array of 2890 * 40,000 requires approximalty over 220 MB to be stored in CSV format.

Feature Reduction Methods
• Mutual Information (MI) • Latent Semantic indexing (LSI, PCA or KLT) • Word Frequency Thresh holding (TF)

Mutual Information
• Supervised feature selection method • MI for feature t can be calculated as

P(t,c)  MI (t , c)    P(t,c)  log   c( Spam , Leg ) t(0,1)  P(t)  P(c) 

Where t = terms or features and c = class MI Scores for all of the terms (features) were calculated and then features were sorted in descending order and top scoring features were selected. (Sahami. et al.98)

Term Frequency Thresh holding
• Unsupervised feature selection • Term frequency
– TF for a feature in a document is the number of times it appear in that document.

• Term frequency score of a feature
– TF Score for a feature is the addition of the individual term frequencies for that feature in the entire set of documents (emails).

Latent Semantic Indexing
• Unsupervised feature extraction • Also known as Principal Component Analysis and Karhunen-Loève transform • It calculates the Eigen vectors EV of the covariance matrix C which is obtained from the multiplication of the mean adjusted data µ with its transpose. • The Eigen vectors corresponding to the top most Eigen values are selected. • Transformed data TD is obtained by taking the Transpose of the Eigen vectors matrix and multiplying it with the mean adjusted data i.e. • TD = EV` * µ
(Günal. et al. 05)

The Classifier
• The classifier used was K-Nearest neighbor. • All the data were stored in the memory. • Classification of new example would be carry out by finding Its Euclidean distance from all the stored data. The ones with the nearest distance would be the class of the new data.
(Androutsopoulos. et al. 00)

The Classifier

Experimental settings
• In the first set the data was represented using Term frequency. • Three algorithms were tested. • MI, LSI and TF thresh holding • All three algorithms were used to select the top most 20,50,100 and 250 features.

Evaluation Measures
• Accuracy
– let N Spam and N Leg be the total number of spam and legitimate emails in our data set. – let NY  Z be the number of emails that are classified as Z but belong to class Y. then

WERR  N Legit  Spam  N Spam Legit N Spam  N Leg

Acc  N SpamSpam  N Leg  Leg N Spam  N Leg

– Identifying legitimate email as spam is more costly then identifying spam as legitimate. To cope with this cost different we redefine accuracy as weighted accuracy and error as weighted error as

WAC   .N Leg  Leg  N Spam Spam N Spam  N Leg WERR  .N Legit  Spam  N Spam Legit N Spam  N Leg

Evaluation Measures
• Spam Recall
– If we consider identification of spam as a filtering process and filter out all of the identified spam from the legitimate ones than. – Spam recall measures the percentage of spam messages that the filter manages to block

SR 

N Spam  Spam N Spam

• Spam precision

– measures the degree to which the blocked messages are indeed spam

SP 

N Spam Spam

N Spam Spam  N Legit  Spam

(Androutsopoulos. et al 00, sahami et al)

Experimental Results (1)
NoFeature

LSI (PCA)
WAC λ=9 (%) WAC WAC λ λ=999 = (%) 9 9 94.1 94.1 (%)

Thresh holding
WAC λ=9 (%) WAC λ=99 (%) WAC λ=999 (%)

MI(Entire data set)
WAC λ=9 (%) WAC WAC λ λ=99 = 9 9 (%) 9 98 98 (%)

MI(individual files)
WAC λ=9 (%) WAC λ=99 (%) WAC λ=999 (%)

20 50 100 250 500

94 92.7 90.7 88.5 -

94.5 94.7 93.2 90.4 91

94.8 94.9 93.3 90.5 91.1

94.9 94.9 93.4 90.6 91.1

97.5 97.5 96.4 93.7 -

96.4 95.4 93.9 92.8 -

96.9 95.9 94.3 93.1 -

96.9 95.9 94.4 93.1 -

92.8 90.7 88.5 -

92.8 90.7 88.5 -

98 96.8 94.1 -

98 96.8 94.1 -

Weighted Accuracy Results with k = 1

Experimental Results (1)
No Featu re

LSI (PCA)
WAC λ=9 (%) WAC WAC λ λ=999 = (%) 9 92.9 92.9 9 (%)

Thresh holding
WAC λ=9 (%) WAC WAC λ λ=999 = (%) 9 94.6 94.6 9 (%)

MI(individual MI(Entire data) File)
WAC λ=9 (%) WAC WAC λ λ=999 = (%) 9 98 98 9 (%) WAC λ=9 (%) WAC λ=99 (%) WAC λ=999 (%)

20 50 100 250 500

92.9 91.3 89 88.5 -

94.3 93.2 91.9 90.8 90

97.5 97.8 96.3 95.1 -

96 95.5 93.8 92.7 -

96.5 95.9 94.2 93.1 -

96.5 96 94.2 93.2 -

91.3 89 88.6 -

91.3 89 88.6 -

93.4 92.1 91 90.1

93.4 92.1 91 90.1

98.3 96.7 95.6 -

98.3 96.8 95.6 -

Weighted Accruacy Results with k = 3

Experimental Results (1)
Spam Recall (%)
95 90 L S I Thresh holding MI(E ntire Data) MI(Indivisual f ile)

85

80

75

70 0 100

Fea tures

200

300

Spam Recall values for K = 1

Experimental Results (1)
Spam Recall (%)

95 90 85 80 75 70 0 100
Fe a ture s

L S I Thresh holding MI(E ntire data) MI(indivisual f ile)

200

300

Spam Recall values for K = 3

Experimental Results (1)
Spam Precision(%)

95 90 85 80 75 70 0 100
Fe a ture s

L S I Thresh holding MI(Entire Data) MI(Indivisual File)

200

300

Spam Precision values for K = 1

Experimental Results (1)
Spam Precision(%)

95 90 85 80 75 70 0 100
Fe atures

L S I Thresh holding MI(E ntire Data) MI(Indivisual File)

200

300

Spam Precision values for K = 3

Summary of Results from Experiment
• MI performs well with accuracy. • MI Scores calculated over the entire data set performs better than MI scores calculated on the individual files. • LSI and TF Thresh holding performs well in Spam Recall but is out performed by MI with Spam Precision. • LSI and TF Thresh holding have similar sort of results

Observation
• Changing the Values of K for the Nearest Neighbor
– does not have significant impact on the results.

• Value of K
– from 1 to 7 can give you approximalty the same results.

• feature set size and accuracy
– There isn’t any consistent relationship

• Changing the values of λ
– from 9 to 999 (in the weighted accuracy equation) improves the accuracy from 0.5% to 1.5% on average.

• The Best accuracy results
– Against the lower feature sets. Which is great improvement over the original feature space of over 40 thousand features.

Future work
• Minimum feature set size
– I was unable to find the minimum feature set size after which the performance starts degrading.

• Other features of email
– The corpus I used does not have other features of emails such as attachments, pictures, domain properties etc. adding these as a features will have a good impact on accuracy and has been examined in (sahami et al 97).

• Spam rate of corpus
– The Spam rate of the corpus was about 16%. Which should be more. Increasing the Spam rate to 70% or 80% might improve the performance in terms of spam recall and precision and will be actually depicting the current spam rate.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.