You are on page 1of 15

First Conference on Email and

Anti-Spam (CEAS)

Filtron: A Learning-Based
Anti-Spam Filter
Eirinaios Michelakis (ernani@iit.demokritos.gr),
Ion Androutsopoulos (ion@aueb.gr),
George Paliouras (paliourg@iit.demokritos.gr),
George Sakkis (gsakkis@rutgers.edu),
Panagiotis Stamatopoulos (takis@di.uoa.gr)

Mountain View, CA, July 30th and 31st 2004


Outline

 Spam Filtering: past, present and future


 Anti-spam filtering with Filtron
 In Vitro Evaluation
 In Vivo Evaluation
 Conclusions
Spam Filtering:
past, present and future
 Past:
 Black-lists and white-lists of e-mail addresses
 Handcrafted rules looking for suspicious keywords
and patterns in headers
 Present:
 Machine learning-based filters
– Mostly using Naïve Bayes classifier
– Examples: Mozilla’s spam filter, POPFILE, K9
 Signature based filtering (Vipul’s Razor)
 Future:
 Combination of several techniques (SpamAssassin)
Filtron: An overview

 A multi-platform learning-based anti-spam filter.


 Features for simple the user:
 Personalized: based on her legitimate messages
 Automatically updating black/white lists
 Efficient: server-side filtering and interception rules
 Features for the advanced user and the researcher:
 Customizable learning component
– Through Weka open source machine learning platform
 Support for creating publicly available message collections
– Privacy-preserving encoding of messages and user profiles
 Portable: Implemented in Java and Tcl/Tk
 Currently supported under POSIX-compatible mail
servers (MS Exchange Server port efforts under way)
Filtron’s Architecture

Filtron
Spam
folders Attribute
Preprocessor
Selector

attribute
Legitimate
set
folders black list,
white list
Vectorizer

trainin
induced g
classifier vectors
User
model Learner
Preprocessing

1. Break down mailbox(es) into distinct messages


2. Remove from every message:
 mail headers
 html tags
 attached files
3. Remove messages with no textual content
4. Store 5 messages per sender
 Avoids bias towards regular correspondents.
 Remove duplicates
 Encode messages (optional)
Message Classification

From:
sender@provider
Black
List

Incoming
e-mail Unix Mail Server Dear Fred ,
Thanks for the Address
immediate Book
reply. I am glad
to hear...

Attachments:
Classifier
Procm
ail 1. File.zip User’s
Profile

Classified
e-mail
Classification
User’s Mailbox
Filtron
In Vitro Evaluation

 We investigated the effect of:


 Single-token versus multi-token attributes (n-grams
for n=1,2,3)
 Number of attributes (40-3000)
 Learning algorithm (Naïve Bayes, Flexible Bayes,
SVMs, LogitBoost)
 Training corpus size (~ 10%-100% of full training
corpus)
 Cost-Sensitive Learning Formulation
 Misclassifying a legitimate message as spam (LS)
is λ times more serious an error than misclassifying a
spam to legitimate (SL)
 Two usage scenarios (λ = 1, 9)
In Vitro Evaluation (cont.)

 Evaluation:
 Four message collections (PU1, PU2, PU3, PUA)
 Stratified 10-fold cross validation
 Results:
 No clear winner among learning algorithms wrt accuracy
⇒ Efficiency (or other criteria) more important for real usage.
 Nevertheless, SVMs consistently among two best
 No substantial improvement with n-grams (for n>1)
 Refer to the TR for more details:
 Learning to filter unsolicited commercial e-mail, TRN 2004/2,
NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)
Summary of in Vitro Evaluation

λ=1 λ=9
Pr Re WAcc Pr Re WAcc
1-grams
Naive Bayes
90.56 94.73 94.65 91.57 92.17 94.87
Flexible
95.55 89.89 95.15 98.88 74.63 97.76
Bayes
92.43 90.08 93.64 97.71 74.89 97.24
LogitBoost
94.95 91.43 95.42 98.12 78.33 97.60
SVM
1/2/3-grams
Flexible
92.98 91.89 93.89 97.43 81.36 96.91
Bayes
94.73 91.70 95.05 98.70 76.40 97.67
SVM
In Vivo Evaluation

 Seven month live-evaluation by the third author


 Training collection: PU3
 2313 legitimate / 1826 spam
 Learning algorithm: SVM
 Cost scenario: λ = 1
 Retained attributes: 520 1-grams
 Numeric values (term frequency)
 No black-list was used
Summary of in Vivo Evaluation

Days used 212


Messages received 6732 (avg. 31.75 per day)
Spam messages received 1623 (avg. 7.66 per day)
Legitimate messages received 5109 (avg. 24.10 per day)
Legitimate-to-Spam Ratio 3.15
Correctly classified legitimate messages (LL) 5057
Incorrectly classified legitimate messages (LS) 52 (avg. 1.72 per week)
Correctly classified spam messages (SS) 1450
Incorrectly classified spam messages (SL) 173 (avg. 5.71 per week)
Precision 96.54% (PU3: 96.43%)
Recall 89.34% (PU3: 95.05%)
WAcc 96.66% (PU3: 96.22%)
Post-Mortem Analysis
False Positives

 52 false positives (out of 6732)


 52%: Automatically generated messages
 subscription verifications, virus warnings, etc.
 22%: Very short messages
 3-5 words in message body
 Along with attachments and hyperlinks
 26%: Short messages
 1-2 lines
 Written in casual style, often exploited by spammers
 With no attachments or hyperlinks
Post-Mortem Analysis
False Negatives
 173 false negatives (out of 6732)
 30%: “Hard Spam”
 Little textual information, avoiding common suspicious word patterns
 Many images and hyperlinks
 Tricks to confuse tokenizers
 8%: Advertisements of pornographic sites with very casual and well
chosen vocabulary
 23%: Non-English messages
 Under-represented in the training corpus
 30%: Encoded messages
 BASE64 format; Filtron could not process it at that time
 6%: Hoax letters
 Long formal letters (“tremendous business opportunity !”)
 Many occurrences of the receiver’s full name
 3%: Short messages with unusual content
Conclusions

 Signs of arms race between spammers and content-based


filters
 Filtron’s performance deemed satisfactory, though it can be
improved with:
 More elaborate preprocessing to tackle usual countermeasures of
spammers (misspellings, uncommon words, text on images)
 Regular retraining
 Currently most promising approach: combination of
different filtering approaches along with Machine Learning
 Collaborative filtering
 Filtering in the transport layer level
 …

You might also like