Professional Documents
Culture Documents
Anti-Spam (CEAS)
Filtron: A Learning-Based
Anti-Spam Filter
Eirinaios Michelakis (ernani@iit.demokritos.gr),
Ion Androutsopoulos (ion@aueb.gr),
George Paliouras (paliourg@iit.demokritos.gr),
George Sakkis (gsakkis@rutgers.edu),
Panagiotis Stamatopoulos (takis@di.uoa.gr)
Filtron
Spam
folders Attribute
Preprocessor
Selector
attribute
Legitimate
set
folders black list,
white list
Vectorizer
trainin
induced g
classifier vectors
User
model Learner
Preprocessing
From:
sender@provider
Black
List
Incoming
e-mail Unix Mail Server Dear Fred ,
Thanks for the Address
immediate Book
reply. I am glad
to hear...
Attachments:
Classifier
Procm
ail 1. File.zip User’s
Profile
Classified
e-mail
Classification
User’s Mailbox
Filtron
In Vitro Evaluation
Evaluation:
Four message collections (PU1, PU2, PU3, PUA)
Stratified 10-fold cross validation
Results:
No clear winner among learning algorithms wrt accuracy
⇒ Efficiency (or other criteria) more important for real usage.
Nevertheless, SVMs consistently among two best
No substantial improvement with n-grams (for n>1)
Refer to the TR for more details:
Learning to filter unsolicited commercial e-mail, TRN 2004/2,
NCSR “Demokritos” (http://www.iit.demokritos.gr/skel/i-config/)
Summary of in Vitro Evaluation
λ=1 λ=9
Pr Re WAcc Pr Re WAcc
1-grams
Naive Bayes
90.56 94.73 94.65 91.57 92.17 94.87
Flexible
95.55 89.89 95.15 98.88 74.63 97.76
Bayes
92.43 90.08 93.64 97.71 74.89 97.24
LogitBoost
94.95 91.43 95.42 98.12 78.33 97.60
SVM
1/2/3-grams
Flexible
92.98 91.89 93.89 97.43 81.36 96.91
Bayes
94.73 91.70 95.05 98.70 76.40 97.67
SVM
In Vivo Evaluation