You are on page 1of 18

Spam Filtering Techniques for Short Message Service

Adrien Besson and Dimitris Perdios

Signal Processing Laboratory (LTS5)


École Polytechnique Fédérale de Lausanne (EPFL)

Adaptation and Learning Project


Lausanne, Switzerland
June 4, 2018

AL2018

1 / 18
Outline

Introduction
Context
Data Exploration
Feature Extraction
The Bag-of-Words Representation
The Term Frequency-Inverse Document Frequency Weighting
Comparison of Several Classifiers
Considered Classifiers
Experimental Settings
Results
On Sensitivity and Specificity
Increasing the Sensitivity
Results
Online Learning Strategies
Motivations and Strategies
Experiments
Conclusion
AL2018

2 / 18
Introduction
Context

I Short message service (SMS) very popular communication service ú 15 M SMS


sent per minute in the world in 2017
I Spam is unwanted message sent in bulk to a group of recipients
I Intrusive
I Waste of money, time and network bandwidth
I SMS filtering still at relatively early stage
I No publicly available dataset of sufficient size
I Current methods inherit from Email spam filtering techniques
I Supporting code: https://github.com/dperdios/sms-spam-filtering

AL2018

3 / 18
Introduction
Data Exploration

SMS spam collection dataset


I 5572 tagged SMS extracted from
2000
Grumbletext Web site / Caroline Tag’s
1750
PhD thesis / SMS Spam Corpus v.

Number of occurences
1500
0.1 Big
1250
I Dataset composed of 4825
1000
ham (86 %) and 747 spam (14 %)
750
Most represented words in the dataset 500
I Stop-words: Most commonly used 250
english words 0

to
you
I
a
the
and
in
is
i
u
for
my
of
your
me
on
have
2
that
it
are
call
or
be
at
with
not
will
get
can
I Uninformative words that have to be
removed

AL2018

4 / 18
Introduction
Data Exploration – Ham

AL2018

5 / 18
Introduction
Data Exploration – Spam

AL2018

6 / 18
Feature Extraction
The Bag-of-Words Representation
Motivation
A document considered as a sequence of symbols cannot be used as input of a
machine learning algorithm ú need a representation
The Bag-Of-Words representation
I Document composed of D words and vocabulary composed of M words:
D
X
x= xd i ,
i=1

xdi ∈ RM one-hot vector with 1 at the entry of di in the vocabulary


I Given a collection of N documents → X ∈ RN×M
SMS dataset
I A SMS is a document
I The vocabulary is extracted from the training set
AL2018

7 / 18
Feature Extraction
The Term Frequency-Inverse Document Frequency Weighting
Motivation
I Preferable to work with frequencies than occurrences
I Need for normalized features
Tf-idf weighting
I Term-frequency of the document n of size Dn in the collection
1
tfn =
Dn
I Inverse-document-frequency of the word di in the collection
N
idfdi = log
dfdi
dfdi = # documents containing di
I Tf-idf weighting of the word di in the document n
T (n, di ) = tfn × idfdi

AL2018

8 / 18
Comparison of Several Classifiers
Considered Classifiers

Type Method Main parameters


k-nearest neighbors kNN Number of neighbors
Multinomial Naive Bayes MNB Additive smoothing parameter
LR –
Logistic Regression LR-`2
Regularization parameter
LR-`1
SVM-L
Support Vector Machine SVM-R Reg. param. on slack variable
SVM-S
DT Minimum number of samples at leaf
Trees RF Number of trees in the forest
AB Number of estimators
Neural Network MLP Hidden-layer sizes

AL2018

9 / 18
Comparison of Several Classifiers
Experimental Settings
General remarks:
I No dimensionality reduction of the data
I Even though data imbalanced we use a threshold of 0.5 for our Bayes plug-in
estimator
I Dataset divided into a training set (80 %) and test set (20 %)
I Hyper-parameter grid search using 10-fold cross-validation on training set
Metrics:
Nt
I Misclassification error (ME): ME = 1
I (c (hi ) 6= γ (i))
P
Nt
i=1
I Sensitivity (SE): Probability of predicting spam given that it is spam
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = spam)
P
SE = #spam
i=1
I Specificity (SP): Probability of predicting ham given that it is ham
Nt
1
I (c (hi ) = γ (i)) × I (γ (i) = ham)
P
SP = #ham
i=1

AL2018

10 / 18
Comparison of Several Classifiers
Results

AB: 2.152% DT: 2.870% LR-l1: 2.063% LR-l2: 1.614% LR: 1.614% MLP: 1.614%
ham 0.996 0.004 0.988 0.012 0.993 0.007 0.999 0.001 0.999 0.001 0.998 0.002

spam 0.153 0.847 0.153 0.847 0.122 0.878 0.130 0.870 0.130 0.870 0.122 0.878

MNB: 2.601% RF: 2.601% SVM-L: 1.794% SVM-R: 1.614% SVM-S: 1.614% kNN: 4.574%
ham 1.000 0.000 0.998 0.002 0.996 0.004 0.998 0.002 0.998 0.002 1.000 0.000

spam 0.221 0.779 0.206 0.794 0.122 0.878 0.122 0.878 0.122 0.878 0.389 0.611

ham spam ham spam ham spam ham spam ham spam ham spam

AL2018

11 / 18
On Sensitivity and Specificity
Increasing the Sensitivity

Motivations
I Optimization of a symmetric loss on an imbalanced dataset ú good predictive
performance on the majority class and poor result on the other class
I One may want to increase sensitivity
Methods
I Decreasing the threshold of the Bayes classifier
I Resampling:
I Downsampling of the majority class
I Upsampling of the minority class
Proposed experiments
1. Random downsampling of the ham class to have a balanced dataset
2. Random upsampling with replacement of the spam class to have a balanced
dataset

AL2018

12 / 18
On Sensitivity and Specificity
Results – Downsampling the Majority Class

AB: 5.471% DT: 6.996% LR-l1: 3.318% LR-l2: 2.601% LR: 2.960% MLP: 1.973%
ham 0.956 0.044 0.932 0.068 0.970 0.030 0.979 0.021 0.975 0.025 0.983 0.017

spam 0.137 0.863 0.084 0.916 0.053 0.947 0.061 0.939 0.061 0.939 0.038 0.962

MNB: 5.381% RF: 2.063% SVM-L: 3.049% SVM-R: 2.511% SVM-S: 3.139% kNN: 3.587%
ham 0.944 0.056 0.989 0.011 0.973 0.027 0.980 0.020 0.970 0.030 0.974 0.026

spam 0.038 0.962 0.092 0.908 0.053 0.947 0.061 0.939 0.038 0.962 0.107 0.893

ham spam ham spam ham spam ham spam ham spam ham spam

AL2018

13 / 18
On Sensitivity and Specificity
Results – Upsampling the Minority Class

AB: 1.794% DT: 3.857% LR-l1: 1.794% LR-l2: 1.614% LR: 1.704% MLP: 1.435%
ham 0.997 0.003 0.973 0.027 0.995 0.005 0.998 0.002 0.997 0.003 0.997 0.003

spam 0.130 0.870 0.122 0.878 0.115 0.885 0.122 0.878 0.122 0.878 0.099 0.901

MNB: 1.973% RF: 1.883% SVM-L: 1.614% SVM-R: 1.614% SVM-S: 1.614% kNN: 4.753%
ham 0.984 0.016 1.000 0.000 0.998 0.002 0.998 0.002 0.998 0.002 1.000 0.000

spam 0.046 0.954 0.160 0.840 0.122 0.878 0.122 0.878 0.122 0.878 0.405 0.595

ham spam ham spam ham spam ham spam ham spam ham spam

AL2018

14 / 18
Online Learning Strategies
Motivations and Strategies

Motivations
I Spam content is obviously changing with time: spammers try to fool spam
filtering algorithms
I Useful to design algorithm able to adapt to new content labeled by users
I Main difficulty in designing an online strategy: the vocabulary size
Naive strategy
I Increasing vocabulary size using all the words of the training set
I Not viable in longer-term due to memory and computational issues
Windowed-time strategy
I Fixed vocabulary size using only the most frequent words of the training set
I Training set windowed in time
I Should be quite efficient
I But requires to be fitted quite often

AL2018

15 / 18
Online Learning Strategies
Experiments

Considerations
I We do not have access to time LR-l2
12 MNB
information
I Not possible to analyze the evolution 10

Misclassification error [%]


of the vocabulary 8
I Relatively small dataset
6
Online scenario
I Naive strategy (increasing training set) 4

I Labeled messages received by batches 2


of 200
0 1000 2000 3000 4000
I Test set fixed at the beginning Train set size [ ]

I Fast classifiers (LR-`2 , MNB)

AL2018

16 / 18
Conclusion

I Several SMS spam filtering methods have been studied from the feature
extraction step to the classification
I State-of-the-art classifiers work well with a misclassification error of less than 5 %
I Resampling methods can be used to counter class imbalance but impact the
classification error ú trade-off required
I Online strategies may have to be adapted to deal with the increasing vocabulary
size
I More advanced feature extraction methods (e.g. word embeddings) and
classification methods (e.g. deep neural networks)

AL2018

17 / 18
THANK YOU FOR YOUR ATTENTION!

Adrien Besson and Dimitris Perdios


‡ https://github.com/dperdios/sms-spam-filtering
 Signal Processing Laboratory (LTS5)
Πhttps://lts5www.epfl.ch
 École Polytechnique Fédérale de Lausanne
AL2018

18 / 18

You might also like