You are on page 1of 4

3rd IEEE International Workshop on Wireless and Internet Services WISe 2010, Denver, Colorado

Bayesian Statistical Analysis for Spams


Youcef Begriche Ahmed Serhrouchni
Institut Telecom, Telecom ParisTech Institut Telecom, Telecom ParisTech
Youcef.Begriche@telecom-paristech.fr Ahmed.Serhrouchni@enst.fr

Abstract—This paper presents a Bayesian statistical analysis • The identification of one or more fields of the message.
applied to the spam problem. In most anti-spam related Clasical cryptographic functions may be used (digital
research, generally it is assumed that the probability of a spam signature and authentication). This method is effective
occurrence is equal to 0.5, which is in our opinion unrealistic. It
is also assumed that in the spam message, words are considered against phishing [3] or any approach of spoofing. Several
as an independent family of words. This makes us look at how approaches related to this category are deployed
the posterior probability behaves when the a priori probability is • Establishing a cost for message transmission. This
different from 0.5 and derive the consequences of the assumption method reduces a significant amount of resources at
of independent words on the posterior probability. The first spammer side. Several approaches based on costs exist.
assumption pushes us to define a prior and find a posterior
probability laws to enhance the spam detection and increase the Because of their limitations when they are used separately,
reliability decision. This analysis differs from previous results, administrators and users sometimes combine several of these
that used the Bayesian approach to the anti-spam issue, especially methods. Paul Graham [4] and Gary Robinson [5] worked on
through refinement and enhancement of various probability laws. messages content by assuming the probability of a message
being a spam equal to that of a ham (Pr (spam) = Pr (ham)
Keywords – Conditional density, Distribution attachment, Bi-
nomial law, Spam(S), Ham(H), Bayesian statistical model, Clas- = 0.5) and all classifiers have used this formula worked under
sification. the same assumption.
In this paper, we focus on the generalization of the Bayes’
I. I NTRODUCTION theorem for the application of anti spam taking into account
This article aims to contribute to the study of anti-spam that the probabilities mentioned above are different.
methods by improving the application of Bayes theorem. The In [6], some naive Bayes classifiers are proposed : multi-
negative effects of spam on information systems have been variate Bernoulli NB 1 , multinomial NB, Boolean attributes,
widely described in [1]. Several research works and efforts are multi-variate Gauss NB.
being made to reduce its scale and impact. Solutions have been The classifiers are built on Bayes theorem : if a message is
developed but are still not sufficient for a total eradication. represented by a vector →−
x = (x1 , . . . , xn ), then the probability
These solutions are based on several methods of treatment, that the message belongs to the class c (spam(S) or ham(H)
and can be found in [2]. Many of these methods require a is given by the formula:
specific network infrastructure or cooperation between net- P (c).P (→−x /c)
work elements (e.g. mail server directories) based on particular P (c/→−
x) = →
− (1)
P( x )
organizations that require specific costs. Anti-spam Bayesian
methods operate as ”standalone” and are based on statistical As there are two classes (spam and ham) the denominator is
and probabilistic mathematical models. These methods benefit written :
from an important research in the field of processing language. P (→

x ) = P (→
−x /S).P (S) + P (→ −x /H).P (H) (2)
Indeed, fundamental results in this field have been obtained
The criterion for classifying a message as spam is given by :
[3] and make it possible to apply it to anti-spam field. Paul
Graham is probably the pioneer who suggested a complete P (S).P (→
−x /S)
>T (3)
solution to filter messages based on Bayesian mathematical P ( x /S).P (S) + P (→

− −x /H).P (H)
method [6]. Several classification-based solutions also exist for Where T is a fixed threshold. The criterion for classifying is
which a summary can be found in [2]. The author proposes adapted for each classifier :
three main classes : 1) For the multi-variate Bernoulli NB, the criterion be-
• Filtering the content amounts to distinguish the spam comes :
(noted S) from the non spam (called ham and noted m

H). As mentioned in [2], this type of process is prone P (S). P (ti /S)xi .(1−P (ti /S))(1−xi )
to errors, and subject to false negatives [1] and false i=1
m >T (4)
positives [2]. The goal is to reduce the number of false  
P (c). P (ti /c)xi .(1−P (ti /c))(1−xi )
positives. For this purpose there are several filtering
c∈{S,H} i=1
methods. Heuristic, adaptive or Bayesian filtering, col-
laborative and honey pot. 1 Naive Bayes

978-1-4244-8389-1/10/$26.00 ©2010 IEEE 989




where : In this paper, they worked only on the ratio PP ((→ −
x /S)
x /H)
and
• xi = 1 or 0 if the token ti occurs in the message. P (C = S) and P (C = H) are also estimated by dividing the
number of training messages of category (spam or legitime)
1+M
•P (t/c) = 2+Mt,c c
where Mt,c is the number of by the total number of training messages.
training messages of category c that contain token t, In [9], authors use the following criterion :
while Mc is the total number of training messages Given an input document d, its target class (spam, ham) can be
of category c. found by choosing the one with the hig posterior probability.
2) For the multinomial NB, the criterion of classifying a  P (w/cj )P (cj ) 

message as spam is : H(d)=arg max   


P (w/d )CHI(w,cj) (11)
cj ∈C P (w/c )P (c )
w∈F
m
 c ∈C
P (S). P (ti /s)xi
i=1 where F is text feature vector; P (w/cj ) is a ratio of
m >T (5)
  frequency of w in d of the total number of words, CHI(w, cj )
P (c). P (ti /c)xi is χ2 statistic of word w and cj which measures the lack of
c∈{S,H} i=1 independence between word w and cj .P (cj ) is the number of
3) For the multi-variate Gauss NB, the criterion is : documents with class label cj divided by the total number of
m documents.

P (S). g(xi ; μi,s , σi,s ) In [10], authors determine the maximum a posterior (MAP)
i=1 class by calculating :
m >T (6) 
 
p(c). g(xi ; μi,s , σi,s ) eM AP = arg max P (ei ) P (kj /ei )
ei ∈E
c∈{S,H} i=1
where kj is the word found in the j th position in the unseen
Where : query. ei is the tutorial example/page.
(xi −μi,s )2
1 −
2σ 2 N umber of queries in Q with ei
g(xi ; μi,s , σi,s ) = √ e i,s (7) P (ei ) = (12)
σi,s 2Π N umber of queries in Q
and the mean (μi,s ) and typical deviation (σi,s ) of each Where Q is the queries set.
distribution are estimated from the training data. In [11,12,13], the degree of confidence is introduced
In those classifiers, p(S) and p(H) are typically estimated by WSN B (→

x ) that →

x is spam by :
dividing the number of training messages of category (s or h) m

by the total number of training messages. P (S)· P (xi /S)
In [7], the criterion for classifying a message as spam is : WSN B (→

x )=P (S/→

x )= i=1
m (13)

−  
P (C = S/ X = → −x) P (k)· P (xi /k)

− >λ (8)
P (C = H/ X = x ) →
− k∈{S,H} i=1

Where : and P (C = S) and P (C = H) are also estimated by


n
 dividing the number of training messages of category (spam
P (C = c)· P (Xi = xi /C = c) or legitime) by the total number of training messages.

− −
P (C=c/ X=→ i=1 In [14], authors deliver a discussion about the implementation
x )= n (9)
  of Binomial Distribution and Poisson Distribution in Bayesian
P (C = k)· P (Xi = xi /C = k) spam filter, to calculate the probability of a mail being spam,
k∈{S,H} i=1 containing words that are not already stored in a database (i.e.,
For this classifier, P (C = S) and P (C = H) are also encountered by the filter for the first time). They use :
estimated by by dividing the number of training messages • Binomial random variable with parameters n and p, then
of category (spam or ham) by the total number of training the probability function of R is given as follows :
messages.
f (r)=P (R)=Ckn · pr q n−r , r=0, 1, 2, 3, . . . , n (14)
In [8], authors work on the same formula (1)(Bayes theorem)
as follows : where p + q = 1.

P (S).P (→

x /S) P (S).P (→

x /S) • Poisson random variable R, the probability function of R
P (S/→

x) = →
− = →
P( x ) P ( x /S).P (S)+P (→
− −
x /H.P (H) is then :
P (→

x /S)
P (→−
x /H)
P (S) f (r)=P (R)=e−m mr /r!, r=0, 1, 2, 3, . . . , n, (15)
= →

P ( x /S)
(10)
P (→
−x /H)
P (H) + P (H) where m = np

990
p is the probability to have a spam word, which is assumed as (x1 , . . . , xn ), the probability that this message is a spam is
a constant constant from trial to trial (0.4 for the Binomial calculated as follows :
distribution and 0.04 for the Poisson distribution). // [15] P (x1,. . .,xn/S)·P (S)
defines a local probability. For example : P (S/x1,. . .,xn)= (19)
P (x1,. . .,xn/S)·P (S)+P (x1,. . .,xn/H)·P (H)
(NS−NH ) Here, in previous works authors assume that
Plocal−S =0.5+ (16)
C1 ·(NS+NH+C2 ) P (S) = P (H) = 0.5.
Each word feature generates one such local probability. These
local probabilities are then used in a Bayesian chain rule to B. Formula
calculate an overall probability that an incoming text is spam. Taking into account that these probabilities are different,
The Bayesian chain rule is : we propose the following formula and its proof is given
P (in class/f eature) =
P (f eat/inclass)×P (in class)
P (f eat/in class)×P (in class)+P (f eat/not in class)×P (not in class) P (S/x1)· · ·P (S/xn)
P (S/x1,. . .,xn)= P (S)
(20)
P (S/x1)· · ·P (S/xn)+[(1−P (n−1)·P (H/x )· · ·P (H/x )
which is applied iteratively to chain each local probability (S))] 1 n

feature into an overall probability for the text. Here III. METHOD OF CALCULATION
P (in class) is also estimated by dividing the number of
A. programmatically
training messages of class (spam or legitimate) by the total
number of training messages. Using the MATLAB software, we wrote a program that
allows us, using the formula to calculate whether a given mes-
In [16], they introduce a score notion as follows. If a sage is spam. Moreover, the probabilities P (S/xi ) involved
mail is represented by →−
x = (x1 , . . . , xn ) then : in the formula are numbers randomly drawn by the MATLAB
 software.
score(→−
x ) = log P (S)+ log P (xk /S)
B. Classification criteria:
k
 When a threshold α is fixed, a messge is classified as spam
−(log P (leg)+ log P (xk /leg)) (17)
when:
k
P (S/x1 , . . . , xn ) ≥ α
Therefore, if score(→−x ) > 0, the e-mail will be a spam, and
legitime otherwise. IV. RESULTS AND INTERPRETATION
P (C = S) and P (C = H) are also estimated by dividing the The formula (23) binds the posterior probability
number of training messages of category (spam or legitime) P (S/x1 , . . . , xn ), the prior probability ps and the size
by the total number of training messages. of the sample n.
This work differs from previous works listed above; in this By fixing n and considering in this formula the posterior
work, we give the generalization of the Bayes’ theorem applied probability P (S/x1 , . . . , xn ) as a function, denoted f , the
to spam and its proof. This generalization gives a relationship prior probability ps, so the derivative of this function is equal
between the posterior probability, the prior probability and to:
the size n of the sample that characterized the message. This
relationship allows to find the value of the posterior probability −(n − 1) · E · (P (S))(n−2)
for any value of the prior probability and all values of n. P (S)
(21)
(1 − P (S))n · [E + ( (1−P (S)) )
(n−1) · F]2
We also explain and compare some results found previously,
ie when the prior probability is equal to 0.5. This derivative is negative then the function cited above is
necessarily decreasing, therefore the two probabilities prior
II. E XPLICATION , FORMULA , PROOF and posterior vary inversely, when the prior probability ps
Afterwards, spam and ham will be denoted respectively S increases from 0 to 1 posterior probability P (S/x1 , . . . , xn )
and H decreases from 1 to 0 . This shows in figures 1 and 2

A. Explication Figure 1 contains a family of histograms. For each value of


the prior probability ps, we have a histogram which gives
If A and B are two events, the Bayes’formula : the value of the probability that the message represented by
P (B/A).P (A) (x1 , . . . , xn ) is a spam, noted P (S/x1 , . . . , xn ), by varying
P (A/B) = (18) the size of the sample n. It may be noted that this was only
P (B)
observed from 0.5 sticks below 0.1, that is to say that the
gives the conditional probability of A knowing B posterior probability is less than or equal to 0.1.
This formula is applying to anti-spam as follows: Figure 2 contains a family of curves. For each value of
If a certain message is represented by the vector →

x = the sample size n, there is a curve showing variations of

991
would have results very unreliable.
To overcome these difficulties, it is useful to make a more
specific study by proposing laws for prior probabilities , which
summarize the best behavior of ps and must respect some
theoretical criteria. Once these laws are established, it will
lead to laws of posterior probability that will identify the best
developments of P (S/x1 , . . . , xn ).
These proposals fall within the scope of future work.
R EFERENCES
[1] D. W. K. Khong. An Economic Analysis of Spam Law. Erasmus Law &
Figure 1: Histogram of posterior probability according to the Economics Review, Vol. 1, pp. 23-45, February 2004.
prior probability [2] A. Herzberg. Controlling Spam by Secure Internet Content Selection.
Secure Communication Networks (SCN) 2004, LNCS vol. 3352, Ed.
Springer-Verlag.
[3] M. Sahami, M. Dumais, S. Heckerman and E. Horvitz. A Bayesian
Approch to Filtering Junk E-mail. AAAI Workshop, 1998, Madison
Wisconsin.
[4] P. Graham. A Plan for Spam. http:// paulgraham.com/.
[5] G. Robinson. A Statistical Approch to the Spam Problem. Linux journal,
01-03-2003, Issue 107,2003.
[6] V. Metsis, I. Androutsopoulos and G. Paliouras. Spam Filtering with
Naive Bayes – Which Naive Bayes? 3rd Conference on Email and Anti-
Spam (CEAS 2006), Mountain View, CA, USA, 2006.
[7] Sang-Bum Kim, Kyoung-Soo Han, Hae-Chang Rim, Sung Hyon
Myaeng. Some Effective Techniques for Naive Bayes Text Classifica-
Figure 2: curves of posterior probability according to the prior tion.IEEE Transactions on Knowledge and Data Engineering, Volume
probability 18 Issue 11, November 2006.
[8] Yanhui Guo; Yaolong Zhang; Jianyi Liu; Cong Wang. Research on
the Comprehensive Anti-Spam Filter, Industrial Informatics, 2006 IEEE
International Conference on 16-18 Aug. 2006 Page(s):1069 - 1074
[9] Yan Zhou, Madhuri S. Mulekar, Praveen Nerellapalli. Adaptive Spam
Filtering Using Dynamic Feature Spaces.International Journal on Arti-
ficial Intelligence Tools 16(4): 627-646 (2007)
[10] Hernes, O.; Jianna Zhang. A tutorial search engine based on Bayesian
learning.Machine Learning and Applications, 2004. Proceedings. 2004
International Conference on 16-18 December, 2004 Page(s):418 - 422
[11] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C. D. Spy-
ropoulos, P. Stamatopoulos. Stacking classifiers for anti-spam filtering of
e-mail.”Empirical Methods in Natural Language Processing” (EMNLP
2001), L. Lee and D. Harman (Eds.), pp. 44-50, Carnegie Mellon
Figure 3: curve of the derivative function of f given by the University, Pittsburgh, PA, 2001
formula 24 [12] Georgios Sakkis , Ion Androutsopoulos , Georgios Paliouras , Vangelis
Karkaletsis , Constantine D. Spyropoulos and Panagiotis Stamatopou-
los. A Memory-Based Approach to Anti-Spam Filtering for Mailing
Lists.Information Retrieval, Volume 6, Number 1, Pages 49-73, / janvier
2003, Kluwer Academic Publishers Hingham, MA,USA,diteur Springer
the posterior probability P (S/x1 , . . . , xn ) based on the prior Netherlands
[13] I. Androutsopoulos, G. Paliouras and E. Michelakis. Learning to Fil-
probability ps. We see that these curves are sharply decreasing ter Unsolicited Commercial E-Mail.Technical report 2004/2, NCSR
in a neighborhood of 0.5. ”Demokritos”, revised version (October 2004), with additional minor
Figure 3 contains the graph of the derivative (24) according corrections (October 2006).
[14] Redwan Zakariah, Samina Ehsan.Detecting junk Mails by Implementing
to ps and n. We note that this derivative is zero outside a Statistical Theory. 20th International Conference on Advanced Infor-
certain neighborhood of 0.5, which explains the variations of mation Networking and Applications (AINA 2006), 18-20 April 2006,
the functions in Figure 2. Vienna, Austria. IEEE Computer Society 2006.
[15] Yerazunis, W.S.”The Spam-Filtering Accuracy Plateau at 99.9% Accu-
racy and How to Get Past It”.MIT Spam Conference, January 2004
V. CONCLUSION AND FUTURE WORK [16] Zhen Yang, Xiangfei Nie, Weiran Xu, Jun Guo.An Approach to Spam
Detection by Naive Bayes Ensemble Based on Decision Induction.
According to remarks made in the previous section, we can Intelligent Systems Design and Applications, 2006. ISDA ’06. Sixth
say we can have interesting results in terms of spam filtering International Conference on, 16-18 Oct. 2006, Volume: 2, On page(s):
that when it is assumed that the a prior probability ps takes 861-866.
values that are in a neighborhood of 0.5 .
Obviously taking ps = 0.5 is not very realistic, because we all
know that we do not receive one spam on two messages. The
histograms in figure 1, the curves in figure 2 and the comments
made on them allow us to say that if we want to be realistic
by taking values of the prior probability ps far from 0.5, we

992

You might also like