A Bayesian Approach To Filtering Junk E Mail

A Bayesian Approach to Filtering Junk E-Mail
Mehran Sahami Susan Dumaisy David Heckermany Eric Horvitzy

Gates Building 1A
Computer Science Department Microsoft Research
y
Stanford University Redmond, WA 98052-6399
Stanford, CA 94305-9010 fsdumais, heckerma, horvitzg@microsoft.com
sahami@cs.stanford.edu
Abstract contain oensive material (such as graphic pornogra-

phy), there is often a higher cost to users of actually
In addressing the growing problem of junk E-mail on viewing this mail than simply the time to sort out the
the Internet, we examine methods for the automated junk. Lastly, junk mail not only wastes user time, but
construction of lters to eliminate such unwanted mes-
sages from a user's mail stream. By casting this prob- can also quickly ll-up le server storage space, espe-
lem in a decision theoretic framework, we are able to cially at large sites with thousands of users who may
make use of probabilistic learning methods in conjunc- all be getting duplicate copies of the same junk mail.
tion with a notion of dierential misclassication cost As a result of this growing problem, automated
to produce lters which are especially appropriate for methods for ltering such junk from legitimate E-mail
the nuances of this task. While this may appear, at
rst, to be a straight-forward text classication prob- are becoming necessary. Indeed, many commercial
lem, we show that by considering domain-specic fea- products are now available which allow users to hand-
tures of this problem in addition to the raw text of craft a set of logical rules to lter junk mail. This so-
E-mail messages, we can produce much more accurate lution, however, is problematic at best. First, systems
lters. Finally, we show the ecacy of such lters in a that require users to hand-build a rule set to detect
real world usage scenario, arguing that this technology
is mature enough for deployment. junk assume that their users are savvy enough to be
able to construct robust rules. Moreover, as the nature
of junk mail changes over time, these rule sets must be
Introduction constantly tuned and rened by the user. This is a
As the number of users connected to the Internet con- time-consuming and often tedious process which can
tinues to skyrocket, electronic mail (E-mail) is quickly be notoriously error-prone.
becoming one of the fastest and most economical forms The problems with the manual construction of rule
of communication available. Since E-mail is extremely sets to detect junk point out the need for adaptive
cheap and easy to send, it has gained enormous popu- methods for dealing with this problem. A junk mail
larity not simply as a means for letting friends and col- ltering system should be able to automatically adapt
leagues exchange messages, but also as a medium for to the changes in the characteristics of junk mail over
conducting electronic commerce. Unfortunately, the time. Moreover, by having a system that can learn
same virtues that have made E-mail popular among directly from data in a user's mail repository, such a
casual users have also enticed direct marketers to bom- junk lter can be personalized to the particular char-
bard unsuspecting E-mailboxes with unsolicited mes- acteristics of a user's legitimate (and junk) mail. This,
sages regarding everything from items for sale and in turn, can lead to the construction of much more
get-rich-quick schemes to information about accessing accurate junk lters for each user.
pornographic Web sites. Along these lines, methods have recently been sug-
With the proliferation of direct marketers on the In- gested for automatically learning rules to classify E-
ternet and the increased availability of enormous E- mail (Cohen 1996). While such approaches have shown
mail address mailing lists, the volume of junk mail some success for general classication tasks based on
(often referred to colloquially as \spam") has grown the text of messages, they have not been employed
tremendously in the past few years. As a result, many specically with the task of ltering junk mail in mind.
readers of E-mail must now spend a non-trivial portion As a result, such systems have not focused on the spe-
of their time on-line wading through such unwanted cic features which distinguish junk from legitimate
messages. Moreover, since some of these messages can E-mail. The more domain specic work along these
lines has focused on detecting \ame" (e.g., hostile) Probabilistic Classi cation
messages (Spertus 1997). This research has looked In order to build probabilistic classiers to detect junk
specically at particular features that are indicative E-mail, we employ the formalismof Bayesian networks.
of \ames", which in general are quite dierent than A Bayesian network is a directed, acyclic graph that
those used for junk mail ltering. Moreover, this work compactly represents a probability distribution (Pearl
only makes use of domain-specic features and does 1988). In such a graph, each random variable
not consider the full text content of messages when
Xi
is denoted by a node. A directed edge between two

trying to identify a \ame". nodes indicates a probabilistic inuence (dependency)
More generally, however, we nd that a rule-based from the variable denoted by the parent node to that
approach is of limited utility in junk mail ltering. of the child. Consequently, the structure of the net-
This is due to the fact that such logical rule sets usually work denotes the assumption that each node in Xi
make rigid binary decisions as to whether to classify the network is conditionally independent of its non-
a given message as junk. These rules generally pro- descendants given its parents. To describe a proba-
vide no sense of a continuous degree of con dence with bility distribution satisfying these assumptions, each
which the classication is made. Such a condence node in the network is associated with a condi-
Xi
score is crucial if we are to consider the notion of dif- tional probability table, which species the distribution
ferential loss in misclassifying E-mail. Since the cost of over given any possible assignment of values to its
Xi
misclassifying a legitimate message as junk is usually parents.
much higher than the cost of classifying a piece of junk A Bayesian classier is simply a Bayesian network
mail as legitimate, a notion of utility modeling is im- applied to a classication task. It contains a node C
perative. To this end, we require, rst, a classication representing the class variable and a node for each Xi
scheme that provides a probability for its classication of the features. Given a specic instance x (an assign-
decision and, second, some quantication of the dif- ment of values 1 2 x x to the feature variables),
::: xn
ference in cost between the two types of errors in this the Bayesian network allows us to compute the prob-
task. Given these, it becomes possible to classify junk ability ( = j X = x) for each possible class .
P C ck ck
E-mail within a Decision Theoretic framework. This is done via Bayes theorem, giving us
There has recently been a good deal of work in au-
tomatically generating probabilistic text classication P C( = j X = x) = (X = x j (X== x)) ( = )
ck
P C ck P C ck
:
models such as the Naive Bayesian classier (Lewis P
(1)
& Ringuette 1994) (Mitchell 1997) (McCallum et al.
1998) as well as more expressive Bayesian classiers The critical quantity in Equation 1 is (X = x j P
(Koller & Sahami 1997). Continuing in this vein, we C = ), which is often impractical to compute without
ck
seek to employ such Bayesian classication techniques imposing independence assumptions. The oldest and
to the problem of junk E-mail ltering. By making use most restrictive form of such assumptions is embod-
of the extensible framework of Bayesian modeling, we ied in the Naive Bayesian classier (Good 1965) which
can not only employ traditional document classica- assumes that each feature is conditionally indepen-
Xi
tion techniques based on the text of messages, but we dent of every other feature, given the class variable . C
Formally, this yields

can also easily incorporate domain knowledge about
the particular task at hand through the introduction
of additional features in our Bayesian classier. Fi- P (X = x j = ) =
C ck
Y P Xi( = xi jC = ) (2)
ck :
nally, by using such a classier in combination with i
a loss model, we can make \optimal" decisions from More recently, there has been a great deal of work on
the standpoint of decision theory with respect to the learning much more expressive Bayesian networks from
classication of a message as junk or not. data (Cooper & Herskovits 1992) (Heckerman, Geiger,
In the remainder of this paper, we rst consider & Chickering 1995) as well as methods for learning
methods for learning Bayesian classiers from textual networks specically for classication tasks (Friedman,
data. We then turn our attention to the specic fea- Geiger, & Goldszmidt 1997) (Sahami 1996). These
tures of junk mail ltering (beyond just the text of each later approaches allow for a limited form of dependence
message) that can be incorporated into the probabilis- between feature variables, so as to relax the restrictive
tic models being learned. To validate our work, we assumptions of the Naive Bayesian classier. Figure 1
provide a number of comparative experimental results contrasts the structure of the Naive Bayesian classier
and nally conclude with a few general observations with that of the more expressive classiers. In this
and directions for future work. paper, we focus on using the Naive Bayesian classier,
C C
X1 X2 X3 Xn X1 X2 X3 Xn
(a) (b)
Figure 1: Bayesian networks corresponding to (a) a Naive Bayesian classier (b) A more complex Bayesian classier
allowing limited dependencies between the features.
but simply point out here that methods for learning be used in classication. The rst of these involves
richer probabilisitic classication models exist that can examining the message text for the appearance of spe-
be harnessed as needed in future work. cic phrases, such as \FREE!", \only $" (as in \only
In the context of text classication, specically junk $4.95") and \be over 21". Approximately 35 such
E-mail ltering, it becomes necessary to represent hand-crafted phrases that seemed particularly germane
mail messages as feature vectors so as to make such to this problem were included. We omit an exhaus-
Bayesian classication methods directly applicable. To tive list of these phrases for brevity. Note that many
this end, we use the Vector Space model (Salton & of these features were based on manually constructed
McGill 1983) in which we dene each dimension of phrases used in an existing rule set for ltering junk
this space as corresponding to a given word in the en- that was readily outperformed by the probabilistic l-
tire corpus of messages seen. Each individual message tering scheme described here.
can then be represented as a binary vector denoting In addition to phrasal features, we also considered
which words are present and absent in the message. domain-specic non-textual features, such as the do-
With this representation, it becomes straight-forward main type of the sender (mentioned previously). For
to learn a probabilistic classier to detect junk mail example, junk mail is virtually never sent from .edu
given a pre-classied set of training messages. domains. Moreover, many programs for reading E-
mail will resolve familiar E-mail address (i.e. replace
Domain Speci c Properties sdumais@microsoft.com with Susan Dumais). By de-
In considering the specic problem of junk E-mail l- tecting such resolutions, which often happen with mes-
tering, however, it is important to note that there are sages sent by users familiar to the recipient, we can
many particular features of E-mail beside just the in- also provide additional evidence that a message is not
dividual words in the text of a message that provide junk. Yet another good non-textual indicator for dis-
evidence as to whether a message is junk or not. For tinguishing if a message is junk is found in examining
example, particular phrases, such as \Free Money", or if the recipient of a message was the individual user or
over-emphasized punctuation, such as \!!!!", are indica- if the message was sent via a mailing list.
tive of junk E-mail. Moreover, E-mail contains many A number of other simple distinctions, such as
non-textual features, such as the domain type of the whether a message has attached documents (most junk
message sender (e.g., .edu or .com), which provide a E-mail does not have them), or when a given message
great deal of information as to whether a message is was received (most junk E-mail is sent at night), are
junk or not. also powerful distinguishers between junk and legiti-
It is straight-forward to incorporate such additional mate E-mail. Furthermore, we considered a number
problem-specic features for junk mail classication of other useful distinctions which work quite well in a
into the Bayesian classiers described above by sim- probabilistic classier but would be problematic to use
ply adding additional variables denoting the presence in a rule-based system. Such features included the per-
or absence of these features into the vector for each centage of non-alphanumeric characters in the subject
message. In this way, various types of evidence about of a mail message (junk E-mail, for example, often has
messages can be uniformly incorporated into the clas- subject descriptions such as \$$$$ BIG MONEY $$$$"
sication models and the learning algorithms employed which contain a high percentage of non-alphanumeric
need not be modied. characters). As shown in Figure 2, there are clear dif-
To this end, we consider adding several dierent ferences in the distributions of non-alphanumeric char-
forms of problem-specic information as features to acters in the subjects of legitimate versus junk mes-
40
1949) of the corpus of E-mail messages to eliminate
words that appear fewer than three times as having lit-
"legitimate"
35 "junk"
tle resolving power between messages. Next, we com-

30
pute the mutual information ( ) between each
Percentage of total messages
M I Xi C
25 feature and the class (Cover & Thomas 1991),

Xi C
given by
20
15 ( )=
M I Xi C
X ( ) log ( ( ) ( ) )
P Xi C
P Xi C
:
P Xi P C
Xi = xi C= c
10
(3)
5 We select the 500 features for which this value is
greatest as the feature set from which to build a clas-
0
0 3 6 9 12 15 18 21 24 27 30 sier. While we did not conduct a rigorous suite of
Percentage of message containing num-alphanumeric characters
experiments to arrive at 500 as the optimal number
Figure 2: Percentages of legitimate and junk E-mail of features to use, initial experiments showed that this
with subjects comprised of varying degrees of non- value provided reliable results.
alphanumeric characters Note that the initial feature set that we select from
can include both word-based as well as hand-crafted
phrasal and other domain-specic features. Previous
sages. But this feature alone (or a discretized variant work in feature selection (Koller & Sahami 1996) (Yang
of it that checks if a message subject contains more & Pedersen 1997) has indicated that such information
than, say, 5% non-alphanumeric characters) could not theoretic approaches are quite eective for text classi-
be used to make a simple yes/no distinction for junk cation problems.
reliably. This is likewise true for many of the other Using Domain-Specic Features
domain-specic features we consider as well. Rather, In our rst set of experiments, we seek to determine
we can use such features as evidence in a probabilistic the ecacy of using features that are hand-crafted
classier to increase its condence in a message being specically for the problem of junk E-mail detection.
classied as junk or not. Here, we use a corpus of 1789 actual E-mail messages
In total, we included approximately 20 non-phrasal of which 1578 messages are pre-classied as \junk" and
hand-crafted, domain-specic features into our junk E- 211 messages are pre-classied as \legitimate." Note
mail lter. These features required very little person- that the proportion of junk to legitimate mail in this
eort to create as most of them were generated during corpus makes it more likely that legitimate mail will
a short brainstorming meeting about this particular be classied as junk. Since such an error is far worse
task. than marking a piece of junk mail as being legitimate,
Results we believe that this class disparity creates a more chal-
lenging classication problem. This data is then split
To validate our approach, we conducted a number of temporally (all the testing messages arrived after the
experiments in junk E-mail detection. Our goal here is training messages) into a training set of 1538 messages
both to measure the performance of various enhance- and a testing set of 251 messages.
ments to the simple baseline classication based on the We rst consider using just the word-based tokens
raw text of the messages, as well as looking at the e- in the subject and body of each E-mail message as
cacy of learning such a junk lter in an \operational" the feature set. We then augment these features with
setting. approximately 35 hand-crafted phrasal features con-
The feature space for text will tend to be very large structed for this task. Finally, we further enhance the
(generally on the order of several thousand dimen- feature set with 20 non-textual domain-specic fea-
sions). Consequently, we employ feature selection for tures for junk E-mail detection (several of which are
several reasons. First, such dimensionality reduction explicitly described above). Using the training data
helps provide an explicit control on the model variance in conjunction with each such feature set, we perform
resulting from estimating many parameters. Moreover, feature selection and then build a Naive Bayesian clas-
feature selection also helps to attenuate the degree to sier that is then used to classify the testing data as
which the independence assumption is violated by the junk or legitimate.
Naive Bayesian classier. Recalling that the cost for misclassifying a legiti-
We rst employ a Zipf's Law-based analysis (Zipf mate E-mail as junk far outweighs the cost of marking
Junk Legitimate
Feature Regime Precision Recall Precision Recall
Words only 97.1% 94.3% 87.7% 93.4%
Words + Phrases 97.6% 94.3% 87.8% 94.7%
Words + Phrases + Domain-Specic 100.0% 98.3% 96.2% 100.0%
Table 1: Classication results using various feature sets.
a piece of junk as legitimate, we appeal to the decision

theoretic notion of cost sensitive classication. To this
1
end, a message is only classied as junk if the probabil- 0.98
ity that it would be placed in the junk class is greater

than 99.9%. Although we do not believe that the Naive
0.96
Junk Precision
Bayesian classier (due to its independence assump- 0.94
tion) provides a very accurate probability estimate for

classication, a close examination of the values it gives
0.92
reveal that the 99.9% threshold is still reasonable for 0.9 Words only
this task.
Words + Phrases
Words + Phrases + Domain-Specific
0.88
The precision and recall for both junk and legitimate 0.86
E-mail for each feature regime is given in Table 1. More 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
specically, junk precision is the percentage of mes- Junk Recall
sages in the test data classied as junk which truly are. Figure 3: Precision/Recall curves for junk mail using
Likewise, legitimate precision denotes the percentage of various feature sets.
messages in the test data classied as legitimate which
truly are. Junk recall denotes the proportion of actual
junk messages in the test set that are categorized as Sub-classes of Junk E-Mail
junk by the classier, and legitimate recall denotes the
proportion of actual legitimate messages in the test In considering the types of E-mail commonly con-
set that are categorized as legitimate. Clearly, junk sidered junk, there seem to be two dominant group-
precision is of greatest concern to most users (as they ings. The rst is messages related to pornographic Web
would not want their legitimate mail discarded as junk) sites. The second concerns mostly \get-rich-quick"
and this is reected in the asymmetric notion of cost money making opportunities. Since these two groups
used for classication. As can be seen in Table 1, while are somewhat disparate, we consider the possibility of
phrasal information does improve performance slightly, creating a junk E-mail lter by casting the junk lter-
the incorporation of even a little domain knowledge for ing problem as a three category learning task. Here,
this task greatly improves the resulting classications. the three categories of E-mail are dened as legitimate,
pornographic-junk, and other-junk. By distinguishing
Figure 3 gives the junk mail Precision/Recall curves between the two sub-groups of junk E-mail, our goal
using the various feature sets. The gure focuses on is to better capture the characteristics of such junk by
the range from 0.85 to 1.0 to more clearly show the allowing for more degrees of freedom in the learned
greatest variation in these curves. We clearly nd that classier.
the incorporation of additonal features, especially non- For this experiment, we consider a collection of 1183
textual domain-specic information, gives consistently E-mail messages of which 972 are junk and 211 are le-
superior results to just considering the words in the gitimate. This collection is split temporally, as before,
messages. We believe that this provides evidence that into a training set of 916 messages and a testing set
for some targeted text classication problems there is a of 267 messages. To measure the ecacy of identify-
good deal of room for improvement by considering sim- ing sub-groupings of junk E-mail, we label this data in
ple salient features of the domain in addition to the raw two dierent ways. In the rst trial, each message is
text which is available. Examples of such features for simply given one of the two labels legitimate or junk.
more general text categorization problems can include In the second trial, each junk message is relabeled as
information relating to document authors, author af- either pornographic-junk or other-junk, thus creating a
liations, publishers, etc. three-way classication problem.
Junk Legitimate
Categories Precision Recall Precision Recall
Legitimate and Junk 98.9% 94.2% 87.1% 97.4%
Legitimate, Porn-Junk and Other-Junk 95.5% 77.0% 61.1% 90.8%
Table 2: Classication results considering sub-groups of junk E-mail.
Considering the results of our previous experiments 1

on domain-specic features, we include both phrasal
and domain-specic features in the feature sets for the
present experiments. As before, we apply feature se-
0.95
lection to the initial feature set to produce 500 features
Junk Precision
which are then used to learn a Naive Bayesian classi- 0.9
er. We again use the 99.9% certainty threshold for

classifying test messages as junk to reect the asym- 0.85
metric cost of errors in this task.

Note that since our true goal is only to lter junk 0.8 Legit and Junk
from legitimate E-mail, and not really to identify sub-

Legit, Porn-Junk and Other-Junk
groups of junk E-mail, we consider any test messages 0.75

classied as either pornographic-junk or other-junk to 0.75 0.8 0.85 0.9
Junk Recall
0.95 1
be \junk" E-mail. Thus any \junk" messages given ei-

ther of these labels in the three-category task is consid- Figure 4: Precision/Recall curves considering sub-
ered correctly classied. We realize that this gives an groups of junk mail.
advantage in terms of evaluation to the three-category
task over the two-category task, since, in the three-
category task, misclassications between the two sub- parameters leads to an overall decrease in the perfor-
categories of junk mail (i.e., pornographic-junk mes- mance of the former classier. This is especially true
sages being classied as other-junk or vice versa) are given that the parameters for each of the sub-classes
not penalized. Nevertheless, this advantage turns out of junk are estimated from less data (since the data is
not to help as seen below. sub-divided) than in the two-category task. Such be-
havior has been seen in other contexts, such as decision
The results of the experiments on sub-groups of junk tree induction, and is known as the data fragmentation
E-mail are given in Table 2. Here we nd, rather sur- problem (Pagallo & Haussler 1990).
prisingly, that modeling the sub-categories of junk E-
mail not only does not improve the results, but actu- Real Usage Scenario
ally makes them much worse. This result is also clearly
echoed in the the junk mail Precision/Recall curves for The two test E-mail collections described so far were
this experiment (shown in the range from 0.75 to 1.0) obtained by classifying existing E-mail folders. The
given in Figure 4. The curve of the two-category task users from which these collections were gathered had
dominates that of the three-category task over the en- already viewed and deleted many legitimate messages
tire range of Precision/Recall values. We believe there by the time the data was sampled. For actual deploy-
are two main reasons for these results. The rst is ment of a junk lter, however, it is important to make
that while some features may be very clearly indicative sure that the user's entire mail stream is classied with
of junk versus legitimate E-mail in the two-category high accuracy. Thus, we cannot simply evaluate such a
task, these features may not be as powerful (i.e., prob- lter using a testing set of legitimate messages that in-
abilistically skewed) in the three-category task since cludes only those messages that a user would read and
they do not distinguish well between the sub-classes choose to store in his or her mail repository. Rather, a
of junk. The second, and more compelling, reason junk mail lter must also be able to accurately discern
is the increase in classication variance that accom- true junk from mail which a user would want to read
panies a model with more degrees of freedom. Since once and then discard, as the latter should be consid-
the classier in the three-category task must t many ered legitimate mail even though it is not permanently
more parameters from the data than the classier in stored.
the two-category task, the variance in the estimated To measure the ecacy of our junk mail lters in
Classied Junk Classied Legitimate Total
Actually Junk 36 (92.0% precision) 9 45
Actually Legitimate 3 174 (95.0% precision) 177
Total 39 183 222
Table 3: Confusion matrix for real usage scenario.
such a real usage scenario, we consider a user's real 1

mail repository of 2593 messages from the previous
year which have been classied as either junk or legit-
imate as the training set for our lter. As the testing
0.95
data we use all 222 messages that are sent to this user
Junk Precision
during the week following the period from which the 0.9
training data was collected. To show the growing mag-

nitude of the junk E-mail problem, these 222 messages 0.85
contained 45 messages (over 20% of the incoming mail)

which were later deemed to be junk by the user. 0.8 Real usage
As before, in this experiment we consider phrasal

and domain-specic features of the E-mail as well as 0.75
0.75 0.8 0.85 0.9 0.95 1
the text of the messages when learning a junk lter. Junk Recall
Again, we employ a Naive Bayesian classier with a

99.9% condence threshold for classifying a message Figure 5: Precision/Recall curve for junk mail in a real
as junk. usage scenario.
The confusion matrix for the results of this experi- Conclusions
ment is given in Table 3. While the precision results
seem promising in this experiment, there is still con- In examining the growing problem of dealing with junk
cern that the three messages classied as junk by the E-mail, we have found that it is possible to automati-
lter which are actually deemed legitimate by the user cally learn eective lters to eliminate a large portion
might be quite important. If this is the case, then such of such junk from a user's mail stream. The ecacy of
a lter might still not be considered suitable for real such lters can also be greatly enhanced by consider-
world usage. A \post mortem" analysis of these mis- ing not only the full text of the E-mail messages to be
classications, however, reveals that the lter is in fact ltered, but also a set of hand-crafted features which
working quite well. Of the three legitimate messages are specic for the task at hand. We believe that the
classied as junk by the lter, one is a message which improvement seen from the use of domain-specic fea-
is actually a junk mail message forwarded to the user tures for this particular problem provides strong ev-
in our study. This message begins with the sentence idence for the incorporation of more domain knowl-
\Check out this spam..." and then contains the full edge in other text categorization problems. Moreover,
text of a junk E-mail message. The other two mis- by using an extensible classication formalism such as
classied legitimate messages are simply news stories Bayesian networks, it becomes possible to easily and
from a E-mail news service that the user subscribes to. uniformly integrate such domain knowledge into the
These messages happen to be talking about \hype" in learning task.
the Web search engine industry and are not very im- Our experiments also show the need for methods
portant to the user. Hence, there would be no loss of aimed at controlling the variance in parameter es-
signicant information if these messages were classi- timates for text categorization problems. This re-
ed as junk by the lter. Moreover, we nd that the sult is further corroborated by more extensive experi-
lter is in fact quite successful at eliminating 80% of ments showing the ecacy of Support Vector Machines
the incoming junk E-mail from the user's mail stream. (SVMs) in text domains (Joachims 1997). SVMs are
For completeness, we also provide the Precision/Recall known to provide explicit controls on parameter vari-
curve for this task in Figure 5. Based on these results, ance during learning (Vapnik 1995) and hence they
we believe that such as system would be practical for seem particularly well suited for text categorization.
usage in commercial E-mail applications. Thus, we believe that using SVMs in a decision theo-
retic framework that incorporates asymmetric misclas- in a hierarchy of classes. In Machine Learning: Pro-
sication costs is a fruitful venue for further research. ceedings of the Fifteenth International Conference.
In future work, we also seek to consider using Mitchell, T. M. 1997. Machine Learning. McGraw-
Bayesian classiers that are less restrictive than Naive Hill.
Bayes. In this way we hope to obtain better classi-
cation probability estimates and thus make more ac- Pagallo, G., and Haussler, D. 1990. Boolean fea-
curate costs sensitive classications. Finally, we are ture discovery in empirical learning. Machine Learn-
also interested in extending this work to automatically ing 5:71{99.
classify messages into a user's hierarchical mail folder Pearl, J. 1988. Probabilistic Reasoning in Intelligent
structure using the Pachinko Machine classier (Koller Systems: Networks of Plausible Inference. Morgan-
& Sahami 1997). In this way we hope to provide not Kaufmann.
just a junk mail lter, but an entire message organiza- Sahami, M. 1996. Learning limited dependence
tion system to users. bayesian classiers. In Proceedings of the Second In-
References ternational Conference on Knowledge Discovery and
Data Mining, 335{338.
Cohen, W. W. 1996. Learning rules that classify e-
mail. In Proceedings of the 1996 AAAI Spring Sym- Salton, G., and McGill, M. J. 1983. Introduction
posium on Machine Learning in Information Access. to Modern Information Retrieval. McGraw-Hill Book
Company.
Cooper, G. F., and Herskovits, E. 1992. A bayesian Spertus, E. 1997. Smokey: Automatic recognition of
method for the induction of probabilistic networks hostile messages. In Proceedings of Innovative Appli-
from data. Machine Learning 9:309{347. cations of Arti cial Intelligence (IAAI), 1058{1065.
Cover, T. M., and Thomas, J. A. 1991. Elements of Vapnik, V. 1995. The Nature of Statistical Learning
Information Theory. Wiley. Theory. Springer Verlag.
Friedman, N. Geiger, D. and Goldszmidt, M. Yang, Y., and Pedersen, J. 1997. Feature selection in
1997. Bayesian network classiers. Machine Learning statistical learning of text categorization. In Machine
29:131{163. Learning: Proceedings of the Fourteenth International
Good, I. J. 1965. The Estimation of Probabilities: An Conference, 412{420. Morgan Kaufmann.
Essay on Modern Bayesian Methods. M.I.T. Press. Zipf, G. K. 1949. Human Behavior and the Principle
Heckerman, D. Geiger, D. and Chickering, D. 1995. of Least E ort. Addison-Wesley.
Learning bayesian networks: The combination of
knowledge and statistical data. Machine Learning
20:197{243.
Joachims, T. 1997. Text categorization with sup-
port vector machines: Learning with many relevant
features. Technical Report LS8-Report, Universitaet
Dortmund.
Koller, D., and Sahami, M. 1996. Toward optimal
feature selection. In Machine Learning: Proceedings
of the Thirteenth International Conference, 284{292.
Morgan Kaufmann.
Koller, D., and Sahami, M. 1997. Hierarchically clas-
sifying documents using very few words. In Machine
Learning: Proceedings of the Fourteenth International
Conference, 170{178. Morgan Kaufmann.
Lewis, D. D., and Ringuette, M. 1994. Comparison
of two learning algorithms for text categorization. In
Proceedings of SDAIR, 81{93.
McCallum, A. Rosenfeld, R. Mitchell, T. and Ng,
A. 1998. Improving text classication by shrinkage

A Bayesian Approach To Filtering Junk E Mail

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Bayesian Approach To Filtering Junk E Mail

Uploaded by

Copyright:

Available Formats

A Bayesian Approach to Filtering Junk E-Mail

Mehran Sahami Susan Dumaisy David Heckermany Eric Horvitzy

Abstract contain oensive material (such as graphic pornogra-

is denoted by a node. A directed edge between two

Formally, this yields

nally, by using such a classier in combination with i

tle resolving power between messages. Next, we com-

25 feature and the class (Cover & Thomas 1991),

a piece of junk as legitimate, we appeal to the decision

end, a message is only classied as junk if the probabil- 0.98

ity that it would be placed in the junk class is greater

tion) provides a very accurate probability estimate for

Considering the results of our previous experiments 1

lection to the initial feature set to produce 500 features

er. We again use the 99.9% certainty threshold for

metric cost of errors in this task.

from legitimate E-mail, and not really to identify sub-

groups of junk E-mail, we consider any test messages 0.75

be \junk" E-mail. Thus any \junk" messages given ei-

such a real usage scenario, we consider a user's real 1

training data was collected. To show the growing mag-

contained 45 messages (over 20% of the incoming mail)

As before, in this experiment we consider phrasal

Again, we employ a Naive Bayesian classier with a

You might also like

Abstract contain oensive material (such as graphic pornogra-

nally, by using such a classier in combination with i

end, a message is only classied as junk if the probabil- 0.98

er. We again use the 99.9% certainty threshold for

Again, we employ a Naive Bayesian classier with a