Professional Documents
Culture Documents
(Koller & Sahami 1997). Continuing in this vein, we C = ), which is often impractical to compute without
ck
seek to employ such Bayesian classication techniques imposing independence assumptions. The oldest and
to the problem of junk E-mail ltering. By making use most restrictive form of such assumptions is embod-
of the extensible framework of Bayesian modeling, we ied in the Naive Bayesian classier (Good 1965) which
can not only employ traditional document classica- assumes that each feature is conditionally indepen-
Xi
tion techniques based on the text of messages, but we dent of every other feature, given the class variable . C
a loss model, we can make \optimal" decisions from More recently, there has been a great deal of work on
the standpoint of decision theory with respect to the learning much more expressive Bayesian networks from
classication of a message as junk or not. data (Cooper & Herskovits 1992) (Heckerman, Geiger,
In the remainder of this paper, we rst consider & Chickering 1995) as well as methods for learning
methods for learning Bayesian classiers from textual networks specically for classication tasks (Friedman,
data. We then turn our attention to the specic fea- Geiger, & Goldszmidt 1997) (Sahami 1996). These
tures of junk mail ltering (beyond just the text of each later approaches allow for a limited form of dependence
message) that can be incorporated into the probabilis- between feature variables, so as to relax the restrictive
tic models being learned. To validate our work, we assumptions of the Naive Bayesian classier. Figure 1
provide a number of comparative experimental results contrasts the structure of the Naive Bayesian classier
and nally conclude with a few general observations with that of the more expressive classiers. In this
and directions for future work. paper, we focus on using the Naive Bayesian classier,
C C
X1 X2 X3 Xn X1 X2 X3 Xn
(a) (b)
Figure 1: Bayesian networks corresponding to (a) a Naive Bayesian classier (b) A more complex Bayesian classier
allowing limited dependencies between the features.
but simply point out here that methods for learning be used in classication. The rst of these involves
richer probabilisitic classication models exist that can examining the message text for the appearance of spe-
be harnessed as needed in future work. cic phrases, such as \FREE!", \only $" (as in \only
In the context of text classication, specically junk $4.95") and \be over 21". Approximately 35 such
E-mail ltering, it becomes necessary to represent hand-crafted phrases that seemed particularly germane
mail messages as feature vectors so as to make such to this problem were included. We omit an exhaus-
Bayesian classication methods directly applicable. To tive list of these phrases for brevity. Note that many
this end, we use the Vector Space model (Salton & of these features were based on manually constructed
McGill 1983) in which we dene each dimension of phrases used in an existing rule set for ltering junk
this space as corresponding to a given word in the en- that was readily outperformed by the probabilistic l-
tire corpus of messages seen. Each individual message tering scheme described here.
can then be represented as a binary vector denoting In addition to phrasal features, we also considered
which words are present and absent in the message. domain-specic non-textual features, such as the do-
With this representation, it becomes straight-forward main type of the sender (mentioned previously). For
to learn a probabilistic classier to detect junk mail example, junk mail is virtually never sent from .edu
given a pre-classied set of training messages. domains. Moreover, many programs for reading E-
mail will resolve familiar E-mail address (i.e. replace
Domain Speci c Properties sdumais@microsoft.com with Susan Dumais). By de-
In considering the specic problem of junk E-mail l- tecting such resolutions, which often happen with mes-
tering, however, it is important to note that there are sages sent by users familiar to the recipient, we can
many particular features of E-mail beside just the in- also provide additional evidence that a message is not
dividual words in the text of a message that provide junk. Yet another good non-textual indicator for dis-
evidence as to whether a message is junk or not. For tinguishing if a message is junk is found in examining
example, particular phrases, such as \Free Money", or if the recipient of a message was the individual user or
over-emphasized punctuation, such as \!!!!", are indica- if the message was sent via a mailing list.
tive of junk E-mail. Moreover, E-mail contains many A number of other simple distinctions, such as
non-textual features, such as the domain type of the whether a message has attached documents (most junk
message sender (e.g., .edu or .com), which provide a E-mail does not have them), or when a given message
great deal of information as to whether a message is was received (most junk E-mail is sent at night), are
junk or not. also powerful distinguishers between junk and legiti-
It is straight-forward to incorporate such additional mate E-mail. Furthermore, we considered a number
problem-specic features for junk mail classication of other useful distinctions which work quite well in a
into the Bayesian classiers described above by sim- probabilistic classier but would be problematic to use
ply adding additional variables denoting the presence in a rule-based system. Such features included the per-
or absence of these features into the vector for each centage of non-alphanumeric characters in the subject
message. In this way, various types of evidence about of a mail message (junk E-mail, for example, often has
messages can be uniformly incorporated into the clas- subject descriptions such as \$$$$ BIG MONEY $$$$"
sication models and the learning algorithms employed which contain a high percentage of non-alphanumeric
need not be modied. characters). As shown in Figure 2, there are clear dif-
To this end, we consider adding several dierent ferences in the distributions of non-alphanumeric char-
forms of problem-specic information as features to acters in the subjects of legitimate versus junk mes-
40
1949) of the corpus of E-mail messages to eliminate
words that appear fewer than three times as having lit-
"legitimate"
35 "junk"
M I Xi C
given by
20
15 ( )=
M I Xi C
X ( ) log ( ( ) ( ) )
P Xi C
P Xi C
:
P Xi P C
Xi = xi C= c
10
(3)
5 We select the 500 features for which this value is
greatest as the feature set from which to build a clas-
0
0 3 6 9 12 15 18 21 24 27 30 sier. While we did not conduct a rigorous suite of
Percentage of message containing num-alphanumeric characters
experiments to arrive at 500 as the optimal number
Figure 2: Percentages of legitimate and junk E-mail of features to use, initial experiments showed that this
with subjects comprised of varying degrees of non- value provided reliable results.
alphanumeric characters Note that the initial feature set that we select from
can include both word-based as well as hand-crafted
phrasal and other domain-specic features. Previous
sages. But this feature alone (or a discretized variant work in feature selection (Koller & Sahami 1996) (Yang
of it that checks if a message subject contains more & Pedersen 1997) has indicated that such information
than, say, 5% non-alphanumeric characters) could not theoretic approaches are quite eective for text classi-
be used to make a simple yes/no distinction for junk cation problems.
reliably. This is likewise true for many of the other Using Domain-Specic Features
domain-specic features we consider as well. Rather, In our rst set of experiments, we seek to determine
we can use such features as evidence in a probabilistic the ecacy of using features that are hand-crafted
classier to increase its condence in a message being specically for the problem of junk E-mail detection.
classied as junk or not. Here, we use a corpus of 1789 actual E-mail messages
In total, we included approximately 20 non-phrasal of which 1578 messages are pre-classied as \junk" and
hand-crafted, domain-specic features into our junk E- 211 messages are pre-classied as \legitimate." Note
mail lter. These features required very little person- that the proportion of junk to legitimate mail in this
eort to create as most of them were generated during corpus makes it more likely that legitimate mail will
a short brainstorming meeting about this particular be classied as junk. Since such an error is far worse
task. than marking a piece of junk mail as being legitimate,
Results we believe that this class disparity creates a more chal-
lenging classication problem. This data is then split
To validate our approach, we conducted a number of temporally (all the testing messages arrived after the
experiments in junk E-mail detection. Our goal here is training messages) into a training set of 1538 messages
both to measure the performance of various enhance- and a testing set of 251 messages.
ments to the simple baseline classication based on the We rst consider using just the word-based tokens
raw text of the messages, as well as looking at the e- in the subject and body of each E-mail message as
cacy of learning such a junk lter in an \operational" the feature set. We then augment these features with
setting. approximately 35 hand-crafted phrasal features con-
The feature space for text will tend to be very large structed for this task. Finally, we further enhance the
(generally on the order of several thousand dimen- feature set with 20 non-textual domain-specic fea-
sions). Consequently, we employ feature selection for tures for junk E-mail detection (several of which are
several reasons. First, such dimensionality reduction explicitly described above). Using the training data
helps provide an explicit control on the model variance in conjunction with each such feature set, we perform
resulting from estimating many parameters. Moreover, feature selection and then build a Naive Bayesian clas-
feature selection also helps to attenuate the degree to sier that is then used to classify the testing data as
which the independence assumption is violated by the junk or legitimate.
Naive Bayesian classier. Recalling that the cost for misclassifying a legiti-
We rst employ a Zipf's Law-based analysis (Zipf mate E-mail as junk far outweighs the cost of marking
Junk Legitimate
Feature Regime Precision Recall Precision Recall
Words only 97.1% 94.3% 87.7% 93.4%
Words + Phrases 97.6% 94.3% 87.8% 94.7%
Words + Phrases + Domain-Specic 100.0% 98.3% 96.2% 100.0%
Table 1: Classication results using various feature sets.
Junk Precision
Bayesian classier (due to its independence assump- 0.94
reveal that the 99.9% threshold is still reasonable for 0.9 Words only
this task.
Words + Phrases
Words + Phrases + Domain-Specific
0.88
The precision and recall for both junk and legitimate 0.86
E-mail for each feature regime is given in Table 1. More 0.86 0.88 0.9 0.92 0.94 0.96 0.98 1
specically, junk precision is the percentage of mes- Junk Recall
sages in the test data classied as junk which truly are. Figure 3: Precision/Recall curves for junk mail using
Likewise, legitimate precision denotes the percentage of various feature sets.
messages in the test data classied as legitimate which
truly are. Junk recall denotes the proportion of actual
junk messages in the test set that are categorized as Sub-classes of Junk E-Mail
junk by the classier, and legitimate recall denotes the
proportion of actual legitimate messages in the test In considering the types of E-mail commonly con-
set that are categorized as legitimate. Clearly, junk sidered junk, there seem to be two dominant group-
precision is of greatest concern to most users (as they ings. The rst is messages related to pornographic Web
would not want their legitimate mail discarded as junk) sites. The second concerns mostly \get-rich-quick"
and this is reected in the asymmetric notion of cost money making opportunities. Since these two groups
used for classication. As can be seen in Table 1, while are somewhat disparate, we consider the possibility of
phrasal information does improve performance slightly, creating a junk E-mail lter by casting the junk lter-
the incorporation of even a little domain knowledge for ing problem as a three category learning task. Here,
this task greatly improves the resulting classications. the three categories of E-mail are dened as legitimate,
pornographic-junk, and other-junk. By distinguishing
Figure 3 gives the junk mail Precision/Recall curves between the two sub-groups of junk E-mail, our goal
using the various feature sets. The gure focuses on is to better capture the characteristics of such junk by
the range from 0.85 to 1.0 to more clearly show the allowing for more degrees of freedom in the learned
greatest variation in these curves. We clearly nd that classier.
the incorporation of additonal features, especially non- For this experiment, we consider a collection of 1183
textual domain-specic information, gives consistently E-mail messages of which 972 are junk and 211 are le-
superior results to just considering the words in the gitimate. This collection is split temporally, as before,
messages. We believe that this provides evidence that into a training set of 916 messages and a testing set
for some targeted text classication problems there is a of 267 messages. To measure the ecacy of identify-
good deal of room for improvement by considering sim- ing sub-groupings of junk E-mail, we label this data in
ple salient features of the domain in addition to the raw two dierent ways. In the rst trial, each message is
text which is available. Examples of such features for simply given one of the two labels legitimate or junk.
more general text categorization problems can include In the second trial, each junk message is relabeled as
information relating to document authors, author af- either pornographic-junk or other-junk, thus creating a
liations, publishers, etc. three-way classication problem.
Junk Legitimate
Categories Precision Recall Precision Recall
Legitimate and Junk 98.9% 94.2% 87.1% 97.4%
Legitimate, Porn-Junk and Other-Junk 95.5% 77.0% 61.1% 90.8%
Table 2: Classication results considering sub-groups of junk E-mail.
Junk Precision
which are then used to learn a Naive Bayesian classi- 0.9
data we use all 222 messages that are sent to this user
Junk Precision
during the week following the period from which the 0.9