Professional Documents
Culture Documents
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
over-simplification. McCallum and Nigam (1998) posit the parameters for each class are not. Parameters must
a multinomial Naive Bayes model for text classifica- be estimated from the training data. We do this by
tion and show improved performance compared to the selecting a Dirichlet prior and taking the expectation
multi-variate Bernoulli model due to the incorporation of the parameter with respect to the posterior. For
of frequency information. It is this multinomial ver- details, we refer the reader to Section 2 of Heckerman
sion, which we call “multinomial Naive Bayes” (MNB), (1995). This gives us a simple form for the estimate of
that we discuss, analyze and improve on in this paper. the multinomial parameter, which involves the number
of times word i appears in the documents in class c
2.1. Modeling and Classification (Nci ), divided by the total number of word occurrences
in class c (Nc ). For word i, a prior adds in αi imagined
Multinomial Naive Bayes models the distribution of occurrences so that the estimate is a smoothed version
words in a document as a multinomial. A document is of the maximum likelihood estimate,
treated as a sequence of words and it is assumed that
each word position is generated independently of every Nci + αi
other. For classification, we assume that there are a θ̂ci = , (4)
Nc + α
fixed number of classes, c ∈ {1, 2, . . . , m}, each with
a fixed set of multinomial parameters. The parameter where α denotes the sum of the αi . While αi can be set
vector for a class c is θ~c = {θc1P , θc2 , . . . , θcn }, where differently for each word, we follow common practice
n is the size of the vocabulary, i θci = 1 and θci is by setting αi = 1 for all words.
the probability that word i occurs in that class. The
Substituting the true parameters in Equation 2 with
likelihood of a document is a product of the parameters
our estimates, we get the MNB classifier,
of the words that appear in the document,
P " #
( fi )! Y Nci + αi
p(d|θ~c ) = Q i (θci )fi ,
X
(1) lMNB (d) = argmaxc log p̂(θc ) + fi log ,
i fi ! i Nc + α
i
where fi is the frequency count of word i in docu-
ment d. By assigning a prior distribution over the set where p̂(θc ) is the class prior estimate. The prior class
of classes, p(θ~c ), we can arrive at the minimum-error probabilities, p(θc ), could be estimated like the word
classification rule (Duda & Hart, 1973) which selects estimates. However, the class probabilities tend to be
the class with the largest posterior probability, overpowered by the combination of word probabilities,
" # so we use a uniform prior estimate for simplicity. The
~
l(d) = argmax log p(θc ) +
X
fi log θci , (2) weights for the decision boundary defined by the MNB
c
i
classifier are the log parameter estimates,
" #
ŵci = log θ̂ci . (5)
X
= argmaxc bc + fi wci , (3)
i
where bc is the threshold term and wci is the class c 3. Correcting Systemic Errors
weight for word i. These values are natural parame-
ters for the decision boundary. This is especially easy Naive Bayes has many systemic errors. Systemic er-
to see for binary classification, where the boundary is rors are byproducts of the algorithm that cause an
defined by setting the differences between the positive inappropriate favoring of one class over the other. In
and negative class parameters equal to zero, this section, we discuss two under-studied systemic er-
X rors that cause Naive Bayes to perform poorly. We
(b+ − b− ) + fi (w+i − w−i ) = 0. highlight how they cause misclassifications and pro-
i pose solutions to mitigate or eliminate them.
The form of this equation is identical to the decision
boundary learned by the (linear) Support Vector Ma- 3.1. Skewed Data Bias
chine, logistic regression, linear least squares and the
perceptron. Naive Bayes’ relatively poor performance In this section, we show that skewed data—more train-
results from how it chooses the bc and wci . ing examples for one class than another—can cause the
decision boundary weights to be biased. This causes
the classifier to unwittingly prefer one class over the
2.2. Parameter Estimation
other. We show the reason for the bias and propose
For the problem of classification, the number of classes to alleviate the problem by learning the weights for a
and labeled training data for each class is given, but class using all training data not in that class.
ing data from a single class, c. In contrast, CNB esti-
Table 1. Shown is a simple classification example with two
mates parameters using data from all classes except c.
classes. Each class has a binomial distribution with prob-
ability of heads θ = 0.25 and θ = 0.2, respectively. We We think CNB’s estimates will be more effective be-
are given one training sample for Class 1 and two train- cause each uses a more even amount of training data
ing samples for Class 2, and want to label a heads (H) per class, which will lessen the bias in the weight es-
occurrence. We find the maximum likelihood parameter timates. We find we get more stable weight estimates
settings (θ̂1 and θ̂2 ) for all possible sets of training data and improved classification accuracy. These improve-
and use these estimates to label the test example with the ments might be due to more data per estimate, but
class that predicts the higher rate of occurrence for heads. overall we are using the same amount of data, just in
Even though Class 1 has the higher rate of heads, the test a way that is less susceptible to the skewed data bias.
example is classified as Class 2 more often.
CNB’s estimate is
Class 1 Class 2 p(data) θ̂1 θ̂2 Label Nc̃i + αi
θ = 0.25 θ = 0.2 for H θ̂c̃i = , (6)
Nc̃ + α
T TT 0.48 0 0 none
1 where Nc̃i is the number of times word i occurred in
T {HT,TH} 0.24 0 2 Class 2
documents in classes other than c and Nc̃ is the total
T HH 0.03 0 1 Class 2
number of word occurrences in classes other than c,
H TT 0.16 1 0 Class 1
1 and αi and α are smoothing parameters, as in Equa-
H {HT,TH} 0.08 1 2 Class 1
tion 4. As before, the weight estimate is ŵc̃i = log θ̂c̃i
H HH 0.01 1 1 none
and the classification rule is
p(θ̂1 > θ̂2 ) = 0.24 " #
p(θ̂2 > θ̂1 ) = 0.27 ~
X Nc̃i + αi
lCNB (d) = argmaxc log p(θc ) − fi log .
i
Nc̃ + α
-5
Boston documents about as often as “San Francisco”
-5.5 appears in San Francisco documents (as one might
expect). Let’s also assume that it’s rare to see the
-6 words “San” and “Francisco” apart. Then, each time
-6.5
a test document has an occurrence of “San Francisco,”
Multinomial Naive Bayes will double count—it will
-7 add in the weight for “San” and the weight for “Fran-
10 100 1000
Training Examples per Class
cisco.” Since “San Francisco” and “Boston” occur
(b) equally in their respective classes, a single occurrence
-6 of “San Francisco” will contribute twice the weight as
-7 an occurrence of “Boston.” Hence, the summed con-
tributions of the classification weights may be larger
-8
Classification Weight
−10 −2
10 10
Probability
Probability
−15 −3
10 10
−20 −4
10 10
Data
Power law
Multinomial
−25 −5
10 10
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
(a) Term Frequency Term Frequency
0
10
Data
Power law Figure 3. Plotted are average term frequencies for words
10
−1 in three classes of Reuters-21578 documents—short doc-
uments, medium length documents and long documents.
−2
Terms in longer documents have heavier tails.
10
Probability
−3
d, it does produce a distribution that is much closer
10
to the empirical distribution than the best fit multino-
mial, as shown by the “Power law” line in Figure 2(a).
−4
10
has been chosen to closely match the text distribution. where δij is 1 if word i occurs in document j, 0 other-
The probability is also proportional to θlog(d+fi ) . Be- wise, and the sum is over all document indices (Salton
cause this is similar to the multinomial model, where & Buckley, 1988). Rare words are given increased term
the probability is proportional to θfi , we can use the frequencies; common words are given less weight. We
multinomial model to generate probabilities propor- found it to improve performance.
tional to a class of power law distributions via a sim-
ple transform, fi0 = log(d + fi ). One such transform,
4.3. Transforming Based on Length
fi0 = log(1 + fi ), has the advantages of being an iden-
tity transform for zero and one counts, while pushing Documents have strong word inter-dependencies. Af-
down larger counts as we would like. The transform ter a word first appears in a document, it is more likely
allows us to more realistically handle text while not to appear again. Since MNB assumes occurrence in-
giving up the advantages of MNB. Although setting dependence, long documents can negatively effect pa-
d = 1 does not match the data as well as an optimized rameter estimates. We normalize word counts to avoid
Table 3. Experiments comparing multinomial Naive Bayes Table 4. Our new Naive Bayes procedure. Assignments are
(MNB) to Transformed Weignt-normalized Complement over all possible index values. Steps 1 through 3 distinguish
Naive Bayes (TWCNB) and the Support Vector Machine TWCNB from WCNB.
(SVM) over several data sets. TWCNB’s performance is
substantially better than MNB, and approaches the SVM’s • Let d~ = (d~1 , . . . , d~n ) be a set of documents; dij is
performance. Industry Sector and 20 News are reported the count of word i in document j.
in terms of accuracy; Reuters results are precision-recall
• Let ~y = (y1 , . . . , yn ) be the labels.
breakeven.
~ ~y )
• TWCNB(d,
MNB TWCNB SVM
Industry Sector 0.582 0.923 0.934 1. dij = log(dij + 1) (TF transform § 4.1)
20 Newsgroups 0.848 0.861 0.862
P
1
2. dij = dij log P k (IDF transform § 4.2)
Reuters (micro) 0.739 0.844 0.887 k δik
Reuters (macro) 0.270 0.647 0.694 3. dij = √Pdij 2 (length norm. § 4.3 )
k (dkj )
P
j:yj 6=c dij +αi
this problem. Figure 3 shows empirical term frequency 4. θ̂ci = P P (complement § 3.1)
j:y 6=c
j k dkj +α
distributions for documents of different lengths. It is
5. wci = log θ̂ci
not surprising that longer documents have larger prob-
abilities for larger term frequencies, but the jump for 6. wci = Pwciwci (weight normalization § 3.2)
i
larger term frequencies is disproportionally large. Doc- 7. Let t = (t1 , . . . , tn ) be a test document; let ti
uments in the 80-160 group are, on average, twice as be the count of word i.
long as those in the 0-80 group, yet the chance of a 8. Label the document according to
word occurring five times in the 80-160 group is larger
X
than a word occurring twice in the 0-80 group. This l(t) = arg min ti wci
would not be the case if text were multinomial. c
i