You are on page 1of 6

The Optimality of Naive Bayes

Harry Zhang
Faculty of Computer Science
University of New Brunswick
Fredericton, New Brunswick, Canada E3B 5A3
email: hzhang@unb.ca

Abstract Rule, the probability of an example E = (x1 , x2 , · · · , xn )


being class c is
Naive Bayes is one of the most efficient and effective
inductive learning algorithms for machine learning and p(E|c)p(c)
data mining. Its competitive performance in classifica- p(c|E) = .
p(E)
tion is surprising, because the conditional independence
assumption on which it is based, is rarely true in real- E is classified as the class C = + if and only if
world applications. An open question is: what is the
true reason for the surprisingly good performance of p(C = +|E)
naive Bayes in classification? fb (E) = ≥ 1, (1)
p(C = −|E)
In this paper, we propose a novel explanation on the
superb classification performance of naive Bayes. We where fb (E) is called a Bayesian classifier.
show that, essentially, the dependence distribution; i.e., Assume that all attributes are independent given the value
how the local dependence of a node distributes in each
class, evenly or unevenly, and how the local dependen-
of the class variable; that is,
cies of all nodes work together, consistently (support- n

ing a certain classification) or inconsistently (cancel- p(E|c) = p(x1 , x2 , · · · , xn |c) = p(xi |c),
ing each other out), plays a crucial role. Therefore, i=1
no matter how strong the dependences among attributes
are, naive Bayes can still be optimal if the dependences the resulting classifier is then:
distribute evenly in classes, or if the dependences can-
n
cel each other out. We propose and prove a sufficient p(C = +)  p(xi |C = +)
and necessary conditions for the optimality of naive fnb (E) = . (2)
Bayes. Further, we investigate the optimality of naive
p(C = −) i=1 p(xi |C = −)
Bayes under the Gaussian distribution. We present and
prove a sufficient condition for the optimality of naive The function fnb (E) is called a naive Bayesian classifier,
Bayes, in which the dependence between attributes do or simply naive Bayes (NB). Figure 1 shows an example of
exist. This provides evidence that dependence among naive Bayes. In naive Bayes, each attribute node has no par-
attributes may cancel out each other. In addition, we ent except the class node.
explore when naive Bayes works well.

Naive Bayes and Augmented Naive Bayes C


Classification is a fundamental issue in machine learning
and data mining. In classification, the goal of a learning
algorithm is to construct a classifier given a set of train-
ing examples with class labels. Typically, an example E is A1 A2 A3 A4
represented by a tuple of attribute values (x1 , x2 , , · · · , xn ),
where xi is the value of attribute Xi . Let C represent the
classification variable, and let c be the value of C. In this Figure 1: An example of naive Bayes
paper, we assume that there are only two classes: + (the
positive class) or − (the negative class). Naive Bayes is the simplest form of Bayesian network, in
A classifier is a function that assigns a class label to an ex- which all attributes are independent given the value of the
ample. From the probability perspective, according to Bayes class variable. This is called conditional independence. It
is obvious that the conditional independence assumption is
Copyright 
c 2004, American Association for Artificial Intelli- rarely true in most real-world applications. A straightfor-
gence (www.aaai.org). All rights reserved. ward approach to overcome the limitation of naive Bayes is
to extend its structure to represent explicitly the dependen- Domingos and Pazzani (Domingos & Pazzani 1997)
cies among attributes. An augmented naive Bayesian net- present an explanation that naive Bayes owes its good per-
work, or simply augmented naive Bayes (ANB), is an ex- formance to the zero-one loss function. This function de-
tended naive Bayes, in which the class node directly points fines the error as the number of incorrect classifications
to all attribute nodes, and there exist links among attribute (Friedman 1996). Unlike other loss functions, such as the
nodes. Figure 2 shows an example of ANB. From the view squared error, the zero-one loss function does not penalize
of probability, an ANB G represents a joint probability dis- inaccurate probability estimation as long as the maximum
tribution represented below. probability is assigned to the correct class. This means that
naive Bayes may change the posterior probabilities of each
n class, but the class with the maximum posterior probability
 is often unchanged. Thus, the classification is still correct,
pG (x1 , · · · , xn , c) = p(c) p(xi |pa(xi ), c), (3)
although the probability estimation is poor. For example, let
i=1
us assume that the true probabilities p(+|E) and p(−|E) are
where pa(xi ) denotes an assignment to values of the par- 0.9 and 0.1 respectively, and that the probability estimates
ents of Xi . We use pa(Xi ) to denote the parents of Xi . p(+|E) and p(−|E) produced by naive Bayes are 0.6 and
ANB is a special form of Bayesian networks in which no 0.4. Obviously, the probability estimates are poor, but the
node is specified as a class node. It has been shown that any classification (positive) is not affected.
Bayesian network can be represented by an ANB (Zhang & Domingos and Pazzani’s explanation (Domingos & Paz-
Ling 2001). Therefore, any joint probability distribution can zani 1997) is verified by the work of Frank et al. (Frank et
be represented by an ANB. al. 2000), which shows that the performance of naive Bayes
is much worse when it is used for regression (predicting a
continuous value). Moreover, evidence has been found that
C naive Bayes produces poor probability estimates (Bennett
2000; Monti & Cooper 1999).
In our opinion, however, Domingos and Pazzani (Domin-
gos & Pazzani 1997)’s explanation is still superficial as it
does not uncover why the strong dependencies among at-
tributes could not flip the classification. For the example
A1 A2 A3 A4 above, why the dependencies could not make the probability
estimates p(+|E) and p(−|E) produced by naive Bayes be
0.4 and 0.6? The key point here is that we need to know how
Figure 2: An example of ANB the dependencies affect the classification, and under what
conditions the dependencies do not affect the classification.
When we apply a logarithm to fb (E) in Equation 1, the There has been some work to explore the optimality of
resulting classifier log fb (E) is the same as fb (E), in the naive Bayes (Rachlin, Kasif, & Aha 1994; Garg & Roth
sense that an example E belongs to the positive class, if and 2001; Roth 1999; Hand & Yu 2001), but none of them give
only if log fb (E) ≥ 0. fnb in Equation 2 is similar. In this an explicit condition for the optimality of naive Bayes.
paper, we assume that, given a classifier f , an example E In this paper, we propose a new explanation that the clas-
belongs to the positive class, if and only if f (E) ≥ 0. sification of naive Bayes is essentially affected by the de-
pendence distribution, instead by the dependencies among
Related Work attributes. In addition, we present a sufficient condition for
the optimality of naive Bayes under the Gaussian distribu-
Many empirical comparisons between naive Bayes and mod- tion, and show theoretically when naive Bayes works well.
ern decision tree algorithms such as C4.5 (Quinlan 1993)
showed that naive Bayes predicts equally well as C4.5 (Lan- A New Explanation on the Superb
gley, Iba, & Thomas 1992; Kononenko 1990; Pazzani 1996).
The good performance of naive Bayes is surprising because Classification Performance of Naive Bayes
it makes an assumption that is almost always violated in real- In this section, we propose a new explanation for the surpris-
world applications: given the class value, all attributes are ingly good classification performance of naive Bayes. The
independent. basic idea comes from the observation as follows. In a given
An open question is what is the true reason for the sur- dataset, two attributes may depend on each other, but the
prisingly good performance of naive Bayes on most classifi- dependence may distribute evenly in each class. Clearly, in
cation tasks? Intuitively, since the conditional independence this case, the conditional independence assumption is vio-
assumption that it is based on is almost never hold, its per- lated, but naive Bayes is still the optimal classifier. Fur-
formance may be poor. It has been observed that, however, ther, what eventually affects the classification is the com-
its classification accuracy does not depend on the dependen- bination of dependencies among all attributes. If we just
cies; i.e., naive Bayes may still have high accuracy on the look at two attributes, there may exist strong dependence be-
datasets in which strong dependencies exist among attributes tween them that affects the classification. When the depen-
(Domingos & Pazzani 1997). dencies among all attributes work together, however, they
may cancel each other out and no longer affect the classifi- Intuitively, when the local dependence derivatives in both
cation. Therefore, we argue that it is the distribution of de- classes support the different classifications, the local depen-
pendencies among all attributes over classes that affects the dencies in the two classes cancel partially each other out, and
classification of naive Bayes, not merely the dependencies the final classification that the local dependence supports, is
themselves. the class with the greater local dependence derivative. An-
Before discussing the details, we introduce the formal other case is that the local dependence derivatives in the two
definition of the equivalence of two classifiers under zero- classes support the same classification. Then, the local de-
one loss, which is used as a basic concept. pendencies in the two classes work together to support the
classification.
Definition 1 Given an example E, two classifiers f1 and f2 The discussion above shows that the ratio of the local de-
are said to be equal under zero-one loss on E, if f1 (E) ≥ 0 pendence derivatives in both classes ultimately determines
.
if and only if f2 (E) ≥ 0, denoted by f1 (E) = f2 (E). If for which classification the local dependence of a node supports.
.
every example E in the example space, f1 (E) = f2 (E), f1 Thus we have the following definition.
and f2 are said to be equal under zero-one loss, denoted by
.
f1 = f2 . Definition 3 For a node X on ANB G, the local dependence
derivative ratio at node X, denoted by ddrG (x) is defined
Local Dependence Distribution below:
dd+
G (x|pa(x))
As discussed in the section above, ANB can represent any ddrG (x) = − . (6)
joint probability distribution. Thus we choose an ANB as ddG (x|pa(x))
the underlying probability distribution. Our motivation is to
find out under what conditions naive Bayes classifies exactly From the above definition, ddrG (x) quantifies the influ-
the same as the underlying ANB. ence of X’s local dependence on the classification. Further,
Assume that the underlying probability distribution is an we have the following results.
ANB G with two classes {+, −}, and the dependencies 1. If X has no parents, ddrG (x) = 1.
among attributes are represented by the arcs among attribute
nodes. For each node, the influence of its parents is quanti- 2. If dd+ −
G (x|pa(x)) = ddG (x|pa(x)), ddrG (x) = 1. This
fied by the correspondent conditional probabilities. We call means that x’s local dependence distributes evenly in
the dependence between a node and its parents local depen- class + and class −. Thus, the dependence does not affect
dence of this node. How do we measure the local depen- the classification, no matter how strong the dependence
dence of a node in each class? Naturally, the ratio of the is.
conditional probability of the node given its parents over the 3. If ddrG (x) > 1, X’s local dependence in class + is
conditional probability of the node without the parents, re- stronger than that in class −. ddrG (x) < 1 means the
flects how strong the parents affect the node in each class. opposite.
Thus we have the following definition.
Global Dependence Distribution
Definition 2 For a node X on ANB G, the local dependence
derivative of X in classes + and − are defined as below. Now let us explore under what condition an ANB works ex-
actly the same as its correspondent naive Bayes. The fol-
p(x|pa(x), +) lowing theorem establishes the relation of an ANB and its
dd+
G (x|pa(x)) = (4) correspondent naive Bayes.
p(x|+)
p(x|pa(x), −) Theorem 1 Given an ANB G and its correspondent naive
dd−
G (x|pa(x)) = . (5) Bayes Gnb (i.e., remove all the arcs among attribute nodes
p(x|−)
from G) on attributes X1 , X2 , ..., Xn , assume that fb and
fnb are the classifiers corresponding to G and Gnb , respec-
Essentially, dd+
G (x|pa(x)) reflects the strength of the lo- tively. For a given example E = (x1 , x2 , · · ·, xn ), the equa-
cal dependence of node X in class +, which measures the tion below is true.
influence of X’s local dependence on the classification in n

class +. dd−G (x|pa(x)) is similar. Further, we have the fol- fb (x1 , x2 , ..., xn ) = fnb (x1 , x2 , ..., xn ) ddrG (xi ), (7)
lowing results. i=1
1. When X has no parent, then n
where i=1 ddrG (xi ) is called the dependence distribution
dd+ − factor at example E, denoted by DFG (E).
G (x|pa(x)) = ddG (x|pa(x)) = 1.
Proof: According to Equation 3, we have:
2. When dd+ G (x|pa(x)) ≥ 1, X’s local dependence in class
p(+)  p(xi |pa(xi ), +)
n
+ supports the classification of C = +. Otherwise, it fb (x1 , · · · , xn ) =
supports the classification of C = −. Similarly, when p(−) p(xi |pa(xi ), −)
i=1
dd−G (x|pa(x)) ≥ 1, A’s local dependence in class − sup-
p(+)  p(xi |+)  p(xi |pa(xi ), +)p(xi |−)
n n
ports the classification of C = −. Otherwise, it supports =
the classification of C = +. p(−) p(xi |−) p(xi |pa(xi ), −)p(xi |+)
i=1 i=1

n
ddr+ (xi |pa(xi )) Theorem 2 represents a sufficient and necessary condition
G
= fnb (E) −
ddrG (xi |pa(xi )) for the optimality of naive Bayes on an example E. If for
i=1 .
each example E in the example space, fb (E) = fnb (E);

n .
i.e., fb = fnb , then naive Bayes is globally optimal.
= fnb (E) ddrG (xi )
i=1
Conditions for the Optimality of Naive Bayes
= DFG (E)fnb (E)
In Section , we proposed that naive Bayes is optimal if the
dependences among attributes cancel each other out. That
(8) is, under circumstance, naive Bayes is still optimal even
though the dependences do exist. In this section, we in-
vestigate naive Bayes under the multivariate Gaussian dis-
From Theorem 1, we know that, in fact, it is the depen-
tribution and prove a sufficient condition for the optimality
dence distribution factor DFG (E) that determines the dif-
of naive Bayes, assuming the dependences among attributes
ference between an ANB and its correspondent naive Bayes
do exist. That provides us with theoretic evidence that the
in the classification. Further, DFG (E) is the product of local
dependences among attributes may cancel each other out.
dependence derivative ratios of all nodes. Therefore, it re- Let us restrict our discussion to two attributes X1 and X2 ,
flects the global dependence distribution (how each local de- and assume that the class density is a multivariate Gaussian
pendence distributes in each class, and how all local depen- in both the positive and negative classes. That is,
dencies work together). For example, when DFG (E) = 1, −1
1 1/2 e− 2 (x−µ )
+ T
1 (x−µ+ )
G has the same classification as Gnb on E. In fact, it is p(x1 , x2 , +) = + ,
not necessary to require DFG (E) = 1, in order to make 2π| +
|
an ANB G has the same classification as its correspondent −1
naive Bayes Gnb , as shown in the theorem below. p(x1 , x2 , −) = 1 e
−1
2
(x−µ− )T

(x−µ− )
,
2π| −
| 1/2

Theorem 2 Given an example E = (x1 , x2 , ..., xn ), an  


where x = (x1 , x2 ), + and − are the covariance matri- 
ANB G is equal to its correspondent naive Bayes Gnb un-
.
der zero-one loss; i.e., fb (E) = fnb (E) (Definition 1), if ces in the positive and negative classes respectively, | − |
   −1
and only if when fb (E) ≥ 1, DFG (E) ≤ fb (E); or when and | + | are the determinants of − and + , + and
−1  
fb (E) < 1, DFG (E) > fb (E). + + +
− are the inverses of − and + ; µ = (µ1 , µ2 ) and
Proof: The proof is straightforward by apply Definition µ− = (µ− − + −
1 , µ2 ), µi and µi are the means of attribute Xi in
1 and Theorem 1. the positive and negative classes respectively, and (x−µ+ )T
and (x − µ− )T are the transposes of (x − µ+ ) and (x − µ− ).
From Theorem 2, if the distribution of the dependences We assume
 that
 two classes
 have a common covariance
among attributes satisfies certain conditions, then naive matrix + = − = , and X1 and X2 have the same
Bayes classifies exactly the same as the underlying ANB, variance σ in both classes. Then, when applying a logarithm
even though there may exist strong dependencies among at- to the Bayesian classifier, defined in Equation 1, we obtain
tributes. Moreover, we have the following results: the classifier fb below.
1. When DFG (E) = 1, the dependencies in ANB G has no p(x1 , x2 , +)
influence on the classification. That is, the classification fb (x1 , x2 ) = log
p(x1 , x2 , −)
of G is exactly the same as that of its correspondent naive −1 +
Bayes Gnb . There exist three cases for DFG (E) = 1. 1 +
=− 2
(µ + µ− ) (µ − µ− )
σ
• no dependence exists among attributes. −1
+xT (µ+ − µ− ).
• for each attribute X on G, ddrG (x) = 1; that is, the lo-
cal distribution of each node distributes evenly in both
classes. Then, because of the conditional independence assump-
tion, we have the correspondent naive Bayesian classifier
• the influence that some local dependencies support fnb
classifying E into C = + is canceled out by the influ-
1 + 1 +
ence that other local dependences support classifying fnb (x1 , x2 ) = (µ − µ− −
1 )x1 + 2 (µ2 − µ2 )x2 .
E into C = −. σ2 1 σ
.
2. fb (E) = fnb (E) does not require that DFG (E) = 1. Assume that
The precise condition is given by Theorem 2. That ex-   
σ σ12
plains why naive Bayes still produces accurate classifica- = .
σ12 σ
tion even in the datasets with strong dependencies among
attributes (Domingos & Pazzani 1997). X1 and X2 are independent if σ12 = 0. If σ = σ12 , we
3. The dependencies in an ANB flip (change) the classifica- have  −σ 
−1 σ 2 −σ 2
σ12
σ 2 −σ 2
tion of its correspondent naive Bayes, only if the condition = 12
σ12
12
−σ .
given by Theorem 2 is no longer true. 2 −σ 2
σ12 2 −σ 2
σ12
Note that, an example E is classified into the positive class
by fb , if and only if fb ≥ 0. fnb is similar. Thus, when fb or y
fnb is divided by a non-zero positive constant, the resulting
classifier is the same as fb or fnb . Then,

fnb (x1 , x2 ) = (µ+ − + −


1 − µ1 )x1 + (µ2 − µ2 )x2 , (9) y=w1x+w2

and
fb (x1 , x2 ) =
x
1
= 2 (σ12 (µ+ − + −
2 − µ2 ) − σ(µ1 − µ1 ))x1
σ12 − σ 2
1
+ 2 (σ12 (µ+ − + −
1 − µ1 ) − σ(µ2 − µ2 ))x2
σ12 − σ 2
+a, (10) Figure 3: Naive Bayes has the same classification with that
−1 of the target classifier in the shaded area.
where a = − σ12 (µ+ + µ− ) (µ+ − µ− ), a constant
independent of x.
For any x1 and x2 , naive Bayes has the same classification (2) If µ+ − + − + − + −
1 = µ2 , µ2 = µ1 , then (µ1 − µ1 ) = −(µ2 − µ2 ),
as that of the underlying classifier if and a = 0. Thus, Equation 12 is simplified as below.
fb (x1 , x2 )fnb (x1 , x2 ) ≥ 0. (11) (µ+ − 2
1 − µ1 )
(x1 + x2 )2 ≥ 0, (14)
σ12 − σ
That is,
It is obvious that Equation 14 is true for any x1 and x2 , if
.
σ12 − σ > 0. Therefore, fb = fnb .
1
2
((σ12 (µ+ − + − + − 2 2
1 − µ1 )(µ2 − µ2 ) − σ(µ1 − µ1 ) )x1
σ12 − σ2 Theorem 3 represents an explicit condition that naive
+(σ12 (µ+ − + − + − 2
1 − µ1 )(µ2 − µ2 ) − σ(µ2 − µ2 ) )x2
2 Bayes is optimal. It shows that naive Bayes is still optimal
under certain condition, even though the conditional inde-
+(2σ12 (µ+ − + − + − 2
1 − µ1 )(µ2 − µ2 ) − σ((µ1 − µ1 )
pendence assumption is violated. In other words, the con-
+(µ+ − 2
2 − µ2 ) ))x1 x2 ) ditional independence assumption is not the necessary con-
+a(µ+ − + −
1 − µ1 )x1 + a(µ2 − µ2 )x2 ≥ 0
dition for the optimality of naive Bayes. This provides evi-
(12) dence that the dependence distribution may play the crucial
role in classification.
Theorem 3 gives a strict condition that naive Bayes per-
Equation 12 represents a sufficient and necessary condi- forms exactly as the target classifier. In reality, however, it is
.
tion for fnb (x1 , x2 ) = fb (x1 , x2 ). But it is too complicated. not necessary to satisfy such a condition. It is more practi-
Let (µ+ − + −
1 − µ1 ) = (µ2 − µ2 ). Equation 12 is simplified as
cal to explore when naive Bayes works well by investigating
below. what factors affect the performance of naive Bayes in clas-
w1 (x1 + x2 )2 + w2 (x1 + x2 ) ≥ 0, (13) sification.
Since under the assumption
 that
 two classes
 have a com-
(µ+ −µ− )2
where w1 = + − mon covariance matrix + = − = , and X1 and X2
σ12 +σ , and w2 = a(µ1 − µ1 ). Let
1 1

2
x = x1 + x2 , and y = w1 (x1 + x2 ) + w2 (x1 + x2 ). Figure have the same variance σ in both classes, both fb and fnb
3 shows the area in which naive Bayes has the same classi- are linear functions, we can examine the difference between
fication with the target classifier. Figure 3 shows that, under them by measuring the difference between their coefficients.
certain condition, naive Bayes is optimal. So we have the following definition.
n
The following theorem presents a sufficient condition for n 4 Given classifiers f1 =
Definition i=1 w1i xi + b and
that naive Bayes works exactly as the target classifier. f2 = w
i=1 2i i x + c, where b and c are constants. The
.
Theorem 3 fb = fnb , if one of the following two conditions distance between two classifiers, denoted by D(f1 , f2 ), is
is true: defined as below


n
1. µ+ − − +
1 = −µ2 , µ1 = −µ2 , and σ12 + σ > 0.

D(f1 , f2 ) = (w1i − w2i )2 + (b − c)2 .
2. µ+ − + −
1 = µ2 , µ2 = µ1 , and σ12 − σ > 0. i=1

Proof: (1) If µ+
1 = −µ−
2, µ−
1= −µ+
2,
then (µ+
1− µ−
= 1) D(f1 , f2 ) = 0, if and only if they have the same co-
(µ+
2 − µ−
). It is straightforward to verify that − σ12 (µ+
+ efficients. Obviously, naive Bayes will well approximate
−12 +
µ− ) (µ − µ− ) = 0. That is, for the constant a in the target classifier, if the distance between them are small.
Equation 10, we have a = 0. Since σ12 + σ > 0, Equation
.
Therefore, we can explore when naive Bayes works well by
13 is always true for any x1 and x2 . Therefore, fb = fnb . observing what factors affect the distance.
2
When (σ12 − σ 2 ) > 0, fb can be simplified as below. Friedman, J. 1996. On bias, variance, 0/1-loss, and the
curse of dimensionality. Data Mining and Knowledge Dis-
fb (x1 , x2 ) = (σ12 (µ+ − + −
2 − µ2 ) − σ(µ1 − µ1 ))x1 covery 1.
+(σ12 (µ+
1 − µ− + − 2 2
1 ) − σ(µ2 − µ2 ))x2 + a(σ12 − σ ), Garg, A., and Roth, D. 2001. Understanding probabilistic
plassifiers. In Raedt, L. D., and Flach, P., eds., Proceed-
µ+ −
2 −µ2
Let r = µ+ −µ−. If (µ+ −
1 − µ1 ) > 0, then
ings of 12th European Conference on Machine Learning.
1 1 Springer. 179–191.
fnb (x1 , x2 ) = x1 + rx2 , Hand, D. J., and Yu, Y. 2001. Idiots Bayes - not so stupid
and after all? International Statistical Review 69:385–389.
2
a(σ12 − σ2 )
Kononenko, I. 1990. Comparison of inductive and naive
fb (x1 , x2 ) = (σ12 r − σ)x1 + (σ12 − σr)x2 + . Bayesian learning approaches to automatic knowledge ac-
µ1 − µ−
+
1 quisition. In Wielinga, B., ed., Current Trends in Knowl-
Then, edge Acquisition. IOS Press.
Langley, P.; Iba, W.; and Thomas, K. 1992. An analysis of
a2 (σ12
2
− σ 2 )2
D(fb , fnb ) = (1−σ12 r+σ)2 +(r−σ12 +σr)2 + . Bayesian classifiers. In Proceedings of the Tenth National
(µ1 − µ−
+
1 )
2
Conference of Artificial Intelligence. AAAI Press. 223–
(15) 228.
It is easy to verify that, when (µ+ −
1 − µ1 ) < 0, we can
Monti, S., and Cooper, G. F. 1999. A Bayesian network
get the same D(fb , fnb ) in Equation 15. Similarly, when classifier that combines a finite mixture model and a Naive
2
(σ12 − σ 2 ) < 0, we have Bayes model. In Proceedings of the 15th Conference on
Uncertainty in Artificial Intelligence. Morgan Kaufmann.
a2 (σ12
2
− σ 2 )2 447–456.
D(fb , fnb ) = (1+σ12 r−σ)2 +(r+σ12 −σr)2 + .
(µ1 − µ−
+
1 )
2
Pazzani, M. J. 1996. Search for dependencies in Bayesian
(16) classifiers. In Fisher, D., and Lenz, H. J., eds., Learning
From Equation 15 and 16, we see that D(fb , fnb ) is affected from Data: Artificial Intelligence and Statistics V. Springer
by r, σ12 and σ. It is true that D(fb , fnb ) increases, as |r| in- Verlag.
creases. That means, the absolute ratio of distances between
two means of classes affect significantly the performance of Quinlan, J. 1993. C4.5: Programs for Machine Learning.
naive Bayes. More precisely, the less absolute ratio, the bet- Morgan Kaufmann: San Mateo, CA.
ter performance of naive Bayes. Rachlin, J. R.; Kasif, S.; and Aha, D. W. 1994. To-
ward a better understanding of memory-based reasoning
Conclusions systems. In Proceedings of the Eleventh International Ma-
chine Learning Conference. Morgan Kaufmann. 242–250.
In this paper, we propose a new explanation on the classifica-
tion performance of naive Bayes. We show that, essentially, Roth, D. 1999. Learning in natural language. In Proceed-
the dependence distribution; i.e., how the local dependence ings of IJCAI’99. Morgan Kaufmann. 898–904.
of a node distributes in each class, evenly or unevenly, and Zhang, H., and Ling, C. X. 2001. Learnability of
how the local dependencies of all nodes work together, con- augmented Naive Bayes in nominal domains. In Brod-
sistently (support a certain classification) or inconsistently ley, C. E., and Danyluk, A. P., eds., Proceedings of the
(cancel each other out), plays a crucial role in the classifica- Eighteenth International Conference on Machine Learn-
tion. We explain why even with strong dependencies, naive ing. Morgan Kaufmann. 617–623.
Bayes still works well; i.e., when those dependencies cancel
each other out, there is no influence on the classification. In
this case, naive Bayes is still the optimal classifier. In addi-
tion, we investigated the optimality of naive Bayes under the
Gaussian distribution, and presented the explicit sufficient
condition under which naive Bayes is optimal, even though
the conditional independence assumption is violated.

References
Bennett, P. N. 2000. Assessing the calibration of Naive
Bayes’ posterior estimates. In Technical Report No. CMU-
CS00-155.
Domingos, P., and Pazzani, M. 1997. Beyond inde-
pendence: Conditions for the optimality of the simple
Bayesian classifier. Machine Learning 29:103–130.
Frank, E.; Trigg, L.; Holmes, G.; and Witten, I. H. 2000.
Naive Bayes for regression. Machine Learning 41(1):5–15.

You might also like