Professional Documents
Culture Documents
FLAIRS04ZhangH - Naive Bayes Classifier Scikit Python References
FLAIRS04ZhangH - Naive Bayes Classifier Scikit Python References
Harry Zhang
Faculty of Computer Science
University of New Brunswick
Fredericton, New Brunswick, Canada E3B 5A3
email: hzhang@unb.ca
and
fb (x1 , x2 ) =
x
1
= 2 (σ12 (µ+ − + −
2 − µ2 ) − σ(µ1 − µ1 ))x1
σ12 − σ 2
1
+ 2 (σ12 (µ+ − + −
1 − µ1 ) − σ(µ2 − µ2 ))x2
σ12 − σ 2
+a, (10) Figure 3: Naive Bayes has the same classification with that
−1 of the target classifier in the shaded area.
where a = − σ12 (µ+ + µ− ) (µ+ − µ− ), a constant
independent of x.
For any x1 and x2 , naive Bayes has the same classification (2) If µ+ − + − + − + −
1 = µ2 , µ2 = µ1 , then (µ1 − µ1 ) = −(µ2 − µ2 ),
as that of the underlying classifier if and a = 0. Thus, Equation 12 is simplified as below.
fb (x1 , x2 )fnb (x1 , x2 ) ≥ 0. (11) (µ+ − 2
1 − µ1 )
(x1 + x2 )2 ≥ 0, (14)
σ12 − σ
That is,
It is obvious that Equation 14 is true for any x1 and x2 , if
.
σ12 − σ > 0. Therefore, fb = fnb .
1
2
((σ12 (µ+ − + − + − 2 2
1 − µ1 )(µ2 − µ2 ) − σ(µ1 − µ1 ) )x1
σ12 − σ2 Theorem 3 represents an explicit condition that naive
+(σ12 (µ+ − + − + − 2
1 − µ1 )(µ2 − µ2 ) − σ(µ2 − µ2 ) )x2
2 Bayes is optimal. It shows that naive Bayes is still optimal
under certain condition, even though the conditional inde-
+(2σ12 (µ+ − + − + − 2
1 − µ1 )(µ2 − µ2 ) − σ((µ1 − µ1 )
pendence assumption is violated. In other words, the con-
+(µ+ − 2
2 − µ2 ) ))x1 x2 ) ditional independence assumption is not the necessary con-
+a(µ+ − + −
1 − µ1 )x1 + a(µ2 − µ2 )x2 ≥ 0
dition for the optimality of naive Bayes. This provides evi-
(12) dence that the dependence distribution may play the crucial
role in classification.
Theorem 3 gives a strict condition that naive Bayes per-
Equation 12 represents a sufficient and necessary condi- forms exactly as the target classifier. In reality, however, it is
.
tion for fnb (x1 , x2 ) = fb (x1 , x2 ). But it is too complicated. not necessary to satisfy such a condition. It is more practi-
Let (µ+ − + −
1 − µ1 ) = (µ2 − µ2 ). Equation 12 is simplified as
cal to explore when naive Bayes works well by investigating
below. what factors affect the performance of naive Bayes in clas-
w1 (x1 + x2 )2 + w2 (x1 + x2 ) ≥ 0, (13) sification.
Since under the assumption
that
two classes
have a com-
(µ+ −µ− )2
where w1 = + − mon covariance matrix + = − = , and X1 and X2
σ12 +σ , and w2 = a(µ1 − µ1 ). Let
1 1
2
x = x1 + x2 , and y = w1 (x1 + x2 ) + w2 (x1 + x2 ). Figure have the same variance σ in both classes, both fb and fnb
3 shows the area in which naive Bayes has the same classi- are linear functions, we can examine the difference between
fication with the target classifier. Figure 3 shows that, under them by measuring the difference between their coefficients.
certain condition, naive Bayes is optimal. So we have the following definition.
n
The following theorem presents a sufficient condition for n 4 Given classifiers f1 =
Definition i=1 w1i xi + b and
that naive Bayes works exactly as the target classifier. f2 = w
i=1 2i i x + c, where b and c are constants. The
.
Theorem 3 fb = fnb , if one of the following two conditions distance between two classifiers, denoted by D(f1 , f2 ), is
is true: defined as below
n
1. µ+ − − +
1 = −µ2 , µ1 = −µ2 , and σ12 + σ > 0.
D(f1 , f2 ) = (w1i − w2i )2 + (b − c)2 .
2. µ+ − + −
1 = µ2 , µ2 = µ1 , and σ12 − σ > 0. i=1
Proof: (1) If µ+
1 = −µ−
2, µ−
1= −µ+
2,
then (µ+
1− µ−
= 1) D(f1 , f2 ) = 0, if and only if they have the same co-
(µ+
2 − µ−
). It is straightforward to verify that − σ12 (µ+
+ efficients. Obviously, naive Bayes will well approximate
−12 +
µ− ) (µ − µ− ) = 0. That is, for the constant a in the target classifier, if the distance between them are small.
Equation 10, we have a = 0. Since σ12 + σ > 0, Equation
.
Therefore, we can explore when naive Bayes works well by
13 is always true for any x1 and x2 . Therefore, fb = fnb . observing what factors affect the distance.
2
When (σ12 − σ 2 ) > 0, fb can be simplified as below. Friedman, J. 1996. On bias, variance, 0/1-loss, and the
curse of dimensionality. Data Mining and Knowledge Dis-
fb (x1 , x2 ) = (σ12 (µ+ − + −
2 − µ2 ) − σ(µ1 − µ1 ))x1 covery 1.
+(σ12 (µ+
1 − µ− + − 2 2
1 ) − σ(µ2 − µ2 ))x2 + a(σ12 − σ ), Garg, A., and Roth, D. 2001. Understanding probabilistic
plassifiers. In Raedt, L. D., and Flach, P., eds., Proceed-
µ+ −
2 −µ2
Let r = µ+ −µ−. If (µ+ −
1 − µ1 ) > 0, then
ings of 12th European Conference on Machine Learning.
1 1 Springer. 179–191.
fnb (x1 , x2 ) = x1 + rx2 , Hand, D. J., and Yu, Y. 2001. Idiots Bayes - not so stupid
and after all? International Statistical Review 69:385–389.
2
a(σ12 − σ2 )
Kononenko, I. 1990. Comparison of inductive and naive
fb (x1 , x2 ) = (σ12 r − σ)x1 + (σ12 − σr)x2 + . Bayesian learning approaches to automatic knowledge ac-
µ1 − µ−
+
1 quisition. In Wielinga, B., ed., Current Trends in Knowl-
Then, edge Acquisition. IOS Press.
Langley, P.; Iba, W.; and Thomas, K. 1992. An analysis of
a2 (σ12
2
− σ 2 )2
D(fb , fnb ) = (1−σ12 r+σ)2 +(r−σ12 +σr)2 + . Bayesian classifiers. In Proceedings of the Tenth National
(µ1 − µ−
+
1 )
2
Conference of Artificial Intelligence. AAAI Press. 223–
(15) 228.
It is easy to verify that, when (µ+ −
1 − µ1 ) < 0, we can
Monti, S., and Cooper, G. F. 1999. A Bayesian network
get the same D(fb , fnb ) in Equation 15. Similarly, when classifier that combines a finite mixture model and a Naive
2
(σ12 − σ 2 ) < 0, we have Bayes model. In Proceedings of the 15th Conference on
Uncertainty in Artificial Intelligence. Morgan Kaufmann.
a2 (σ12
2
− σ 2 )2 447–456.
D(fb , fnb ) = (1+σ12 r−σ)2 +(r+σ12 −σr)2 + .
(µ1 − µ−
+
1 )
2
Pazzani, M. J. 1996. Search for dependencies in Bayesian
(16) classifiers. In Fisher, D., and Lenz, H. J., eds., Learning
From Equation 15 and 16, we see that D(fb , fnb ) is affected from Data: Artificial Intelligence and Statistics V. Springer
by r, σ12 and σ. It is true that D(fb , fnb ) increases, as |r| in- Verlag.
creases. That means, the absolute ratio of distances between
two means of classes affect significantly the performance of Quinlan, J. 1993. C4.5: Programs for Machine Learning.
naive Bayes. More precisely, the less absolute ratio, the bet- Morgan Kaufmann: San Mateo, CA.
ter performance of naive Bayes. Rachlin, J. R.; Kasif, S.; and Aha, D. W. 1994. To-
ward a better understanding of memory-based reasoning
Conclusions systems. In Proceedings of the Eleventh International Ma-
chine Learning Conference. Morgan Kaufmann. 242–250.
In this paper, we propose a new explanation on the classifica-
tion performance of naive Bayes. We show that, essentially, Roth, D. 1999. Learning in natural language. In Proceed-
the dependence distribution; i.e., how the local dependence ings of IJCAI’99. Morgan Kaufmann. 898–904.
of a node distributes in each class, evenly or unevenly, and Zhang, H., and Ling, C. X. 2001. Learnability of
how the local dependencies of all nodes work together, con- augmented Naive Bayes in nominal domains. In Brod-
sistently (support a certain classification) or inconsistently ley, C. E., and Danyluk, A. P., eds., Proceedings of the
(cancel each other out), plays a crucial role in the classifica- Eighteenth International Conference on Machine Learn-
tion. We explain why even with strong dependencies, naive ing. Morgan Kaufmann. 617–623.
Bayes still works well; i.e., when those dependencies cancel
each other out, there is no influence on the classification. In
this case, naive Bayes is still the optimal classifier. In addi-
tion, we investigated the optimality of naive Bayes under the
Gaussian distribution, and presented the explicit sufficient
condition under which naive Bayes is optimal, even though
the conditional independence assumption is violated.
References
Bennett, P. N. 2000. Assessing the calibration of Naive
Bayes’ posterior estimates. In Technical Report No. CMU-
CS00-155.
Domingos, P., and Pazzani, M. 1997. Beyond inde-
pendence: Conditions for the optimality of the simple
Bayesian classifier. Machine Learning 29:103–130.
Frank, E.; Trigg, L.; Holmes, G.; and Witten, I. H. 2000.
Naive Bayes for regression. Machine Learning 41(1):5–15.