Professional Documents
Culture Documents
August 7, 2022
1/45
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).
3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:
In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.
6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|
8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n
11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.
12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as
Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)
16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).
18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Model P(xα | y ):
and
Kα
X
[θjc ]α = 1
j=1
19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1
21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1
30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Features:
33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Prediction:
36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
That is,
w> x + b > 0 ⇐⇒ h(x) = +1.
40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes