You are on page 1of 91

Bayes Classifier and Naive Bayes

Tapas Kumar Mishra

August 7, 2022

1/45
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our


training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).

3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar


to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label


y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:

P(+1|x) = 0.8, P(−1|x) = 0.2

In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.

6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|

8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

We can approach this dilemma with a simple trick, and an


additional assumption. The trick part is to estimate P(y ) and
P(x|y ) instead, since, by Bayes rule,

P(x|y )P(y )
P(y |x) = .
P(x)

Recall from Estimating Probabilities from Data that estimating


P(y ) and P(x|y ) is called generative learning.

10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n

11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.

12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a


very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Illustration behind the Naive Bayes algorithm. We estimate
P(xα |y ) independently in each dimension (middle two images) and
then obtain an estimate of the full data
Q distribution by assuming
conditional independence P(x|y ) = α P(xα |y ) (very right image).

14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)


y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one


dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Estimating P([x]α |y )

Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)

16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).

18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Model P(xα | y ):

P(xα = j|y = c) = [θjc ]α

and

X
[θjc ]α = 1
j=1

where [θjc ]α is the probability of feature α having the value j,


given that the label is c. And the constraint indicates that xα must
have one of the categories {1, . . . , Kα }.

19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.


By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j


.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.


By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j


.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1

21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :


P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):


P(X |illness = No) = P(Dallas|No) ∗ P(Female|No) =
2/5 × 2/5 = 4/25.
P(X |illness = Yes) = P(Dallas|Yes) ∗ P(Female|Yes) =
2/3 × 2/3 = 4/9.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):


P(X |illness = No) = P(Dallas|No) ∗ P(Female|No) =
2/5 × 2/5 = 4/25.
P(X |illness = Yes) = P(Dallas|Yes) ∗ P(Female|Yes) =
2/3 × 2/3 = 4/9.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)


but counts we need to use a different model. E.g. in the text
document categorization, feature value xα = j means that in this
particular document x the αth word in my dictionary appears j
times.
Think of starting with a all zero vector of dimension d = V . We
throw a dice of dimension V to sample the first word, increase the
count of the vector at the position by 1. Keep repeating this
experiment until we have m word that is the email.

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)


but counts we need to use a different model. E.g. in the text
document categorization, feature value xα = j means that in this
particular document x the αth word in my dictionary appears j
times.
Think of starting with a all zero vector of dimension d = V . We
throw a dice of dimension V to sample the first word, increase the
count of the vector at the position by 1. Keep repeating this
experiment until we have m word that is the email.

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth


word is indicative towards “spam”. Then if xα = 10 means that
this email is likely spam (as word α appears 10 times in it). And
another email with xα0 = 20 should be even more likely to be spam
(as the spammy word appears twice as often).
With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word α
exactly 20 times. In this case you would simply get the
hallucinated smoothing values for both spam and not-spam - and
the signal is lost.

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth


word is indicative towards “spam”. Then if xα = 10 means that
this email is likely spam (as word α appears 10 times in it). And
another email with xα0 = 20 should be even more likely to be spam
(as the spammy word appears twice as often).
With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word α
exactly 20 times. In this case you would simply get the
hallucinated smoothing values for both spam and not-spam - and
the signal is lost.

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features


are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the


sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features


are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the


sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution


d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability


P of selecting α, xα is the number of
times α is chosen and dα=1 θαc = 1.
So, we can use this to generate a spam email, i.e., a document x
of class y = spam by picking m words independently at random
from the vocabulary of d words using P(x | y = spam).

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution


d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability


P of selecting α, xα is the number of
times α is chosen and dα=1 θαc = 1.
So, we can use this to generate a spam email, i.e., a document x
of class y = spam by picking m words independently at random
from the vocabulary of d words using P(x | y = spam).

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.


P
The numerator sums up all counts for feature xα and the
denominator sums up all counts of all features across all data
points. E.g.,
# of times word α appears in all spam emails
.
# of words in all spam emails combined
Again, l is the smoothing parameter.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.


P
The numerator sums up all counts for feature xα and the
denominator sums up all counts of all features across all data
points. E.g.,
# of times word α appears in all spam emails
.
# of words in all spam emails combined
Again, l is the smoothing parameter.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1

30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Illustration of Gaussian NB. Each class conditional feature


distribution P(xα |y ) is assumed to originate from an independent
Gaussian distribution with its own mean µα,y and variance σα,y2 .

32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Features:

xα ∈ R (each feature takes on a real value) (4)

33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Model P(xα | y ): Use Gaussian distribution


 2
2
 1 −1 xα −µαc
P(xα | y = c) = N µαc , σαc =√ e 2 σαc
(5)
2πσαc
Note that the model specified above is based on our assumption
about the data - that each feature α comes from a
class-conditional Gaussian distribution. The full distribution
P(x|y ) ∼ N (µy , Σy ), where Σy is a diagonal covariance matrix
2 .
with [Σy ]α,α = σα,y

34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Parameter estimation: As always, we estimate the parameters of


the distributions for each dimension and class independently.
Gaussian distributions only have two parameters, the mean and
variance. The mean µα,y is estimated by the average feature value
of dimension α from all samples with label y . The (squared)
standard deviation is simply the variance of this estimate.
n n
1 X X
µαc ← I (yi = c)xiα where nc = I (yi = c)
nc
i=1 i=1
(6)
n
2 1 X
σαc ← I (yi = c)(xiα − µαc )2 (7)
nc
i=1

35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Prediction:

argmax P(y = c | x) ∝ argmax π̂c P(x|y = c)


c c

36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Likelihood of Yes , the person has disease is


P(Income = 100000|Illness = Yes) = (1/(42397.5 ∗ (2.506))) ∗
(2.718)− ((10000068386.66)/(2 ∗ 42397.5)) = 0.00000712789
Likelihood of No , the person has not disease is
P(Income = 100000|Illness = No) = (1/(28972.49∗(2.506)))∗
(2.718)− ((10000079571.8)/(2 ∗ 28972.49)) = 0.0000107421 37/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,


income=100k):
P(X |illness = No) =
P(Dallas|No) ∗ P(Female|No) ∗ P(Income = 100K |Class =
No) = 2/5 × 2/5 × 0.00000712789 = 0.00000114046.
P(X |illness = Yes) =
P(Dallas|Yes) ∗ P(Female|Yes) ∗ P(Income = 100K |Class =
Yes) = 2/3 × 2/3 × 0.0000107421 = 0.00000477426.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,


income=100k):
P(X |illness = No) =
P(Dallas|No) ∗ P(Female|No) ∗ P(Income = 100K |Class =
No) = 2/5 × 2/5 × 0.00000712789 = 0.00000114046.
P(X |illness = Yes) =
P(Dallas|Yes) ∗ P(Female|Yes) ∗ P(Income = 100K |Class =
Yes) = 2/3 × 2/3 × 0.0000107421 = 0.00000477426.
Since P(X |No) ∗ P(No) < P(X |Yes) ∗ P(Yes), from bayes
rule, [P(No|X ) < P(Yes|X )] → Class = Yes.

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

Naive Bayes leads to a linear decision boundary in many common


cases. Illustrated here is the case where P(xα |y ) is Gaussian and
where σα,c is identical for all c (but can differ across dimensions
α). The boundary of the ellipsoids indicate regions of equal
probabilities P(x|y ). The black decision line indicates the decision
boundary where P(y = 1|x) = P(y = 2|x).
39/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

1. Suppose that yi ∈ {−1, +1} and features are multinomial


We can show that
d
Y
h(x) = argmax P(y ) P(xα | y ) = sign(w> x + b)
y
α=1

That is,
w> x + b > 0 ⇐⇒ h(x) = +1.

40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier


As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)


b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for


w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
 
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒  
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

2. In the case of continuous features (Gaussian Naive Bayes), we


can show that
1
P(y | x) =
1 + e (w> x+b)
−y

This model is also known as logistic regression. NB and LR


produce asymptotically the same model if the Naive Bayes
assumption holds.

44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes

You might also like