I239-5 Naive Bayes

Naïve Bayes
- Supervised Learning (3) -
I239 Machine Learning

Agenda
• Discriminative vs Generative
• Conditional probability
• Bayesʼ rule
• Naïve Bayes Model
– Training
– Smoothing
• Exercise
Page 2 2022/12/27
Discriminative vs Generative
So far, the methods we have tried attempt to learn
𝑝(𝑦 ∣ 𝒙) directly.
Given an input 𝒙, we try to map directly to the output 𝑦.
Any algorithm that does this is called a discriminative

learning algorithm.
Another class of algorithms instead tries to model 𝑝(𝒙 ∣ 𝒚)

and 𝑝(𝑦).
Such methods are called generative learning algorithms.
Page 3 2022/12/27
Discriminative vs Generative
• Discriminative model
– i.e. SVM
– Learn boundary of class
– Learn from given data
• Generative model
– Today's topic︕
– Learn model as probability
distribution
– Use Bayes rule to calculate
For more info:

https://proceedings.neurips.cc/paper/2001/file/7b7a53e239400a13bd6be6c91c4f6c4e-Paper.pdf
Page 4 2022/12/27
Why Generative ML?
• Assume data is generated by a certain
probability rule
Observable
𝑃(𝑥)
Data
– Uncertainty by observation noise

– Uncertainty by unknown causality
• How to represent uncertain phenomenon?
- Probability theory
Page 5 2022/12/27
Conditional Probability
• The probability that A occurs, given that
B has occurred is called conditional
probability of A given B
!(#∩%)
• 𝑃 𝐴𝐵 =
!(%)
if 𝑃(𝐵) ≠ 0
Probability of A
= under condition B
A B
Page 6 2022/12/27
Example 1
• A bowl contains five balls. Two red and
three blue. Randomly select two balls and
define
– A: 2nd ball is red
– B: 1st ball is blue
𝑃(𝐴 ∩ 𝐵) 2+ . 3+ 1
𝑃 𝐴𝐵 = = 4 5 =
𝑃(𝐵) 3+ 2
5
Page 7 2022/12/27
Example 2
• Toss a fair coin twice. Define
– A: Head on 2nd toss
– B: Head on 1st toss
𝑃(𝐴 ∩ 𝐵) 1+ . 1+ 1
𝑃 𝐴𝐵 = = 2 2 =
𝑃(𝐵) 1 +2 2
Independent: if 1st event is occurred

or not occurred, P of occurrence of 2nd
event doesnʼt change
Page 8 2022/12/27
Law of Total Probability
• Let 𝑆( , 𝑆) , ⋯ , 𝑆* be mutually exclusive and
exhaustive events (that is, one and only
one must happen).
• Then, the probability of any event A can
be written as
– 𝑃 𝐴 = 𝑃 𝐴 ∩ 𝑆' + 𝑃 𝐴 ∩ 𝑆( + ⋯ + 𝑃 𝐴 ∩ 𝑆)
= 𝑃 𝑆' 𝑃 𝐴 𝑆' + 𝑃 𝑆( 𝑃 𝐴 𝑆( + ⋯
+𝑃 𝑆) 𝑃(𝐴|𝑆) )
𝐴 ∩ 𝑆! 𝐴 ∩ 𝑆#
𝑆! 𝐴 𝑆#
𝑆"
Page 9 2022/12/27
Bayesʼ Rule
Noise
Observable
𝑃(𝑥)
Data: y
Unknown: x Known: y
Inference
!(,∩-) ! 𝑦𝑥 !(,)
• 𝑃 𝑥𝑦 = =
!(-) !(-)
• Posterior probability = 𝑃 𝑥 𝑦
= 𝑃(𝐶𝑎𝑢𝑠𝑒|𝐸𝑓𝑓𝑒𝑐𝑡)
Page 10 2022/12/27
Example of Bayesʼ Rule
• Probability of whether people has disease from the
diagnosis results
! 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 !(./01201)
• 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
!(340/5/61)
! 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 !(¬./01201)
• 𝑃 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 =
!(340/5/61)
𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.01, 𝑃 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.99

𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.98 𝑃 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.02
𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.03 𝑃 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.97
Page 11 2022/12/27
Example of Bayesʼ Rule (cont.)
If 𝑦 is discrete, so we have
𝑝 𝒙 = B 𝑝 𝒙 𝑦 = 𝑦! 𝑝(𝑦 = 𝑦! )
!
If 𝑦 is continuous, we just have an integral instead of a sum.
Page 12 2022/12/27
Example of Bayesʼ Rule (cont.)
• 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.01×0.98 = 0.0098
• 𝑃 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.99×0.03
= 0.0297
• 𝑃 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = 0.0098 + 0.0297 = 0.0395
8.88:;
• 𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 = = 0.249
8.8<:=
Page 13 2022/12/27
What should we do?
!(,∩-) ! 𝑦𝑥 !(,)
• 𝑃 𝑥𝑦 = =
!(-) !(-)
– Estimate 𝑃(𝑥) by frequency of class 𝑥 in the data
– How to estimate 𝑃(𝑦|𝑥) ?
• Probability of data 𝑦 generated from model 𝑥
• Learning 𝑃(𝑦|𝑥) is difficult in many features
Observable
𝑃(𝑥)
Data: y
Estimation Learning
Model 𝑃(𝑥|𝑦)
Page 14 2022/12/27
Naïve Bayes Model
• Naïve Assumption: “Class conditional
independence”
– Conditional independence:
𝑃 𝑦' ∩ 𝑦( ∩ ⋯ ∩ 𝑦) 𝑥 = 𝑃(𝑦'|𝑥) . 𝑃(𝑦(|𝑥) ⋯ 𝑃(𝑦) |𝑥)
– If i-th feature is categorical:
• 𝑃(𝑦! |𝑥) is estimated as the relative frequency of data
having value 𝑦! as i-th feature in class 𝑥
– If i-th feature is continuous:
• 𝑃 𝑦! 𝑥 is estimated through a Gaussian density
function
– Computationally easy in both cases
Page 15 2022/12/27
Naïve Bayes Model (cont.)
• Naïve Assumption: “Class conditional
independence”
– Event y is represented by some features
𝒚 =< 𝑦', ⋯ , 𝑦> >
– Ex. Taro = <female, 40.5-> in <Gender,
HoursWorked>. Is Taro rich or ¬rich?
– Number of param for 𝑃 𝑦' ∩ 𝑦( ∩ ⋯ ∩ 𝑦) 𝑥
• 2 2" − 1
– Number of param for 𝑃 𝑦' 𝑥 . 𝑃 𝑦( 𝑥 ⋯ 𝑃 𝑦) 𝑥
• 2n
Page 16 2022/12/27
Naive Bayes attempts to reduce the number of parameters required using
the (very strong but very useful) assumption that the features are
conditionally independent given 𝒚
𝑝 𝒚 𝑥 = 𝑝(𝑦! , 𝑦" , … , 𝑦$ ∣ 𝑥)
= 𝑝 𝑦! 𝑥 𝑝 𝑦" 𝑦! , 𝑥 … 𝑝(𝑦$ ∣ 𝑦! , … , 𝑦$%! , 𝑥)
By Chain Rule
2$ − 1 ×2
parameters
Page 17 2022/12/27
𝑝 𝒚 𝑥 = 𝑝(𝑦! , 𝑦" , … , 𝑦$ ∣ 𝑥)
= 𝑝 𝑦! 𝑥 𝑝 𝑦" 𝑦! , 𝑥 … 𝑝(𝑦$ ∣ 𝑦! , … , 𝑦$%! , 𝑥)
By Chain Rule $
2$ − 1 ×2 ≈ L 𝑝(𝑦& ∣ 𝑥)
parameters &
By Naïve Assumption
2𝑛 parameters
Page 18 2022/12/27
𝑝 𝒚 𝑥 = 𝑝(𝑦! , 𝑦" , … , 𝑦$ ∣ 𝑥)
= 𝑝 𝑦! 𝑥 𝑝 𝑦" 𝑦! , 𝑥 … 𝑝(𝑦$ ∣ 𝑦! , … , 𝑦$%! , 𝑥)
By Chain Rule $
2$ − 1 ×2 ≈ L 𝑝(𝑦& ∣ 𝑥)
parameters &
By Naïve Assumption
2𝑛 parameters
Note that if the variables 𝑦& are binary, we only need 2 parameters
𝜙&∣()! = 𝑝 𝑦& = 1 𝑥 = 1 and 𝜙&∣()* = 𝑝 𝑦& = 1 𝑥 = 0
Page 19 2022/12/27
Data for Example
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Page 20 2022/12/27
Naïve Bayes Training
• Training in Naïve Bayes is easy:
– Estimate 𝑃(𝑋 = 𝑥) as fraction of records with
class 𝑥
#$%"&(()
• 𝑃 𝑥 =
*%+,-. $/ 01&1
– Estimate 𝑃 𝑦/ 𝑥 as the fraction of records

with 𝑦/ given class 𝑥
#$%"&(2! ∧()
• 𝑃 𝑦! 𝑥 =
#$%"&(()
Page 21 2022/12/27
NB Classifier Example (cont.)
• Given a training set, compute the probabilities
– P(P) = 9/14, P(N) = 5/14
Outlook P N Humidity P N
Sunny 2/9 3/5 High 3/9 4/5
Overcast 4/9 0 Normal 6/9 1/5
Rain 3/9 2/5
Temp. P N Windy P N
Hot 2/9 2/5 Strong 3/9 3/5
Mild 4/9 2/5 Weak 6/9 5/5
Cool 3/9 1/5
Page 22 2022/12/27
NB Classifier Example (cont.)
• Predict play tennis in the day with condition
– y = <sunny, cool, high, strong>
– P(x| sunny, cool, high, strong) using the training data
• 𝑃 𝑌𝑒𝑠 sunny, cool, high, strong
= 𝑃 𝑠𝑢𝑛 𝑌 𝑃 𝑐𝑜𝑜𝑙 𝑌 𝑃 ℎ𝑖𝑔ℎ 𝑌 𝑃 𝑠𝑡𝑟𝑜𝑛𝑔 𝑌 𝑃 𝑌𝑒𝑠
≃ 0.005
• 𝑃 𝑁𝑜 sunny, cool, high, strong
= 𝑃 𝑠𝑢𝑛 𝑁 𝑃 𝑐𝑜𝑜𝑙 𝑁 𝑃 ℎ𝑖𝑔ℎ 𝑁 𝑃 𝑠𝑡𝑟𝑜𝑛𝑔 𝑁 𝑃 𝑁𝑜
≃ 0.0192
• Normalization
– Yes: 0.005/(0.005+0.0192)=21%
– No: 0.0192/(0.005+0.0192)=79%
Page 23 2022/12/27
Smoothing
• In practice, some of counts can be zero
– Zero conditional probability problem
• Fix this by adding “virtual” counts:
#$%"& 2! ∧( 45
– 𝑃 𝑦! 𝑥 =
#$%"& ( 46
– Laplace smoothing
745 5
– Ex. 𝑃 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 𝑁𝑜 = =
846 9
Why? Imagine some features not appear in all combinations with every class
Page 24 2022/12/27
Smoothing
• In practice, some of counts can be zero
• Fix this by adding “virtual” counts:
#$%"& 2! ∧( 45
– 𝑃 𝑦! 𝑥 =
#$%"& ( 46
– Laplace smoothing
745 5
– Ex. 𝑃 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 𝑁𝑜 = =
846 9
This is for binary features (Multivariate Bernoulli Naïve Bayes).

It is also quite natural to extend the same logic for Multinomial Naive Bayes.
Page 25 2022/12/27
Algorithm (Spam Classifier)
Let V be the set of unique tokens in the training set.
Algorithm 1: Bernoulli Naïve Bayes Classifier
for ∀𝑢 ∈ 𝑉 do
number of spam emails containing u+!
Compute and store 𝑝 𝑢 𝑆 ←
number of spam emails+"
number of NOT spam emails containing u+!
Compute and store 𝑝 𝑢 ¬𝑆 ←
number of NOT spam emails+"
end
|”spam emails”|
Compute 𝑝 𝑆 ← and 𝑝 ¬𝑆 ← 1 − 𝑝(𝑆)
”spam emails” +|”NOT spam emails”|
for each email 𝑒 ∈ test set do
Create set of distinct words 𝑤! , … , 𝑤$ in 𝑒, ignoring unseen words in 𝑉
c_pred ← argmax-∈/ 𝑝(𝑐) ∏$&)! 𝑝(𝑤& ∣ 𝑐)
end
Page 26 2022/12/27
Multinomial Naïve Bayes
The Bag of Words Representation
Borrows an image from: http://stanford.edu/~jurafsky/slp3/slides/7_NB.pdf
Page 27 2022/12/27
Q
𝐶MNB = argmaxL∈N 𝑝(𝑐) 2 𝑝(𝑤O ∣ 𝑐)

OP(
where
doccount(𝐶 = 𝑐)
𝑝 𝑐 =
𝑁doc
count 𝑤O , 𝑐 + 𝛼
𝑝 𝑤O 𝑐 =
∑R∈S count(𝑤, 𝑐) + 𝛼|V|
where V represents a set of vocabulary
Page 28 2022/12/27
c : Chinese
j : Japanese
Priors: 𝑝 c = 3/4 𝑝 j = 1/4
5+1 6 3
Conditional Propabilities: 𝑝 Chinese c = = =
8 + 6 14 7
0+1 1 0+1 1
𝑝 Tokyo c = = 𝑝 Japan c = =
8 + 6 14 8 + 6 14
"
𝑝 Chinese j = = 𝑝 Tokyo j = 𝑝 Japan j
0
Page 29 2022/12/27
c : Chinese
j : Japanese
Computing posteriors:
2
3 3 1 1
𝑝 c d1 = × × × ≈ 0.0003
4 7 14 14
2
1 2 2 2
𝑝 j d1 = × × × ≈ 0.0001
4 9 9 9
Page 30 2022/12/27
Conclusions
• Naïve Bayes based on independence assumption
– Training is easy and fast: just requiring data with
features and class separately
– Test is straightforward: just looking up tables or
calculating conditional probabilities
• A popular generative model
– Performance competitive to most of state-of-the-art
even in presence of violating independence assumption
– Many successful applications: spam filtering
Page 31 2022/12/27
Exercise 1
• We know that 51% of the population are female
and 49% are male, 8% of the female patients and
12% are high risk.
• A single person is selected at random and found to
be high risk.
1. What is the probability that it is a high risk female?

2. What is the probability that it is a male?
Page 32 2022/12/27
Exercise 2
• Assume that we have the following set of email
classified as Spam or Ham.
Spam: “send us your password”
Ham: “send us your review”
Ham: “password review”
Spam: “review us ”
Spam: “send your password”
Spam: “send us your account”
Classify the following new email as Spam or Ham

using Bernoulli Naïve Bayes (without smoothing).
New Email: “review us now”
Page 33 2022/12/27
Exercise 3
• Spam filtering examples
(numbers indicate the quantity of occurrence)
w1 w2 w3 w4 w5 w6 w7 class
D1 1 2 1 1
Usual mail
D2 2 1 1 1
D3 1 1 1 2 Spam mail
• Now, we receive a new mail D={w1=1, w6=1}.
Is this the usual or spam using Multinomial Naïve
Bayes Classifier with smoothing (𝛼 = 1) ?
Page 34 2022/12/27
Exercise 4
• Write a pseudocode for implementing Multinomial
Naïve Bayes Classifier by modifying Algorithm1.
Page 35 2022/12/27

I239-5 Naive Bayes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

I239-5 Naive Bayes

Uploaded by

Copyright:

Available Formats

Naïve Bayes

- Supervised Learning (3) -

I239 Machine Learning

Given an input 𝒙, we try to map directly to the output 𝑦.

Any algorithm that does this is called a discriminative

Another class of algorithms instead tries to model 𝑝(𝒙 ∣ 𝒚)

Such methods are called generative learning algorithms.

For more info:

– Uncertainty by observation noise

Independent: if 1st event is occurred

𝑃 𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.01, 𝑃 ¬𝑑𝑖𝑠𝑒𝑎𝑠𝑒 = 0.99

If 𝑦 is continuous, we just have an integral instead of a sum.

– Estimate 𝑃 𝑦/ 𝑥 as the fraction of records

This is for binary features (Multivariate Bernoulli Naïve Bayes).

Borrows an image from: http://stanford.edu/~jurafsky/slp3/slides/7_NB.pdf

𝐶MNB = argmaxL∈N 𝑝(𝑐) 2 𝑝(𝑤O ∣ 𝑐)

Priors: 𝑝 c = 3/4 𝑝 j = 1/4

1. What is the probability that it is a high risk female?

Classify the following new email as Spam or Ham

You might also like