You are on page 1of 25

Naïve Bayes Classifier

Adopted from slides by Ke Chen from


University of Manchester and
YangQiu Song from MSRA

1
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: X  Y, or P(Y|X)

Discriminative classifiers (also called ‘informative’ by


Rubinstein&Hastie):
1. Assume some functional form for P(Y|X)
2. Estimate parameters of P(Y|X) directly from training data

Generative classifiers
1. Assume some functional form for P(X|Y), P(X)
2. Estimate parameters of P(X|Y), P(X) directly from training data
3. Use Bayes rule to calculate P(Y|X= xi)
Bayes Formula
Generative Model

• Color
• Size
• Texture
• Weight
• …
Discriminative Model

• Logistic Regression

• Color
• Size
• Texture
• Weight
• …
Comparison
• Generative models
– Assume some functional form for P(X|Y), P(Y)
– Estimate parameters of P(X|Y), P(Y) directly from
training data
– Use Bayes rule to calculate P(Y|X= x)
• Discriminative models
– Directly assume some functional form for P(Y|X)
– Estimate parameters of P(Y|X) directly from
training data
Probability Basics
• Prior, conditional and joint probability for random
variables
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X  ( X1 , X2 ), P( X )  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P( X2 | X1 )P( X1 )  P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 )  P( X2 ), P( X1 |X2 )  P( X1 ), P(X1 ,X2 )  P( X1 )P( X2 )
• Bayesian Rule
P( X |C )P(C ) Likelihood  Prior
P(C |X )  Posterior 
P( X ) Evidence

7
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:

1) P( A)  ?
2) P(B)  ?
3) P(C)  ?
4) P( A| B)  ?
5) P(C | A)  ?
6) P( A , B)  ?
7) P( A , C )  ?
8) Is P( A , C ) equals P(A)  P(C) ?
8
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C |X ) C  c1 ,  , c L , X  (X1 ,  , Xn )

P(c1 | x) P(c 2 | x) P( c L | x )


Discriminative
Probabilistic Classifier


x1 x2 xn
x  ( x1 , x2 ,  , xn )
9
Probabilistic Classification
• Establishing a probabilistic model for classification
(cont.)
– Generative model
P( X |C ) C  c1 ,  , c L , X  (X1 ,  , Xn )

P( x | c1 ) P( x |c 2 ) P( x |c L )

Generative Generative Generative


Probabilistic Model Probabilistic Model  Probabilistic Model
for Class 1 for Class 2 for Class L
  
x1 x2 xn x1 x2 xn x1 x2 xn

x  ( x1 , x2 ,  , xn )
10
Probabilistic Classification
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C  c * |X  x)  P(C  c |X  x) c  c * , c  c1 ,  , c L

• Generative classification with the MAP rule


– Apply Bayesian rule to convert them into posterior
probabilities
P( X  x |C  ci )P(C  ci )
P(C  ci |X  x) 
P( X  x)
 P( X  x |C  ci )P(C  ci )
for i  1,2 ,  , L
– Then apply the MAP rule
11
Naïve Bayes
• Bayes classification
P(C |X )  P( X |C )P(C )  P( X1 ,  , Xn |C )P(C )

Difficulty: learning the joint probability P( X1 ,  , Xn |C )


• Naïve Bayes classification
– Assumption that all input attributes are conditionally
independent!
P( X1 , X2 ,  , Xn |C )  P( X1 | X2 ,  , Xn ; C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 |C )    P( Xn |C )

– MAP classification rule: for x  ( x1 , x2 ,  , xn )


[ P( x1 |c * )    P( xn |c * )]P(c * )  [ P( x1 |c)    P( xn |c)]P(c), c  c * , c  c1 ,  , c L
12
Naïve Bayes
• Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
For each target value of ci (ci  c1 ,  , c L )
Pˆ (C  ci )  estimate P(C  ci ) with examples in S;
For every attribute value x jk of each attribute X j ( j  1,  , n; k  1,  , N j )
Pˆ ( X j  x jk |C  ci )  estimate P( X j  x jk |C  ci ) with examples in S;

Output: conditional probability tables; for X j , N j  L elements


– Test Phase: Given an unknown instance
X  ( a1 ,  , an ) ,
Look up tables to assign the label c* to X’ if
[ Pˆ ( a1 |c * )    Pˆ ( an |c * )]Pˆ (c * )  [ Pˆ ( a1 |c)    Pˆ ( an |c)]Pˆ (c), c  c * , c  c1 ,  , c L

13
Example
• Example: Play Tennis

14
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Yes Play=No
Wind Play=Yes Play=No
High Strong 3/9 3/5
3/9 4/5
Normal Weak 6/9 2/5
6/9 1/5

P(Play=Yes) = 9/14 P(Play=No) = 5/14

15
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 16


Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”. 17


Relevant Issues
• Violation of Independence Assumption
– For many real world tasks, P( X1 ,  , Xn |C )  P( X1 |C )    P( Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value X j  a jk , Pˆ ( X j  a jk |C  ci )  0
– In this circumstance, Pˆ ( x1 |ci )    Pˆ ( a jk |ci )    Pˆ ( xn |ci )  0 during test
– For a remedy, conditional probabilities estimated with
n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
18
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
1  ( X j   ji )2 
Pˆ ( X j |C  ci )  exp  
2  ji  2 ji 
2

 ji : mean (avearage) of attribute values X j of examples for which C  ci
 ji : standard deviation of attribute values X j of examples for which C  ci

– Learning Phase: for X  ( X1 ,  , Xn ), C  c1 ,  , c L


Output: n L normal distributions and P(C  ci ) i  1,  , L
– Test Phase: for X  ( X1 ,  , Xn )
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision

19
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…

20
Extra Slides

21
Naïve Bayes (1)
• Revisit

• Which is equal to

• Naïve Bayes assumes conditional independency

• Then the inference of posterior is


Naïve Bayes (2)
• Training: Observation is multinomial; Supervised, with label information
– Maximum Likelihood Estimation (MLE)

– Maximum a Posteriori (MAP): put Dirichlet prior

• Classification
Naïve Bayes (3)
• What if we have continuous Xi ?

• Generative training

• Prediction
Naïve Bayes (4)
• Problems
– Features may overlapped
– Features may not be independent
• Size and weight of tiger
– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a
conditional problem (P(Y|X= x))
• Can we discriminatively train?
– Logistic regression
– Regularization
– Gradient ascent

You might also like