Naïve Bayes Classifier: Ke Chen

Naïve Bayes Classifier
Ke Chen
http://intranet.cs.man.ac.uk/mlo/comp20411/
Modified and extended by Longin Jan Latecki

latecki@temple.edu
Outline
• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
• Example: Play Tennis
• Relevant Issues
• Conclusions
2
Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: multi-layered perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification
3
Probability Basics
• Prior, conditional and joint probability
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X  ( X1 , X2 ), P( X )  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P( X2 | X1 )P( X1 )  P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 )  P( X2 ), P( X1 | X2 )  P( X1 ), P(X1 ,X2 )  P( X1 )P( X2 )
• Bayesian Rule
P( X |C )P(C ) Likelihood  Prior

P(C |X )  Posterior 
P( X ) Evidence
4
Example by Dieter Fox
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C |X ) C  c1 ,  , c L , X  (X1 ,  , Xn )
– Generative model
P( X |C ) C  c1 ,  , c L , X  (X1 ,  , Xn )
• MAP classification rule

– MAP: Maximum A Posterior
– Assign x to c* if P(C  c *
| X  x )  P(C  c | X  x ) c  c *
, c  c1 ,  , c L
• Generative classification with the MAP rule

P( X |C )P(C )
– Apply Bayesian rule to convert: P(C |X )   P( X |C )P(C )
P( X )
8
Feature Histograms
P(x)
C1
C2
Slide by Stephen Marsland

x
Posterior Probability
P(C|x)
0
Slide by Stephen Marsland
x
Naïve Bayes
• Bayes classification
P(C |X )  P( X |C )P(C )  P( X1 ,  , Xn |C )P(C )
Difficulty: learning the joint probability P( X1 ,  , Xn |C )

• Naïve Bayes classification
– Making the assumption that all input attributes are independent
P( X1 , X2 ,  , Xn |C )  P( X1 | X2 ,  , Xn ; C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 ,  , Xn |C )
 P( X1 |C )P( X2 |C )    P( Xn |C )
– MAP classification rule

[ P( x1 |c * )    P( xn |c * )]P(c * )  [ P( x1 |c)    P( xn |c)]P(c), c  c * , c  c1 ,  , c L
11
Naïve Bayes
• Naïve Bayes Algorithm (for discrete input attributes)
– Learning Phase: Given a training set S,
For each target value of ci (ci  c1 ,  , c L )
Pˆ (C  ci )  estimate P(C  ci ) with examples in S;
For every attribute value a jk of each attribute x j ( j  1,  , n; k  1,  , N j )
Pˆ ( X j  a jk |C  ci )  estimate P( X j  a jk |C  ci ) with examples in S;
Output: conditional probability tables; for x j , N j  L elements

– Test Phase: Given an unknown instance X  ( a1 ,  , an ),
Look up tables to assign the label c* to X’ if
[ Pˆ ( a1 |c * )    Pˆ ( an |c * )]Pˆ ( c * )  [ Pˆ ( a1 |c)    Pˆ ( an |c )]Pˆ (c), c  c * , c  c1 ,  , c L
12
Example
• Example: Play Tennis
13
Learning Phase
P(Outlook=o|Play=b) P(Temperature=t|Play=b)
Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
P(Humidity=h|Play=b) P(Wind=w|Play=b)
Humidity Play=Yes Play=No Wind Play=Yes Play=No

High Strong 3/9 3/5
3/9 4/5
Normal Weak 6/9 2/5
6/9 1/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14

14
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
15
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks, P( X1 ,  , Xn |C )  P( X1 |C )    P( Xn |C )
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value X j  a jk , Pˆ ( X j  a jk |C  ci )  0
– In this circumstance, Pˆ ( x |c )    Pˆ ( a |c )    Pˆ ( x |c )  0 during test
1 i jk i n i
– For a remedy, conditional probabilities estimated with Laplace

smoothing: n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : number of training examples for which X j  a jk and C  ci
n : number of training examples for which C  ci
p : prior estimate (usually, p  1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
16
Homework
• Redo the test on Slide 15 using the formula on Slide 16
with m=1.
• Compute P(Play=Yes|x’) and P(Play=No|x’) with m=0
and with m=1 for
x’=(Outlook=Overcast, Temperature=Cool, Humidity=High, Wind=Strong)
Does the result change?
17
Relevant Issues
• Continuous-valued Input Attributes
– Numberless values for an attribute
– Conditional probability modeled with the normal distribution
1  ( X j   ji )2 
Pˆ ( X j |C  ci )  exp  
2  ji  2 ji 
2

 ji : mean (avearage) of attribute values X j of examples for which C  ci
 ji : standard deviation of attribute values X j of examples for which C  ci
– Learning Phase: for X  ( X1 ,  , Xn ), C  c1 ,  , c L

Output: n  L normal distributions and P(C  ci ) i  1,  , L
– Test Phase: for X  ( X1 ,  , Xn )
• Calculate conditional probabilities with all the normal distributions
• Apply the MAP rule to make a decision
18
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even
in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– Apart from classification, naïve Bayes can do more…
19

Naïve Bayes Classifier: Ke Chen

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naïve Bayes Classifier: Ke Chen

Uploaded by

Copyright:

Available Formats

Naïve Bayes Classifier

Modified and extended by Longin Jan Latecki

P( X |C )P(C ) Likelihood  Prior

• MAP classification rule

• Generative classification with the MAP rule

Slide by Stephen Marsland

Difficulty: learning the joint probability P( X1 ,  , Xn |C )

– MAP classification rule

Output: conditional probability tables; for x j , N j  L elements

Outlook Play=Yes Play=No Temperature Play=Yes Play=No

Humidity Play=Yes Play=No Wind Play=Yes Play=No

P(Play=Yes) = 9/14 P(Play=No) = 5/14

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

– For a remedy, conditional probabilities estimated with Laplace

– Learning Phase: for X  ( X1 ,  , Xn ), C  c1 ,  , c L

You might also like