Naïve Bayes Classifier: Dr. Hussain Dawood

Naïve Bayes Classifier
Dr. Hussain Dawood
Modified from Chen

COMP24111 Machine Learning
Outline
• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
– Principle and Algorithms
– Example: Play Tennis
• Relevant Issues
• Summary
2
Bayesian Methods
• Our focus is in this lecture
• Learning and classification methods based on
probability theory.
• Bayes theorem plays a critical role in probabilistic
learning and classification.
• Uses prior probability of each category given no
information about an item.
• Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.

Background
• There are three methods to establish a classifier
a) Model a classification rule directly
Examples: k-NN, decision trees, perceptron, SVM
b) Model the probability of class memberships given input data
Example: perceptron with the cross-entropy cost
c) Make a probabilistic model of data within each class
Examples: naive Bayes, model based classifiers
• a) and b) are examples of discriminative classification
• c) is an example of generative classification
• b) and c) are both examples of probabilistic classification
4
Basic Probability Formulas
• Product rule
P( A  B)  P( A | B) P( B)  P( B | A) P( A)
• Sum rule
P( A  B)  P( A)  P( B)  P( A  B)
• Bayes theorem
P ( D | h) P ( h)
P(h | D) 
P( D)
• Theorem of total probability, if event Ai is mutually
exclusive and probability sum to 1
n
P( B)   P( B | Ai ) P( Ai )
i 1

Bayes Theorem
• Given a hypothesis h and data D which bears on the
hypothesis: P ( D | h) P ( h)
P(h | D) 
P( D)
• P(h): independent probability of h: prior probability
• P(D): independent probability of D
• P(D|h): conditional probability of D given h:
likelihood
• P(h|D): conditional probability of h given D: posterior
probability

Probability Basics
• Prior, conditional and joint probability for random variables
– Prior probability: P(X)
– Conditional probability: P(X1 |X2 ), P(X2 |X1 )
– Joint probability: X  (X1 , X2 ), P(X)  P(X1 ,X2 )
– Relationship: P(X1 ,X2 )  P(X2 |X1 )P(X1 )  P(X1 |X2 )P(X2 )
– Independence: P(X2 |X1 )  P(X2 ), P(X1 |X2 )  P(X1 ), P(X1 ,X2 )  P(X1 )P(X2 )
• Bayesian Rule
P( X |C )P(C ) Likelihood  Prior

P(C |X )  Posterior 
P( X ) Evidence
7
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
1) P( A)  ?
2) P(B)  ?
3) P(C)  ?
4) P( A | B)  ?
5) P(C | A)  ?
6) P( A , B)  ?
7) P( A , C )  ?
8) Is P( A , C ) equal to P(A)  P(C) ?
8
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C|X) C  c1 ,  ,cL , X  (X1 ,  , Xn )
P(c1 | x ) P(c 2 | x) P(c L | x)


Discriminative
Probabilistic Classifier

x1 x2 xn
x  ( x1 , x2 ,  , xn )
9
• Establishing a probabilistic model for classification (cont.)
– Generative model
P(X|C) C  c1 ,  ,cL , X  (X1 ,  , Xn )
P( x | c 1 ) P( x |c 2 ) P( x |c L )
Generative Generative Generative

Probabilistic Model Probabilistic Model  Probabilistic Model
for Class 1 for Class 2 for Class L
  
x1 x2 xn x1 x2 xn x1 x2 xn
x  ( x1 , x2 ,  , xn )
10
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C  c * |X  x)  P(C  c |X  x) c  c * , c  c1 ,  , c L
• Generative classification with the MAP rule

– Apply Bayesian rule to convert them into posterior probabilities
P( X  x |C  ci )P(C  ci )
P(C  ci |X  x) 
P( X  x)
 P( X  x |C  ci )P(C  ci )
for i  1,2 ,  , L
– Then apply the MAP rule
11
Naïve Bayes
• Bayes classification
P(C|X)  P(X|C)P(C)  P(X1 ,  , Xn |C)P(C)
Difficulty: learning the joint probability P(X1 ,  , Xn |C)
• Naïve Bayes classification
– Assumption that all input features are conditionally independent!
P(X1 , X2 ,  , Xn |C)  P(X1 | X2 ,  , Xn , C)P(X2 ,  , Xn |C)
 P(X1 |C)P(X2 ,  , Xn |C)
 P(X1 |C)P(X2 |C)    P(Xn |C)
– MAP classification rule: for x  ( x1 , x2 ,  , xn )
[ P( x1 |c * )    P( xn |c * )]P(c * )  [ P( x1 |c)    P( xn |c)]P(c), c  c * , c  c1 ,  , c L
12
Naïve Bayes
For each target value of ci (ci = c1,×××,cL )

P̂(C = ci ) ¬ estimate P(C = ci ) with examples in S;
For every feature value x jk of each feature X j ( j = 1,×××, F; k = 1,×××, N j )
P̂(X j = x jk | C = ci ) ¬ estimate P(X j = x jk | C = ci ) with examples in S;
X  (a1 ,  , an )
[Pˆ (a1 |c* )    Pˆ (an |c* )]Pˆ (c* )  [Pˆ (a1 |c)    Pˆ (an |c)]Pˆ (c), c  c* , c  c1 ,  , cL
13
Example
• Example: Play Tennis
14
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
15
Example
• Test Phase
– Given a new instance, predict its label
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables achieved in the learning phrase
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Decision making with the MAP rule

P(Yes|x’) ≈ [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053
P(No|x’) ≈ [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
16
Naïve Bayes
• Algorithm: Continuous-valued Features
– Numberless values taken by a continuous-valued feature
– Conditional probability often modeled with the normal distribution
1  ( X j  ji )2 
Pˆ ( X j |C  ci )  exp   
2  ji  2  2 
 ji 
 ji : mean (avearage) of feature values X j of examples for which C  ci
 ji : standard deviation of feature values X j of examples for which C  ci
– Learning Phase: for X  (X1 ,  , Xn ), C  c1 ,  , cL

Output: n L normal distributions and P(C  ci ) i  1,  , L
– Test Phase: Given an unknown instance X  (a1 ,  , an )
• Instead of looking-up tables, calculate conditional probabilities with all the
normal distributions achieved in the learning phrase
• Apply the MAP rule to make a decision
17
Naïve Bayes
• Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes  21.64 , Yes  2.35
   xn , 2   ( xn  )2  No  23.88 , No  7.09
N n1 N n1
– Learning Phase: output two Gaussian models for P(temp|C)

ˆ 1  ( x  21 . 64 ) 2
 1  ( x  21 .64 ) 2

P ( x | Yes )  
exp   
  
exp   
2.35 2  2  2.35  2.35 2
2
 11.09 
ˆ 1  ( x  23 .88 ) 2
 1  ( x  23. 88) 2

P ( x | No)  exp     exp   
7.09 2  2  7.09  7.09 2
2
 50.25 
18
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks, P(X1 ,  , Xn |C)  P(X1 |C)    P(Xn |C)
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the attribute value X j  a jk , Pˆ ( X j  a jk |C  ci )  0
– In this circumstance, Pˆ ( x |c )    Pˆ ( a |c )    Pˆ ( x |c )  0 during test
1 i jk i n i
– For a remedy, conditional probabilities estimated with

n  mp
Pˆ ( X j  a jk |C  ci )  c
nm
nc : numberof trainingexamplesfor w hichX j  a jk and C  ci
n : numberof trainingexamplesfor w hichC  ci
p : prior estimate(usually,p  1 / t for t possiblevaluesof X j )
m : w eight to prior(numberof " virtual"examples, m  1)
19
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
20

Naïve Bayes Classifier: Dr. Hussain Dawood

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naïve Bayes Classifier: Dr. Hussain Dawood

Uploaded by

Copyright:

Available Formats

Naïve Bayes Classifier

Dr. Hussain Dawood

Modified from Chen

COMP24111 Machine Learning

COMP24111 Machine Learning

COMP24111 Machine Learning

P( X |C )P(C ) Likelihood  Prior

P(c1 | x ) P(c 2 | x) P(c L | x)

P(X|C) C  c1 ,  ,cL , X  (X1 ,  , Xn )

Generative Generative Generative

• Generative classification with the MAP rule

For each target value of ci (ci = c1,×××,cL )

Humidity Play=Yes Play=No Wind Play=Yes Play=No

P(Play=Yes) = 9/14 P(Play=No) = 5/14

– Decision making with the MAP rule

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

– Learning Phase: for X  (X1 ,  , Xn ), C  c1 ,  , cL

– Learning Phase: output two Gaussian models for P(temp|C)

– For a remedy, conditional probabilities estimated with

You might also like