Professional Documents
Culture Documents
1
Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: X Y, or P(Y|X)
Generative classifiers
1. Assume some functional form for P(X|Y), P(X)
2. Estimate parameters of P(X|Y), P(X) directly from training data
3. Use Bayes rule to calculate P(Y|X= xi)
Bayes Formula
Generative Model
• Color
• Size
• Texture
• Weight
• …
Discriminative Model
• Logistic Regression
• Color
• Size
• Texture
• Weight
• …
Comparison
• Generative models
– Assume some functional form for P(X|Y), P(Y)
– Estimate parameters of P(X|Y), P(Y) directly from
training data
– Use Bayes rule to calculate P(Y|X= x)
• Discriminative models
– Directly assume some functional form for P(Y|X)
– Estimate parameters of P(Y|X) directly from
training data
Probability Basics
• Prior, conditional and joint probability for random
variables
– Prior probability: P(X )
– Conditional probability: P( X1 |X2 ), P(X2 | X1 )
– Joint probability: X ( X1 , X2 ), P( X ) P(X1 ,X2 )
– Relationship: P(X1 ,X2 ) P( X2 | X1 )P( X1 ) P( X1 | X2 )P( X2 )
– Independence: P( X2 | X1 ) P( X2 ), P( X1 |X2 ) P( X1 ), P(X1 ,X2 ) P( X1 )P( X2 )
• Bayesian Rule
P( X |C )P(C ) Likelihood Prior
P(C |X ) Posterior
P( X ) Evidence
7
Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurance: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
1) P( A) ?
2) P(B) ?
3) P(C) ?
4) P( A| B) ?
5) P(C | A) ?
6) P( A , B) ?
7) P( A , C ) ?
8) Is P( A , C ) equals P(A) P(C) ?
8
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
P(C |X ) C c1 , , c L , X (X1 , , Xn )
P(c1 | x) P(c 2 | x) P( c L | x )
Discriminative
Probabilistic Classifier
x1 x2 xn
x ( x1 , x2 , , xn )
9
Probabilistic Classification
• Establishing a probabilistic model for classification
(cont.)
– Generative model
P( X |C ) C c1 , , c L , X (X1 , , Xn )
P( x | c1 ) P( x |c 2 ) P( x |c L )
x ( x1 , x2 , , xn )
10
Probabilistic Classification
• MAP classification rule
– MAP: Maximum A Posterior
– Assign x to c* if
P(C c * |X x) P(C c |X x) c c * , c c1 , , c L
13
Example
• Example: Play Tennis
14
Example
• Learning Phase
Outlook Play=Yes Play=No Temperature Play=Yes Play=No
Sunny 2/9 3/5
Hot 2/9 2/5
Overcast 4/9 0/5
Mild 4/9 2/5
Rain 3/9 2/5
Cool 3/9 1/5
Humidity Play=Yes Play=No
Wind Play=Yes Play=No
High Strong 3/9 3/5
3/9 4/5
Normal Weak 6/9 2/5
6/9 1/5
15
Example
• Test Phase
– Given a new instance,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) =
0.0053
P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) =
0.0206
19
Conclusions
• Naïve Bayes based on the independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with normal distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers
even in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…
20
Extra Slides
21
Naïve Bayes (1)
• Revisit
• Which is equal to
• Classification
Naïve Bayes (3)
• What if we have continuous Xi ?
• Generative training
• Prediction
Naïve Bayes (4)
• Problems
– Features may overlapped
– Features may not be independent
• Size and weight of tiger
– Use a joint distribution estimation (P(X|Y), P(Y)) to solve a
conditional problem (P(Y|X= x))
• Can we discriminatively train?
– Logistic regression
– Regularization
– Gradient ascent