Naïve Bayes Classifier

Learning
Classification Problem
Classification Problem
• Problem statement:
– Given features X1, X2,…, Xn
– Predict a label Y
• Definition: (Classification) It is the task of
learning a target function f that maps each
attribute set X to one of the predefined class
labels Y.
Example
Day Outlook Temperature Humidity Wind Play
Tennis
Day1 Sunny Hot High Weak No

Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
Classification problem
• Training data: examples of the form (d, h(d))

– where d are the data objects to classify (inputs)
– and h(d) are the correct class info for d, h(d){1,…,K}
• Goal: given dnew, provide h(dnew)
Things We’d Like to Do
• Spam Classification
– Given an email, predict whether it is spam or not
• Medical Diagnosis
– Given a list of symptoms, predict whether a patient has
disease X or not
• Weather
– Based on temperature, humidity, etc… predict if it will rain
tomorrow
Learning
• A classification technique is a systematic approach to build classification models
from an input dataset
• Each technique employs a learning algorithm to identify a model that best fits
the relationship between the attribute set and class label of the input data.
• Formally, a computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E.
• Thus a learning system is characterized by:
❑ task T
❑ experience E, and
❑ performance measure P
Performance of Classification
• Evaluation is based on the counts of test records correctly and
incorrectly predicted by the model
• These counts are tabulated in a table known as confusion matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 f11 f10
Class = 0 f01 f00
NumberofCorrect Pr edictions f11 + f 00

Accuracy = =
TotalNumberof Pr edictions f11 + f10 + f 01 + f 00
• Accuracy will yield misleading results if the data set is unbalanced

• For example, if there were 95 cats and only 5 dogs in the data, a particular classifier might classify all the
observations as cats.
• The classifier would have a 100% recognition rate for the cat class but a 0% recognition rate for the dog
class.
Confusion Matrix
Predicted Class
Class = 1 Class = 0
Actual Class Class = 1 TP FN
Class = 0 FP TN
• True Positive: You predicted positive and it’s true.

• True Negative: You predicted negative and it’s true.
• False Positive: (Type 1 Error) You predicted positive and it’s false.
• False Negative: (Type 2 Error) You predicted negative and it’s false.
Performance of Classification contd..
• In addition to classification accuracy there are two other metrics for

performance evaluation
• Precision: (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, i.e. Out of all the positive classes we
have predicted correctly, how many are actually positive
• Recall: (also known as sensitivity) is the fraction of relevant instances that have
been retrieved over the total amount of relevant instances, i.e. Out of all the
positive classes, how much we predicted correctly. It should be high as possible.
• Example: Suppose a computer program for recognizing dogs in photographs

identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight
dogs identified, five actually are dogs (true positives), while the rest are cats
(false positives).
• In addition to classification accuracy there are two other metrics for

performance evaluation
• Precision: (also called positive predictive value) is the fraction of relevant
instances among the retrieved instances, i.e. Out of all the positive classes we
have predicted correctly, how many are actually positive
• Recall: (also known as sensitivity) is the fraction of relevant instances that have
been retrieved over the total amount of relevant instances, i.e. Out of all the
positive classes, how much we predicted correctly. It should be high as possible.
• Example: Suppose a computer program for recognizing dogs in photographs

identifies eight dogs in a picture containing 12 dogs and some cats. Of the eight
dogs identified, five actually are dogs (true positives), while the rest are cats
(false positives).
• Answer: The program's precision is 5/8 while its recall is 5/12.

• In simple terms, high precision means that an algorithm returned substantially

more relevant results than irrelevant ones, while high recall means that an
algorithm returned most of the relevant results
Relevant Nonrelevant
Retrieved tp fp
Not Retrieved fn tn
• Precision P = tp/(tp + fp)

• Recall R = tp/(tp + fn)
• F-Measure F = 2*R*P/(R+P)
Classification Techniques
• Decision Tree based Methods

• Rule-based Methods
• Artificial Neural Networks
• Naïve Bayes Classifier
• Support Vector Machines
Naïve Bayes Classifier
Thomas Bayes
1702 - 1761
A word about the Bayesian
framework
• Allows us to combine observed data and prior knowledge
• It is a generative (model based) approach, which offers a useful

conceptual framework
– This means that any kind of objects (e.g. time series, trees, etc.)
can be classified, based on a probabilistic model specification
– Find out the probability of the previously unseen instance
belonging to each class, then simply pick the most probable class.
Another Application
• Digit Recognition
Classifier 5
• X1,…,Xn  {0,1} (Black vs. White pixels)
• Y  {5,6} (predict whether a digit is a 5 or a 6)
The Bayes Classifier
• A good strategy is to predict:
– (for example: what is the probability that the image

represents a 5 given its pixels?)
• So … How do we compute that?

• Use Bayes Rule!
Likelihood Prior
Normalization Constant
• Why did this help? Well, we think that we might be able to
specify how features are “generated” by the class label
Bayes Classifiers
• Bayesian classifiers use Bayes’ theorem, which says
p(cj | d ) = p(d | cj ) p(cj)
p(d)
• p(cj | d) = probability of instance d being in class cj,

This is what we are trying to compute
• p(d | cj) = probability of generating instance d given class cj,
We can imagine that being in class cj, causes you to have feature d
with some probability
• p(cj) = probability of occurrence of class cj,
This is just how frequent the class cj, is in our database
• p(d) = probability of instance d occurring
This can actually be ignored, since it is the same for all classes
• Let’s expand this for our digit recognition task:
• To classify, we’ll simply compute these two probabilities and predict based
on which one is greater
Model Parameters
• For the Bayes classifier, we need to “learn” two functions, the
likelihood and the prior
• How many parameters are required to specify the prior for

our digit recognition example?
Model Parameters
• How many parameters are required to specify the likelihood?
– (Supposing that each image is 30x30 pixels)
?
Model Parameters
• The problem with explicitly modeling P(X1,…,Xn|Y) is that
there are usually way too many parameters:
– We’ll run out of space
– We’ll run out of time
– And we’ll need tons of training data (which is usually not
available)
Why is this useful?
• # of parameters for modeling P(X1,…,Xn|Y):
▪ 2(2n-1)
• # of parameters for modeling P(X1|Y),…,P(Xn|Y)
▪ 2n
Example
This is Officer Drew (who arrested me in
1997). Is Officer Drew a Male or Female?
Luckily, we have a small database with
names and sex. Name Sex
We can use it to apply Bayes rule… Drew Male
Officer Drew Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Nina Female
Sergio Male
Assume that we have two classes
c1 = male, and c2 = female.
We have a person whose sex we do not know,

say “drew” or d.
Classifying drew as male or female is
equivalent to asking is it more probable that
drew is male or female, i.e which is greater
p(male | drew) or p(female | drew)
What is the probability of being called

“drew” given that you are a male?
What is the probability
of being a male?
p(male | drew) = p(drew | male ) p(male)
p(drew) What is the probability of
being named “drew”?
(actually irrelevant, since it is
that same for all classes)
Name Sex
Drew Male
Claudia Female
Drew Female
Drew Female
p(cj | d) = p(d | cj ) p(cj) Alberto Male
p(d) Karin Female
Officer Drew
Nina Female
Sergio Male
p(male | drew) = 1/3 * 3/8 = 0.125
3/8 3/8
Officer Drew is more
likely to be a Female.
p(female | drew) = 2/5 * 5/8 = 0.250
3/8 3/8
Officer Drew IS a female!
Officer Drew
p(male | drew) = 1/3 * 3/8 = 0.125

3/8 3/8
p(female | drew) = 2/5 * 5/8 = 0.250

3/8 3/8
The Naïve Bayes Assumption
• So far we have only considered Bayes Classification when we have one
attribute
• What can we do if our data d has several attributes?
• Naïve Bayes assumption: Attributes that describe data instances are
conditionally independent given the classification hypothesis
P
(
d|
h)=
P(
a1 a
|
Th=
,...,
)P
(
at|h
)
t
– it is a simplifying assumption, obviously it may be violated in reality

– in spite of that, it works well in practice
• The Bayesian classifier that uses the Naïve Bayes assumption and
computes the MAP hypothesis is called Naïve Bayes classifier
Name Over 170CM Eye Hair length Sex
Drew No Blue Short Male
Claudia Yes Brown Long Female
Drew No Blue Long Female
Drew No Blue Long Female
Alberto Yes Brown Short Male
Karin No Blue Long Female
Nina Yes Brown Short Female
Sergio Yes Blue Long Male
• To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(d|cj) = p(d1|cj) * p(d2|cj) * ….* p(dn|cj)
p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….
Officer Drew
is blue-eyed, p(officer drew| Female) = 2/5 * 3/5 * ….
over 170cm p(officer drew| Male) = 2/3 * 2/3 * ….
tall, and has
long hair
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis
Day1 Sunny Hot High Weak No

Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No
• Based on the examples in the table, classify the

following datum x:
x=(Outlook=Sunny, Temp=Cool, Hum=High, Wind=strong)
• That means: Play tennis or not?
• Answer
P =yes
(PlayTennis )=9 /14=0 .64
P =
(PlayTennis
no )= 5/14 =0.36
P(Wind =strong
|PlayTennis=yes )=3 /9= 0.33
P(Wind =strong
|PlayTennis= )=
no 3/5 =0.60
etc
.
P(yes)P(sunny
|yes)P(cool|yes)P(high|yes
)P(strong)=
|yes0.0053
P(no)P(sunny
|no)P (cool|no
)P (high
|no )P
(strong )=
|no 0.0206
 answer:PlayTennis
(x )=no
Naïve Bayes Classifier With Continuous Valued
Attribute
Example: Continuous-valued Features
– Temperature is naturally of continuous value.
Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8
No: 27.3, 30.1, 17.4, 29.5, 15.1
– Estimate mean and variance for each class
1 N 1 N Yes = 21.64 , Yes = 2.35

 =  xn , 2 =  ( xn − )2
N n=1 N n=1  No = 23.88 , No = 7.09
– Learning Phase: output two Gaussian models

for P(temp|C)
ˆ ( x | Yes) = 1  ( x − 21.64) 2  1  ( x − 21.64) 2 
P exp
 − 2  2.352   = 2.35 2 exp
− 

2.35 2    11.09 
ˆ 1  ( x − 23.88) 2  1  ( x − 23.88) 2 
P ( x | No) = exp
 − 2  7.09 2   = 7.09 2 exp
− 

7.09 2    50.25 
Outputting Probabilities
• What’s nice about Naïve Bayes (and generative models in
general) is that it returns probabilities
– These probabilities can tell us how confident the algorithm
is
– So… don’t throw away those probabilities!
Exclusive-OR Example
• For an example where conditional independence fails:

– Y=XOR(X1,X2)
X1 X2 P(Y=0|X1,X2) P(Y=1|X1,X2)
0 0 1 0
0 1 0 1
1 0 0 1
1 1 1 0
Relevant Issues
• Violation of Independence Assumption
– For many real world tasks,
– Nevertheless, naïve Bayes works surprisingly well anyway!
• Zero conditional probability Problem
– If no example contains the feature value X j = a jk , Pˆ ( X j = a jk |C = ci ) = 0
– In this circumstance, Pˆ ( x1 |ci )    Pˆ ( a jk |ci )    Pˆ ( xn | ci ) = 0 during test
– For a remedy, conditional probabilities re-estimated with
n + mp
Pˆ ( X j = a jk |C = ci ) = c
n+m
nc : number of training examples for which X j = a jk and C = ci
n : number of training examples for which C = ci
p : prior estimate (usually, p = 1 / t for t possible values of X j )
m : weight to prior (number of " virtual" examples, m  1)
Summary
• Naïve Bayes: the conditional independence assumption
– Training is very easy and fast; just requiring considering each
attribute in each class separately
– Test is straightforward; just looking up tables or calculating
conditional probabilities with estimated distributions
• A popular generative model
– Performance competitive to most of state-of-the-art classifiers even
in presence of violating independence assumption
– Many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
– Apart from classification, naïve Bayes can do more…

Naïve Bayes Classifier

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Naïve Bayes Classifier

Uploaded by

Copyright:

Available Formats

Learning

Day1 Sunny Hot High Weak No

• Training data: examples of the form (d, h(d))

NumberofCorrect Pr edictions f11 + f 00

• Accuracy will yield misleading results if the data set is unbalanced

• True Positive: You predicted positive and it’s true.

• In addition to classification accuracy there are two other metrics for

• Example: Suppose a computer program for recognizing dogs in photographs

• In addition to classification accuracy there are two other metrics for

• Example: Suppose a computer program for recognizing dogs in photographs

• Answer: The program's precision is 5/8 while its recall is 5/12.

• In simple terms, high precision means that an algorithm returned substantially

• Precision P = tp/(tp + fp)

• Decision Tree based Methods

• It is a generative (model based) approach, which offers a useful

– (for example: what is the probability that the image

• So … How do we compute that?

• p(cj | d) = probability of instance d being in class cj,

• How many parameters are required to specify the prior for

• # of parameters for modeling P(X1|Y),…,P(Xn|Y)

We have a person whose sex we do not know,

What is the probability of being called

p(male | drew) = 1/3 * 3/8 = 0.125

p(female | drew) = 2/5 * 5/8 = 0.250

– it is a simplifying assumption, obviously it may be violated in reality

p(officer drew|cj) = p(over_170cm = yes|cj) * p(eye =blue|cj) * ….

Day1 Sunny Hot High Weak No

• Based on the examples in the table, classify the

1 N 1 N Yes = 21.64 , Yes = 2.35

– Learning Phase: output two Gaussian models

• For an example where conditional independence fails:

You might also like