Lecture - 4.1 - Bayes Classifier

Naïve Bayes Classifier
Outline
• Background
• Probability Basics
• Probabilistic Classification
• Naïve Bayes
• Example: Play Tennis
• Relevant Issues
• Conclusions
*
QUIZZ: Probability Basics
• Quiz: We have two six-sided dice. When they are tolled, it could end up
with the following occurrence: (A) dice 1 lands on side “3”, (B) dice 2 lands
on side “1”, and (C) Two dice sum to eight. Answer the following questions:
*
Probabilistic
Classification
*
Supervised learning
• A Generative Model learns the joint probability distribution
p(x,y). It predicts the conditional probability with the help of
Bayes Theorem.
• A Discriminative model learns the conditional probability

distribution p(y|x). Both of these models were generally used
in supervised learning problems.
*
Generative Classifiers
Training classifiers involve estimating f: X -> Y, or P(Y|X)
Generative classifiers
•Assume some functional form for P(Y), P(X|Y)
•Estimate parameters of P(X|Y), P(Y) directly from training data
•Use Bayes rule to calculate P(Y |X)
Examples:
•Naïve Bayes
•Bayesian networks
•Markov random fields
•Hidden Markov Models (HMM)
Probabilistic Classification
• Establishing a probabilistic model for classification (cont.)
– Generative model
Probability that this Probability that this

fruit is an apple fruit is an orange
Generative Generative Generative

Probabilistic Model Probabilistic Model Probabilistic Model
for Class 1 for Class 2 for Class L
*
Discriminative Classifiers
Discriminative Classifiers
•Assume some functional form for P(Y|X)
•Estimate parameters of P(Y|X) directly from training data
Examples
•Logistic regression
•Scalar Vector Machine
•Traditional neural networks
•Nearest neighbour
*
Probabilistic Classification
• Establishing a probabilistic model for classification
– Discriminative model
What is a Example
discriminative •C1 – benign mole
Probabilistic Discriminative •C2 - cancer
Classifier? Probabilistic Classifier
*
Probability Basics
• We defined prior, conditional and joint probability for random
variables X
– Prior probability:
– Conditional probability:
– Joint probability:
– Relationship:
– Independence:
• Bayesian Rule
*
Method: Probabilistic Classification with
MAP
We use this rule in many
• MAP classification rule applications
– MAP: Maximum A Posterior
– Assign x to c* if
• Method of Generative classification with the MAP rule

• Apply Bayesian rule to convert them into posterior probabilities
• Then apply the MAP rule
*
Naïve Bayes
*
For a class, the previous generative model
Naïve Bayes can be decomposed by n generative
models of a single input.
• Bayes classification
Difficulty: learning the joint probability

• Naïve Bayes classification
– Assumption that all input attributes are conditionally independent!
Product of
individual
probabilities
– MAP classification rule: for
*
Naïve Bayes Algorithm
• Naïve Bayes Algorithm (for discrete input attributes) has two phases
– 1. Learning Phase: Given a training set S,
for each target value ci ⋲ C
Calculate P(ci ) from the training data
for each attribute value xjk of attribute Xj
Calculate P(Xj=xjk| C = ci) for each target value ci
Output: conditional probability tables;
– 2. Test Phase: Given an unknown instance ,
Look up tables to assign the label c* to X’ if
Tennis Example
• Example: Play Tennis
*
The learning phase for tennis
example
P(Play=Yes) = 9/14
P(Play=No) = 5/14
We have four variables, we calculate for each
P(Xi|C) we calculate the conditional probability table
Temperature Play=Yes Play=No
Outlook Play=Yes Play=No
Sunny 2/9 3/5 Hot 2/9 2/5
Overcast 4/9 0/5 Mild 4/9 2/5
Rain 3/9 2/5 Cool 3/9 1/5
Humidity Play=Yes Play=No Wind Play=Yes Play=No

High 3/9 4/5 Strong 3/9 3/5
Normal 6/9 1/5 Weak 6/9 2/5
*
Formulation of a Classification
Problem
• Given the data as found in last slide:
• Find for a new point in space (vector of

values) to which group it belongs (classify)
*
The test phase for the tennis example
• Test Phase
– Given a new instance of variable values,
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Given calculated Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Humidity=High|Play=Yes) = 3/9 P(Humidity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– Use the MAP rule to calculate Yes or No
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
* Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Example: software exists
• Test Phase
– Given a new instance, From previous slide
x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
– Look up tables
P(Outlook=Sunny|Play=Yes) = 2/9 P(Outlook=Sunny|Play=No) = 3/5
P(Temperature=Cool|Play=Yes) = 3/9 P(Temperature=Cool|Play==No) = 1/5
P(Huminity=High|Play=Yes) = 3/9 P(Huminity=High|Play=No) = 4/5
P(Wind=Strong|Play=Yes) = 3/9 P(Wind=Strong|Play=No) = 3/5
P(Play=Yes) = 9/14 P(Play=No) = 5/14
– MAP rule
P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206
Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.
*
Advantages of Using Naive Bayes
1. Less complex: Compared to other classifiers, Naïve Bayes is
considered a simpler classifier since the parameters are easier to
estimate. As a result, it’s one of the first algorithms learned within data
science and machine learning courses.
2. Scales well: Compared to logistic regression, Naïve Bayes is considered
a fast and efficient classifier that is fairly accurate when the conditional
independence assumption holds. It also has low storage requirements.
3. Can handle high-dimensional data: Use cases, such document
classification, can have a high number of dimensions, which can be
difficult for other classifiers to manage.
Disadvantages
1. Subject to Zero frequency: Zero frequency occurs when a categorical
variable does not exist within the training set. For example, imagine that
we’re trying to find the maximum likelihood estimator for the word, “sir”
given class “spam”, but the word, “sir” doesn’t exist in the training data.
The probability in this case would zero, and since this classifier
multiplies all the conditional probabilities together, this also means that
posterior probability will be zero. To avoid this issue, laplace smoothing
can be leveraged.
2. Unrealistic core assumption: While the conditional independence
assumption overall performs well, the assumption does not always hold,
leading to incorrect classifications.
Issues Relevant to Naïve Bayes
1. Zero conditional probability Problem
– Such problem exists when no example contains the attribute
value
– In this circumstance, during test

– An approach to overcome this ‘zero-frequency problem’ in
a Bayesian environment is to add one to the count for
every attribute value-class combination when an attribute
value doesn’t occur with every class value.
*
Another Problem: Continuous-valued
Input Attributes
• What to do in such a case?
– Numberless values for an attribute
– Conditional probability is then modeled with the normal distribution
– Learning Phase:
Output: normal distributions and
– Test Phase:
1. Calculate conditional probabilities with all the normal distributions
2. Apply the MAP rule to make a decision
*
Dataset
Prior
Probability
Types of Naive Bayes
1. Gaussian Naïve Bayes (GaussianNB): This is a variant of the Naïve
Bayes classifier, which is used with Gaussian distributions—i.e. normal
distributions—and continuous variables. This model is fitted by finding
the mean and standard deviation of each class.
2. Multinomial Naïve Bayes (MultinomialNB): This type of Naïve Bayes
classifier assumes that the features are from multinomial distributions.
This variant is useful when using discrete data, such as frequency
counts, and it is typically applied within natural language processing use
cases, like spam classification.
3. Bernoulli Naïve Bayes (BernoulliNB): This is another variant of the
Naïve Bayes classifier, which is used with Boolean variables—that is,
variables with two values, such as True and False or 1 and 0.
Conclusion on classifiers
• Naïve Bayes is based on the independence assumption
– Training is very easy and fast; just requiring considering each attribute in
each class separately
– Test is straightforward; just looking up tables or calculating conditional
probabilities with normal distributions
• Naïve Bayes is a popular generative classifier model

1. Performance of naïve Bayes is competitive to most of state-of-the-art
classifiers even if in presence of violating the independence assumption
2. It has many successful applications, e.g., spam mail filtering
– A good candidate of a base learner in ensemble learning
*
Question on naïve bayes
*
Hom Yes No Status Yes No Income Yes No
yes 1/4 3/6 Employed 1/4 3/6 High 1/4 3/6
No 3/4 3/6 Business 2/4 2/6 Average 2/4 1/6
Unemployed 1/4 1/6 Low 1/4 2/6
P(Class=No) =6/10 P(Class=Yes) =4/10

*
Hom Yes No Status Yes No Income Yes No
yes 1/4 3/6 Employed 1/4 3/6 High 1/4 3/6
No 3/4 3/6 Business 2/4 2/6 Average 2/4 1/6
Unemployed 1/4 1/6 Low 1/4 2/6
P(Class=No) =6/10 P(Class=Yes) =4/10
X =(‘Homemaker’, ’Employed’, ‘Average’)
P(Class = Yes|X) α P(Homemaker|Yes)*P(Employed|Yes)*P(Average |Yes)*P(Class=Yes)

= 1/4*1/4*2/4*4/10 =1/80 = 0.0125
P(Class = No|X) α P(Homemaker|No)*P(Employed|No)*P(Average |No)*P(Class=No)

= 3/6*3/6*1/6*6/10 =1/40 = 0.025
Since P(Class=No| X)> P(Class =Yes | X)
Therefore Class = No
P(Class=Yes) = 0.0125/(0.0125+0.025) = 0.0125/0.0375 = 0.333
*
*
Code for Naïve Bayes
from sklearn.datasets import load_iris
rris=load_iris()
# store the feature matrix (X) and response vector (y)
X = iris.data
y = iris.target
#splitting X and y into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
#training the model on training set

from sklearn.naive_bayes import GaussianNB
gnb=GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set

y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

Lecture - 4.1 - Bayes Classifier

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture - 4.1 - Bayes Classifier

Uploaded by

Copyright:

Available Formats

Naïve Bayes Classifier

• A Discriminative model learns the conditional probability

Probability that this Probability that this

Generative Generative Generative

• Method of Generative classification with the MAP rule

• Then apply the MAP rule

Difficulty: learning the joint probability

Humidity Play=Yes Play=No Wind Play=Yes Play=No

• Given the data as found in last slide:

• Find for a new point in space (vector of

– Use the MAP rule to calculate Yes or No

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

* Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053

Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

– In this circumstance, during test

• Naïve Bayes is a popular generative classifier model

P(Class=No) =6/10 P(Class=Yes) =4/10

P(Class=No) =6/10 P(Class=Yes) =4/10

X =(‘Homemaker’, ’Employed’, ‘Average’)

P(Class = Yes|X) α P(Homemaker|Yes)*P(Employed|Yes)*P(Average |Yes)*P(Class=Yes)

P(Class = No|X) α P(Homemaker|No)*P(Employed|No)*P(Average |No)*P(Class=No)

Since P(Class=No| X)> P(Class =Yes | X)

#splitting X and y into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)

#training the model on training set

# making predictions on the testing set

You might also like

P(Class = Yes|X) α P(Homemaker|Yes)P(Employed|Yes)P(Average |Yes)*P(Class=Yes)

P(Class = No|X) α P(Homemaker|No)P(Employed|No)P(Average |No)*P(Class=No)