You are on page 1of 25

Developing the Naïve Bayes Method

• The goal:
• To be able to classify new records using P(Y=1 | X1,…,Xp)
based on “simple” probabilities, that is probabilities that
are based on a single predictor only – and are therefore
easy to obtain from data
• The tools to get there:
• Conditional Probabilities - Bayes theorem
• Simplifying assumptions
• Algebra

1
Developing the Naïve Bayes Method
• Step 1: P ( X 1 ,..., X p | Y  1) P(Y  1)
P(Y  1 | X 1 ,..., X p ) 
P ( X 1 ,..., X p )

Conditional probabilities – Bayes Rule:


Let A = the event “customer accepts a loan” (Loan=1)
Let B = the event “customer has a credit card” (CC=1)
P(A|B) = probability of A given B (the conditional probability that A occurs given
that B occurred), or the probability that the customer accepts a loan given that
she has a credit card.
Bayes’ rules states:P ( A | B )  P ( A  B )  P( B | A) P( A) , if P(B)>0
P( B) P( B)

2
Developing the Naïve Bayes Method
• Step 2: P( X 1 ,..., X p | Y  1) P (Y  1)
P( X 1 ,..., X p )
P ( X 1 | Y  1)  P ( X 2 | Y  1)  P ( X p | Y  1)  P (Y  1)

P( X 1 ,..., X p )

Rules of probability: This is the


P(X1,…,Xp | Y=1) = P(X1|Y=1) ּP(X2|Y=1) ּּּP(Xp|Y=1) fraction of
if the events X1,…,Xp are independent. class 1 in the
This is usually not true! But we will assume it is true! sample

3
Developing the Naïve Bayes Method
• Step iii: P( X | Y )  P ( X | Y  1)  P( X | Y  1)  P (Y  1)
1 2 p

P( X 1 ,..., X p )
P ( X 1 | Y )  P( X 2 | Y  1)  P ( X p | Y  1)  P(Y  1)
 k

 P( X
i 1
1 | Y  i )  P( X 2 | Y  i )  P( X p | Y  i )  P(Y  i )

Rules of probability:
P(X1,…,Xp )=P(X1,…,Xp |Y=1) P(Y=1)+…+P(X1,…,Xp |Y=k) P(Y=k)
We then apply the simplifying assumption again:
P(X1,…,Xp |Y=1) P(Y=1)+…+P(X1,…,Xp |Y=k) P(Y=k)
≈ P(X1|Y=1) P(X2|Y=1) … P(Xp |Y=1) P(Y=1)+…+ P(X1|Y=k) P(X2|Y=k) … P(Xp |Y=k) P(Y=k)

4
The Simplifying Assumption
• In our example, the simplifying assumption substitutes
• P(CC=1,Online=1 | accept) with
• P(CC=1 | accept)*P(Online=1 | accept)
• Based on the data: 48/279=0.172
79/279*(130+48)/279=0.181
• P(CC=1,Online=1 | accept) = ____________
• P(CC=1 | accept)*P(Online=1
Count of Personal Loan
| accept) = _______________
Online
CreditCard Personal Loan 0 1 Grand Total
0 794 1123 1917
0
1 70 130 200
0 Total 864 1253 2117 “the
0 323 481 804 universe”
1
1 31 48 79
1 Total 354 529 883
Grand Total 1218 1782 3000

5
The Simplifying Assumption
• The simplifying assumption means that we assume conditional
independence between CC and Online
• It is not perfect approximation
• If the conditional dependence is not extreme, it will work reasonably
well
• One reason why it works well is that often the exact probability doesn’t
matter much, rather it is the ranking, or the order of the probabilities of a
new record belonging to different classes

6
Running Naïve Bayes in R

• Use the naiveBayes


# function in library e1071
# Requires the library e1071
library(e1071)
# Can handle both categorical and numeric input
# but output must be categorical
model <- naiveBayes(Personal.Loan~CreditCard+Online,
data=dftrain)
prediction <- predict(model, newdata = dfvalidation[,-8])
table(dfvalidation$Personal.Loan,prediction,dnn=list('actual','predi
cted'))

7
Running Naïve Bayes in R
• Some output options
# Running the model returns a list, and we can examine attributes
# For the a priori class distribution
model$apriori
#
# For class probabilities
predicted.probability <- predict(model, newdata = dfvalidation[,-8],
type="raw")

8
Output
table(dfvalidation$Personal.Loan,prediction,dnn=list('actual','predicted
'))
predicted
actual 0 1
0 1808 0
1 192 0
model$apriori
Y
0 1
2712 288

9
Output

How does this look?

10
From the output we see that
1. The resulting model is the same as the naïve
rule
2. The resulting model is the opposite of the
naïve rule
3. The model is better than the naïve rule
4. Both 2 and 3
5. None of the above

11
Recall: The Naïve Rule
• Classify a new observation as a member of the majority class
• In the personal loan example, the majority of customers did not
accept the loan
• This will be our baseline
• Not a very strong one!

12
Output using all the variables
• Can we learn anything from the output?
table(dfvalidation$Personal.Loan,prediction,dnn=list('actual','pre
dicted'))
predicted
actual 0 1
0 1651 157
1 80 112
model$apriori
Y
0 1
2712 288

13
Output using all the variables

14
Continuous Predictors
• You will notice that the most recent model includes continuous
predictors
• How does this work?
• For a continuous predictor, Xi, it is unlikely that we can find a case in the
training set with exactly the same value (so the previously outlined approach
won’t work)
• It is assumed P(Xi|Class = C) has some probability distribution (e.g.
Normal) and a density is fit to the data for each class C
• Rest of the calculations are as before

15
Example in One Dimension
• Recall ourx -0.39701
1-dimensional
class
0
example
-0.11216 0
-0.08226 0
-0.83098 0
0.172896 0
-0.23603 0
-0.99109 0
-0.0739 0
-0.91048 0
0.777112 0
-0.32008 0
0.521552 1
-0.70176 1
0.397391 1
0.36739 1
0.098193 1
0.813719 1
-0.27989 1
0.631199 1
0.378386 1

16
Naïve Bayes in One Dimension
• Step 1: Binning
x
-0.39701
the0 continuous data
class
setwd("C:/Users/KPrasad/Desktop/BUDT 758T/Etc")
df <- read.csv("Example 1.csv")
-0.11216 0 df$class <- as.factor(df$class)
-0.08226 0 #
-0.83098 0 # We will first bin the continuous variable x
df$group = as.numeric(cut(df$x, c("-1", "-0.35", "-0.075", "0.38", "1")))
0.172896 0 df$group = as.factor(df$group)
-0.23603 0 table(df$group)
-0.99109 0
##
-0.0739 0 ## 1 2 3 4
-0.91048 0 ## 5 5 5 5
0.777112 0
-0.32008 0
0.521552 1
-0.70176 1
0.397391 1
0.36739 1
0.098193 1
0.813719 1
-0.27989 1
0.631199 1
0.378386 1

17
Naïve Bayes in One Dimension
• Step 2: Calculating Probabilities
• By hand it is easiest
Count ofto use
class data
Column tables
Labels
Row Labels 1 2 3 4 Grand Total
0 36.36% 36.36% 18.18% 9.09% 100.00%
1 11.11% 11.11% 33.33% 44.44% 100.00%
Grand Total 25.00% 25.00% 25.00% 25.00% 100.00%

## A-priori probabilities: P(X=1|Y=0)


## Y

• R calculates the probabilities for us


##
##
0 1
0.55 0.45
##
## Conditional probabilities:
## group
## Y 1 2 3 4
## 0 0.36363636 0.36363636 0.18181818 0.09090909
## 1 0.11111111 0.11111111 0.33333333 0.44444444

18
Naïve Bayes in One Dimension
• Step 3: Classifying a new record
• For Bin 1
• PC1 = 0.45*0.1111= 0.05 PC0 = 0.55*0.363636= 0.2
• P(record is class 1|Bin = 1) =0.05/(0.05+0.2)=0.2
• P(record##isA-priori
in class 0|Bin = 1) =0.2/(0.05+0.2)=0.8
probabilities:
## Y
## 0 1
## 0.55 0.45
##
## Conditional probabilities:
## group
## Y 1 2 3 4
## 0 0.36363636 0.36363636 0.18181818 0.09090909
## 1 0.11111111 0.11111111 0.33333333 0.44444444

prediction.probs <- predict(model, newdata = df[,-2], type="raw")

Can generate directly as well


19
Naïve Bayes in One Dimension
The
• Step 3: Classifying a new record decision
boundary

Classes-->
Input 1 0
Variables Value Prob Value Prob Pc1 Pc0 P(class 1) P(class 0)
1 0.111111111 1 0.363636363 0.05 0.2 0.2 0.8
2 0.111111111 2 0.363636363 0.05 0.2 0.2 0.8
Binned_x
3 0.333333333 3 0.181818182 0.15 0.1 0.6 0.4
4 0.444444444 4 0.090909091 0.2 0.05 0.8 0.2

20
Naïve Bayes with circular data
• Continuous data binned into 2 bins (x1=1,2 and x2=1,2)
• There is one prediction per section
• Class 0 is more prevalent in all sections and overall – we predict class 0
everywhere
Example:
P(x1=1|Y=1)*P(x2=1|Y=1)P(Y=1)=0.137
P(x1=1|Y=0)*P(x2=1|Y=0)P(Y=0)=0.185
P(new record is class 1)
=0.137/(0.137+0.185)=0.424
P(new record is class 0)
=0.185/(0.137+0.185)

21
Naïve Bayes with circular data
• Continuous data binned into 3 bins
• We now predict different classes for each section

Class 1 Class 0 Class 1

Class 0 Class 0 Class 0

Class 1 Class 0 Class 1

22
Naïve Bayes with circular data
• Continuous data binned into 4 bins
No data!
• We no longer have data in each section

One data Class 1 Class 1 Class0 Class 1


point!

Class 0 Class 0 Class 0 Class 0

Class 0 Class 0 Class 0 Class 1

Class 1 Class 0 Class 0 Class 1

23
Advantages and Disadvantages
• The good
• Simple
• Can handle large number of predictors (even n < p)
• High performance accuracy, when the goal is ranking
• Pretty robust to independence assumption!

• The bad
• Needs lots of data for good performance
• Predictors with “rare” categories  zero probability
• If this category is important, this can be a problem
• Gives biased probability estimates of class membership
• No insight about importance/role of each predictor
24
Summary
• We now have three methods in our classification toolbox
• Logistic regression
• Naïve Bayes
• Classification only
• Builds on a probability model
• KNN
• Classification or prediction
• Nonparametric
• The latter two are among the top 10 data mining algorithms!

25

You might also like