You are on page 1of 39

Naïve Bayes

Overview
• Naïve Bayes Classifier
• Components and their function
• Independence assumption
• Dealing with missing data
• Continuous Example
• Discrete Example
• Advantages / Disadvantages
Bayesian Classification
• Goal: learning function 𝑓 𝑥 → 𝑦
• y … one of k classes (e.g. spam/ham, digits 0-9)
• x=x1….xd - values of attributes (numerical or categorical)
Bayesian Classification
• Goal: learning function 𝑓 𝑥 → 𝑦
• y … one of k classes (e.g. spam/ham, digits 0-9)
• x=x1….xd - values of attributes (numerical or categorical)
• Probabilistic classification:
• Most probable class given observation: 𝑦! = arg max 𝑃(𝑦|𝑥)
!
Bayesian Classification
• Goal: learning function 𝑓 𝑥 → 𝑦
• y … one of k classes (e.g. spam/ham, digits 0-9)
• x=x1….xd - values of attributes (numerical or categorical)
• Probabilistic classification:
• Most probable class given observation: 𝑦! = arg max 𝑃(𝑦|𝑥)
!
• Bayesian Probability of a class:

𝑃 𝑥|𝑦 𝑃(𝑦)
𝑃 𝑦|𝑥 =
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
Bayesian Classification
• Goal: learning function 𝑓 𝑥 → 𝑦
• y … one of k classes (e.g. spam/ham, digits 0-9)
• x=x1….xd - values of attributes (numerical or categorical)
• Probabilistic classification:
• Most probable class given observation: 𝑦! = arg max 𝑃(𝑦|𝑥)
!
• Bayesian Probability of a class:
#$%&& '()*$ +,-(,
𝑃 𝑥|𝑦 𝑃(𝑦)+
𝑃 𝑦|𝑥 =
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
.(,'%$-/*, 0(2)
Bayesian classification: components
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Ebola
𝑃 𝑦|𝑥 = x …observed symptoms
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
Bayesian classification: components
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Irish patient has Ebola
𝑃 𝑦|𝑥 = x …observed symptoms
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(y): prior probability of each class
• Encodes which classes are common, which are rare
• Apriori much more likely to have something other than Ebola
Bayesian classification: components
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Irish patient has Ebola
𝑃 𝑦|𝑥 = x …observed symptoms
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(y): prior probability of each class
• Encodes which classes are common, which are rare
• Apriori much more likely to have something other than Ebola
• P(x|y): class-conditional model
• Describes how likely to see observation x for class y.
• Assuming its Ebola do the symptoms look plausible?
Bayesian classification: components
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Irish patient has Ebola
𝑃 𝑦|𝑥 = x …observed symptoms
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(y): prior probability of each class
• Encodes which classes are common, which are rare
• Apriori much more likely to have something other than Ebola
• P(x|y): class-conditional model
• Describes how likely to see observation x for class y.
• Assuming its Ebola do the symptoms look plausible?
• P(x): Normalise probabilities across observations
• Does not affect which class is most likely because we are looking for arg max
Naïve Bayes: a generative model
• A complete probability distribution for each class
• Definite likelihood for any point x
• P(class) via P(observation) 𝑃 𝑦𝑥 ∝𝑃 𝑥𝑦 𝑃 𝑦
• Can ”generate” synthetic observations
• Will share some properties of the original data
Naïve Bayes: a generative model
• A complete probability distribution for each class
• Definite likelihood for any point x
• P(class) via P(observation) 𝑃 𝑦𝑥 ∝𝑃 𝑥𝑦 𝑃 𝑦
• Can ”generate” synthetic observations
• Will share some properties of the original data
Independence Assumption (this is when it
becomes naive)
• Compute P(x1,…xd|y) for every observation x1,…xd
• Class-conditional ”counts” based on training data
Independence Assumption (this is when it
becomes naive)
• Compute P(x1,…xd|y) for every observation x1,…xd
• Class-conditional ”counts” based on training data
• Problem is that we may not have seen every x1,…xd for every y.
• Digits: 2400 possible black/white patterns (20 x 20)
• Spam: every possible combination of words: 210,000
• Often have observations for individual xi for every class.
Independence Assumption (this is when it
becomes naive)
• Compute P(x1,…xd|y) for every observation x1,…xd
• Class-conditional ”counts” based on training data
• Problem is that we may not have seen every x1,…xd for every y.
• Digits: 2400 possible black/white patterns (20 x 20)
• Spam: every possible combination of words: 210,000
• Often have observations for individual xi for every class.
• Idea: assume x1,…xd conditionally independent given y
) )

𝑃 𝑥4 , … 𝑥) |𝑦 = . 𝑃 𝑥- |𝑥4 … 𝑥-64 , 𝑦 = . 𝑃 𝑥- |𝑦
-54 -54

Chain rule independence


Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)

Go to the
beach

Getting
heatstroke

Popsicle melts
Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)
• However these may be independent if we know the weather
is hot, i.e. P(B,S|H)=P(B|H)P(S|H)
Go to the
beach

Getting
heatstroke

Popsicle melts
Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)
• However these may be independent if we know the weather
is hot, i.e. P(B,S|H)=P(B|H)P(S|H)
Go to the
• Hot weather ”explains” all the dependence between beach beach
and heatstroke

heatstroke

Popsicle melts
Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)
• However these may be independent if we know the weather
is hot, i.e. P(B,S|H)=P(B|H)P(S|H)
Go to the
• Hot weather ”explains” all the dependence between beach beach
and heatstroke
• In classification heatstroke
Hot weather
• Class value explains all the dependence

Popsicle melts
Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)
• However these may be independent if we know the weather
is hot, i.e. P(B,S|H)=P(B|H)P(S|H)
Pixel x1 is
• Hot weather ”explains” all the dependence between beach black
and heatstroke
• In classification Pixel xi is black
“3” ness
• Class value explains all the dependence

Pixel xd is
black
Continuous Example
• Distinguish children from adults based on size
• Classes: {a, c}, attributes: height [cm], weight [kg]
• Training examples: {hi, wi, yi}, 4 adults, 12 children
Continuous Example
• Distinguish children from adults based on size
• Classes: {a, c}, attributes: height [cm], weight [kg]
• Training examples: {hi, wi, yi}, 4 adults, 12 children
• For thing to estimate is the prior
• Class probabilities: P(a)=4/(4+12)=0.25;P(c)=0.75
Continuous Example
• Distinguish children from adults based on size
• Classes: {a, c}, attributes: height [cm], weight [kg]
• Training examples: {hi, wi, yi}, 4 adults, 12 children
• First thing to estimate is the prior
• Class probabilities: P(a)=4/(4+12)=0.25;P(c)=0.75
• Assuming they are an adult what is the model for the height and the
weight
• Now height and weight are real numbers so cannot simply count about adults are
precisely 181.5 cm for example as there might be none!
• So assume a continuous model and therefore heights are taken from a Gaussian distribution
so if I have mean and variance then I have completely described the Gaussian and can get
probability for any height (and any weight if we assume same process/model)
Continuous Example
1
• Model for adults: 𝜇!,# =
4
0 ℎ$
• Height ~ Gaussian with mean, variance estimated as: $:&! '#

( 1 (
𝜎!,# = 0 ℎ$ − 𝜇!,#
4
) $:&! '#
• Weight~Gaussian with 𝜇&,( , 𝜎&,(

• Assume height and weight independent


• Model for Children
) )
• Same but labelled as 𝜇*,+ , 𝜎*,+ and 𝜇&,+ , 𝜎&,+
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)

Height (cm) 200

(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)
𝜇*,) , 𝜎*,)
(

Height (cm) 200

(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200

X
Weight (kg)
𝜇*,) , 𝜎*,)
(

Height (cm) 200

(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)
𝜇*,) , 𝜎*,)
(

Height (cm) 200

(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200

X
Weight (kg)
𝜇*,) , 𝜎*,)
(

Height (cm) 200

(
𝜇!,) , 𝜎!,)
Continuous Example p ℎ+ 𝑐 =
,
𝑒𝑥𝑝
,
−(
!& /0",$
%

%
.",$
%
(-.",$

P(a)=4/(4+12)=0.25;P(c)=0.75 %
, , *& /0',$
p 𝑤+ 𝑐 = 𝑒𝑥𝑝 − %
% ( .',$
(-.',$

%
, , !& /0",(
200

p ℎ+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .",(
(-.",(

X
Weight (kg)

%
, , *& /0',(
p 𝑤+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .',(
(-.',(
𝜇*,) , 𝜎*,)
(

P(x|a)=p(hx|a)p(wx|a)

P(x|c)=p(hc|c)p(wx|c)

Height (cm) 200

(
𝜇!,) , 𝜎!,)
%
, , !& /0",$
p ℎ+ 𝑐 = 𝑒𝑥𝑝 −(
Continuous Example
%
.",$
%
(-.",$

%
, , *& /0',$
p 𝑤+ 𝑐 = 𝑒𝑥𝑝 − %
% ( .',$
P(a)=4/(4+12)=0.25;P(c)=0.75 (-.',$

%
, , !& /0",(
p ℎ+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .",(
(-.",(
200

%
, , *& /0',(
p 𝑤+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .',(
X (-.',(
Weight (kg)
𝜇*,) , 𝜎*,)

P(x|a)=p(hx|a)p(wx|a)
(

P(x|c)=p(hc|c)p(wx|c)

Height (cm) 200 𝑃 𝑥 𝑎 𝑃(𝑎)


𝑃 𝑎𝑥 =
𝑃 𝑥 𝑎 𝑃 𝑎 + 𝑃 𝑥 𝑐 𝑃(𝑐)

(
𝜇!,) , 𝜎!,)
Discrete example: spam
• Separate spam from valid email, attributes = words P(spam)=4/6 P(ham)=2/6
D1: “send us your password” spam
D2: “send us your review” ham
D3: “review your password” ham
D4: “review us” spam
D5: “send us your password” spam
D6: “send us your account” spam
Discrete example: spam
• Separate spam from valid email, attributes = words P(spam)=4/6 P(ham)=2/6
D1: “send us your password” spam
D2: “send us your review” ham
D3: “review your password” ham Spam Ham
D4: “review us” spam 2/4 1/2 password
D5: “send us your password” spam 1/4 2/2 review
D6: “send us your account” spam 3/4 1/2 Send
3/4 1/2 Us
3/4 1/2 Your
1/4 0/2 account
Discrete example: spam
• Separate spam from valid email, attributes = words P(spam)=4/6 P(ham)=2/6
D1: “send us your password” spam
D2: “send us your review” ham
D3: “review your password” ham Spam Ham
D4: “review us” spam 2/4 1/2 password
D5: “send us your password” spam 1/4 2/2 review
D6: “send us your account” spam 3/4 1/2 Send
3/4 1/2 Us
New email: “review us now” 3/4 1/2 Your
1/4 0/2 account
Discrete example: spam
• Separate spam from valid email, attributes = words P(spam)=4/6 P(ham)=2/6
D1: “send us your password” spam
D2: “send us your review” ham
D3: “review your password” ham Spam Ham
D4: “review us” spam 2/4 1/2 password
D5: “send us your password” spam 1/4 2/2 review
D6: “send us your account” spam 3/4 1/2 Send
3/4 1/2 Us
New email: “review us now” 3/4 1/2 Your
1/4 0/2 account

( , 2 2 2 ,
P(review us|spam)=P(0,1,0,1,0,0|spam)= 1 − 1− 1− 1− = 0.0044
1 1 1 1 1 1
, ( , , , 3
P(review us|ham)=P(0,1,0,1,0,0|ham)= 1 − ( (
1−( (
1−( 1 − ( =0.0625
%
3.35(6×
)
P(ham|review us)= % =0.87 (note identical example)
*
3.35(6× 83.3311×
) )
Good examples on Wikipedia
• https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Problems with Naïve Bayes
• Zero-frequency problem
• Any mail containing “account” is spam: P(account|ham)=0/2
• Solution: never allow zero frequencies
• Laplace smoothing: add a small positive number to all counts 𝑛𝑢𝑚 𝑤, 𝑐 + 𝜖
𝑃 𝑤𝑐 =
• May use global statistics in place of 𝜖: 𝑛𝑢𝑚(𝑤)/𝑛𝑢𝑚 𝑛𝑢𝑚 𝑐 + 2𝜖
• Very common problem (Zipf’s law: 50% of words occur once)
• Assumes word independence
• Every word contributes independently to P(spam|email)
• Fooling NB: add lots of “hammy” words into spam email
Missing Data
• Suppose don’t have value for some attribute Xi
• Applicant’s credit history unknown
• Some medical test not performed on patient
• How to compute P(X1=x1…Xj=?...Xd=xd|y)
:
• Easy with Naïve Bayes 𝑃 𝑥, … 𝑋9 … 𝑥: 𝑦 = C 𝑃(𝑥$ |𝑦)
$;9
• Ignore attribute in instance where its value is missing
• Computer likelihood based on observed attributes
• No need to “fill in” or explicitly model missing values
• Based on conditional independence between attributes
Missing Data II
• Ex: three coin tosses: Event = {X1=H, X2=?, X3=T}
• Event = head, unknown (either head or tail), tail X2=H
• Event = {H,H,T} + {H,T,T}
• P(event)= P(H,H,T)+P(H,T,T) X1=H X3=T

• General case: Xj has missing value X2=T

𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)

0 𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = 0 P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)


++ ++

= P(𝑥, |𝑦)… ∑++ P 𝑥9 𝑦 … P(𝑥: |𝑦)

= P(𝑥, |𝑦)… 1… P(𝑥: |𝑦)

You might also like