NaiveBayes TomasWard

Naïve Bayes
Overview
• Naïve Bayes Classifier
• Components and their function
• Independence assumption
• Dealing with missing data
• Continuous Example
• Discrete Example
• Advantages / Disadvantages
Bayesian Classification
• Goal: learning function 𝑓 𝑥 → 𝑦
• y … one of k classes (e.g. spam/ham, digits 0-9)
• x=x1….xd - values of attributes (numerical or categorical)
• Probabilistic classification:
• Most probable class given observation: 𝑦! = arg max 𝑃(𝑦|𝑥)
!
!
• Bayesian Probability of a class:
𝑃 𝑥|𝑦 𝑃(𝑦)
𝑃 𝑦|𝑥 =
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
!
• Bayesian Probability of a class:
#$%&& '()*$ +,-(,
𝑃 𝑥|𝑦 𝑃(𝑦)+
𝑃 𝑦|𝑥 =
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
.(,'%$-/*, 0(2)
Bayesian classification: components
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Ebola
𝑃 𝑦|𝑥 = x …observed symptoms
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
Example:
𝑃 𝑥|𝑦 𝑃(𝑦) y … Irish patient has Ebola
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(y): prior probability of each class
• Encodes which classes are common, which are rare
• Apriori much more likely to have something other than Ebola
Example:
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(x|y): class-conditional model
• Describes how likely to see observation x for class y.
• Assuming its Ebola do the symptoms look plausible?
Example:
∑!" 𝑃 𝑥 𝑦 " 𝑃(𝑦 " )
• P(x|y): class-conditional model
• Describes how likely to see observation x for class y.
• Assuming its Ebola do the symptoms look plausible?
• P(x): Normalise probabilities across observations
• Does not affect which class is most likely because we are looking for arg max
Naïve Bayes: a generative model
• A complete probability distribution for each class
• Definite likelihood for any point x
• P(class) via P(observation) 𝑃 𝑦𝑥 ∝𝑃 𝑥𝑦 𝑃 𝑦
• Can ”generate” synthetic observations
• Will share some properties of the original data
Naïve Bayes: a generative model
• A complete probability distribution for each class
• Definite likelihood for any point x
• P(class) via P(observation) 𝑃 𝑦𝑥 ∝𝑃 𝑥𝑦 𝑃 𝑦
• Can ”generate” synthetic observations
• Will share some properties of the original data
Independence Assumption (this is when it
becomes naive)
• Compute P(x1,…xd|y) for every observation x1,…xd
• Class-conditional ”counts” based on training data
becomes naive)
• Problem is that we may not have seen every x1,…xd for every y.
• Digits: 2400 possible black/white patterns (20 x 20)
• Spam: every possible combination of words: 210,000
• Often have observations for individual xi for every class.
becomes naive)
• Problem is that we may not have seen every x1,…xd for every y.
• Digits: 2400 possible black/white patterns (20 x 20)
• Spam: every possible combination of words: 210,000
• Often have observations for individual xi for every class.
• Idea: assume x1,…xd conditionally independent given y
) )
𝑃 𝑥4 , … 𝑥) |𝑦 = . 𝑃 𝑥- |𝑥4 … 𝑥-64 , 𝑦 = . 𝑃 𝑥- |𝑦
-54 -54
Chain rule independence

Conditional Independence
• Probability of going to the beach and getting a heatstroke are
not independent, i.e. P(B,S)>P(B)P(S)
Go to the
beach
Getting
heatstroke
Popsicle melts
• However these may be independent if we know the weather
is hot, i.e. P(B,S|H)=P(B|H)P(S|H)
Go to the
beach
Getting
heatstroke
Popsicle melts
Go to the
• Hot weather ”explains” all the dependence between beach beach
and heatstroke
heatstroke
Popsicle melts
Go to the
• Hot weather ”explains” all the dependence between beach beach
and heatstroke
• In classification heatstroke
Hot weather
• Class value explains all the dependence
Popsicle melts
Pixel x1 is
• Hot weather ”explains” all the dependence between beach black
and heatstroke
• In classification Pixel xi is black
“3” ness
• Class value explains all the dependence
Pixel xd is
black
Continuous Example
• Distinguish children from adults based on size
• Classes: {a, c}, attributes: height [cm], weight [kg]
• Training examples: {hi, wi, yi}, 4 adults, 12 children
Continuous Example
• For thing to estimate is the prior
• Class probabilities: P(a)=4/(4+12)=0.25;P(c)=0.75
Continuous Example
• First thing to estimate is the prior
• Class probabilities: P(a)=4/(4+12)=0.25;P(c)=0.75
• Assuming they are an adult what is the model for the height and the
weight
• Now height and weight are real numbers so cannot simply count about adults are
precisely 181.5 cm for example as there might be none!
• So assume a continuous model and therefore heights are taken from a Gaussian distribution
so if I have mean and variance then I have completely described the Gaussian and can get
probability for any height (and any weight if we assume same process/model)
Continuous Example
1
• Model for adults: 𝜇!,# =
4
0 ℎ$
• Height ~ Gaussian with mean, variance estimated as: $:&! '#
( 1 (
𝜎!,# = 0 ℎ$ − 𝜇!,#
4
) $:&! '#
• Weight~Gaussian with 𝜇&,( , 𝜎&,(
• Assume height and weight independent

• Model for Children
) )
• Same but labelled as 𝜇*,+ , 𝜎*,+ and 𝜇&,+ , 𝜎&,+
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)
Height (cm) 200
(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)
𝜇*,) , 𝜎*,)
(
Height (cm) 200
(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
X
Weight (kg)
𝜇*,) , 𝜎*,)
(
Height (cm) 200
(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
Weight (kg)
𝜇*,) , 𝜎*,)
(
Height (cm) 200
(
𝜇!,) , 𝜎!,)
Continuous Example
P(a)=4/(4+12)=0.25;P(c)=0.75
200
X
Weight (kg)
𝜇*,) , 𝜎*,)
(
Height (cm) 200
(
𝜇!,) , 𝜎!,)
Continuous Example p ℎ+ 𝑐 =
,
𝑒𝑥𝑝
,
−(
!& /0",$
%
%
.",$
%
(-.",$
P(a)=4/(4+12)=0.25;P(c)=0.75 %
, , *& /0',$
p 𝑤+ 𝑐 = 𝑒𝑥𝑝 − %
% ( .',$
(-.',$
%
, , !& /0",(
200
p ℎ+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .",(
(-.",(
X
Weight (kg)
%
, , *& /0',(
p 𝑤+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .',(
(-.',(
𝜇*,) , 𝜎*,)
(
P(x|a)=p(hx|a)p(wx|a)
P(x|c)=p(hc|c)p(wx|c)
Height (cm) 200
(
𝜇!,) , 𝜎!,)
%
, , !& /0",$
p ℎ+ 𝑐 = 𝑒𝑥𝑝 −(
Continuous Example
%
.",$
%
(-.",$
%
, , *& /0',$
p 𝑤+ 𝑐 = 𝑒𝑥𝑝 − %
% ( .',$
P(a)=4/(4+12)=0.25;P(c)=0.75 (-.',$
%
, , !& /0",(
p ℎ+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .",(
(-.",(
200
%
, , *& /0',(
p 𝑤+ 𝑎 = 𝑒𝑥𝑝 − %
% ( .',(
X (-.',(
Weight (kg)
𝜇*,) , 𝜎*,)
P(x|a)=p(hx|a)p(wx|a)
(
P(x|c)=p(hc|c)p(wx|c)
Height (cm) 200 𝑃 𝑥 𝑎 𝑃(𝑎)

𝑃 𝑎𝑥 =
𝑃 𝑥 𝑎 𝑃 𝑎 + 𝑃 𝑥 𝑐 𝑃(𝑐)
(
𝜇!,) , 𝜎!,)
Discrete example: spam
• Separate spam from valid email, attributes = words P(spam)=4/6 P(ham)=2/6
D1: “send us your password” spam
D2: “send us your review” ham
D3: “review your password” ham
D4: “review us” spam
D6: “send us your account” spam
D3: “review your password” ham Spam Ham
D4: “review us” spam 2/4 1/2 password
D5: “send us your password” spam 1/4 2/2 review
D6: “send us your account” spam 3/4 1/2 Send
3/4 1/2 Us
3/4 1/2 Your
1/4 0/2 account
3/4 1/2 Us
New email: “review us now” 3/4 1/2 Your
1/4 0/2 account
3/4 1/2 Us
New email: “review us now” 3/4 1/2 Your
1/4 0/2 account
( , 2 2 2 ,
P(review us|spam)=P(0,1,0,1,0,0|spam)= 1 − 1− 1− 1− = 0.0044
1 1 1 1 1 1
, ( , , , 3
P(review us|ham)=P(0,1,0,1,0,0|ham)= 1 − ( (
1−( (
1−( 1 − ( =0.0625
%
3.35(6×
)
P(ham|review us)= % =0.87 (note identical example)
*
3.35(6× 83.3311×
) )
Good examples on Wikipedia
• https://en.wikipedia.org/wiki/Naive_Bayes_classifier
Problems with Naïve Bayes
• Zero-frequency problem
• Any mail containing “account” is spam: P(account|ham)=0/2
• Solution: never allow zero frequencies
• Laplace smoothing: add a small positive number to all counts 𝑛𝑢𝑚 𝑤, 𝑐 + 𝜖
𝑃 𝑤𝑐 =
• May use global statistics in place of 𝜖: 𝑛𝑢𝑚(𝑤)/𝑛𝑢𝑚 𝑛𝑢𝑚 𝑐 + 2𝜖
• Very common problem (Zipf’s law: 50% of words occur once)
• Assumes word independence
• Every word contributes independently to P(spam|email)
• Fooling NB: add lots of “hammy” words into spam email
Missing Data
• Suppose don’t have value for some attribute Xi
• Applicant’s credit history unknown
• Some medical test not performed on patient
• How to compute P(X1=x1…Xj=?...Xd=xd|y)
:
• Easy with Naïve Bayes 𝑃 𝑥, … 𝑋9 … 𝑥: 𝑦 = C 𝑃(𝑥$ |𝑦)
$;9
• Ignore attribute in instance where its value is missing
• Computer likelihood based on observed attributes
• No need to “fill in” or explicitly model missing values
• Based on conditional independence between attributes
Missing Data II
• Ex: three coin tosses: Event = {X1=H, X2=?, X3=T}
• Event = head, unknown (either head or tail), tail X2=H
• Event = {H,H,T} + {H,T,T}
• P(event)= P(H,H,T)+P(H,T,T) X1=H X3=T
• General case: Xj has missing value X2=T
𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)
0 𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = 0 P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)

++ ++
= P(𝑥, |𝑦)… ∑++ P 𝑥9 𝑦 … P(𝑥: |𝑦)
= P(𝑥, |𝑦)… 1… P(𝑥: |𝑦)

NaiveBayes TomasWard

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NaiveBayes TomasWard

Uploaded by

Copyright:

Available Formats

Naïve Bayes

Chain rule independence

• Assume height and weight independent

Height (cm) 200

Height (cm) 200

Height (cm) 200

Height (cm) 200

Height (cm) 200

Height (cm) 200

Height (cm) 200 𝑃 𝑥 𝑎 𝑃(𝑎)

• General case: Xj has missing value X2=T

𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)

0 𝑃 𝑥, … 𝑥9 … 𝑥: 𝑦 = 0 P(𝑥, |𝑦)… P 𝑥9 𝑦 … P(𝑥: |𝑦)

= P(𝑥, |𝑦)… ∑++ P 𝑥9 𝑦 … P(𝑥: |𝑦)

= P(𝑥, |𝑦)… 1… P(𝑥: |𝑦)

You might also like