You are on page 1of 41

Aprendizagem Computacional Gladys Castillo, UA

Bayesian Networks Classifiers


Part I Naive Bayes
Aprendizagem Computacional Gladys Castillo, U.A. 2
The Supervised Classification Problem
A classifier is a function f: O
X
O
C
that assigns a
class label c e O
C
={c
1
,, c
m
} to objects described by
a set of attributes X ={X
1
, X
2
, , X
n
} e O
X
Supervised
Learning
Algorithm
h
C
h
C
C
(N+1)
Inputs: Attribute values of x
(N+1)
Output: class of x
(N+1)
Classification Phase: the class attached to x
(N+1)
is c
(N+1)
=h
C
(x
(N+1)
) e O
C

Given: a dataset D with N labeled examples of <X, C>
Build: a classifier, a hypothesis h
C
: O
X
O
C
that can
correctly predict the class labels of new objects
Learning Phase:

c
(N)
<x
(N)
, c
(N)
>

c
(2)
<x
(2)
, c
(2)
>
c
(1)
<x
(1)
, c
(1)
>
C C X X
n n
X X
1 1 D D
c
(N)
<x
(N)
, c
(N)
>

c
(2)
<x
(2)
, c
(2)
>
c
(1)
<x
(1)
, c
(1)
>
C C X X
n n
X X
1 1 D D
) 1 (
1
x
) 1 (
n
x
) 2 (
1
x
) 2 (
n
x
) (
1
N
x
) ( N
n
x
) 1 (
1
+ N
x
) 1 (
2
+ N
x
) 1 ( + N
n
x

attribute X
2
a
t
t
r
i
b
u
t
e

X
1

x
x
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
o
o
o
o
o
o
give credit

dont give
credit

Aprendizagem Computacional Gladys Castillo, U.A. 3
Statistical Classifiers
Treat the attributes X= {X
1
, X
2
, , X
n
} and the
class C as random variables
A random variable is defined by the
probability density function
Give the probability P(c
j
| x) that x
belongs to a particular class rather than a
simple classification




Probability density function of a random
variable and few observations
) (x f ) (x f
instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(c
j
| x)

Aprendizagem Computacional Gladys Castillo, U.A. 4
Bayesian Classifiers
Bayesian because the class c* attached to an
example x is determined by the Bayes Theorem
P(X,C)= P(X|C).P(C)
P(X,C)= P(C|X).P(X)
Bayes theorem is the main tool in Bayesian inference
) (
) | ( ) (
) | (
X P
C X P C P
X C P =
We can combine the prior distribution and the likelihood of the
observed data in order to derive the posterior distribution.
Aprendizagem Computacional Gladys Castillo, U.A. 5
Bayes Theorem
Exemplo

Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck,
whats the probability he/she has meningitis?
0002 . 0
20 / 1
50000 / 1 5 . 0
) (
) ( ) | (
) | ( =

= =
S P
M P M S P
S M P
likelihood
prior
prior
posterior o prior x likelihood
) (
) | ( ) (
) | (
X P
C X P C P
X C P =
verosimilitud
Before observing the data, our
prior beliefs can be expressed in a
prior probability distribution that
represents the knowledge we have
about the unknown features.
After observing the data our
revised beliefs are captured by a
posterior distribution over the
unknown features.
Aprendizagem Computacional Gladys Castillo, U.A. 6
) (
) | ( ) (
) | (
x
x
x
P
c P c P
c P
j j
j
=
Bayesian Classifier
Bayes Theorem
Maximum a
posteriori
classification
How to determine P(c
j
| x) for each class c
j
?
P(x) can be ignored because it is
the same for all the classes (a
normalization constant)
) | ( ) ( max ) (
1
*
j j
...m j
Bayes
c P c P arg h c x x
=
= =
Classify x to the class which has bigger posterior probability
Bayesian
Classification
Rule
Aprendizagem Computacional Gladys Castillo, U.A. 7
Nave because of its very nave independence assumption:
Nave Bayes (NB) Classifier
all the attributes are conditionally independent given the class
Duda and Hart (1973); Langley (1992)
P(x | c
j
) can be decomposed
into a product of n terms, one
term for each attribute
Bayes because the class c* attached to an example x is
determined by the Bayes Theorem
) | ( ) ( max ) (
1
*
j j
...m j
Bayes
c P c P arg h c x x
=
= =
when the attribute space is
high dimensional direct
estimation is hard unless we
introduce some assumptions
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
NB
Classification
Rule
Aprendizagem Computacional Gladys Castillo, U.A. 8

X
i
continuous
The attribute is discretized and then treats as a discrete attribute
A Normal distribution is usually assumed

2. Estimate P(X
i
=x
k
|c
j
) for each value x
k
of the attribute X
i
and for each class c
j

X
i
discrete

1. Estimate P(c
j
) for each class c
j



Nave Bayes (NB)
Learning Phase (Statistical Parameter Estimation)
) , ; ( ) | (
ij ij k j k i
x g c x X P o = =
Given a training dataset D of N labeled examples (assuming complete data)
N
N
c P
j
j
= ) (

j
ijk
j k i
N
N
c x X P = = ) | (

2
2
) (
2
2
1
) , ; (
e
x
x g
o

o t
o

=
onde
N
j
- the number of examples of the class c
j
N
ijk
- number of examples of the class c
j
having the value x
k
for the attribute X
i

Two
options
The mean
ij
e the standard deviation o
ij
are estimated from D
Aprendizagem Computacional Gladys Castillo, U.A. 9
2. Estimate P(X
i
=x
k
|c
j
) for a value of the attribute X
i
and for each class c
j

A Normal distribution is usually assumed



X
i
| c
j
~ N(
ij
,
2
ij
) - the mean
ij
e the standard deviation o
ij
are estimated from D
Continuous Attributes
Normal or Gaussian Distribution
For a variable X ~ N(74, 36), the probability
of observing the value 66 is given by:

f(x) = g(66; 74, 6) = 0.0273
T
h
e

d
e
n
s
i
t
y

p
r
o
b
a
b
i
l
i
t
y

f
u
n
c
t
i
o
n


o
- o 2o 3o 2o 3o
N(0,o
2
)
o
- o 2o 3o 2o 3o
N(0,o
2
)
f(x) is symmetrical around its mean
) , ; ( ) | (
ij ij k j k i
x g c x X P o = =
2
2
) (
2
2
1
) , ; (
e
x
x g
o

o t
o

=
Aprendizagem Computacional Gladys Castillo, U.A. 10
1. Estimate P(c
j
) for each class c
j

2. Estimate P(X
i
=x
k
|c
j
) for each value of X
1
and each class c
j

X
1
discrete X
2
continuos



Nave Bayes
Probability Estimates
) 10 1 1.73 ( ) | (
2
. , x; g C x X P = + = =
T
r
a
i
n
i
n
g

d
a
t
a
s
e
t

D

5
3
) (

= + = C P
3
2
) | (

= + = = C a X P
i
Class
X
1
X
2
+ a
1.0
+ b
1.2
+ a
3.0
- b
4.4
- b
4.5
Two classes: + (positive) e - (negative)
Two attributes:
X
1
discrete which takes values a e b
X
2
continuos
Binary Classification Problem
3
1
) | (

= + = = C b X P
i
5
2
) (

= = C P
) 07 0 4.45 ( ) | (
2
. , x; g C x X P = = =
2
0
) | (

= = = C a X P
i
2
2
) | (

= = = C b X P
i

2+
=1.73, o
2+
= 1.10

2-
=4.45, o
2
2+
= 0.07
Example from John & Langley (1995)
Aprendizagem Computacional Gladys Castillo, U.A. 11
Probability Estimates
Discrete Attributes
for each class:
P(No) = 7/10
P(Yes) = 3/10

for each attribute value and class:







Examples:
P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Adapted from Tan,Steinbach, Kumar, Introduction to Data Mining book
N
N
c P
j
j
= ) (

j
ijk
j k i
N
N
c x X P = = ) | (

To compute N
ijk
we need to count the
number of examples of the class c
j
having the value x
k
for the attribute X
i

Aprendizagem Computacional Gladys Castillo, U.A. 12
for each pair attribute-class (X
i
, c
j
)








Example
for (Income, Class=No)
If Class=No
sample mean = 110
sample standard deviation = 54.54

0072 . 0
) 54 . 54 ( 2
1
) | 120 (
) 2975 ( 2
) 110 120 (
2
= = =

e No Income P
t
Probability Estimates
Continuos Attributes
Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Adapted from Tan,Steinbach, Kumar, Introduction to Data Mining book
2
2
) (
2
2
1
) ( ) | (
e
ij
ij k
x
ij
ij ij k j k i
, ; x g c x X P o

t o
o

= = =
Aprendizagem Computacional Gladys Castillo, U.A. 13
The Balance-scale Problem
Each example has 4 numerical attributes:
the left weight (Left_W)
the left distance (Left_D)
the right weight (Right_W)
right distance (Right_D)

The dataset was generated to model psychological experimental results
Left_W
Each example is classified into 3 classes:
the balance-scale:
tip to the right (Right)
tip to the left (Left)
is balanced (Balanced)
3 target rules:
If LD x LW > RD x RW tip to the left
If LD x LW < RD x RW tip to the right
If LD x LW = RD x RW it is balanced
dataset from the UCI repository
Right_W
Left_D
Right_D
Aprendizagem Computacional Gladys Castillo, U.A. 14
The Balance-scale Problem
Left_W
Left_D
Right_W Right_D
Balanced 2 6 4 3
...
2
1
Left Left_W _W
Balance Balance- -Scale DataSet Scale DataSet
... ... ... ...
Left 2 3 5
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
Balanced 2 6 4 3
...
2
1
Left Left_W _W
Balance Balance- -Scale DataSet Scale DataSet
... ... ... ...
Left 2 3 5
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
Adapted from Joo Gamas slides Aprendizagem Bayesiana
Discretization is applied: each attribute is mapped to 5 intervals
Aprendizagem Computacional Gladys Castillo, U.A. 15
Balanced 2 6 4 3
...
2
1
Left Left_W _W
Balance Balance- -Scale DataSet Scale DataSet
... ... ... ...
Left 2 3 5
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
Balanced 2 6 4 3
...
2
1
Left Left_W _W
Balance Balance- -Scale DataSet Scale DataSet
... ... ... ...
Left 2 3 5
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
The Balance-scale Problem
Learning Phase
Build
Contingency
tables
Contingency Tables
Attribute Attribute: : Left Left_W _W
25 34 49 66 86
Right Right
9 10 8 8 10
Balanced Balanced
72 71 61 42 14
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Left Left_W _W
25 34 49 66 86
Right Right
9 10 8 8 10
Balanced Balanced
72 71 61 42 14
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Left Left_D _D
27 37 49 57 90
Right Right
8 10 9 10 8
Balanced Balanced
77 70 59 38 16
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Left Left_D _D
27 37 49 57 90
Right Right
8 10 9 10 8
Balanced Balanced
77 70 59 38 16
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Right Right_W _W
79 70 58 37 16
Right Right
8 9 10 10 8
Balanced Balanced
28 33 49 63 87
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Right Right_W _W
79 70 58 37 16
Right Right
8 9 10 10 8
Balanced Balanced
28 33 49 63 87
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Right Right_D _D
82 67 57 37 17
Right Right
9 10 8 10 8
Balanced Balanced
25 35 44 65 91
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Right Right_D _D
82 67 57 37 17
Right Right
9 10 8 10 8
Balanced Balanced
25 35 44 65 91
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
260
Left Left
Classes Classes Counters Counters
565 260 45
Total Total Right Right Balanced Balanced
260
Left Left
Classes Classes Counters Counters
565 260 45
Total Total Right Right Balanced Balanced
565 examples
Assuming complete data, the computation
of all the required estimates requires a
simple scan through the data, an operation
of time complexity O(N n), where N is the
number of training examples.
Aprendizagem Computacional Gladys Castillo, U.A. 16
We need to estimate the posterior probabilities

P(c
j
| x) for each class
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
P(Left|x) P(Balanced|x) P(Right|x)
0.277796 0.135227 0.586978 Class = Right
P(c
j
|x) = P(c
j
) x P(Left_W=1 | c
j
) x P(Left_D=5 | c
j
)
x P(Right_W=4 | c
j
) x P(Right_D=2 | c
j
),
c
j
e {Left, Balanced, Right}
The class counters and
contingency tables are used to
compute the posterior
probabilities for each class
1
Left Left_W _W
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
1
Left Left_W _W
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
?
max
The Balance-scale Problem
Classification Phase
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
Aprendizagem Computacional Gladys Castillo, U.A. 17
Iris Dataset
From the statistician Douglas Fisher
Three flower types (classes):
Setosa
Virginica
Versicolour
Four continuous attributes
Sepal width and length
Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
dataset from the UCI repository
Aprendizagem Computacional Gladys Castillo, U.A. 18
Iris Dataset
Aprendizagem Computacional Gladys Castillo, U.A. 19
Iris Dataset
Aprendizagem Computacional Gladys Castillo, U.A. 20
Scatter Plot Array of Iris Attributes
The attributes
petal width and
petal length provide a
moderate separation of
the Irish species
Aprendizagem Computacional Gladys Castillo, U.A. 21
Nave Bayes
Iris Dataset Continuous Attributes
Normal Probability Density Functions
Attribute: PetalWidth
P(PetalWidth|Setosa)
P(PetalWidth|Versicolor)
P(PetalWidth|Virginica)
Class Iris-setosa
mean: 0.244, standard deviation: 0.107
Class Iris-versicolor
mean: 1.326, standard deviation: 0.198
Class Iris-virginica
mean: 2.026, standard deviation: 0.275
Aprendizagem Computacional Gladys Castillo, U.A. 22
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
mean: 5.936, standard deviation: 0.516
Attribute sepalwidth
mean: 2.770, standard deviation: 0.314
Attribute petallength
mean: 4.260, standard deviation: 0.470
Attribute petalwidth
mean: 1.326, standard deviation: 0.198
Nave Bayes
Iris Dataset Continuous Attributes
Model
Class Iris-setosa (0.327):
==========================
Attribute sepallength
mean: 5.006, standard deviation: 0.352
Attribute sepalwidth
mean: 3.418, standard deviation: 0.381
Attribute petallength
mean: 1.464, standard deviation: 0.174
Attribute petalwidth
mean: 0.244, standard deviation: 0.107

Class Iris-virginica (0.327):
=============================
Attribute sepallength
mean: 6.588, standard deviation: 0.636
Attribute sepalwidth
mean: 2.974, standard deviation: 0.322
Attribute petallength
mean: 5.552, standard deviation: 0.552
Attribute petalwidth
mean: 2.026, standard deviation: 0.275
Aprendizagem Computacional Gladys Castillo, U.A. 23
Nave Bayes
Iris Dataset Continuous Attributes
Classification Phase
maximum values
Aprendizagem Computacional Gladys Castillo, U.A. 24
Nave Bayes
Iris Dataset Continuous Attributes
Estimate P(c
j
| x) for each class:
P(setosa|x) P(versicolor|x) P(virginica|x)
0 0.995 0.005 Classe = versicolor
P(c
j
|x) = P(c
j
) x P(sepalLength =5 | c
j
) x P(sepalWidth =3 | c
j
)
x P(petalLength =2 | c
j
) x P(petalWidth =2 | c
j
),
c
j
e {setosa, versicolor, virgnica}
?
max
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
Aprendizagem Computacional Gladys Castillo, U.A. 25
Nave Bayes
Iris Dataset Continuous Attributes
Estimate P(c
j
| x) for the versicolor class
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
mean: 5.936, standard deviation: 0.516
Attribute sepalwidth
mean: 2.770, standard deviation: 0.314
Attribute petallength
mean: 4.260, standard deviation: 0.470
Attribute petalwidth
mean: 1.326, standard deviation: 0.198
P(versicolor

|x) = P(versicolor

) x P(sepalLength = 5 | versicolor )
x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor)
x P(petalWidth =2 | versicolor

)
P(versicolor

|x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314)
x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
Aprendizagem Computacional Gladys Castillo, U.A. 26
Nave Bayes
Continuous Attributes - Discretization
For continuos attributes
After BinDiscretization, bins=3
Aprendizagem Computacional Gladys Castillo, U.A. 27
Nave Bayes
Continuous Attributes - Discretization
Model
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
range1: 0.220 range2: 0.720 range3: 0.060
Attribute sepalwidth
range1: 0.460 range2: 0.540 range3: 0.000
Attribute petallength
range1: 0.000 range2: 0.960 range3: 0.040
Attribute petalwidth
range1: 0.000 range2: 0.980 range3: 0.020
Class Iris-setosa (0.327):
==========================
Attribute sepallength
range1: 0.940 range2: 0.060 range3: 0.000
Attribute sepalwidth
range1: 0.020 range2: 0.720 range3: 0.260
Attribute petallength
range1: 1.000 range2: 0.000 range3: 0.000
Attribute petalwidth
range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-virginica (0.327):
=============================
Attribute sepallength
range1: 0.020 range2: 0.640 range3: 0.340
Attribute sepalwidth
range1: 0.380 range2: 0.580 range3: 0.040
Attribute petallength
range1: 0.000 range2: 0.120 range3: 0.880
Attribute petalwidth
range1: 0.000 range2: 0.100 range3: 0.900
Aprendizagem Computacional Gladys Castillo, U.A. 28
Nave Bayes
Continuous Attributes - Discretization
Class Iris-setosa (0.327):
==========================
Attribute sepallength
range1: 0.940 range2: 0.060 range3: 0.000
Attribute sepalwidth
range1: 0.020 range2: 0.720 range3: 0.260
Attribute petallength
range1: 1.000 range2: 0.000 range3: 0.000
Attribute petalwidth
range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
range1: 0.220 range2: 0.720 range3: 0.060
Attribute sepalwidth
range1: 0.460 range2: 0.540 range3: 0.000
Attribute petallength
range1: 0.000 range2: 0.960 range3: 0.040
Attribute petalwidth
range1: 0.000 range2: 0.980 range3: 0.020
Class Iris-virginica (0.327):
=============================
Attribute sepallength
range1: 0.020 range2: 0.640 range3: 0.340
Attribute sepalwidth
range1: 0.380 range2: 0.580 range3: 0.040
Attribute petallength
range1: 0.000 range2: 0.120 range3: 0.880
Attribute petalwidth
range1: 0.000 range2: 0.100 range3: 0.900
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
We can build a conditional probability table (CPT)
for each attribute
Aprendizagem Computacional Gladys Castillo, U.A. 29
Nave Bayes
Continuous Attributes - Discretization
Class Iris-setosa (0.327):
==========================
Attribute sepallength
range1: 0.940 range2: 0.060 range3: 0.000
Attribute sepalwidth
range1: 0.020 range2: 0.720 range3: 0.260
Attribute petallength
range1: 1.000 range2: 0.000 range3: 0.000
Attribute petalwidth
range1: 1.000 range2: 0.000 range3: 0.000
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
range1: 0.220 range2: 0.720 range3: 0.060
Attribute sepalwidth
range1: 0.460 range2: 0.540 range3: 0.000
Attribute petallength
range1: 0.000 range2: 0.960 range3: 0.040
Attribute petalwidth
range1: 0.000 range2: 0.980 range3: 0.020
Class Iris-virginica (0.327):
=============================
Attribute sepallength
range1: 0.020 range2: 0.640 range3: 0.340
Attribute sepalwidth
range1: 0.380 range2: 0.580 range3: 0.040
Attribute petallength
range1: 0.000 range2: 0.120 range3: 0.880
Attribute petalwidth
range1: 0.000 range2: 0.100 range3: 0.900

Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Aprendizagem Computacional Gladys Castillo, U.A. 30
Nave Bayes
Iris Dataset Discretized Attributes
Classification (Implementation) Phase
maximum values
Discretized Examples
Aprendizagem Computacional Gladys Castillo, U.A. 31
Nave Bayes
Classification Phase
How to classify a
new example ?
P(setosa|x) = P(setosa

) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa)
x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa

)
P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor )
x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)
x P(petalWidth =r3 | versicolor )
P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica)
x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica)
x P(petalWidth =r3 | virginica )
Compute the conditional posterior probabilities for each class
sepallengt sepalwidth petallength petallength
r1 r2 r1 r3
Example with
discretized attributes
sepallengt sepalwidth petallength petallength
5 3 2 2
Aprendizagem Computacional Gladys Castillo, U.A. 32
Nave Bayes
Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(setosa | x) = P(setosa

)
x P(sepalLength = r1 | setosa )
x P(sepalWidth = r2 | setosa )
x P(petalLength = r1 | setosa )
x P(petalWidth = r3 | setosa

)
P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
Aprendizagem Computacional Gladys Castillo, U.A. 33
Nave Bayes
Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(versicolor | x) = P(versicolor)
x P(sepalLength = r1 | versicolor )
x P(sepalWidth = r2 | versicolor )
x P(petalLength = r1 | versicolor )
x P(petalWidth = r3 | versicolor

)
P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
Aprendizagem Computacional Gladys Castillo, U.A. 34
Nave Bayes
Continuous Attributes - Discretization
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Range1 Range2 Range3
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petallength
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Range1 Range2 Range3
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(virginica | x) = P(virginica)
x P(sepalLength = r1 | virginica )
x P(sepalWidth = r2 | virginica )
x P(petalLength = r1 | virginica )
x P(petalWidth = r3 | virginica

)
P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
If all the class probabilities are zero
we can not determine the class for
this example
Aprendizagem Computacional Gladys Castillo, U.A. 35
Nave Bayes
Laplace Correction
N
ijk
number of examples in D such that
X
i
= x
k
and C = c
j

Nj number of examples in D of class c
j
k number of possible values of X
i
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
To avoid zero
probabilities due to zero
counters we can
implement the Laplace
correction
j
ijk
j k i
N
N
c x X P = = ) | (

k N
N
c x X P
j
ijk
j k i
+
+
= =
1
) | (

To calculate the conditional probabilities, instead of using the estimate


To use the Laplace correction:
Aprendizagem Computacional Gladys Castillo, U.A. 36
Bayesian SPAM Filters
Binary Classification Problem
57 continuous attributes
Some word frequencies
Some character frequencies
Capital letters frequency
Two classes
0 is not SPAM
1 is SPAM
Spambase dataset from the UCI repository
Aprendizagem Computacional Gladys Castillo, U.A. 37
Bayesian SPAM Filters
Implementation in RapidMiner
Discretization Method
Learning
Testing
Feature Subset Selection Method
Evaluation Method
In this dataset there are no missing values;
otherwise we need first to use a method for
replace the missing values
Aprendizagem Computacional Gladys Castillo, U.A. 38
Bayesian SPAM Filters
Confussion Matrix
SPAM Precision percentage of e-mails classified as SPAMs which truly are




SPAM Recall percentage of e-mails classified as SPAMs over the total of
examples that are SPAM
sensibilidad
FN TP
TP
TPR recall
+
= =
FP TP
TP
precision
+
=
Concept Learning Problem:
Is an email a SPAM?
True Positive = number of examples classified as positive which truly are
False Positive = number of examples classified as positive which are negative
False Negative = number of examples classified as negative which are positive
Aprendizagem Computacional Gladys Castillo, U.A. 39
Experimental Results
with continuous attributes discretization after FSS discretization before FSS
1. Feature Selection (CFS)

2. Genetic Algorithm (CFS)

3. Wrapper

4. Feature Selection (CFS)
Minimal Entropy Discretization

5. Genetic Algorithm (CFS)
Minimal Entropy Discretization


6. Feature Selection (CFS)
Frequency Discretization

7. Wrapper
Minimal Entropy Discretization

8. Minimal Entropy Discretization
Feature Selection (CFS)

9. Minimal Entropy Discretization
Genetic Algorithm (CFS)

10. Minimal Entropy Discretization
Wrapper

o Legenda Ordem dos operadores de pr-processamento utilizados em cada caso:
Aprendizagem Computacional Gladys Castillo, U.A. 40
NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:
Nave Bayes Performance
all the attributes are conditionally independent given the value of class
Bias-variance decomposition of Test Error
Naive Bayes for Nursery Dataset
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
500 2000 3500 5000 6500 8000 9500 11000 12500
# Examples
Variance
Bias
in practice: independence assumption
is violated HIGH BIAS
it can lead to poor classification
However, NB is efficient due to its
high variance management
less parameters LOW VARIANCE
Aprendizagem Computacional Gladys Castillo, U.A. 41
reducing the bias resulting from the modeling error
by relaxing the attribute independence assumption
one natural extension: Bayesian Network Classifiers
Improving Nave Bayes
reducing the bias of the parameter estimates
by improving the probability estimates computed from data
R
e
l
e
v
a
n
t

w
o
r
k
s
:

Web and Pazzani (1998) - Adjusted probability naive Bayesian induction in LNCS v 1502
J. Gama (2001, 2003) - Iterative Bayes, in Theoretical Computer Science, v. 292
Friedman, Geiger and Goldszmidt (1998) Bayesian Network Classifiers in Machine Learning, 29
Pazzani (1995) - Searching for attribute dependencies in Bayesian Network Classifiers in Proc.
of the 5th Workshop of Artificial Intelligence and Statistics
Keogh and Pazzani (1999) - Learning augmented Bayesian classifiers, in Theoretical
Computer Science, v. 292
R
e
l
e
v
a
n
t

w
o
r
k
s
:

You might also like