Bayesian Networks Classifiers

Aprendizagem Computacional Gladys Castillo, UA
Bayesian Networks Classifiers

Part I Naive Bayes
Aprendizagem Computacional Gladys Castillo, U.A. 2
The Supervised Classification Problem
A classifier is a function f: O
X
O
C
that assigns a
class label c e O
C
={c
1
,, c
m
} to objects described by
a set of attributes X ={X
1
, X
2
, , X
n
} e O
X
Supervised
Learning
Algorithm
h
C
h
C
C
(N+1)
Inputs: Attribute values of x
(N+1)
Output: class of x
(N+1)
Classification Phase: the class attached to x
(N+1)
is c
(N+1)
=h
C
(x
(N+1)
) e O
C

Given: a dataset D with N labeled examples of <X, C>
Build: a classifier, a hypothesis h
C
: O
X
O
C
that can
correctly predict the class labels of new objects
Learning Phase:

c
(N)
<x
(N)
, c
(N)
>

c
(2)
<x
(2)
, c
(2)
>
c
(1)
<x
(1)
, c
(1)
>
C C X X
n n
X X
1 1 D D
c
(N)
<x
(N)
, c
(N)
>

c
(2)
<x
(2)
, c
(2)
>
c
(1)
<x
(1)
, c
(1)
>
C C X X
n n
X X
1 1 D D
) 1 (
1
x
) 1 (
n
x
) 2 (
1
x
) 2 (
n
x
) (
1
N
x
) ( N
n
x
) 1 (
1
+ N
x
) 1 (
2
+ N
x
) 1 ( + N
n
x

attribute X
2
a
t
t
r
i
b
u
t
e

X
1

x
x
x
x
x
x
x
x
x
x
o
o
o
o
o
o
o
o
o
o
o
o
o
give credit

dont give
credit

Statistical Classifiers
Treat the attributes X= {X
1
, X
2
, , X
n
} and the
class C as random variables
A random variable is defined by the
probability density function
Give the probability P(c
j
| x) that x
belongs to a particular class rather than a
simple classification

Probability density function of a random
variable and few observations
) (x f ) (x f
instead of having the map X C , we have X P(C| X)

The class c* attached to an example is the class with bigger P(c
j
| x)

Bayesian Classifiers
Bayesian because the class c* attached to an
example x is determined by the Bayes Theorem
P(X,C)= P(X|C).P(C)
P(X,C)= P(C|X).P(X)
Bayes theorem is the main tool in Bayesian inference
) (
) | ( ) (
) | (
X P
C X P C P
X C P =
We can combine the prior distribution and the likelihood of the
observed data in order to derive the posterior distribution.
Bayes Theorem
Exemplo

Given:
A doctor knows that meningitis causes stiff neck 50% of the time
Prior probability of any patient having meningitis is 1/50,000
Prior probability of any patient having stiff neck is 1/20

If a patient has stiff neck,
whats the probability he/she has meningitis?
0002 . 0
20 / 1
50000 / 1 5 . 0
) (
) ( ) | (
) | ( =
= =
S P
M P M S P
S M P
likelihood
prior
prior
posterior o prior x likelihood
) (
) | ( ) (
) | (
X P
C X P C P
X C P =
verosimilitud
Before observing the data, our
prior beliefs can be expressed in a
prior probability distribution that
represents the knowledge we have
about the unknown features.
After observing the data our
revised beliefs are captured by a
posterior distribution over the
unknown features.
) (
) | ( ) (
) | (
x
x
x
P
c P c P
c P
j j
j
=
Bayesian Classifier
Bayes Theorem
Maximum a
posteriori
classification
How to determine P(c
j
| x) for each class c
j
?
P(x) can be ignored because it is
the same for all the classes (a
normalization constant)
) | ( ) ( max ) (
1
*
j j
...m j
Bayes
c P c P arg h c x x
=
= =
Classify x to the class which has bigger posterior probability
Bayesian
Classification
Rule
Nave because of its very nave independence assumption:
Nave Bayes (NB) Classifier
all the attributes are conditionally independent given the class
Duda and Hart (1973); Langley (1992)
P(x | c
j
) can be decomposed
into a product of n terms, one
term for each attribute
Bayes because the class c* attached to an example x is
determined by the Bayes Theorem
) | ( ) ( max ) (
1
*
j j
...m j
Bayes
c P c P arg h c x x
=
= =
when the attribute space is
high dimensional direct
estimation is hard unless we
introduce some assumptions
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
NB
Classification
Rule

X
i
continuous
The attribute is discretized and then treats as a discrete attribute
A Normal distribution is usually assumed

2. Estimate P(X
i
=x
k
|c
j
) for each value x
k
of the attribute X
i
and for each class c
j

X
i
discrete

1. Estimate P(c
j
) for each class c
j

Nave Bayes (NB)
Learning Phase (Statistical Parameter Estimation)
) , ; ( ) | (
ij ij k j k i
x g c x X P o = =
Given a training dataset D of N labeled examples (assuming complete data)
N
N
c P
j
j
= ) (
j
ijk
j k i
N
N
c x X P = = ) | (
2
2
) (
2
2
1
) , ; (
e
x
x g
o
o t
o

=
onde
N
j
- the number of examples of the class c
j
N
ijk
- number of examples of the class c
j
having the value x
k
for the attribute X
i

Two
options
The mean
ij
e the standard deviation o
ij
are estimated from D
2. Estimate P(X
i
=x
k
|c
j
) for a value of the attribute X
i
and for each class c
j

A Normal distribution is usually assumed

X
i
| c
j
~ N(
ij
,
2
ij
) - the mean
ij
e the standard deviation o
ij
are estimated from D
Continuous Attributes
Normal or Gaussian Distribution
For a variable X ~ N(74, 36), the probability
of observing the value 66 is given by:

f(x) = g(66; 74, 6) = 0.0273
T
h
e

d
e
n
s
i
t
y

p
r
o
b
a
b
i
l
i
t
y

f
u
n
c
t
i
o
n

o
- o 2o 3o 2o 3o
N(0,o
2
)
o
- o 2o 3o 2o 3o
N(0,o
2
)
f(x) is symmetrical around its mean
) , ; ( ) | (
ij ij k j k i
x g c x X P o = =
2
2
) (
2
2
1
) , ; (
e
x
x g
o
o t
o

=
1. Estimate P(c
j
) for each class c
j

2. Estimate P(X
i
=x
k
|c
j
) for each value of X
1
and each class c
j

X
1
discrete X
2
continuos

Nave Bayes
Probability Estimates
) 10 1 1.73 ( ) | (
2
. , x; g C x X P = + = =
T
r
a
i
n
i
n
g

d
a
t
a
s
e
t

D

5
3
) (
= + = C P
3
2
) | (
= + = = C a X P
i
Class
X
1
X
2
+ a
1.0
+ b
1.2
+ a
3.0
- b
4.4
- b
4.5
Two classes: + (positive) e - (negative)
Two attributes:
X
1
discrete which takes values a e b
X
2
continuos
Binary Classification Problem
3
1
) | (
= + = = C b X P
i
5
2
) (
= = C P
) 07 0 4.45 ( ) | (
2
. , x; g C x X P = = =
2
0
) | (
= = = C a X P
i
2
2
) | (
= = = C b X P
i
2+
=1.73, o
2+
= 1.10
2-
=4.45, o
2
2+
= 0.07
Example from John & Langley (1995)
Discrete Attributes
for each class:
P(No) = 7/10
P(Yes) = 3/10

for each attribute value and class:

Examples:
P(Status=Married|No) = 4/7

P(Refund=Yes|Yes)=0

Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Adapted from Tan,Steinbach, Kumar, Introduction to Data Mining book
N
N
c P
j
j
= ) (
j
ijk
j k i
N
N
c x X P = = ) | (
To compute N
ijk
we need to count the
number of examples of the class c
j
having the value x
k
for the attribute X
i

for each pair attribute-class (X
i
, c
j
)

Example
for (Income, Class=No)
If Class=No
sample mean = 110
sample standard deviation = 54.54

0072 . 0
) 54 . 54 ( 2
1
) | 120 (
) 2975 ( 2
) 110 120 (
2
= = =

e No Income P
t
Continuos Attributes
Tid Refund Marital
Status
Taxable
Income
Evade
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

c
a
t
e
g
o
r
i
c
a
l
c
a
t
e
g
o
r
i
c
a
l
c
o
n
t
i
n
u
o
u
s
c
l
a
s
s
Adapted from Tan,Steinbach, Kumar, Introduction to Data Mining book
2
2
) (
2
2
1
) ( ) | (
e
ij
ij k
x
ij
ij ij k j k i
, ; x g c x X P o

t o
o

= = =
The Balance-scale Problem
Each example has 4 numerical attributes:
the left weight (Left_W)
the left distance (Left_D)
the right weight (Right_W)
right distance (Right_D)

The dataset was generated to model psychological experimental results
Left_W
Each example is classified into 3 classes:
the balance-scale:
tip to the right (Right)
tip to the left (Left)
is balanced (Balanced)
3 target rules:
If LD x LW > RD x RW tip to the left
If LD x LW < RD x RW tip to the right
If LD x LW = RD x RW it is balanced
dataset from the UCI repository
Right_W
Left_D
Right_D
Left_W
Left_D
Right_W Right_D
Balanced 2 6 4 3
...
2
1
Left Left_W _W
Balance Balance- -Scale DataSet Scale DataSet
... ... ... ...
Left 2 3 5
Right 2 4 5
Class Class Right Right_D _D Right Right_W _W Left Left_D _D
Balanced 2 6 4 3
...
2
1
Left Left_W _W
... ... ... ...
Left 2 3 5
Right 2 4 5
Adapted from Joo Gamas slides Aprendizagem Bayesiana
Discretization is applied: each attribute is mapped to 5 intervals
Balanced 2 6 4 3
...
2
1
Left Left_W _W
... ... ... ...
Left 2 3 5
Right 2 4 5
Balanced 2 6 4 3
...
2
1
Left Left_W _W
... ... ... ...
Left 2 3 5
Right 2 4 5
Learning Phase
Build
Contingency
tables
Contingency Tables
Attribute Attribute: : Left Left_W _W
25 34 49 66 86
Right Right
9 10 8 8 10
Balanced Balanced
72 71 61 42 14
Left Left
I5 I5 I4 I4 I3 I3 I2 I2 I1 I1 Class Class
Attribute Attribute: : Left Left_W _W
25 34 49 66 86
Right Right
9 10 8 8 10
Balanced Balanced
72 71 61 42 14
Left Left
Attribute Attribute: : Left Left_D _D
27 37 49 57 90
Right Right
8 10 9 10 8
Balanced Balanced
77 70 59 38 16
Left Left
Attribute Attribute: : Left Left_D _D
27 37 49 57 90
Right Right
8 10 9 10 8
Balanced Balanced
77 70 59 38 16
Left Left
Attribute Attribute: : Right Right_W _W
79 70 58 37 16
Right Right
8 9 10 10 8
Balanced Balanced
28 33 49 63 87
Left Left
Attribute Attribute: : Right Right_W _W
79 70 58 37 16
Right Right
8 9 10 10 8
Balanced Balanced
28 33 49 63 87
Left Left
Attribute Attribute: : Right Right_D _D
82 67 57 37 17
Right Right
9 10 8 10 8
Balanced Balanced
25 35 44 65 91
Left Left
Attribute Attribute: : Right Right_D _D
82 67 57 37 17
Right Right
9 10 8 10 8
Balanced Balanced
25 35 44 65 91
Left Left
260
Left Left
Classes Classes Counters Counters
565 260 45
Total Total Right Right Balanced Balanced
260
Left Left
Classes Classes Counters Counters
565 260 45
Total Total Right Right Balanced Balanced
565 examples
Assuming complete data, the computation
of all the required estimates requires a
simple scan through the data, an operation
of time complexity O(N n), where N is the
number of training examples.
We need to estimate the posterior probabilities

P(c
j
| x) for each class
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
P(Left|x) P(Balanced|x) P(Right|x)
0.277796 0.135227 0.586978 Class = Right
P(c
j
|x) = P(c
j
) x P(Left_W=1 | c
j
) x P(Left_D=5 | c
j
)
x P(Right_W=4 | c
j
) x P(Right_D=2 | c
j
),
c
j
e {Left, Balanced, Right}
The class counters and
contingency tables are used to
compute the posterior
probabilities for each class
1
Left Left_W _W
Right 2 4 5
1
Left Left_W _W
Right 2 4 5
?
max
Classification Phase
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
Iris Dataset
From the statistician Douglas Fisher
Three flower types (classes):
Setosa
Virginica
Versicolour
Four continuous attributes
Sepal width and length
Petal width and length
Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
dataset from the UCI repository
Iris Dataset
Iris Dataset
Scatter Plot Array of Iris Attributes
The attributes
petal width and
petal length provide a
moderate separation of
the Irish species
Nave Bayes
Iris Dataset Continuous Attributes
Normal Probability Density Functions
Attribute: PetalWidth
P(PetalWidth|Setosa)
P(PetalWidth|Versicolor)
P(PetalWidth|Virginica)
Class Iris-setosa
mean: 0.244, standard deviation: 0.107
Class Iris-versicolor
Class Iris-virginica
Class Iris-versicolor (0.327):
==============================
Attribute sepallength
Attribute sepalwidth
Attribute petallength
Attribute petalwidth
Nave Bayes
Model
Class Iris-setosa (0.327):
==========================

Class Iris-virginica (0.327):
=============================
Nave Bayes
maximum values
Nave Bayes
Estimate P(c
j
| x) for each class:
P(setosa|x) P(versicolor|x) P(virginica|x)
0 0.995 0.005 Classe = versicolor
P(c
j
|x) = P(c
j
) x P(sepalLength =5 | c
j
) x P(sepalWidth =3 | c
j
)
x P(petalLength =2 | c
j
) x P(petalWidth =2 | c
j
),
c
j
e {setosa, versicolor, virgnica}
?
max
How NB classifies this example?
The class for this example is the class which has bigger posterior probability
Nave Bayes
Estimate P(c
j
| x) for the versicolor class
==============================
P(versicolor

|x) = P(versicolor

) x P(sepalLength = 5 | versicolor )
x P(sepalWidth =3 | versicolor) x P(petalLength =2 | versicolor)
x P(petalWidth =2 | versicolor

)
P(versicolor

|x) = 0.327 x g(5; 5.936, 0.516) x g(3; 2.770, 0.314)
x g(2; 4.260, 0.470) x g(2; 1.326, 0.198)
[
=
=
= = =
n
i
j i i j
...m j
NB
c x X P c P arg h c
1
1
*
) | ( ) ( max ) (x
Nave Bayes
Continuous Attributes - Discretization
For continuos attributes
After BinDiscretization, bins=3
Nave Bayes
Model
==============================
range1: 0.220 range2: 0.720 range3: 0.060
==========================
=============================
Nave Bayes
==========================
==============================
=============================
Class Probabilities
Setosa Versicolor Virgnica
0.327 0.327 0.327
Atributo Sepallength
Range1 Range2 Range3
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
We can build a conditional probability table (CPT)
for each attribute
Nave Bayes
==========================
==============================
=============================

Atributo Petallength
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Nave Bayes
Iris Dataset Discretized Attributes
Classification (Implementation) Phase
maximum values
Discretized Examples
Nave Bayes
How to classify a
new example ?
P(setosa|x) = P(setosa

) x P(sepalLength = r1 | setosa ) x P(sepalWidth =r2 | setosa)
x P(petalLength = r1 | setosa) x P(petalWidth =r3 | setosa

)
P(versicolor|x) = P(versicolor) x P(sepalLength = r1 | versicolor )
x P(sepalWidth =r2 | setosa) x P(petalLength = r1 | versicolor)
x P(petalWidth =r3 | versicolor )
P(virginica|x) = P(virginica) x P(sepalLength = r1 | virginica)
x P(sepalWidth =r2 | virginica) x P(petalLength = r1 | virginica)
x P(petalWidth =r3 | virginica )
Compute the conditional posterior probabilities for each class
sepallengt sepalwidth petallength petallength
r1 r2 r1 r3
Example with
discretized attributes
sepallengt sepalwidth petallength petallength
5 3 2 2
Nave Bayes
Class Probabilities
0.327 0.327 0.327
Class Probabilities
0.327 0.327 0.327
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(setosa | x) = P(setosa

)
x P(sepalLength = r1 | setosa )
x P(sepalWidth = r2 | setosa )
x P(petalLength = r1 | setosa )
x P(petalWidth = r3 | setosa

)
P(setosa | x) = 0.327 x 0.940 x 0.720 x 1.000 x 0.000 = 0
r1
petallength petallength
r1
sepallengt sepallengt
r3 r2
petallength petallength sepalwidth sepalwidth
r1
r1
r3 r2
Nave Bayes
Class Probabilities
0.327 0.327 0.327
Class Probabilities
0.327 0.327 0.327
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(versicolor | x) = P(versicolor)
x P(sepalLength = r1 | versicolor )
x P(sepalWidth = r2 | versicolor )
x P(petalLength = r1 | versicolor )
x P(petalWidth = r3 | versicolor

)
P(versicolor | x) = 0.327 x 0.220 x 0.540 x 0.000 x 0.020 = 0
r1
r1
r3 r2
r1
r1
r3 r2
Nave Bayes
Class Probabilities
0.327 0.327 0.327
Class Probabilities
0.327 0.327 0.327
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Atributo Sepalwidth
Setosa 0.020 0.720 0.260
Versicolor 0.460 0.540 0.000
Virginica 0.380 0.580 0.040
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.960 0.040
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
Atributo Petalwidth
Setosa 1.000 0.000 0.000
Versicolor 0.000 0.980 0.020
Virginica 0.000 0.100 0.900
P(virginica | x) = P(virginica)
x P(sepalLength = r1 | virginica )
x P(sepalWidth = r2 | virginica )
x P(petalLength = r1 | virginica )
x P(petalWidth = r3 | virginica

)
P(virginica | x) = 0.327 x 0.020 x 0.580 x 0.000 x 0.900 = 0
r1
r1
r3 r2
r1
r1
r3 r2
If all the class probabilities are zero
we can not determine the class for
this example
Nave Bayes
Laplace Correction
N
ijk
number of examples in D such that
X
i
= x
k
and C = c
j

Nj number of examples in D of class c
j
k number of possible values of X
i
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
Setosa 0.940 0.060 0.000
Versicolor 0.220 0.720 0.060
Virginica 0.020 0.640 0.340
To avoid zero
probabilities due to zero
counters we can
implement the Laplace
correction
j
ijk
j k i
N
N
c x X P = = ) | (
k N
N
c x X P
j
ijk
j k i
+
+
= =
1
) | (
To calculate the conditional probabilities, instead of using the estimate

To use the Laplace correction:
Bayesian SPAM Filters
Binary Classification Problem
57 continuous attributes
Some word frequencies
Some character frequencies
Capital letters frequency
Two classes
0 is not SPAM
1 is SPAM
Spambase dataset from the UCI repository
Implementation in RapidMiner
Discretization Method
Learning
Testing
Feature Subset Selection Method
Evaluation Method
In this dataset there are no missing values;
otherwise we need first to use a method for
replace the missing values
Confussion Matrix
SPAM Precision percentage of e-mails classified as SPAMs which truly are

SPAM Recall percentage of e-mails classified as SPAMs over the total of
examples that are SPAM
sensibilidad
FN TP
TP
TPR recall
+
= =
FP TP
TP
precision
+
=
Concept Learning Problem:
Is an email a SPAM?
True Positive = number of examples classified as positive which truly are
False Positive = number of examples classified as positive which are negative
False Negative = number of examples classified as negative which are positive
Experimental Results
with continuous attributes discretization after FSS discretization before FSS
1. Feature Selection (CFS)

2. Genetic Algorithm (CFS)

3. Wrapper

Minimal Entropy Discretization

5. Genetic Algorithm (CFS)

Frequency Discretization

7. Wrapper

8. Minimal Entropy Discretization
Feature Selection (CFS)

Genetic Algorithm (CFS)

Wrapper

o Legenda Ordem dos operadores de pr-processamento utilizados em cada caso:
NB is one of more simple and effective classifiers

NB has a very strong unrealistic independence assumption:
Nave Bayes Performance
all the attributes are conditionally independent given the value of class
Bias-variance decomposition of Test Error
Naive Bayes for Nursery Dataset
0.00%
2.00%
4.00%
6.00%
8.00%
10.00%
12.00%
14.00%
16.00%
18.00%
500 2000 3500 5000 6500 8000 9500 11000 12500
# Examples
Variance
Bias
in practice: independence assumption
is violated HIGH BIAS
it can lead to poor classification
However, NB is efficient due to its
high variance management
less parameters LOW VARIANCE
reducing the bias resulting from the modeling error
by relaxing the attribute independence assumption
one natural extension: Bayesian Network Classifiers
Improving Nave Bayes
reducing the bias of the parameter estimates
by improving the probability estimates computed from data
R
e
l
e
v
a
n
t

w
o
r
k
s
:

Web and Pazzani (1998) - Adjusted probability naive Bayesian induction in LNCS v 1502
J. Gama (2001, 2003) - Iterative Bayes, in Theoretical Computer Science, v. 292
Friedman, Geiger and Goldszmidt (1998) Bayesian Network Classifiers in Machine Learning, 29
Pazzani (1995) - Searching for attribute dependencies in Bayesian Network Classifiers in Proc.
of the 5th Workshop of Artificial Intelligence and Statistics
Keogh and Pazzani (1999) - Learning augmented Bayesian classifiers, in Theoretical
Computer Science, v. 292
R
e
l
e
v
a
n
t

w
o
r
k
s
:

Bayesian Networks Classifiers

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Networks Classifiers

Uploaded by

Copyright:

Available Formats

Aprendizagem Computacional Gladys Castillo, UA

Bayesian Networks Classifiers

To calculate the conditional probabilities, instead of using the estimate

You might also like