Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 1
Bayes Classifier
Let the training dataset D consists of n points x i in a d-dimensional space, and
let yi denote the class for each point, with yi ∈ {c1 , c2 , . . . , ck }.
The Bayes classifier estimates the posterior probability P(ci |x) for each class ci ,
and chooses the class that has the largest probability. The predicted class for x is
given as
ŷ = arg max{P(ci |x)}
ci
P(x|ci ) · P(ci )
P(ci |x) =
P(x)
Because P(x) is fixed for a given point, Bayes rule can be rewritten as
P(x|ci )P(ci )
ŷ = arg max{P(ci |x)} = arg max = arg max P(x|ci )P(ci )
ci ci P(x) ci
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 2
Estimating the Prior Probability: P(ci )
Let D i denote the subset of points in D that are labeled with class ci :
D i = {x j ∈ D | x j has class yj = ci }
Let the size of the dataset D be given as |D| = n, and let the size of each
class-specific subset D i be given as |D i | = ni .
ni
P̂(ci ) =
n
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 3
Estimating the Likelihood
Numeric Attributes, Parametric Approach
bi | 2
( 2π)d |Σ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 4
Bayes Classifier Algorithm
8 b i for all i = 1, . . . , k
return P̂(ci ), µ̂i , Σ
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 5
Bayes Classifier: Iris Data
Consider the 2D Iris data (sepal length and sepal width). Class c1 , which
corresponds to iris-setosa (circles), has n1 = 50 points, whereas the other class
c2 (triangles) has n2 = 100 points.
n1 50 n2 100
P̂(c1 ) = = = 0.33 P̂(c2 ) = = = 0.67
n 150 n 150
5.006 6.262
µ̂1 = µ̂2 =
3.418 2.872
b 0.1218 0.0983 b2 = 0.435 0.1209
Σ1 = Σ
0.0983 0.1423 0.1209 0.1096
Plot shows the contour or level curve (corresponding to 1% of the peak density)
for the multivariate normal distribution modeling the probability density for both
classes.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 6
Bayes Classifier: Iris Data
X 1 :sepal length versus X2 :sepal width
X2
bC
x = (6.75, 4.25)T
rS
bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 7
Bayes Classifier: Iris Data
Let x = (6.75, 4.25)T be a test point (shown as white square). The posterior
probabilities for c1 and c2 can be computed as:
b 1 )P̂(c1 ) = (4.951 × 10−7 ) × 0.33 = 1.634 × 10−7
P̂(c1 |x) ∝ fˆ(x|µ̂1 , Σ
P̂(c2 |x) ∝ fˆ(x|µ̂ , Σb 2 )P̂(c2 ) = (2.589 × 10−5 ) × 0.67 = 1.735 × 10−5
2
Because P̂(c2 |x) > P̂(c1 |x) the class for x is predicted as ŷ = c2 .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 8
Bayes Classifier
Categorical Attributes
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 9
Bayes Classifier
Categorical Attributes
The probability of the categorical point x is obtained from the joint probability
mass function (PMF) for the vector random variable X :
P(x|ci ) = f (v |ci ) = f X 1 = e 1r1 , . . . , X d = e drd | ci
The joint PMF can be estimated directly from the data D i for each class ci as
follows:
ni (v )
fˆ(v |ci ) =
ni
where ni (v ) is the number of times the value v occurs in class ci .
However, to avoid zero probabilities we add a pseudo-count of 1 for each value
ni (v ) + 1
fˆ(v |ci ) = Qd
ni + j =1 mj
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 10
Discretized Iris Data
Assume that the sepal length and sepal width attributes in the Iris dataset
have been discretized as shown below.
Bins Domain
Bins Domain
[4.3, 5.2] Very Short (a11 )
[2.0, 2.8] Short (a21 )
(5.2, 6.1] Short (a12 )
(2.8, 3.6] Medium (a22 )
(6.1, 7.0] Long (a13 )
(3.6, 4.4] Long (a23 )
(7.0, 7.9] Very Long (a14 )
(b) Discretized sepal width
(a) Discretized sepal length
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 11
Discretized Iris Data
Class-specific Empirical Joint Probability Mass Function
X2
Class: c1 fˆX1
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 1/50 33/50 5/50 39/50
Short (e 12 ) 0 3/50 8/50 13/50
X1
Long (e 13 ) 0 0 0 0
Very Long (e 14 ) 0 0 0 0
fˆX2 1/50 36/50 13/50
X2
Class: c2 fˆX1
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 6/100 0 0 6/100
Short (e 12 ) 24/100 15/100 0 39/100
X1
Long (e 13 ) 13/100 30/100 0 43/100
Very Long (e 14 ) 3/100 7/100 2/100 12/100
fˆX2 46/100 52/100 2/100
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 12
Discretized Iris Data: Test Case
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 13
Discretized Iris Data: Test Case with Pseudo-counts
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 14
Bayes Classifier: Challenges
The main problem with the Bayes classifier is the lack of enough data to reliably
estimate the joint probability density or mass function, especially for
high-dimensional data.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 15
Naive Bayes Classifier
Numeric Attributes
The naive Bayes approach makes the simple assumption that all the attributes are
independent, which implies that the likelihood can be decomposed into a product
of dimension-wise probabilities:
d
Y
P(x|ci ) = P(x1 , x2 , . . . , xd |ci ) = P(xj |ci )
j =1
where µ̂ij and σ̂ij2 denote the estimated mean and variance for attribute Xj , for
class ci .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 16
Naive Bayes Classifier
Numeric Attributes
bi ,
The naive assumption corresponds to setting all the covariances to zero in Σ
that is,
2
σi 1 0 . . . 0
0 σi22 . . . 0
Σi = . .. ..
.. . .
2
0 0 . . . σid
The naive Bayes classifier thus uses the sample mean µ̂i = (µ̂i 1 , . . . , µ̂id )T and a
diagonal sample covariance matrix Σ b i = diag (σ 2 , . . . , σ 2 ) for each class ci . In
i1 id
total 2d parameters have to be estimated, corresponding to the sample mean and
sample variance for each dimension Xj .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 17
Naive Bayes Algorithm
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 18
Iris Data: Naive Bayes
Consider 2D Iris Data consisting of sepal length and sepal width. P̂(ci ) and
µ̂i remain unchanged. Σb are assumed to be diagonal:
b 0.1218 0 b 0.435 0
Σ1 = Σ2 =
0 0.1423 0 0.1096
Figure shows the contour or level curve (corresponding to 1% of the peak density)
of the multivariate normal distribution for both classes. The diagonal assumption
leads to contours that are axis-parallel ellipses.
For the test point x = (6.75, 4.25)T , the posterior probabilities for c1 and c2 are as
follows:
b 1 )P̂(c1 ) = (4.014 × 10−7 ) × 0.33 = 1.325 × 10−7
P̂(c1 |x) ∝ fˆ(x|µ̂1 , Σ
P̂(c2 |x) ∝ fˆ(x|µ̂ , Σb 2 )P̂(c2 ) = (9.585 × 10−5 ) × 0.67 = 6.422 × 10−5
2
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 19
Naive Bayes versus Full Bayes Classifier
X1 :sepal length versus X2 :sepal width
X2 X2
bC bC
x = (6.75, 4.25)T x = (6.75, 4.25)T
rS rS
bC bC
bC bC
bC bC
4.0 4.0
bC bC
bC bC uT uT bC bC uT uT
bC bC bC bC bC bC
bC bC uT bC bC uT
bC bC bC bC bC bC bC bC
3.5 3.5
bC bC Cb bC bC bC bC uT uT uT bC bC Cb bC bC bC bC uT uT uT
bC bC uT uT bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0 3.0
bC uT uT uT uT uT uT uT uT uT uT bC uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT uT uT
2.5 2.5
uT uT uT uT
bC uT uT uT bC uT uT uT
uT uT uT uT
uT uT
2 X1 2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 20
Naive Bayes
Categorical Attributes
d
Y d
Y
P(x|ci ) = P(xj |ci ) = f X j = e jrj | ci
j =1 j =1
where f (X j = e jrj |ci ) is the probability mass function for X j , estimated as:
ni (v j )
fˆ(v j |ci ) =
ni
where ni (v j ) is the observed frequency of the value v j = e j rj corresponding to the
rj th categorical value ajrj for the attribute Xj for class ci .
If the count is zero, we can use the pseudo-count method to obtain a prior
probability. The adjusted estimates with pseudo-counts are given as
ni (v j ) + 1
fˆ(v j |ci ) =
n i + mj
X2
Class: c1 fˆX1
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 1/50 33/50 5/50 39/50
Short (e 12 ) 0 3/50 8/50 13/50
X1
Long (e 13 ) 0 0 0 0
Very Long (e 14 ) 0 0 0 0
fˆX2 1/50 36/50 13/50
X2
Class: c2 fˆX1
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 6/100 0 0 6/100
Short (e 12 ) 24/100 15/100 0 39/100
X1
Long (e 13 ) 13/100 30/100 0 43/100
Very Long (e 14 ) 3/100 7/100 2/100 12/100
fˆX2 46/100 52/100 2/100
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 22
Discretized Iris Data: Naive Bayes
0+1 13
P̂(v |c1 ) = P̂(e 13 |c1 ) · P̂(e 23 |c1 ) = · = 4.81 × 10−3
50 + 4 50
43 2
P̂(v |c2 ) = P̂(e 13 |c2 ) · P̂(e 23 |c2 ) = · = 8.60 × 10−3
100 100
P̂(c1 |v ) ∝ (4.81 × 10−3 ) × 0.33 = 1.59 × 10−3
P̂(c2 |v ) ∝ (8.6 × 10−3 ) × 0.67 = 5.76 × 10−3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 23
Nonparametric Approach
K Nearest Neighbors Classifier
Here δ(x, x i ) is the distance between x and x i , which is usually assumed to be the
Euclidean distance, i.e., δ(x, x i ) = kx − x i k2 . We assume that |Bd (x, r )| = K .
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 24
Nonparametric Approach
K Nearest Neighbors Classifier
Let Ki denote the number of points among the K nearest neighbors of x that are
labeled with class ci , that is
Ki = x j ∈ Bd (x, r ) | yj = ci
The class conditional probability density at x can be estimated as the fraction of
points from class ci that lie within the hyperball divided by its volume, that is
Ki /ni Ki
fˆ(x|ci ) = =
V ni V
Ki
ŷ = arg max {P(ci |x)} = arg max = arg max {Ki }
ci ci K ci
Because K is fixed, the KNN classifier predicts the class of x as the majority class
among its K nearest neighbors.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 26
Iris Data
K Nearest Neighbors Classifier
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 27
Iris Data
K Nearest Neighbors Classifier
X2
bC
x = (6.75, 4.25)T
rS
bC
bC
bC
4.0
bC
bC bC
r uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT
2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 28
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 29