Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Data Mining and Machine Learning:
Fundamental Concepts and Algorithms

dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 18: Probabilistic Classification
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 18: Probabilistic Classification 1
Bayes Classifier
Let the training dataset D consists of n points x i in a d-dimensional space, and
let yi denote the class for each point, with yi ∈ {c1 , c2 , . . . , ck }.
The Bayes classifier estimates the posterior probability P(ci |x) for each class ci ,
and chooses the class that has the largest probability. The predicted class for x is
given as
ŷ = arg max{P(ci |x)}
ci
According to the Bayes theorem, we have
P(x|ci ) · P(ci )
P(ci |x) =
P(x)
Because P(x) is fixed for a given point, Bayes rule can be rewritten as

P(x|ci )P(ci )
ŷ = arg max{P(ci |x)} = arg max = arg max P(x|ci )P(ci )
ci ci P(x) ci
Estimating the Prior Probability: P(ci )
Let D i denote the subset of points in D that are labeled with class ci :
D i = {x j ∈ D | x j has class yj = ci }
Let the size of the dataset D be given as |D| = n, and let the size of each
class-specific subset D i be given as |D i | = ni .
The prior probability for class ci can be estimated as follows:
ni
P̂(ci ) =
n
Estimating the Likelihood
Numeric Attributes, Parametric Approach
To estimate the likelihood P(x|ci ), we have to estimate the joint probability of x

across all the d dimensions, i.e., we have to estimate P x = (x1 , x2 , . . . , xd )|ci .
In the parametric approach we assume that each class ci is normally distributed,
and we use the estimated mean µ̂i and covariance matrix Σ b i to compute the
probability density at x
( )
1 b −1 (x − µ̂ )
(x − µ̂i )T Σ
ˆ ˆ b
f i (x) = f (x|µ̂i , Σi ) = √ q exp − i i
bi | 2
( 2π)d |Σ
The posterior probability is then given as

fˆi (x)P(ci )
P(ci |x) = Pk
ˆ
j =1 f j (x)P(cj )
The predicted class for x is:

n o
ŷ = arg max fˆi (x)P(ci )
ci
Bayes Classifier Algorithm
BayesClassifier (D = {(x j , yj )}nj=1 ):

1 for i = 1, .. . , k do
2 D i ← x j | yj = ci , j = 1, . . . , n // class-specific subsets
3 ni ← |D i | // cardinality
4 P̂(ci ) ← ni /n // prior probability
P
5 µ̂i ← n1 x j ∈D i x j // mean
i
6 Z i ← D i − 1ni µ̂Ti // centered data
7 b i ← 1 Z T Z i // covariance matrix
Σ n i
i
8 b i for all i = 1, . . . , k
return P̂(ci ), µ̂i , Σ
Testing (x and P̂(ci ), µ̂i , Σ b i , for all i ∈ [1, k]):

9 b
ŷ ← arg max f (x|µ̂i , Σi ) · P(ci )
ci
10 return ŷ
Bayes Classifier: Iris Data
Consider the 2D Iris data (sepal length and sepal width). Class c1 , which
corresponds to iris-setosa (circles), has n1 = 50 points, whereas the other class
c2 (triangles) has n2 = 100 points.
n1 50 n2 100
P̂(c1 ) = = = 0.33 P̂(c2 ) = = = 0.67
n 150 n 150

5.006 6.262
µ̂1 = µ̂2 =
3.418 2.872

b 0.1218 0.0983 b2 = 0.435 0.1209
Σ1 = Σ
0.0983 0.1423 0.1209 0.1096
Plot shows the contour or level curve (corresponding to 1% of the peak density)
for the multivariate normal distribution modeling the probability density for both
classes.
X 1 :sepal length versus X2 :sepal width
X2
bC
x = (6.75, 4.25)T
rS
bC
bC
bC
4.0
bC
bC bC uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT X1
2
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Let x = (6.75, 4.25)T be a test point (shown as white square). The posterior
probabilities for c1 and c2 can be computed as:
b 1 )P̂(c1 ) = (4.951 × 10−7 ) × 0.33 = 1.634 × 10−7
P̂(c1 |x) ∝ fˆ(x|µ̂1 , Σ
P̂(c2 |x) ∝ fˆ(x|µ̂ , Σb 2 )P̂(c2 ) = (2.589 × 10−5 ) × 0.67 = 1.735 × 10−5
2
Because P̂(c2 |x) > P̂(c1 |x) the class for x is predicted as ŷ = c2 .
Bayes Classifier
Categorical Attributes
Let Xj be a categorical attribute over the domain dom(Xj ) = {aj 1 , aj 2 , . . . , ajmj }.

Each categorical attribute Xj is modeled as an mj -dimensional multivariate
Bernoulli random variable X j that takes on mj distinct vector values
e j 1 , e j 2 , . . . , e jmj , where e jr is the r th standard basis vector in Rmj and corresponds
to the r th value or symbol ajr ∈ dom(Xj ).
The entire d-dimensional dataset is Pmodeled as the vector random variable

d
X = (X 1 , X 2 , . . . , X d )T . Let d ′ = j =1 mj ; a categorical point
x = (x1 , x2 , . . . , xd )T is therefore represented as the d ′ -dimensional binary vector
   
v1 e 1r1
   
v =  ...  =  ... 
vd e drd
where v j = e jrj provided xj = ajrj is the rj th value in the domain of Xj .
Bayes Classifier
The probability of the categorical point x is obtained from the joint probability
mass function (PMF) for the vector random variable X :

P(x|ci ) = f (v |ci ) = f X 1 = e 1r1 , . . . , X d = e drd | ci
The joint PMF can be estimated directly from the data D i for each class ci as
follows:
ni (v )
fˆ(v |ci ) =
ni
where ni (v ) is the number of times the value v occurs in class ci .
However, to avoid zero probabilities we add a pseudo-count of 1 for each value
ni (v ) + 1
fˆ(v |ci ) = Qd
ni + j =1 mj
Discretized Iris Data
Assume that the sepal length and sepal width attributes in the Iris dataset
have been discretized as shown below.
Bins Domain
Bins Domain
[4.3, 5.2] Very Short (a11 )
[2.0, 2.8] Short (a21 )
(5.2, 6.1] Short (a12 )
(2.8, 3.6] Medium (a22 )
(6.1, 7.0] Long (a13 )
(3.6, 4.4] Long (a23 )
(7.0, 7.9] Very Long (a14 )
(b) Discretized sepal width
(a) Discretized sepal length
We have |dom(X1 )| = m1 = 4 and |dom(X2 )| = m2 = 3.
Discretized Iris Data
Class-specific Empirical Joint Probability Mass Function
X2
Class: c1 fˆX1
Short (e 21 ) Medium (e 22 ) Long (e 23 )
Very Short (e 11 ) 1/50 33/50 5/50 39/50
Short (e 12 ) 0 3/50 8/50 13/50
X1
Long (e 13 ) 0 0 0 0
Very Long (e 14 ) 0 0 0 0
fˆX2 1/50 36/50 13/50
X2
Class: c2 fˆX1
Very Short (e 11 ) 6/100 0 0 6/100
Short (e 12 ) 24/100 15/100 0 39/100
X1
Long (e 13 ) 13/100 30/100 0 43/100
Very Long (e 14 ) 3/100 7/100 2/100 12/100
fˆX2 46/100 52/100 2/100
Discretized Iris Data: Test Case
Consider a test point x = (5.3, 3.0)T corresponding to the categorical point

T
(Short, Medium), which is represented as v = e T12 e T22 .
The prior probabilities of the classes are P̂(c1 ) = 0.33 and P̂(c2 ) = 0.67.
The likelihood and posterior probability for each class is given as
P̂(x|c1 ) = fˆ(v |c1 ) = 3/50 = 0.06

P̂(x|c2 ) = fˆ(v |c2 ) = 15/100 = 0.15
P̂(c1 |x) ∝ 0.06 × 0.33 = 0.0198
P̂(c2 |x) ∝ 0.15 × 0.67 = 0.1005
In this case the predicted class is ŷ = c2 .
Discretized Iris Data: Test Case with Pseudo-counts
The test point x = (6.75, 4.25)T corresponds to the categorical point

T
(Long, Long), and it is represented as v = e T13 e T23 .
Unfortunately the probability mass at v is zero for both classes. We adjust the
PMF via pseudo-counts noting that the number of possible values are
m1 × m2 = 4 × 3 = 12.
The likelihood and prior probability can then be computed as
0+1
P̂(x|c1 ) = fˆ(v |c1 ) = = 1.61 × 10−2
50 + 12
0+1
P̂(x|c2 ) = fˆ(v |c2 ) = = 8.93 × 10−3
100 + 12
P̂(c1 |x) ∝ (1.61 × 10−2 ) × 0.33 = 5.32 × 10−3
P̂(c2 |x) ∝ (8.93 × 10−3 ) × 0.67 = 5.98 × 10−3
Thus, the predicted class is ŷ = c2 .
Bayes Classifier: Challenges
The main problem with the Bayes classifier is the lack of enough data to reliably
estimate the joint probability density or mass function, especially for
high-dimensional data.
For numeric attributes we have to estimate O(d 2 ) covariances, and as the

dimensionality increases, this requires us to estimate too many parameters.
For categorical attributes we have
Q to estimate
the joint probability for all the
possible values of v , given as j |dom Xj |. Even if each categorical attribute has
only two values, we would need to estimate the probability for 2d values. However,
because there can be at most n distinct values for v , most of the counts will be
zero.
Naive Bayes classifier addresses these concerns.
Naive Bayes Classifier
Numeric Attributes
The naive Bayes approach makes the simple assumption that all the attributes are
independent, which implies that the likelihood can be decomposed into a product
of dimension-wise probabilities:
d
Y
P(x|ci ) = P(x1 , x2 , . . . , xd |ci ) = P(xj |ci )
j =1
The likelihood for class ci , for dimension Xj , is given as

( )
2 1 (xj − µ̂ij )2
P(xj |ci ) ∝ f (xj |µ̂ij , σ̂ij ) = √ exp −
2πσ̂ij 2σ̂ij2
where µ̂ij and σ̂ij2 denote the estimated mean and variance for attribute Xj , for
class ci .
Naive Bayes Classifier
Numeric Attributes
bi ,
The naive assumption corresponds to setting all the covariances to zero in Σ
that is,
 2 
σi 1 0 . . . 0
 0 σi22 . . . 0 
 
Σi =  . .. .. 
 .. . . 
2
0 0 . . . σid
The naive Bayes classifier thus uses the sample mean µ̂i = (µ̂i 1 , . . . , µ̂id )T and a
diagonal sample covariance matrix Σ b i = diag (σ 2 , . . . , σ 2 ) for each class ci . In
i1 id
total 2d parameters have to be estimated, corresponding to the sample mean and
sample variance for each dimension Xj .
Naive Bayes Algorithm
NaiveBayes (D = {(x j , yj )}nj=1 ):

1 for i = 1, .. . , k do
D i ← xjT | yj = ci , j = 1, . . . , n // class-specific subsets

2
3 ni ← |D i | // cardinality
4 P̂(ci ) ←P ni /n // prior probability
5 µ̂i ← n1i xj ∈D i xj // mean
6 D i = D i − 1 · µ̂Ti // centered data for class ci
7 for j = 1, .., d do // class-specific var for jth attribute
8 σ̂ij2 ← n1i (Xj i )T (Xj i ) // variance
2 T
σ̂ i ← σ̂i21 , . . . , σ̂id

9 // class-specific attribute variances
10 return P̂(ci ), µ̂i , σ̂ i for all i = 1, . . . , k
Testing (x and P̂(ci ), µ̂i , σ̂ i , for all i ∈ [1, k]):

d
Y
11 ŷ ← arg max P̂(ci ) f (xj |µ̂ij , σ̂ij2 )
ci
j =1
12 return ŷ
Iris Data: Naive Bayes
Consider 2D Iris Data consisting of sepal length and sepal width. P̂(ci ) and
µ̂i remain unchanged. Σb are assumed to be diagonal:

b 0.1218 0 b 0.435 0
Σ1 = Σ2 =
0 0.1423 0 0.1096
Figure shows the contour or level curve (corresponding to 1% of the peak density)
of the multivariate normal distribution for both classes. The diagonal assumption
leads to contours that are axis-parallel ellipses.
For the test point x = (6.75, 4.25)T , the posterior probabilities for c1 and c2 are as
follows:
b 1 )P̂(c1 ) = (4.014 × 10−7 ) × 0.33 = 1.325 × 10−7
P̂(c1 |x) ∝ fˆ(x|µ̂1 , Σ
P̂(c2 |x) ∝ fˆ(x|µ̂ , Σb 2 )P̂(c2 ) = (9.585 × 10−5 ) × 0.67 = 6.422 × 10−5
2
Because P̂(c2 |x) > P̂(c1 |x), it predicted ŷ = c2 to x.
Naive Bayes versus Full Bayes Classifier
X1 :sepal length versus X2 :sepal width
X2 X2
bC bC
x = (6.75, 4.25)T x = (6.75, 4.25)T
rS rS
bC bC
bC bC
bC bC
4.0 4.0
bC bC
bC bC uT uT bC bC uT uT
bC bC bC bC bC bC
bC bC uT bC bC uT
bC bC bC bC bC bC bC bC
3.5 3.5
bC bC Cb bC bC bC bC uT uT uT bC bC Cb bC bC bC bC uT uT uT
bC bC uT uT bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0 3.0
bC uT uT uT uT uT uT uT uT uT uT bC uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT uT uT uT
2.5 2.5
uT uT uT uT
bC uT uT uT bC uT uT uT
uT uT uT uT
uT uT
2 X1 2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
(a) Naive Bayes (b) Full Bayes
Naive Bayes
The independence assumption leads to a simplification:
d
Y d
Y
P(x|ci ) = P(xj |ci ) = f X j = e jrj | ci
j =1 j =1
where f (X j = e jrj |ci ) is the probability mass function for X j , estimated as:
ni (v j )
fˆ(v j |ci ) =
ni
where ni (v j ) is the observed frequency of the value v j = e j rj corresponding to the
rj th categorical value ajrj for the attribute Xj for class ci .
If the count is zero, we can use the pseudo-count method to obtain a prior
probability. The adjusted estimates with pseudo-counts are given as
ni (v j ) + 1
fˆ(v j |ci ) =
n i + mj
where mj = |dom(Xj )|.

Discretized Iris Data: Naive Bayes
Class-specific Empirical Joint Probability Mass Function
X2
Class: c1 fˆX1
Very Short (e 11 ) 1/50 33/50 5/50 39/50
Short (e 12 ) 0 3/50 8/50 13/50
X1
Long (e 13 ) 0 0 0 0
Very Long (e 14 ) 0 0 0 0
fˆX2 1/50 36/50 13/50
X2
Class: c2 fˆX1
Very Short (e 11 ) 6/100 0 0 6/100
Short (e 12 ) 24/100 15/100 0 39/100
X1
Long (e 13 ) 13/100 30/100 0 43/100
Very Long (e 14 ) 3/100 7/100 2/100 12/100
fˆX2 46/100 52/100 2/100
Discretized Iris Data: Naive Bayes
The test point x = (6.75, 4.25), corresponding to (Long, Long) or v = (e 13 , e 23 ),

is classified as follows:

0+1 13
P̂(v |c1 ) = P̂(e 13 |c1 ) · P̂(e 23 |c1 ) = · = 4.81 × 10−3
50 + 4 50

43 2
P̂(v |c2 ) = P̂(e 13 |c2 ) · P̂(e 23 |c2 ) = · = 8.60 × 10−3
100 100
P̂(c1 |v ) ∝ (4.81 × 10−3 ) × 0.33 = 1.59 × 10−3
P̂(c2 |v ) ∝ (8.6 × 10−3 ) × 0.67 = 5.76 × 10−3
Thus, the predicted class is ŷ = c2 .
Nonparametric Approach
K Nearest Neighbors Classifier
We consider a non-parametric approach for likelihood estimation using the nearest

neighbors density estimation.
Let D be a training dataset comprising n points x i ∈ Rd , and let D i denote the
subset of points in D that are labeled with class ci , with ni = |D i |.
Given a test point x ∈ Rd , and K , the number of neighbors to consider, let r
denote the distance from x to its K th nearest neighbor in D.
Consider the d-dimensional hyperball of radius r around the test point x,
defined as

Bd (x, r ) = x i ∈ D | δ(x, x i ) ≤ r
Here δ(x, x i ) is the distance between x and x i , which is usually assumed to be the
Euclidean distance, i.e., δ(x, x i ) = kx − x i k2 . We assume that |Bd (x, r )| = K .
Let Ki denote the number of points among the K nearest neighbors of x that are
labeled with class ci , that is

Ki = x j ∈ Bd (x, r ) | yj = ci
The class conditional probability density at x can be estimated as the fraction of
points from class ci that lie within the hyperball divided by its volume, that is
Ki /ni Ki
fˆ(x|ci ) = =
V ni V
where V = vol(Bd (x, r )) is the volume of the d-dimensional hyperball.

The posterior probability P(ci |x) can be estimated as
fˆ(x|ci )P̂(ci )
P(ci |x) = Pk
ˆ
j =1 f (x|cj )P̂(cj )
ni
However, because P̂(ci ) = n
, we have
Ki ni Ki
fˆ(x|ci )P̂(ci ) = · =
ni V n nV
The posterior probability is given as

Ki
Ki
P(ci |x) = PknV Kj
=
K
j =1 nV
Finally, the predicted class for x is

Ki
ŷ = arg max {P(ci |x)} = arg max = arg max {Ki }
ci ci K ci
Because K is fixed, the KNN classifier predicts the class of x as the majority class
among its K nearest neighbors.
Iris Data
Consider the 2D Iris dataset. Class c1 (circles) has n1 = 50 points and c2

(triangles) has n2 = 100 points.
We use KNN (K = 5) to classify the test point x = (6.75, 4.25)T .
The√distance from x to its 5th nearest neighbor, namely (6.2, 3.4)T , is given as
r = 1.025 = 1.012.
The enclosing ball or circle of radius r encompasses K1 = 1 point from c1 and
K2 = 4 points from c2 .
Therefore, the predicted class for x is ŷ = c2 .
Iris Data
X2
bC
x = (6.75, 4.25)T
rS
bC
bC
bC
4.0
bC
bC bC
r uT uT
bC bC bC
bC bC uT
bC bC bC bC
3.5
bC bC bC bC bC bC uT uT uT
bC bC uT uT
bC bC bC bC uT uT uT uT uT uT uT
bC bC bC uT uT uT
bC bC bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT
3.0
bC uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT uT uT uT uT uT
uT uT uT uT uT uT
uT uT uT uT uT
uT uT uT uT uT uT uT
2.5
uT uT
bC uT uT uT
uT uT
uT
2 X1
4 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 18: Probabilistic Classification

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Machine Learning: Fundamental Concepts and Algorithms

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 18: Probabilistic Classification

According to the Bayes theorem, we have

The prior probability for class ci can be estimated as follows:

To estimate the likelihood P(x|ci ), we have to estimate the joint probability of x

The posterior probability is then given as

The predicted class for x is:

BayesClassifier (D = {(x j , yj )}nj=1 ):

Testing (x and P̂(ci ), µ̂i , Σ b i , for all i ∈ [1, k]):

Let Xj be a categorical attribute over the domain dom(Xj ) = {aj 1 , aj 2 , . . . , ajmj }.

The entire d-dimensional dataset is Pmodeled as the vector random variable

where v j = e jrj provided xj = ajrj is the rj th value in the domain of Xj .

We have |dom(X1 )| = m1 = 4 and |dom(X2 )| = m2 = 3.

Consider a test point x = (5.3, 3.0)T corresponding to the categorical point

P̂(x|c1 ) = fˆ(v |c1 ) = 3/50 = 0.06

In this case the predicted class is ŷ = c2 .

The test point x = (6.75, 4.25)T corresponds to the categorical point

Thus, the predicted class is ŷ = c2 .

For numeric attributes we have to estimate O(d 2 ) covariances, and as the

The likelihood for class ci , for dimension Xj , is given as

NaiveBayes (D = {(x j , yj )}nj=1 ):

Testing (x and P̂(ci ), µ̂i , σ̂ i , for all i ∈ [1, k]):

Because P̂(c2 |x) > P̂(c1 |x), it predicted ŷ = c2 to x.

(a) Naive Bayes (b) Full Bayes

The independence assumption leads to a simplification:

where mj = |dom(Xj )|.

The test point x = (6.75, 4.25), corresponding to (Long, Long) or v = (e 13 , e 23 ),

Thus, the predicted class is ŷ = c2 .

We consider a non-parametric approach for likelihood estimation using the nearest

where V = vol(Bd (x, r )) is the volume of the d-dimensional hyperball.

The posterior probability is given as

Finally, the predicted class for x is

Consider the 2D Iris dataset. Class c1 (circles) has n1 = 50 points and c2

Mohammed J. Zaki1 Wagner Meira Jr.2

Chapter 18: Probabilistic Classification

You might also like