You are on page 1of 6

CIS 520: Machine Learning Spring 2021: Lecture 3

Generative Probabilistic Models for Classification

Lecturer: Shivani Agarwal

Disclaimer: These notes are designed to be a supplement to the lecture. They


may or may not cover all the material discussed in the lecture (and vice versa).

Outline
• Introduction

• Multivariate normal class-conditional densities: Quadratic/linear discriminant analysis (QDA/LDA)

• Conditionally independent features: Naı̈ve Bayes

• Extensions to multiclass classification

1 Introduction

Recall that if we know the joint probability distribution D from which labeled examples are generated, then
we can simply use a Bayes optimal classifier for that distribution. For binary classification (under 0-1 loss),
a Bayes optimal classifier is given by h∗ (x) = sign(η(x) − 12 ), where η(x) = P(Y = +1|X = x) is the class
probability function under D; in other words, given an instance x, a Bayes optimal classifier predicts class
+1 if the probability η(x) of a positive label given x is greater than 21 , and predicts class −1 otherwise.
Generative probabilistic models estimate the joint probability distribution D, usually by estimating the
overall class probabilities py = P(Y = y) and the class-conditional distributions p(x|y) ≡ p(x|Y = y). Here
for each class y, p(x|y) denotes a (conditional) probability mass function over the instance space X if X is
discrete, and a (conditional) probability density function over X if X is continuous. Such models are said to
be generative because they can be used to generate new examples from the distribution (by first sampling a
label y with probability py and then sampling an instance x according to p(x|y)). Given such a generative
model, the class probabilities η(x) can be obtained via Bayes’ rule:

p+1 · p(x|+1) p+1 · p(x|+1)


η(x) = = .
p(x) p+1 · p(x|+1) + p−1 · p(x|−1)

Once the generative model components py and p(x|y) have been estimated from the training data, they are
used to construct a ‘plug-in’ classifier by using the resulting class probability function estimate in place of
the true function η(x) in the form of the Bayes optimal classifier above.
Since generative models model the full joint distribution, they often make simplifying assumptions on the
form of the distribution in order to obtain estimates from a reasonable number of data points. We will see two
examples below: one where features are continuous and one assumes multivariate normal class-conditional
densities, and the other where one makes conditional independence assumptions among the features given
the labels (Naı̈ve Bayes assumption).

1
2 Generative Probabilistic Models for Classification

2 Multivariate Normal Class-Conditional Densities: Quadratic/Linear


Discriminant Analysis (QDA/LDA)

Suppose first that our instances are continuous feature vectors, with instance space X = Rd , and consider a
binary classification task with label and prediction spaces Y = Yb = {±1}. Assume that for each class y ∈ Y,
the class-conditional density of x given y is a multivariate normal density:
 
1 1 > −1
p(x|y) = exp − (x − µy ) Σy (x − µy ) ,
(2π)d/2 |Σy |1/2 2

where µy ∈ Rd and Σy ∈ Rd×d are the unknown class-conditional mean and covariance matrix, respectively.
Also, let p+1 = P(Y = +1) and p−1 = P(Y = −1) = 1−p+1 . As discussed above, the conditional probability
of a positive class label for any x ∈ X can be obtained using Bayes’ rule:
p+1 · p(x|+1)
η(x) = P(Y = +1|x) = ,
p+1 · p(x|+1) + p−1 · p(x|−1)
leading to the following Bayes optimal classifier:
( p+1 ·p(x|+1) 1
∗ +1 if p+1 ·p(x|+1)+p −1 ·p(x|−1)
> 2
h (x) =
−1 otherwise
(
+1 p(x|+1)
if p(x|−1) > pp−1
= +1

−1 otherwise
( p(x|+1) 
> ln pp−1

+1 if ln p(x|−1)
= +1

−1 otherwise
  p(x| + 1)   p 
−1
= sign ln − ln .
p(x| − 1) p+1
Now, under the above assumption on the class-conditional densities, we have
 p(x|+1)  1  > −1  |Σ | 
−1
ln = x (Σ−1 − Σ−1 −1 −1 > > −1 > −1
+1 )x − 2(Σ−1 µ−1 − Σ+1 µ+1 ) x + µ−1 Σ−1 µ−1 − µ+1 Σ+1 µ+1 + ln .
p(x|−1) 2 |Σ+1 |
This is a quadratic function of x, and we can therefore write the Bayes optimal classifier in this case as

h∗ (x) = sign x> Ax + b> x + c ,




where

A = Σ−1 −1
−1 − Σ+1
b = −2(Σ−1 −1
−1 µ−1 − Σ+1 µ+1 )
 |Σ
−1 |
 p 
−1 −1 −1
c = µ> >
−1 Σ−1 µ−1 − µ+1 Σ+1 µ+1 + ln − 2 ln .
|Σ+1 | p+1
A classifier of this form is often called a degree-2 polynomial threshold classifier, or simply a quadratic
classifier. Note that if in addition, the class-conditional densities have equal covariance matrices, Σ+1 =
Σ−1 = Σ, then A = 0, and the Bayes optimal classifier becomes a linear threshold classifier, or simply
a linear classifier.
Of course, in practice, one does not know the parameters of the class-conditional densities, µ+1 , µ−1 , Σ+1 , Σ−1 ,
or the class probability parameter p+1 = P(Y = +1). In this case, one estimates these quantities from the
Generative Probabilistic Models for Classification 3

given training sample S = ((x1 , y1 ), . . . , (xm , ym )), and uses the estimated values µ
b +1 , µ
b −1 , Σ
b +1 , Σ
b −1 , pb+1
to obtain a class probability estimate ηbS (x), and a corresponding plug-in classifier

hS (x) = sign ηbS (x) − 21 .




For example, a natural approach is to use maximum likelihood estimation, which yields
1 X
µ
by = xi
my i:y =y
i

1 X
Σ
by = (xi − µ b y )>
b y )(xi − µ
my i:y =y
i
m+1
pb+1 = ,
m

where my = {i ∈ [m] : yi = y} denotes the number of training examples with class label y.1 The resulting
classifier is called the quadratic discriminant analysis (QDA) classifier.
If one assumes the class-conditional covariances are equal, then the maximum likelihood estimate for the
common covariance matrix is given by
m
b = 1
X
Σ (xi − µ b yi )> ,
b yi )(xi − µ
m i=1

and the resulting classifier is given by


b>x + b

hS (x) = sign b c ,

where

b
b b −1 (b
= −2 Σ µ−1 − µ
b +1 )
−1 −1
 1 − pb 
b> b>
+1
c = µ
b −1 Σ
b b −1 − µ
µ +1 Σ
b b +1 − 2 ln
µ .
pb+1

This classifier is called the linear discriminant analysis (LDA) classifier.

3 Conditionally Independent Features: Naı̈ve Bayes

Above, we assumed a certain parametric form (multivariate normal) for the class-conditional distributions
p(x|y). We now consider an alternative, widely used assumption on the class-conditional distributions:
namely, that the features are conditionally independent given the labels. This is typically called the Naı̈ve
Bayes assumption. We will describe it below for the case of discrete features, although the assumption can
also be employed with continuous features.2
Suppose for simplicity that our instances are binary feature vectors, with instance space X = {0, 1}d , and
consider again a binary classification task with label and prediction spaces Y = Yb = {±1}. The class-
conditional distributions p(x|+1) and p(x|−1) are now discrete. Clearly, in the general case, each of these
distributions, defined over the sample space X = {0, 1}d containing 2d elements, is parametrized by 2d − 1
numbers, namely the probabilities of seeing the different elements in X . However these 2d − 1 parameters
1 Here
[m] denotes the set of integers from 1 through m, i.e. [m] = {1, . . . , m}.
2 When using Naı̈ve Bayes with continuous features, one usually also assumes a parametric form for the distributions of
individual features given labels, p(xj |y) (j = 1, . . . , d).
4 Generative Probabilistic Models for Classification

can be estimated reliably only when all instances in X have been seen several times, which is unrealistic in
a typical learning situation. The Naı̈ve Bayes assumption allows these class-conditional distributions to be
represented more compactly. In particular, under Naı̈ve Bayes, we assume that given the class label y, the
individual features in an instance are conditionally independent; i.e. that each class-conditional probability
distribution factors as follows:
Yd
p(x|y) = p(xj |y) .
j=1

In this case, one needs to estimate only d parameters for each class-conditional distribution (why is this not
d − 1?). Denote a random (d-dimensional) feature vector as X = (X1 , . . . , Xd ), and for each y ∈ Y and
j ∈ {1, . . . , d}, denote
θy,j = P(Xj = 1|Y = y) .
Then we can write
d
Y
p(x|y) = (θy,j )xj (1 − θy,j )1−xj .
j=1
The conditional probability of a positive label for any x ∈ X is again obtained via Bayes’ rule:
p+1 · p(x|+1)
η(x) = P(Y = 1|x) = ,
p+1 · p(x|+1) + p−1 · p(x|−1)
where again p+1 = P(Y = +1), leading again to the following Bayes optimal classifier:
  p(x|+1)   p 
−1
h∗ (x) = sign ln − ln .
p(x|−1) p+1
In this case, we have
 p(x|+1)  d  θ  1 − θ 
+1,j +1,j
X
ln = xj ln + (1 − xj ) ln .
p(x|−1) j=1
θ−1,j 1 − θ−1,j

This is a linear function of x, and we can therefore write the Bayes optimal classifier in this case as
h∗ (x) = sign w> x + b ,


where
θ  1 − θ 
+1,j +1,j
wj = ln − ln
θ−1,j 1 − θ−1,j
Xd 1 − θ  p 
+1,j −1
b = ln − ln .
j=1
1 − θ−1,j p +1

As can be seen, this again yields a linear classifier. Again, in practice, one estimates the parameters θy,j and
p+1 from the given training data S = ((x1 , y1 ), . . . , (xm , ym )) using maximum likelihood estimation, which
yields
1 X
θby,j = 1(xij = 1)
my i:y =y
i
m+1
pb+1 = ,
m

where my = {i ∈ [m] : yi = y} as before. The resulting plug-in classifier, obtained by substituting these
parameter estimates in the expression for the Bayes optimal classifier above, is known as the naı̈ve Bayes
classifier.
Exercise. How does the above derivation change if you have q-ary features, X = {0, . . . , q − 1}d ? How
many parameters do you now need to estimate for each class? Do you still get a linear classifier?
Generative Probabilistic Models for Classification 5

4 Extensions to Multiclass Classification

Let us see how things change when there are K > 2 classes, say Y = [K] = {1, . . . , K} (such as in the
handwritten digit recognition example, where K = 10). In this case, we need to consider the conditional
probability of different labels given an instance x ∈ X . For each y ∈ Y, let ηy (x) = P(Y = y|X = x) denote
PK
the conditional probability of seeing label y given x. Clearly, for all x, y=1 ηy (x) = 1 (in the binary case,
we had η+1 (x) = η(x) and η−1 (x) = 1 − η(x)). What does the optimal classifier look like in this case? This
depends on how we measure the performance of a classifier h : X →[K]. Denoting again the joint distribution
over X × Y by D, say we define again the accuracy of h w.r.t. D as the probability that an example (x, y)
drawn randomly from D is classified correctly by h:

accD [h] = P(X,Y )∼D h(X) = Y .

Equivalently, we use again the 0-1 loss function, defined now over labels and predictions in Y = [K], giving
`0-1 : [K] × [K]→R+ defined as
y 6= y) ,
`0-1 (y, yb) = 1(b
with the corresponding 0-1 error of h w.r.t. D defined as

er0-1
  
D [h] = P(X,Y )∼D h(X) 6= Y = E(X,Y )∼D `0-1 (Y, h(X)) .

We can write this as

er0-1
 
D [h] = E(X,Y )∼D 1(h(X) 6= Y )
  
= EX EY |X 1(h(X) 6= Y )
X K 
= EX ηy (X) · 1(h(X) 6= y)
y=1
 X 
= EX ηy (X)
y6=h(X)
 
= EX 1 − ηh(X) (X) .

The minimum achievable 0-1 error w.r.t. D is therefore


h i
er0-1,∗
D = inf er0-1
D [h] = 1 − EX max ηy (X) ,
h:X →[K] y∈[K]

and is clearly achieved by any classifier h∗ : X →[K] satisfying

h∗ (x) ∈ arg max ηy (x) .


y∈[K]

In other words, given an instance x, a Bayes optimal classifier here predicts a class y with highest conditional
probability ηy (x) = P(Y = y|X = x) given x.3
Now, suppose our instances are continuous feature vectors, with instance space X = Rd , and assume again
that for each class y ∈ Y, the class-conditional density p(x|y) is a multivariate normal density with mean
vector µy and covariance matrix Σy . Also, for each y ∈ Y, let py = P(Y = y) denote the overall probability
3 Note that if our loss function assigns a different loss/penalty for different types of mistakes (e.g. if misclassifying a digit 8 as

9 incurs a smaller loss than misclassifying it as 0), then the minimum achievable error as well as the optimal classifier achieving
this error will be different. This is true also in the case of binary classification, where for example the cost of mis-diagnosing a
cancer patient as normal could be higher than mis-diagnosing a normal patient as having cancer (can you see how the optimal
binary classifier would change in this case?). Such problems are often referred to as cost-sensitive classification.
6 Generative Probabilistic Models for Classification

of seeing label y. Then the conditional probability of seeing label y for any x ∈ X can again be obtained by
Bayes’ rule:
py · p(x|y)
ηy (x) = P(Y = y|X = x) = PK ,
0
y 0 =1 py · p(x|y )
0

leading to the following optimal classifier (under the 0-1 loss above):

py · p(x|y)
h∗ (x) ∈ arg max PK
y∈[K]
y 0 =1 py0 · p(x|y 0 )
= arg max py · p(x|y)
y∈[K]

= arg max ln py · p(x|y)
y∈[K]
1 1
= arg max ln py − ln |Σy | − (x − µy )> Σ−1
y (x − µy ) .
y∈[K] 2 2

Again, the parameters µy , Σy , py can be estimated from a given training sample S = ((x1 , y1 ), . . . , (xm , ym ))
(e.g. using maximum likelihood estimation as before), and a plug-in classifier using the estimated values can
then be constructed based on the above. Note that the above classifier amounts to estimating parameters
that determine K quadratic functions fy : X →R for y ∈ [K], and classifying an instance x according to a
label y with largest value of fy (x). Similarly, if the class-conditional covariances are assumed to be equal, the
above classifier will amount to learning parameters determining K linear functions, and classifying according
to the largest value.
Exercise. Consider a multiclass classification problem with binary features, X = {0, 1}d and Y = [K], and
assume the features are conditionally independent given the class label. Can you derive the Naı̈ve Bayes
classifier in this setting?

You might also like