Probabilistic Models for Classification: Generative Classification

Probabilistic Models for Classification: Generative Classification
Piyush Rai
Probabilistic Machine Learning (CS772A)

Aug 17, 2017
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 1
Recap
Parameter Estimation in Probabilistic Models
Fully Bayesian inference (what we would ideally like to do)
p(X|θ)p(θ) Likelihood × Prior
p(θ|X) = = (easy when we have conjugacy!)
p(X) Marginal likelihood
A cheaper alternative: Maximum Likelihood (ML) Estimation !

N
X
θ̂MLE = arg max log p(X|θ) = arg max log p(x n |θ), if data X = {x 1 , . . . , x N } is i.i.d.
θ θ
n=1

N
X
θ θ
n=1
Another cheaper (and better) alternative: Maximum-a-Posteriori

"
(MAP) Estimation
# !
N
X
θ̂MAP = arg max log p(θ|X) = arg max log p(X|θ) + log p(θ) = arg max log p(x n |θ) + log p(θ) , if data is i.i.d.
θ θ θ
n=1

N
X
θ θ
n=1
Another cheaper (and better) alternative: Maximum-a-Posteriori

"
(MAP) Estimation
# !
N
X
θ̂MAP = arg max log p(θ|X) = arg max log p(X|θ) + log p(θ) = arg max log p(x n |θ) + log p(θ) , if data is i.i.d.
θ θ θ
n=1
In some cases, we may want (or may have to) to do a “hybrid” sort of inference (fully Bayesian
inference for some parameters and MLE/MAP for other parameters)
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z
p(X|M) = p(X, θ|M)dθ
Marginal Likelihood
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ
Marginal Likelihood
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]
Marginal Likelihood
Z Z
Intuitively, marginal likelihood = The average proability that model M generated X

It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal Likelihood
Z Z

Marginal likelihood can also be seen as a measure of “goodness” of model M for data X
Marginal Likelihood
Z Z

Note: The word “model” is meant here in a rather broad sense
Marginal Likelihood
Z Z


Could mean a model from a class of models (e.g., linear, quadratic, cubic regression)
Marginal Likelihood
Z Z


Could mean a hyperparameter set to some value (each possible value will mean a different model)
Marginal Likelihood
Z Z


When clear from the context, we omit M and simply use p(X) to denote the marginal likelihood
Marginal Likelihood
Z Z


When clear from the context, we omit M and simply use p(X) to denote the marginal likelihood
Marginal likelihood can be hard to compute in general, but is a very useful quantity
Model Selection via Marginal Likelihood
Can use marginal likelihood to compare and choose from a set of models M1 , M2 , . . . , MK
E.g., choose the model with the largest p(X|M) (or largest p(M|X))
Intuition: For a “good” model, many random θ’s from it (as opposed to just a select few!) would
fit the data reasonably well (so, “on an average”, we can say that this model fits the data well)
Important: Note that here we aren’t talking about a single θ fitting the data well (which could just
be a case of overfitting the data) but many random θ’s from the model M
Important: Note that here we aren’t talking about a single θ fitting the data well (which could just
be a case of overfitting the data) but many random θ’s from the model M
Important: This approach doesn’t require a separate held-out data (therefore especially useful when
there is very little training data, or in unsupervised learning where cross-validation is impossible)
Predictions via Posterior Averaging
Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]
Z Z
which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)
Z Z
Note: Computing the posterior predictive p(x ∗ |X) is tractable only in rare cases and may require
an approximation in other cases (e.g., via sampling methods; we will look at these later)
Z Z
Note: If we only have a point estimate of θ, say θMLE or θMAP , then
p(x ∗ |X) = p(x ∗ |θMLE ) or p(x ∗ |X) = p(x ∗ |θMAP )
Z Z
which is still a distribution but only depends on a single value of θ
Z Z
which is still a distribution but only depends on a single value of θ

Note: The posterior predictive distribution p(x ∗ |X) integrates out θ but may still depend on other
hyperparameters if those are hard to integrate out (or if we know “good” values for those).
Probabilistic Models for Classification
p(y = k|x) =?, k = 1, 2, . . . , K
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule
p(y = k)p(x|y = k)
p(y = k|x) =
p(x)
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
Discriminative classification
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
Based on directly learning a probability distribution p(y = k|x) of class y given the input x
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
Some examples: Logistic regression, softmax classification, Gaussian Process classification, etc.
p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)
Some examples: Logistic regression, softmax classification, Gaussian Process classification, etc.
Important: Generative approach models the inputs x. Discriminative approach treats x as fixed
Generative Classification
Assume we’ve learned p(x|y ) and p(y ). The classification rule will be
ŷ = arg max p(y = k|x)

k
p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max
k k p(x)
p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k
p(y = k)p(x|y = k)
k k p(x) k
If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
p(y = k)p(x|y = k)
k k p(x) k
Also known as the Bayes optimal classifier (minimizes the misclassification probability)
p(y = k)p(x|y = k)
k k p(x) k
However, the generative classification also suffers from some issues
p(y = k)p(x|y = k)
k k p(x) k
However, the generative classification also suffers from some issues, e.g.,
May require too many parameters to estimate (especially for class-conditional distributions p(x|y ))
p(y = k)p(x|y = k)
k k p(x) k
May do badly if our assumed p(x|y ) (say a Gaussian) is very different from the true class-conditional
p(y = k)p(x|y = k)
k k p(x) k
Nevertheless, a very powerful approach, especially if we have a good method to estimate the
class-conditional densities.
p(y = k)p(x|y = k)
k k p(x) k
Nevertheless, a very powerful approach, especially if we have a good method to estimate the
class-conditional densities. Also enables doing semi-supervised learning (more on this later)
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn
PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
PK
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)
PK
Again note that there is no directly defined p(y = k|x) in this case
PK
The associated generative story for each example (x n , yn ) is as follows
PK
First choose a class/label yn ∈ {1, 2, . . . , K }
yn ∼ multinoulli(π)
PK
First choose a class/label yn ∈ {1, 2, . . . , K }
yn ∼ multinoulli(π)
Now draw (“generate”) the input x n from a distribution specific to the value yn takes
x n |yn ∼ p(x|θyn )
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
p(y =k)p(x|y =k)
Assume each class-conditional to be a Gaussian

1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2
p(y =k)p(x|y =k)

1 1
k (x − µk )
(2π)D |Σk | 2
PK
Assume p(y = k) = πk ∈ (0, 1), s.t.. k=1 πk = 1 (equivalent to saying that p(y ) is multinoulli)
p(y =k)p(x|y =k)

1 1
k (x − µk )
(2π)D |Σk | 2
PK
Parameters π = {πk }K K
k=1 , θ = {µk , Σk }k=1 can be estimated using MLE/MAP/Bayesian approach
p(y =k)p(x|y =k)

1 1
k (x − µk )
(2π)D |Σk | 2
PK
After estimating the parameters,the “plug-in” prediction rule will be

h i
πk |Σk |−1/2 exp − 12 (x − µk )> Σ−1
k (x − µk )
p(y = k|x, θ, π) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
p(y =k)p(x|y =k)

1 1
k (x − µk )
(2π)D |Σk | 2
PK
After estimating the parameters,the “plug-in” prediction rule will be

h i
πk |Σk |−1/2 exp − 12 (x − µk )> Σ−1
k (x − µk )
p(y = k|x, θ, π) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
An Important Note: The above rule is not truly Bayesian (it WILL be if we inferred the full posterior
distribution of the parameters and averaged p(y = k|x, θ, π) over that posterior distribution)
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be
N
Y
log p(D|Θ) = log p(x n , yn |Θ)
n=1
N
Y N
Y
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π)
n=1 n=1
N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1
N
Y N
Y N
X
n=1 n=1 n=1
where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model
N
Y N
Y N
X
n=1 n=1 n=1
Here log p(x n |θyn ) = log N (x n |µyn , Σyn )
N
Y N
Y N
X
n=1 n=1 n=1

QK I[yn =k]
Here log p(yn |π) = log multinoulli(yn |π) = log k=1 πk
N
Y N
Y N
X
n=1 n=1 n=1

QK I[yn =k]
Substituting these, we can write the likelihood as
K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk (Exercise: Verify this!)
k=1 n:yn =k n=1 k=1
N
Y N
Y N
X
n=1 n=1 n=1

QK I[yn =k]
Substituting these, we can write the likelihood as
K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk (Exercise: Verify this!)
k=1 n:yn =k n=1 k=1
We can now do MLE for the parameters Θ = {πk , µk , Σk }K

k=1
The log likelihood is

K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

K
X X N X
X K
k=1 n:yn =k n=1 k=1
Nk
MLE for πk is straightforward: πk = N (makes sense intuitively, but verify as an exercise)

K
X X N X
X K
k=1 n:yn =k n=1 k=1
Nk
MLE for µk , Σk is a Gaussian parameter estimation problem given data from class k. Solution is:
1 X 1 X
µ̂k = x n, Σ̂k = (x n − µ̂k )(x n − µ̂k )> (Exercise: Verify this!)
Nk Nk
n:yn =k n:yn =k

K
X X N X
X K
k=1 n:yn =k n=1 k=1
Nk
1 X 1 X
Nk Nk
n:yn =k n:yn =k
.. which is simply the empirical mean and empirical covariance of the inputs from class k

K
X X N X
X K
k=1 n:yn =k n=1 k=1
Nk
1 X 1 X
Nk Nk
n:yn =k n:yn =k
.. which is simply the empirical mean and empirical covariance of the inputs from class k
Note: Can also do MAP or fully Bayesian inference for these parameters
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
The decision boundary between any pair of classes will be..
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
The decision boundary between any pair of classes will be.. a quadratic curve
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. defines the decision boundary
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
(x − µk )> Σ−1 > −1
.. defines the decision boundary, which is a quadratic function of x
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
(x − µk )> Σ−1 > −1
.. defines the decision boundary, which is a quadratic function of x (this model is popularly known
as Quadratic Discriminant Analysis)
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be..
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Now the decision boundary between any pair of classes will be.. linear
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. terms quadratic in x cancel out in this case and we get a linear function of x
Decision Boundaries
h i
−1
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. terms quadratic in x cancel out in this case and we get a linear function of x (this model is
popularly known as Linear or “Fisher” Discriminant Analysis)
Decision Boundaries
Depending on the form of the covariance matrices, the boundaries can be quadratic/linear
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have

1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

1
2
Expanding further, we can write the above as

1 > −1 h i
p(y = k|x, θ) ∝ exp µ>
k Σ −1
x − µk Σ µk + log πk exp x > Σ−1 x
2

1
2

1 > −1 h i
k Σ −1
2
Therefore, the above posterior class probability can be written as

exp w >

k x + bk
p(y = k|x, θ) = PK >
k=1 exp w k x + bk
where w k = Σ−1 µk

1
2

1 > −1 h i
k Σ −1
2

exp w >

k x + bk
p(y = k|x, θ) = PK >
k=1 exp w k x + bk
where w k = Σ−1 µk and bk = − 12 µ> −1

k Σ µk + log πk

1
2

1 > −1 h i
k Σ −1
2

exp w >

k x + bk
p(y = k|x, θ) = PK >
k=1 exp w k x + bk
where w k = Σ−1 µk and bk = − 12 µ> −1

k Σ µk + log πk
Interestingly, this has exactly the same form as the softmax classification model, which is a
discriminative model (will look at these later), as opposed to a generative model.
Analogy: Classifying by Computing Distances from Classes
We can get a non-probabilistic analogy for the Gaussian generative classification model
Note the decision rule when Σk = Σ

1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2

1 > −1
k k 2
1
= arg max log πk − (x − µk )> Σ−1 (x − µk )
k 2

1 > −1
k k 2
1
k 2
Further, let’s assume the classes to be of equal size, i.e., πk = 1/K . Then we will have
ŷ = arg min (x − µk )> Σ−1 (x − µk )

k

1 > −1
k k 2
1
k 2
ŷ = arg min (x − µk )> Σ−1 (x − µk )

k
This is equivalent to assigning x to the “closest” class in terms of a Mahalanobis distance

1 > −1
k k 2
1
k 2
ŷ = arg min (x − µk )> Σ−1 (x − µk )

k
This is equivalent to assigning x to the “closest” class in terms of a Mahalanobis distance

The covariance matrix “modulate” how the distances are computed
Generative Classification: Some Comments
A simple but powerful approach to probabilistic classification

Especially easy to learn if class-conditionals are simple

E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes

Another popular model is multinomial naı̈ve Bayes (widely used for document classification)

The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

D
Y
d=1
Generative classification models work seamlessly for any number of classes

D
Y
d=1

Can choose the form of class-conditionals p(x|y ) based on the type of inputs x

D
Y
d=1

Can handle missing data (e.g., if some part of the input x is missing) or missing labels

D
Y
d=1

Can handle missing data (e.g., if some part of the input x is missing) or missing labels
Generative models are also useful for semi-supervised learning (will look at later)
Estimating the class-conditional distributions p(x|y ) reliably is important

In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)

Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params

.. or the “naı̈ve” assumptions

MLE for parameter estimation in these models can be prone to overfitting


Need to regularize the model properly to prevent that


A MAP or fully Bayesian approach can help (will need prior on the parameters)


A MAP or fully Bayesian approach can help (will need prior on the parameters)
A good density estimation model is necessary for generative classification model to work well

Probabilistic Models for Classification: Generative Classification

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probabilistic Models for Classification: Generative Classification

Uploaded by

Copyright:

Available Formats

Probabilistic Models for Classification: Generative Classification

Probabilistic Machine Learning (CS772A)

A cheaper alternative: Maximum Likelihood (ML) Estimation !

A cheaper alternative: Maximum Likelihood (ML) Estimation !

Another cheaper (and better) alternative: Maximum-a-Posteriori

A cheaper alternative: Maximum Likelihood (ML) Estimation !

Another cheaper (and better) alternative: Maximum-a-Posteriori

Intuitively, marginal likelihood = The average proability that model M generated X

Intuitively, marginal likelihood = The average proability that model M generated X

Intuitively, marginal likelihood = The average proability that model M generated X

Note: The word “model” is meant here in a rather broad sense

Intuitively, marginal likelihood = The average proability that model M generated X

Note: The word “model” is meant here in a rather broad sense

Intuitively, marginal likelihood = The average proability that model M generated X

Note: The word “model” is meant here in a rather broad sense

Intuitively, marginal likelihood = The average proability that model M generated X

Note: The word “model” is meant here in a rather broad sense

Intuitively, marginal likelihood = The average proability that model M generated X

Note: The word “model” is meant here in a rather broad sense

which is still a distribution but only depends on a single value of θ

which is still a distribution but only depends on a single value of θ

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

p(y = k)p(x|y = k) p(y = k)p(x|y = k)

ŷ = arg max p(y = k|x)

However, the generative classification also suffers from some issues

Assume each class-conditional to be a Gaussian

Assume each class-conditional to be a Gaussian

Assume each class-conditional to be a Gaussian

Assume each class-conditional to be a Gaussian

After estimating the parameters,the “plug-in” prediction rule will be

Assume each class-conditional to be a Gaussian

After estimating the parameters,the “plug-in” prediction rule will be

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )

We can now do MLE for the parameters Θ = {πk , µk , Σk }K

The log likelihood is

The log likelihood is

The log likelihood is

The log likelihood is

The log likelihood is

The decision boundary between any pair of classes will be..

.. defines the decision boundary

.. defines the decision boundary, which is a quadratic function of x

Expanding further, we can write the above as

Expanding further, we can write the above as

Therefore, the above posterior class probability can be written as

Expanding further, we can write the above as

Therefore, the above posterior class probability can be written as

where w k = Σ−1 µk and bk = − 12 µ> −1

Expanding further, we can write the above as

Therefore, the above posterior class probability can be written as

where w k = Σ−1 µk and bk = − 12 µ> −1

Note the decision rule when Σk = Σ

Note the decision rule when Σk = Σ

Note the decision rule when Σk = Σ

ŷ = arg min (x − µk )> Σ−1 (x − µk )