You are on page 1of 115

Probabilistic Models for Classification: Generative Classification

Piyush Rai

Probabilistic Machine Learning (CS772A)


Aug 17, 2017

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 1
Recap

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 2
Parameter Estimation in Probabilistic Models
Fully Bayesian inference (what we would ideally like to do)
p(X|θ)p(θ) Likelihood × Prior
p(θ|X) = = (easy when we have conjugacy!)
p(X) Marginal likelihood

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 3
Parameter Estimation in Probabilistic Models
Fully Bayesian inference (what we would ideally like to do)
p(X|θ)p(θ) Likelihood × Prior
p(θ|X) = = (easy when we have conjugacy!)
p(X) Marginal likelihood

A cheaper alternative: Maximum Likelihood (ML) Estimation !


N
X
θ̂MLE = arg max log p(X|θ) = arg max log p(x n |θ), if data X = {x 1 , . . . , x N } is i.i.d.
θ θ
n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 3
Parameter Estimation in Probabilistic Models
Fully Bayesian inference (what we would ideally like to do)
p(X|θ)p(θ) Likelihood × Prior
p(θ|X) = = (easy when we have conjugacy!)
p(X) Marginal likelihood

A cheaper alternative: Maximum Likelihood (ML) Estimation !


N
X
θ̂MLE = arg max log p(X|θ) = arg max log p(x n |θ), if data X = {x 1 , . . . , x N } is i.i.d.
θ θ
n=1

Another cheaper (and better) alternative: Maximum-a-Posteriori


"
(MAP) Estimation
# !
N
X
θ̂MAP = arg max log p(θ|X) = arg max log p(X|θ) + log p(θ) = arg max log p(x n |θ) + log p(θ) , if data is i.i.d.
θ θ θ
n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 3
Parameter Estimation in Probabilistic Models
Fully Bayesian inference (what we would ideally like to do)
p(X|θ)p(θ) Likelihood × Prior
p(θ|X) = = (easy when we have conjugacy!)
p(X) Marginal likelihood

A cheaper alternative: Maximum Likelihood (ML) Estimation !


N
X
θ̂MLE = arg max log p(X|θ) = arg max log p(x n |θ), if data X = {x 1 , . . . , x N } is i.i.d.
θ θ
n=1

Another cheaper (and better) alternative: Maximum-a-Posteriori


"
(MAP) Estimation
# !
N
X
θ̂MAP = arg max log p(θ|X) = arg max log p(X|θ) + log p(θ) = arg max log p(x n |θ) + log p(θ) , if data is i.i.d.
θ θ θ
n=1

In some cases, we may want (or may have to) to do a “hybrid” sort of inference (fully Bayesian
inference for some parameters and MLE/MAP for other parameters)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 3
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z
p(X|M) = p(X, θ|M)dθ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Note: The word “model” is meant here in a rather broad sense

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Note: The word “model” is meant here in a rather broad sense


Could mean a model from a class of models (e.g., linear, quadratic, cubic regression)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Note: The word “model” is meant here in a rather broad sense


Could mean a model from a class of models (e.g., linear, quadratic, cubic regression)
Could mean a hyperparameter set to some value (each possible value will mean a different model)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Note: The word “model” is meant here in a rather broad sense


Could mean a model from a class of models (e.g., linear, quadratic, cubic regression)
Could mean a hyperparameter set to some value (each possible value will mean a different model)

When clear from the context, we omit M and simply use p(X) to denote the marginal likelihood

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Marginal Likelihood
Given data X from a model M with likelihood p(X|θ, M), the marginal likelihood (or “evidence”)
Z Z
p(X|M) = p(X, θ|M)dθ = p(X|θ, M)p(θ|M)dθ = Ep(θ|M) [p(X|θ, M)]

Intuitively, marginal likelihood = The average proability that model M generated X


It’s the likelihood p(X|θ, M) averaged or marginalized/integrated over the prior p(θ|M)
Marginal likelihood can also be seen as a measure of “goodness” of model M for data X

Note: The word “model” is meant here in a rather broad sense


Could mean a model from a class of models (e.g., linear, quadratic, cubic regression)
Could mean a hyperparameter set to some value (each possible value will mean a different model)

When clear from the context, we omit M and simply use p(X) to denote the marginal likelihood
Marginal likelihood can be hard to compute in general, but is a very useful quantity
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 4
Model Selection via Marginal Likelihood
Can use marginal likelihood to compare and choose from a set of models M1 , M2 , . . . , MK
E.g., choose the model with the largest p(X|M) (or largest p(M|X))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 5
Model Selection via Marginal Likelihood
Can use marginal likelihood to compare and choose from a set of models M1 , M2 , . . . , MK
E.g., choose the model with the largest p(X|M) (or largest p(M|X))

Intuition: For a “good” model, many random θ’s from it (as opposed to just a select few!) would
fit the data reasonably well (so, “on an average”, we can say that this model fits the data well)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 5
Model Selection via Marginal Likelihood
Can use marginal likelihood to compare and choose from a set of models M1 , M2 , . . . , MK
E.g., choose the model with the largest p(X|M) (or largest p(M|X))

Intuition: For a “good” model, many random θ’s from it (as opposed to just a select few!) would
fit the data reasonably well (so, “on an average”, we can say that this model fits the data well)

Important: Note that here we aren’t talking about a single θ fitting the data well (which could just
be a case of overfitting the data) but many random θ’s from the model M

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 5
Model Selection via Marginal Likelihood
Can use marginal likelihood to compare and choose from a set of models M1 , M2 , . . . , MK
E.g., choose the model with the largest p(X|M) (or largest p(M|X))

Intuition: For a “good” model, many random θ’s from it (as opposed to just a select few!) would
fit the data reasonably well (so, “on an average”, we can say that this model fits the data well)

Important: Note that here we aren’t talking about a single θ fitting the data well (which could just
be a case of overfitting the data) but many random θ’s from the model M
Important: This approach doesn’t require a separate held-out data (therefore especially useful when
there is very little training data, or in unsupervised learning where cross-validation is impossible)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 5
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)
Note: Computing the posterior predictive p(x ∗ |X) is tractable only in rare cases and may require
an approximation in other cases (e.g., via sampling methods; we will look at these later)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)
Note: Computing the posterior predictive p(x ∗ |X) is tractable only in rare cases and may require
an approximation in other cases (e.g., via sampling methods; we will look at these later)
Note: If we only have a point estimate of θ, say θMLE or θMAP , then
p(x ∗ |X) = p(x ∗ |θMLE ) or p(x ∗ |X) = p(x ∗ |θMAP )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)
Note: Computing the posterior predictive p(x ∗ |X) is tractable only in rare cases and may require
an approximation in other cases (e.g., via sampling methods; we will look at these later)
Note: If we only have a point estimate of θ, say θMLE or θMAP , then
p(x ∗ |X) = p(x ∗ |θMLE ) or p(x ∗ |X) = p(x ∗ |θMAP )

which is still a distribution but only depends on a single value of θ

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Predictions via Posterior Averaging

Having learned a posterior p(θ|X), the posterior predictive distribution for a new observation x ∗
Z Z
p(x ∗ |X) = p(x ∗ , θ|X)dθ = p(x ∗ |θ)p(θ|X)dθ = Ep(θ|X) [p(x ∗ |θ)]

which doesn’t depend on a single θ but averages the prediction p(x ∗ |θ) w.r.t. the posterior p(θ|X)
Note: Computing the posterior predictive p(x ∗ |X) is tractable only in rare cases and may require
an approximation in other cases (e.g., via sampling methods; we will look at these later)
Note: If we only have a point estimate of θ, say θMLE or θMAP , then
p(x ∗ |X) = p(x ∗ |θMLE ) or p(x ∗ |X) = p(x ∗ |θMAP )

which is still a distribution but only depends on a single value of θ


Note: The posterior predictive distribution p(x ∗ |X) integrates out θ but may still depend on other
hyperparameters if those are hard to integrate out (or if we know “good” values for those).

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 6
Probabilistic Models for Classification
p(y = k|x) =?, k = 1, 2, . . . , K

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 7
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k)
p(y = k|x) =
p(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.

Discriminative classification

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.

Discriminative classification
Based on directly learning a probability distribution p(y = k|x) of class y given the input x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.

Discriminative classification
Based on directly learning a probability distribution p(y = k|x) of class y given the input x
Some examples: Logistic regression, softmax classification, Gaussian Process classification, etc.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Two Approaches to Probabilistic Classification
Generative classification
Use training data to learn class-conditional distribution p(x|y ) of inputs from each class y = 1, . . . , K
Use training data to learn class prior probability distribution p(y ) for each class y = 1, . . . , K
Note: We can use MLE/MAP or a fully Bayesian procedure to estimate p(x|y ) and p(y )
Finally, use these distributions to compute p(y = k|x) for an input x by applying Bayes rule

p(y = k)p(x|y = k) p(y = k)p(x|y = k)


p(y = k|x) = = PK
p(x) k=1 p(y = k)p(x|y = k)

Some examples: Naı̈ve Bayes, Linear/Quadratic Discriminant Analysis (name a misnomer), etc.

Discriminative classification
Based on directly learning a probability distribution p(y = k|x) of class y given the input x
Some examples: Logistic regression, softmax classification, Gaussian Process classification, etc.

Important: Generative approach models the inputs x. Discriminative approach treats x as fixed
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 8
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

ŷ = arg max p(y = k|x)


k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max
k k p(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

However, the generative classification also suffers from some issues

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

However, the generative classification also suffers from some issues, e.g.,
May require too many parameters to estimate (especially for class-conditional distributions p(x|y ))

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

However, the generative classification also suffers from some issues, e.g.,
May require too many parameters to estimate (especially for class-conditional distributions p(x|y ))
May do badly if our assumed p(x|y ) (say a Gaussian) is very different from the true class-conditional

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

However, the generative classification also suffers from some issues, e.g.,
May require too many parameters to estimate (especially for class-conditional distributions p(x|y ))
May do badly if our assumed p(x|y ) (say a Gaussian) is very different from the true class-conditional

Nevertheless, a very powerful approach, especially if we have a good method to estimate the
class-conditional densities.

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification

Assume we’ve learned p(x|y ) and p(y ). The classification rule will be

p(y = k)p(x|y = k)
ŷ = arg max p(y = k|x) = arg max = arg max p(y = k)p(x|y = k)
k k p(x) k

If we know true p(x|y ) and p(y ), this approach provably has smallest possible classification error
Also known as the Bayes optimal classifier (minimizes the misclassification probability)

However, the generative classification also suffers from some issues, e.g.,
May require too many parameters to estimate (especially for class-conditional distributions p(x|y ))
May do badly if our assumed p(x|y ) (say a Gaussian) is very different from the true class-conditional

Nevertheless, a very powerful approach, especially if we have a good method to estimate the
class-conditional densities. Also enables doing semi-supervised learning (more on this later)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 9
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)
Again note that there is no directly defined p(y = k|x) in this case

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)
Again note that there is no directly defined p(y = k|x) in this case
The associated generative story for each example (x n , yn ) is as follows

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)
Again note that there is no directly defined p(y = k|x) in this case
The associated generative story for each example (x n , yn ) is as follows
First choose a class/label yn ∈ {1, 2, . . . , K }

yn ∼ multinoulli(π)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification: A Generative Story of Labeled Data
Basic idea: Each input x n is generated conditioned on the value of the corresponding label yn

PK
Assume a class probability param. vector π = {π1 , . . . , πK }, s.t., k=1 πk = 1: Defines p(y = k)
Assume θ = {θ1 , . . . , θk } to be parameters distributions generating the data: Defines p(x|y = k)
Again note that there is no directly defined p(y = k|x) in this case
The associated generative story for each example (x n , yn ) is as follows
First choose a class/label yn ∈ {1, 2, . . . , K }

yn ∼ multinoulli(π)
Now draw (“generate”) the input x n from a distribution specific to the value yn takes

x n |yn ∼ p(x|θyn )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 10
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Assume each class-conditional to be a Gaussian


 
1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Assume each class-conditional to be a Gaussian


 
1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2
PK
Assume p(y = k) = πk ∈ (0, 1), s.t.. k=1 πk = 1 (equivalent to saying that p(y ) is multinoulli)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Assume each class-conditional to be a Gaussian


 
1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2
PK
Assume p(y = k) = πk ∈ (0, 1), s.t.. k=1 πk = 1 (equivalent to saying that p(y ) is multinoulli)
Parameters π = {πk }K K
k=1 , θ = {µk , Σk }k=1 can be estimated using MLE/MAP/Bayesian approach

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Assume each class-conditional to be a Gaussian


 
1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2
PK
Assume p(y = k) = πk ∈ (0, 1), s.t.. k=1 πk = 1 (equivalent to saying that p(y ) is multinoulli)
Parameters π = {πk }K K
k=1 , θ = {µk , Σk }k=1 can be estimated using MLE/MAP/Bayesian approach

After estimating the parameters,the “plug-in” prediction rule will be


h i
πk |Σk |−1/2 exp − 12 (x − µk )> Σ−1
k (x − µk )
p(y = k|x, θ, π) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)

Assume each class-conditional to be a Gaussian


 
1 1
p(x|y = k) = p(x|θk ) = N (x|µk , Σk ) = p exp − (x − µk )> Σ−1
k (x − µk )
(2π)D |Σk | 2
PK
Assume p(y = k) = πk ∈ (0, 1), s.t.. k=1 πk = 1 (equivalent to saying that p(y ) is multinoulli)
Parameters π = {πk }K K
k=1 , θ = {µk , Σk }k=1 can be estimated using MLE/MAP/Bayesian approach

After estimating the parameters,the “plug-in” prediction rule will be


h i
πk |Σk |−1/2 exp − 12 (x − µk )> Σ−1
k (x − µk )
p(y = k|x, θ, π) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

An Important Note: The above rule is not truly Bayesian (it WILL be if we inferred the full posterior
distribution of the parameters and averaged p(y = k|x, θ, π) over that posterior distribution)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 11
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y
log p(D|Θ) = log p(x n , yn |Θ)
n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π)
n=1 n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )


QK I[yn =k]
Here log p(yn |π) = log multinoulli(yn |π) = log k=1 πk

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )


QK I[yn =k]
Here log p(yn |π) = log multinoulli(yn |π) = log k=1 πk
Substituting these, we can write the likelihood as
K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk (Exercise: Verify this!)
k=1 n:yn =k n=1 k=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE
Given data D = {x n , yn }N D
n=1 with x n ∈ R and yn ∈ {1, . . . , K }, the log likelihood will be

N
Y N
Y N
X
log p(D|Θ) = log p(x n , yn |Θ) = log p(x n |θyn )p(yn |π) = [log p(x n |θyn ) + log p(yn |π)]
n=1 n=1 n=1

where Θ = {πk , µk , Σk }K
k=1 denotes all the unknown parameters of the model

Here log p(x n |θyn ) = log N (x n |µyn , Σyn )


QK I[yn =k]
Here log p(yn |π) = log multinoulli(yn |π) = log k=1 πk
Substituting these, we can write the likelihood as
K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk (Exercise: Verify this!)
k=1 n:yn =k n=1 k=1

We can now do MLE for the parameters Θ = {πk , µk , Σk }K


k=1
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 12
Parameter Estimation via MLE

The log likelihood is


K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 13
Parameter Estimation via MLE

The log likelihood is


K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

Nk
MLE for πk is straightforward: πk = N (makes sense intuitively, but verify as an exercise)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 13
Parameter Estimation via MLE

The log likelihood is


K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

Nk
MLE for πk is straightforward: πk = N (makes sense intuitively, but verify as an exercise)
MLE for µk , Σk is a Gaussian parameter estimation problem given data from class k. Solution is:
1 X 1 X
µ̂k = x n, Σ̂k = (x n − µ̂k )(x n − µ̂k )> (Exercise: Verify this!)
Nk Nk
n:yn =k n:yn =k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 13
Parameter Estimation via MLE

The log likelihood is


K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

Nk
MLE for πk is straightforward: πk = N (makes sense intuitively, but verify as an exercise)
MLE for µk , Σk is a Gaussian parameter estimation problem given data from class k. Solution is:
1 X 1 X
µ̂k = x n, Σ̂k = (x n − µ̂k )(x n − µ̂k )> (Exercise: Verify this!)
Nk Nk
n:yn =k n:yn =k

.. which is simply the empirical mean and empirical covariance of the inputs from class k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 13
Parameter Estimation via MLE

The log likelihood is


K
X X N X
X K
log p(D|Θ) = log N (x n |µk , Σk ) + I[yn = k] log πk
k=1 n:yn =k n=1 k=1

Nk
MLE for πk is straightforward: πk = N (makes sense intuitively, but verify as an exercise)
MLE for µk , Σk is a Gaussian parameter estimation problem given data from class k. Solution is:
1 X 1 X
µ̂k = x n, Σ̂k = (x n − µ̂k )(x n − µ̂k )> (Exercise: Verify this!)
Nk Nk
n:yn =k n:yn =k

.. which is simply the empirical mean and empirical covariance of the inputs from class k
Note: Can also do MAP or fully Bayesian inference for these parameters

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 13
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be..

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be.. a quadratic curve

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be.. a quadratic curve

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be.. a quadratic curve

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)

.. defines the decision boundary

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be.. a quadratic curve

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)

.. defines the decision boundary, which is a quadratic function of x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
The generative classification prediction rule was
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

The decision boundary between any pair of classes will be.. a quadratic curve

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)

.. defines the decision boundary, which is a quadratic function of x (this model is popularly known
as Quadratic Discriminant Analysis)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 14
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 15
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be..

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 15
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 15
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)

.. terms quadratic in x cancel out in this case and we get a linear function of x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 15
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k

Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear

Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)

.. terms quadratic in x cancel out in this case and we get a linear function of x (this model is
popularly known as Linear or “Fisher” Discriminant Analysis)
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 15
Decision Boundaries

Depending on the form of the covariance matrices, the boundaries can be quadratic/linear

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 16
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
 
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 17
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
 
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

Expanding further, we can write the above as


 
1 > −1 h i
p(y = k|x, θ) ∝ exp µ>
k Σ −1
x − µk Σ µk + log πk exp x > Σ−1 x
2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 17
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
 
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

Expanding further, we can write the above as


 
1 > −1 h i
p(y = k|x, θ) ∝ exp µ>
k Σ −1
x − µk Σ µk + log πk exp x > Σ−1 x
2

Therefore, the above posterior class probability can be written as


exp w >
 
k x + bk
p(y = k|x, θ) = PK  > 
k=1 exp w k x + bk

where w k = Σ−1 µk

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 17
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
 
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

Expanding further, we can write the above as


 
1 > −1 h i
p(y = k|x, θ) ∝ exp µ>
k Σ −1
x − µk Σ µk + log πk exp x > Σ−1 x
2

Therefore, the above posterior class probability can be written as


exp w >
 
k x + bk
p(y = k|x, θ) = PK  > 
k=1 exp w k x + bk

where w k = Σ−1 µk and bk = − 12 µ> −1


k Σ µk + log πk

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 17
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
 
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2

Expanding further, we can write the above as


 
1 > −1 h i
p(y = k|x, θ) ∝ exp µ>
k Σ −1
x − µk Σ µk + log πk exp x > Σ−1 x
2

Therefore, the above posterior class probability can be written as


exp w >
 
k x + bk
p(y = k|x, θ) = PK  > 
k=1 exp w k x + bk

where w k = Σ−1 µk and bk = − 12 µ> −1


k Σ µk + log πk

Interestingly, this has exactly the same form as the softmax classification model, which is a
discriminative model (will look at these later), as opposed to a generative model.
Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 17
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Note the decision rule when Σk = Σ


 
1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Note the decision rule when Σk = Σ


 
1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2
1
= arg max log πk − (x − µk )> Σ−1 (x − µk )
k 2

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Note the decision rule when Σk = Σ


 
1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2
1
= arg max log πk − (x − µk )> Σ−1 (x − µk )
k 2
Further, let’s assume the classes to be of equal size, i.e., πk = 1/K . Then we will have

ŷ = arg min (x − µk )> Σ−1 (x − µk )


k

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Note the decision rule when Σk = Σ


 
1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2
1
= arg max log πk − (x − µk )> Σ−1 (x − µk )
k 2
Further, let’s assume the classes to be of equal size, i.e., πk = 1/K . Then we will have

ŷ = arg min (x − µk )> Σ−1 (x − µk )


k

This is equivalent to assigning x to the “closest” class in terms of a Mahalanobis distance

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Analogy: Classifying by Computing Distances from Classes

We can get a non-probabilistic analogy for the Gaussian generative classification model

Note the decision rule when Σk = Σ


 
1 > −1
ŷ = arg max p(y = k|x) = arg max πk exp − (x − µk ) Σ (x − µk )
k k 2
1
= arg max log πk − (x − µk )> Σ−1 (x − µk )
k 2
Further, let’s assume the classes to be of equal size, i.e., πk = 1/K . Then we will have

ŷ = arg min (x − µk )> Σ−1 (x − µk )


k

This is equivalent to assigning x to the “closest” class in terms of a Mahalanobis distance


The covariance matrix “modulate” how the distances are computed

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 18
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)
The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)
The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

Generative classification models work seamlessly for any number of classes

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)
The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

Generative classification models work seamlessly for any number of classes


Can choose the form of class-conditionals p(x|y ) based on the type of inputs x

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)
The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

Generative classification models work seamlessly for any number of classes


Can choose the form of class-conditionals p(x|y ) based on the type of inputs x
Can handle missing data (e.g., if some part of the input x is missing) or missing labels

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

A simple but powerful approach to probabilistic classification


Especially easy to learn if class-conditionals are simple
E.g., Gaussian with diagonal covariances ⇒ Gaussian naı̈ve Bayes
Another popular model is multinomial naı̈ve Bayes (widely used for document classification)
The so-called “naı̈ve” models assume features to be independent conditioned on y , i.e.,
D
Y
p(x|θy ) = p(xd |θy ) (significantly reduces the number of parameters to be estimated)
d=1

Generative classification models work seamlessly for any number of classes


Can choose the form of class-conditionals p(x|y ) based on the type of inputs x
Can handle missing data (e.g., if some part of the input x is missing) or missing labels
Generative models are also useful for semi-supervised learning (will look at later)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 19
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions

MLE for parameter estimation in these models can be prone to overfitting

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions

MLE for parameter estimation in these models can be prone to overfitting


Need to regularize the model properly to prevent that

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions

MLE for parameter estimation in these models can be prone to overfitting


Need to regularize the model properly to prevent that
A MAP or fully Bayesian approach can help (will need prior on the parameters)

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20
Generative Classification: Some Comments

Estimating the class-conditional distributions p(x|y ) reliably is important


In general, the class-conditional p(x|y ) may have too many parameter to be estimated (e.g., if we
use full covariance Gaussians when the class-conditionals are Gaussians)
Can be difficult if we don’t have enough data for each class

Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions

MLE for parameter estimation in these models can be prone to overfitting


Need to regularize the model properly to prevent that
A MAP or fully Bayesian approach can help (will need prior on the parameters)

A good density estimation model is necessary for generative classification model to work well

Probabilistic Machine Learning - CS772A (Piyush Rai, IITK) Probabilistic Models for Classification: Generative Classification 20

You might also like