Professional Documents
Culture Documents
771 A18 Lec7
771 A18 Lec7
Piyush Rai
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 1
Recap: Probabilistic Linear Regression
Mean Variance
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 2
Recap: Probabilistic Linear Regression
Mean Variance
Equivalently
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 2
Recap: Probabilistic Linear Regression
The Likelihood
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 3
Recap: Probabilistic Linear Regression
The Likelihood
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 3
Recap: Probabilistic Linear Regression
The Likelihood
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 3
Recap: Probabilistic Linear Regression
The Prior
Zero-mean Gaussian prior encourages weights to be small. Precision λ controls how strong this prior is.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 4
Recap: Probabilistic Linear Regression
The Prior
=
0
Zero-mean Gaussian prior encourages weights to be small. Precision λ controls how strong this prior is.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 4
Recap: Probabilistic Linear Regression
The Prior
=
0
Zero-mean Gaussian prior encourages weights to be small. Precision λ controls how strong this prior is.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 4
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
For Bayesian inference, we compute the full posterior. Easily computable (thanks to conjugacy)
p(w |y , X) = N (µN , ΣN )
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
For Bayesian inference, we compute the full posterior. Easily computable (thanks to conjugacy)
p(w |y , X) = N (µN , ΣN )
ΣN = (βX> X + λID )−1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: MLE, MAP, and Bayesian Inference for Prob. Lin. Reg.
For MLE, we maximize the log-likelihood. Ignoring constants w.r.t. w , we have
" N #
βX > 2
ŵ MLE = arg max log p(y |X, w ) = arg min (yn − w x n )
w w 2 n=1
For Bayesian inference, we compute the full posterior. Easily computable (thanks to conjugacy)
p(w |y , X) = N (µN , ΣN )
ΣN = (βX> X + λID )−1
λ
µN = (X> X + ID )−1 X> y
β
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 5
Recap: Predictive Distribution for Prob. Lin. Reg.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Predictive Distribution for Prob. Lin. Reg.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Predictive Distribution for Prob. Lin. Reg.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Predictive Distribution for Prob. Lin. Reg.
When using the fully posterior, we can compute the posterior predictive distribution
Z
p(y∗ |x ∗ , X, y ) = p(y∗ |x ∗ , w )p(w |X, y )dw = N (µ>
N x ∗, β
−1
+x >
∗ ΣN x ∗ )
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Predictive Distribution for Prob. Lin. Reg.
When using the fully posterior, we can compute the posterior predictive distribution
Z
p(y∗ |x ∗ , X, y ) = p(y∗ |x ∗ , w )p(w |X, y )dw = N (µ>
N x ∗, β
−1
+x >
∗ ΣN x ∗ )
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Predictive Distribution for Prob. Lin. Reg.
When using the fully posterior, we can compute the posterior predictive distribution
Z
p(y∗ |x ∗ , X, y ) = p(y∗ |x ∗ , w )p(w |X, y )dw = N (µ>
N x ∗, β
−1
+x >
∗ ΣN x ∗ )
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 6
Recap: Logistic Regression
Logistic Regression models p(yn = 1|w , x n ) using the sigmoid function
Sigmoid Function
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 7
Recap: Logistic Regression
Logistic Regression models p(yn = 1|w , x n ) using the sigmoid function
Sigmoid Function
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 7
Recap: Logistic Regression
Logistic Regression models p(yn = 1|w , x n ) using the sigmoid function
Sigmoid Function
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 7
Recap: Logistic Regression
Logistic Regression models p(yn = 1|w , x n ) using the sigmoid function
Sigmoid Function
Can also use a Gaussian prior p(w ) = N (w |0, λ−1 ID ) just like in probabilistic linear regression
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 7
Recap: Logistic Regression
Logistic Regression models p(yn = 1|w , x n ) using the sigmoid function
Sigmoid Function
Can also use a Gaussian prior p(w ) = N (w |0, λ−1 ID ) just like in probabilistic linear regression
Can estimate w via MLE, MAP, or (a somewhat hard to do) fully Bayesian inference
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 7
Recap: Logistic Regression
exp(w >
k x n)
p(yn = k|x n , W) = PK >
= µnk
`=1 exp(w ` x n)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
MLE/MAP for logistic/softmax does not have closed form solution (unlike linear regression case)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
MLE/MAP for logistic/softmax does not have closed form solution (unlike linear regression case)
Computing full posterior is intractable (since Bernoulli/multinoulli and Gaussian are not conjugate)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
MLE/MAP for logistic/softmax does not have closed form solution (unlike linear regression case)
Computing full posterior is intractable (since Bernoulli/multinoulli and Gaussian are not conjugate)
Laplace (Gaussian) approximation is one way to get an approximate posterior
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
MLE/MAP for logistic/softmax does not have closed form solution (unlike linear regression case)
Computing full posterior is intractable (since Bernoulli/multinoulli and Gaussian are not conjugate)
Laplace (Gaussian) approximation is one way to get an approximate posterior
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Recap: Logistic Regression
MLE/MAP for logistic/softmax does not have closed form solution (unlike linear regression case)
Computing full posterior is intractable (since Bernoulli/multinoulli and Gaussian are not conjugate)
Laplace (Gaussian) approximation is one way to get an approximate posterior
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 8
Generative Models for Supervised Learning
p(x, y )
p(y |x) =
p(x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 9
Generative Classification
Consider a classification problem with K ≥ 2 classes
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Can use the chain rule to re-express the above as
p(y = k|θ)p(x|y = k, θ)
p(y = k|x, θ) =
p(x|θ)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Can use the chain rule to re-express the above as
p(y = k|θ)p(x|y = k, θ)
p(y = k|x, θ) =
p(x|θ)
This depends on two quantities
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Can use the chain rule to re-express the above as
p(y = k|θ)p(x|y = k, θ)
p(y = k|x, θ) =
p(x|θ)
This depends on two quantities
p(y = k|θ): The class-marginal distribution (also called “class prior”)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Can use the chain rule to re-express the above as
p(y = k|θ)p(x|y = k, θ)
p(y = k|x, θ) =
p(x|θ)
This depends on two quantities
p(y = k|θ): The class-marginal distribution (also called “class prior”)
p(x|y = k, θ): The class-conditional distribution of the inputs
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification
Consider a classification problem with K ≥ 2 classes
Assuming θ to collectively denote all the params, the generative classification model is
p(x, y = k|θ)
p(y = k|x, θ) = , k = 1, . . . , K
p(x|θ)
PK
Note that the denominator p(x|θ) = k=1 p(x, y = k|θ), using sum rule of probability
Can use the chain rule to re-express the above as
p(y = k|θ)p(x|y = k, θ)
p(y = k|x, θ) =
p(x|θ)
This depends on two quantities
p(y = k|θ): The class-marginal distribution (also called “class prior”)
p(x|y = k, θ): The class-conditional distribution of the inputs
Generative classification requires first estimating the parameters θ of these two distributions
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 10
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1 , and πk = p(y = k)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1 , and πk = p(y = k)
Given N labeled training examples {(x n , yn )}N
n=1 , MLE for π (won’t depend on x n ’s) will be
XN
π MLE = arg max log p(yn |π)
π
n=1
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1 , and πk = p(y = k)
Given N labeled training examples {(x n , yn )}N
n=1 , MLE for π (won’t depend on x n ’s) will be
XN
π MLE = arg max log p(yn |π)
π
n=1
PN
.. which gives πk = Nk /N (exercise: verify) where Nk = n=1 I[yn = k]
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1 , and πk = p(y = k)
Given N labeled training examples {(x n , yn )}N
n=1 , MLE for π (won’t depend on x n ’s) will be
XN
π MLE = arg max log p(yn |π)
π
n=1
PN
.. which gives πk = Nk /N (exercise: verify) where Nk = n=1 I[yn = k]
Note: If MAP (or full posterior) is needed, we can use a Dirichlet prior distribution on π
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Marginal Distribution
Estimating the class-marginal is usually straightforward in generative classification
The class marginal distribution is (has to be!) a discrete distribution (multinoulli)
K
I[y =k]
Y
p(y |π) = multinoulli(y |π1 , . . . , πK ) = πk
k=1
PK
where multinoilli parameters π = [π1 , . . . , πK ], k=1 πk = 1 , and πk = p(y = k)
Given N labeled training examples {(x n , yn )}N
n=1 , MLE for π (won’t depend on x n ’s) will be
XN
π MLE = arg max log p(yn |π)
π
n=1
PN
.. which gives πk = Nk /N (exercise: verify) where Nk = n=1 I[yn = k]
Note: If MAP (or full posterior) is needed, we can use a Dirichlet prior distribution on π
Another exercise: Try to derive the MAP estimate of π and also the full posterior (good news:
multinoulli and Dirichlet are conjugate to each other, so full posterior is easy)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 11
Generative Classification: Estimating Class-Conditional Distr.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
An issue: When D is large, we may need to estimate a huge number of parameters
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
An issue: When D is large, we may need to estimate a huge number of parameters, e.g.,
A D-dim Gaussian will have D params for mean and O(D 2 ) params for covariance matrix
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
An issue: When D is large, we may need to estimate a huge number of parameters, e.g.,
A D-dim Gaussian will have D params for mean and O(D 2 ) params for covariance matrix
Some workarounds: Regularize well
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
An issue: When D is large, we may need to estimate a huge number of parameters, e.g.,
A D-dim Gaussian will have D params for mean and O(D 2 ) params for covariance matrix
Some workarounds: Regularize well; assume diagonal (or same) covariance for all classes, which means
that the features are independent given the class
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: Estimating Class-Conditional Distr.
For the assumed class-conditional, we can do MLE/MAP estimation or learn full posterior for θ
An issue: When D is large, we may need to estimate a huge number of parameters, e.g.,
A D-dim Gaussian will have D params for mean and O(D 2 ) params for covariance matrix
Some workarounds: Regularize well; assume diagonal (or same) covariance for all classes, which means
that the features are independent given the class (used in “naı̈ve” Bayes models)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 12
Generative Classification: The Prediction Rule
Suppose we’ve estimated the parameters of p(y = k|θ) and p(x|y = k, θ) (assuming MLE/MAP)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 13
Generative Classification: The Prediction Rule
Suppose we’ve estimated the parameters of p(y = k|θ) and p(x|y = k, θ) (assuming MLE/MAP)
The “most likely” class for a test input x ∗ will be (skipping θ from the notation)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 13
Generative Classification: The Prediction Rule
Suppose we’ve estimated the parameters of p(y = k|θ) and p(x|y = k, θ) (assuming MLE/MAP)
The “most likely” class for a test input x ∗ will be (skipping θ from the notation)
p(y∗ = k)p(x ∗ |y∗ = k)
y∗ = arg max p(y∗ = k|x ∗ ) = arg max
k k p(x ∗ )
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 13
Generative Classification: The Prediction Rule
Suppose we’ve estimated the parameters of p(y = k|θ) and p(x|y = k, θ) (assuming MLE/MAP)
The “most likely” class for a test input x ∗ will be (skipping θ from the notation)
p(y∗ = k)p(x ∗ |y∗ = k)
y∗ = arg max p(y∗ = k|x ∗ ) = arg max = arg max p(y∗ = k)p(x ∗ |y∗ = k)
k k p(x ∗ ) k
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 13
Generative Classification: The Prediction Rule
Suppose we’ve estimated the parameters of p(y = k|θ) and p(x|y = k, θ) (assuming MLE/MAP)
The “most likely” class for a test input x ∗ will be (skipping θ from the notation)
p(y∗ = k)p(x ∗ |y∗ = k)
y∗ = arg max p(y∗ = k|x ∗ ) = arg max = arg max p(y∗ = k)p(x ∗ |y∗ = k)
k k p(x ∗ ) k
If p(y = k) is the same for all the classes then, we simple compare p(x|y = k)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 13
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Parameters θ = {πk , µk , Σk }K
k=1 can be estimated using MLE/MAP/Bayesian approach
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Parameters θ = {πk , µk , Σk }K
k=1 can be estimated using MLE/MAP/Bayesian approach
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Parameters θ = {πk , µk , Σk }K
k=1 can be estimated using MLE/MAP/Bayesian approach
We also saw estimation of πk ’s. (µk , Σk ) can be found via Gaussian parameter estimation
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Generative Classification using Gaussian Class-conditionals
p(y =k)p(x|y =k)
Recall our generative classification model p(y = k|x) = p(x)
Parameters θ = {πk , µk , Σk }K
k=1 can be estimated using MLE/MAP/Bayesian approach
We also saw estimation of πk ’s. (µk , Σk ) can be found via Gaussian parameter estimation
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 14
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
The decision boundary between any pair of classes will be.. a quadratic curve
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
The decision boundary between any pair of classes will be.. a quadratic curve
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
The decision boundary between any pair of classes will be.. a quadratic curve
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
The decision boundary between any pair of classes will be.. a quadratic curve
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
The generative classification prediction rule we saw had
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
The decision boundary between any pair of classes will be.. a quadratic curve
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).Thus
(x − µk )> Σ−1 > −1
k (x − µk ) − (x − µk 0 ) Σk 0 (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. defines the decision boundary, which is a quadratic function of x (this model is popularly known
as Quadratic Discriminant Analysis)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 15
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be..
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x).
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. terms quadratic in x cancel out in this case and we get a linear function of x
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Let’s again consider the generative classification prediction rule with Gaussian class-conditionals
h i
−1
πk |Σk |−1/2 exp − 12 (x − µk )> Σk (x − µk )
p(y = k|x, θ) = P h i
K −1/2 exp − 1 (x − µ )> Σ−1 (x − µ )
k=1 πk |Σk | 2 k k k
Let’s assume all classes to have the same covariance (i.e., same shape/size), i.e., Σk = Σ, ∀k
Now the decision boundary between any pair of classes will be.. linear
Reason: For any two classes k and k 0 , at the decision boundary p(y = k|x) = p(y = k 0 |x). Thus
> −1 > −1
(x − µk ) Σ (x − µk ) − (x − µk 0 ) Σ (x − µk 0 ) = 0 (ignoring terms that don’t depend on x)
.. terms quadratic in x cancel out in this case and we get a linear function of x (this model is
popularly known as Linear or “Fisher” Discriminant Analysis)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 16
Decision Boundaries
Depending on the form of the covariance matrices, the boundaries can be quadratic/linear
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 17
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 18
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 18
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2
where w k = Σ−1 µk
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 18
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 18
A Closer Look at the Linear Case
For the linear case (when Σk = Σ), we have
1
p(y = k|x, θ) ∝ πk exp − (x − µk )> Σ−1 (x − µk )
2
Interestingly, this has exactly the same form as the softmax classification model (saw it in last
class), which is a discriminative model, as opposed to a generative model.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 18
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
A Very Special Case: Prototype based Classification
We can get a non-probabilistic analogy for the Gaussian generative classification model
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 19
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 20
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Generative Classification: Some Comments
Assuming shared and/or diagonal covariance for each Gaussian can reduce the number of params
.. or the “naı̈ve” assumptions
A good density estimation model is necessary for generative classification model to work well
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 21
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Note that both are basically doing density estimation for learning p(y |x) but in different ways
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Note that both are basically doing density estimation for learning p(y |x) but in different ways
Discriminative models directly model p(y |x) via some parameters (say w if using linear model)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Note that both are basically doing density estimation for learning p(y |x) but in different ways
Discriminative models directly model p(y |x) via some parameters (say w if using linear model)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Note that both are basically doing density estimation for learning p(y |x) but in different ways
Discriminative models directly model p(y |x) via some parameters (say w if using linear model)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22
Probabilistic Models for Supervised Learning: Wrapping Up
Both discriminative and generative models estimate the conditional distribution p(y |x)
Note that both are basically doing density estimation for learning p(y |x) but in different ways
Discriminative models directly model p(y |x) via some parameters (say w if using linear model)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning (Contd.) 22