Lecture 11 - Probabilistic ML (3) - Parameter Estimation - Plain

Probabilistic Machine Learning (3):
Parameter Estimation via Maximum Likelihood
CS771: Introduction to Machine Learning

Piyush Rai
2
Probabilistic ML: Some Motivation
 In many ML problems, we want to model and reason about data
probabilistically p(y=red|x)
p(y=green|x)
 At a high-level, this is the density estimation view of ML, e.g.,

 Given input-output pairs estimate the conditional
 Given inputs , estimate the distribution of the inputs
 Note 1: These dist. will depend

𝑝(𝑦on
∨𝒙 , 𝜃)
some or 𝑝(𝒙∨𝜃)
parameters (to be estimated), and written as
 Note 2: These dist. sometimes assumed to have a specific form, but sometimes not
 Assuming the form of the distribution to be known, the goal in estimation is to

use the observed data to estimate the parameters of these distributions CS771: Intro to ML
3
Probabilistic Modeling: The Basic Idea
 Assume observations , generated from a presumed prob. model
(assumed independently & identically distributed ( i. i. d .) )
 Here is a conditional distribution, conditioned on params (to be learned)

 Note: may be fixed unknown or an unknown random variable (we will study both cases)
The parameters may themselves
depend on other unknown/known 𝜃 Such diagrams are
usually called the
The Predictive dist. tells us
how likely each possible
parameters (called
hyperparameters), which may “plate notation” value of a new observation is.
depend on other unknowns, and Example: if denotes the
outcome of a coin toss, then
so on.  This is essentially
“hierarchical” modeling (will see
𝑦𝑛 what is , given previous coin
various examples later) 𝑁 tosses
 Some of the tasks that we may be interested in

 Parameter estimation: Estimating the unknown parameters (and other unknowns depends on)
 Prediction: Estimating the predictive distribution of new data, i.e., - this is also a conditional
distribution (conditioned on past data , as well as and other things)
CS771: Intro to ML
4
Parameter Estimation in Probabilistic Models
This now is an
 Since data is assumed to be i.i.d., we can write down its total probability as optimization
problem
essentially
(being the
unknown)
 called “likelihood” - probability of observed data as a function of params
Well, one option is to find the that
𝑝 ( 𝒚 ∨𝜃 ) maximizes the likelihood (probability
of the observed data) – basically,
How do I which value of makes the observed
find the data most likely to have come from
the assumed distribution ---
best ?
Maximum Likelihood Estimation
(MLE)
𝜃 𝑜𝑝𝑡 =𝜃 𝑀𝐿𝐸 𝜃
 In parameter estimation, the goal is to find the “best” given observed data
 Note: Instead of finding single best, sometimes may be more informative to learn a
distribution for (can tell us about uncertainty in our estimate of – more later)CS771: Intro to ML
5
Maximum Likelihood Estimation (MLE)
 The goal in MLE is to find the optimal by maximizing the likelihood
 In practice, we maximize the log of the likelihood (log-likelihood in short)
Leads to simpler algebra/calculus, and
Taking log doesn’t
also yields better numerical stability
affect the optima since
when implementing it om computer
log 𝑝 ( 𝒚 ∨𝜃 ) log is a monotonic
(dealing with log of probabilities)
function
𝑁
𝐿𝐿 ( 𝜃 )=log 𝑝 ( 𝒚 ∨𝜃 ) =log ∏ 𝑝( 𝑦 𝑛 ∨𝜃)
𝑛=1
𝑁
𝜃 𝑜𝑝𝑡 =𝜃 𝑀𝐿𝐸 𝜃 ¿ ∑ log 𝑝 ( 𝑦 𝑛|𝜃 )
𝑛=1
 Thus the MLE problem is
 This is now an optimization (maximization problem). Note: may have constraints

CS771: Intro to ML
6
Maximum Likelihood Estimation (MLE)Negative Log-Likelihood
(NLL)
 The MLE problem can also be easily written as a minimization problem
 Thus MLE can also be seen as minimizing the negative log-likelihood (NLL)
Indeed. It may overfit. Several ways to prevent
it: Use regularizer or other strategies to prevent
overfitting. Alternatives, use “prior”
distributions on the parameters that we are
 NLL is analogous to a loss function trying to estimate (which will kind of act as a
regularizer as we will see shortly)
 The negative log-lik ) is akin to the loss on each data point
Such priors have various other
benefits as we will see later
 Thus doing MLE is akin to minimizing training loss Does it mean MLE
could overfit? If so,
how to prevent this?
CS771: Intro to ML
7
MLE: An Example
 Consider a sequence of coin toss outcomes (observations) Probability
of a head
 Each observation is a binary random variable. Head: , Tail:
 Each is assumed generated by a Bernoulli distribution with param
 Here the unknown param (probability of head). Want to estimate it using MLE
Take deriv. set it
to zero and solve.
 Log-likelihood: = Easy
optimization
 Maximizing log-lik (or minimizing NLL) w.r.t. will give a closed form expression
Thus MLE Indeed – if you want to trust
I tossed a coin 5 times – gave 1 head and 𝑁 solution is MLE solution. But with small
number of training
4 tails. Does it means = 0.2?? The MLE
approach says so. What is I see 0 head
∑ 𝑦𝑛 simply the
fraction of observations, MLE may
𝑛 =1
and 5 tails. Does it mean = 0? 𝜃 𝑀𝐿𝐸 = heads!  Makes overfit and may not be
𝑁 reliable. We will soon see
intuitive sense!
better alternatives that use
CS771: Intro to ML
prior distributions!
8
Coming up next
 Prior distributions and their role in parameter estimation
 Maximum-a-Posteriori (MAP) Estimation
 Fully Bayesian inference
 Probabilistic modeling for regression and classification problems
CS771: Intro to ML

Lecture 11 - Probabilistic ML (3) - Parameter Estimation - Plain

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 11 - Probabilistic ML (3) - Parameter Estimation - Plain

Uploaded by

Copyright:

Available Formats

Probabilistic Machine Learning (3):

Parameter Estimation via Maximum Likelihood

CS771: Introduction to Machine Learning

 At a high-level, this is the density estimation view of ML, e.g.,

 Note 1: These dist. will depend

 Assuming the form of the distribution to be known, the goal in estimation is to

 Here is a conditional distribution, conditioned on params (to be learned)

 Some of the tasks that we may be interested in

 This is now an optimization (maximization problem). Note: may have constraints

You might also like