Professional Documents
Culture Documents
Lecture 11 - Probabilistic ML (3) - Parameter Estimation - Plain
Lecture 11 - Probabilistic ML (3) - Parameter Estimation - Plain
Note 2: These dist. sometimes assumed to have a specific form, but sometimes not
In parameter estimation, the goal is to find the “best” given observed data
Note: Instead of finding single best, sometimes may be more informative to learn a
distribution for (can tell us about uncertainty in our estimate of – more later)CS771: Intro to ML
5
Maximum Likelihood Estimation (MLE)
The goal in MLE is to find the optimal by maximizing the likelihood
In practice, we maximize the log of the likelihood (log-likelihood in short)
Leads to simpler algebra/calculus, and
Taking log doesn’t
also yields better numerical stability
affect the optima since
when implementing it om computer
log 𝑝 ( 𝒚 ∨𝜃 ) log is a monotonic
(dealing with log of probabilities)
function
𝑁
𝐿𝐿 ( 𝜃 )=log 𝑝 ( 𝒚 ∨𝜃 ) =log ∏ 𝑝( 𝑦 𝑛 ∨𝜃)
𝑛=1
𝑁
𝜃 𝑜𝑝𝑡 =𝜃 𝑀𝐿𝐸 𝜃 ¿ ∑ log 𝑝 ( 𝑦 𝑛|𝜃 )
𝑛=1
Thus the MLE problem is
Thus MLE can also be seen as minimizing the negative log-likelihood (NLL)
Indeed. It may overfit. Several ways to prevent
it: Use regularizer or other strategies to prevent
overfitting. Alternatives, use “prior”
distributions on the parameters that we are
NLL is analogous to a loss function trying to estimate (which will kind of act as a
regularizer as we will see shortly)
The negative log-lik ) is akin to the loss on each data point
Such priors have various other
benefits as we will see later
Thus doing MLE is akin to minimizing training loss Does it mean MLE
could overfit? If so,
how to prevent this?
CS771: Intro to ML
7
MLE: An Example
Consider a sequence of coin toss outcomes (observations) Probability
of a head
Each observation is a binary random variable. Head: , Tail:
Each is assumed generated by a Bernoulli distribution with param
Here the unknown param (probability of head). Want to estimate it using MLE
Take deriv. set it
to zero and solve.
Log-likelihood: = Easy
optimization
Maximizing log-lik (or minimizing NLL) w.r.t. will give a closed form expression
Thus MLE Indeed – if you want to trust
I tossed a coin 5 times – gave 1 head and 𝑁 solution is MLE solution. But with small
number of training
4 tails. Does it means = 0.2?? The MLE
approach says so. What is I see 0 head
∑ 𝑦𝑛 simply the
fraction of observations, MLE may
𝑛 =1
and 5 tails. Does it mean = 0? 𝜃 𝑀𝐿𝐸 = heads! Makes overfit and may not be
𝑁 reliable. We will soon see
intuitive sense!
better alternatives that use
CS771: Intro to ML
prior distributions!
8
Coming up next
Prior distributions and their role in parameter estimation
Maximum-a-Posteriori (MAP) Estimation
Fully Bayesian inference
Probabilistic modeling for regression and classification problems
CS771: Intro to ML