You are on page 1of 15

Probabilistic Machine Learning (4):

Parameter Estimation: MAP and Bayesian Inference

CS771: Introduction to Machine Learning


Piyush Rai
2
MLE and Its Shortcomings..
 MLE finds parameters that make the observed data most probable
Log- Neg. log-likelihood
𝑁 likelihood 𝑁 (NLL)
𝜃 𝑀𝐿𝐸 = arg max ∑ log 𝑝 ( 𝑦 𝑛|𝜃 ) =  arg min ∑ − log 𝑝 ( 𝑦 𝑛|𝜃 )
𝜃 𝑛=1 𝜃 𝑛=1

 No provision to control overfitting (MLE is just like minimizing training loss)

 How do we regularize probabilistic models in a principled way?


This distribution can
give us a sense about
 Also, MLE gives only a single “best” answer (“point estimate”) the uncertainty in the
parameter estimate
 .. and it may not be very reliable, especially when we have very little data
 Desirable: Report a probability distribution over the learned params instead of point est

 Prior distributions provide a nice way to accomplish such things!


CS771: Intro to ML
3
Priors
 Can specify our prior belief about likely param values via a prob. dist., e.g.,
This is a rather simplistic/contrived prior. 𝑝(𝜃) A possible prior for the coin
 Just to illustrate the basic idea. We will bias estimation problem.
see more concrete examples of priors The unknown is being
shortly. Also, the prior usually depends treated as a random variable,
(assumed conditioned on) on some not simply a fixed unknown
fixed/learnable hyperparameters (say as we treated it as in MLE
some and , and written as
0 0.25 0.5 0.75 1 𝜃
 Once we observe the data , apply Bayes rule to update prior into posterior
Likelihoo
Prio Note: Marginal lik. is hard to compute in
d
𝑝 ( 𝜃 ) 𝑝 (𝒚 ∨𝜃) r general as it requires a summation or
Posterior 𝑝 ( 𝜃|𝒚 )= Marginal
integral which may not be easy (will

𝑝(𝒚 ) likelihoo
briefly look at this in CS771, although
will stay away going too deep in this
course –CS775 does that in more detail)
 Two way now to report the answer now: d
Maximum-a- Fully
 Report the maxima (mode) of the posterior: posteriori (MAP) Bayesian
estimation inference
 Report the full posterior (and its properties, e.g., mean, mode, variance, quantiles, etc)
CS771: Intro to ML
4
Posterior
 Posterior distribution tells us how probable different parameter values are after
we have observed some data
 Height of posterior at each value gives the posterior probability of that value

𝑝(𝜃∨𝒚 ) More likely values


Less likely values

0 0.25 0.5 0.75 1


𝜃

 Can think of the posterior as a “hybrid” obtained by combining information


from the likelihood and the prior
CS771: Intro to ML
5
Maximum-a-Posteriori (MAP) Estimation
 The MAP estimation approach reports the maxima/mode of the posterior
𝑝 ( 𝜃 ) 𝑝 ( 𝒚 ∨𝜃)
𝜃 𝑀 𝐴𝑃 =arg max 𝑝 ( 𝜃| 𝑦 ) =arg max log 𝑝 ( 𝜃|𝑦 )=arg max log
𝜃 𝜃 𝜃 𝑝(𝒚 )
 Since is constant w.r.t. , the above simplifies to
The NLL term acts
𝜃 𝑀 𝐴𝑃 =arg max [ log 𝑝 ( 𝑦|𝜃 )+ log 𝑝 ( 𝜃 ) ] like the training
𝜃
loss and the
¿ arg min [− log 𝑝 ( 𝑦|𝜃 ) − log 𝑝 ( 𝜃 ) ] (negative) log-prior
𝜃
acts as regularizer.
Keep in mind this
𝜃 𝑀𝐴𝑃 =arg min [ 𝑁𝐿𝐿 (𝜃 )− log 𝑝 ( 𝜃 ) ]
𝜃 analogy. 

 Same as MLE with an extra log-prior-distribution term (acts as a regularizer) 


 If the prior is absent or uniform (all values equally likely a prior) then
MAP=MLE
CS771: Intro to ML
6
MAP Estimation: An Example
 Let’s again consider the coin-toss problem (estimating the bias of the coin)
 Each likelihood term is Bernoulli
𝑝 ( 𝑦 𝑛|𝜃 ) = Bernoulli ( 𝑦 n|𝜃 ) =𝜃
𝑦𝑛 1 − 𝑦𝑛
 (1− 𝜃 )

 Also need a prior since we want to do MAP estimation


 Since , a reasonable choice of prior for would be Beta distribution
Γ (𝛼+ 𝛽) 𝛼 −1
𝑝 ( 𝜃|𝛼 , 𝛽 ) = 𝜃 ( 1− 𝜃 ) 𝛽 −1
Γ (𝛼) Γ ( 𝛽 )
The gamma
and (both non-negative reals) are
function
the two hyperparameters of this
Using and will make the Beta
Beta prior
prior a uniform prior
Can set these based on intuition,
cross-validation, or even learn
CS771: them
Intro to ML
7
MAP Estimation: An Example (Contd)
 The log posterior for the coin-toss model is log-lik + log-prior
𝑁
𝐿𝑃 ( 𝜃 )= ∑ log 𝑝 ( 𝑦 𝑛|𝜃 ) + log 𝑝 ( 𝜃|𝛼 , 𝛽 )
𝑛 =1

 Plugging in the expressions for Bernoulli and Beta and ignoring any terms that
don’t depend on , the log posterior simplifies to
𝑁
𝐿𝑃 ( 𝜃 )= ∑ [ 𝑦 𝑛 log   θ+  ( 1 − 𝑦 𝑛 ) log ( 1− 𝜃 ) ¿+¿ ( 𝛼 − 1 ) log 𝜃 + ( 𝛽 − 1 ) log (1 −𝜃 )
𝑛 =1

 Maximizing the above log post. (or min. of its negative) w.r.t. 𝜃 gives
Prior’s hyperparameters have an
𝑁
Using and gives us the same interesting interpretation. Can think
solution as MLE ∑ 𝑦𝑛+ 𝛼 − 1 of and as the number of heads and
𝑛 =1
𝜃 𝑀𝐴𝑃 = tails, respectively, before starting the
Recall that and for Beta distribution 𝑁 + 𝛼+ 𝛽 − 2 coin-toss experiment (akin to
is in fact equivalent to a uniform Such interpretations of prior’s hyperparameters as
“pseudo-observations”)
prior (hence making MAP being “pseudo-observations” exist for various other
prior distributions as well (in particular, distributions
equivalent to MLE) belonging to “exponential family” of distributions CS771: Intro to ML
8
Fully Bayesian Inference
 MLE/MAP only give us a point estimate of Interesting fact to keep in mind:
MAP estimate is more Note that the use of the prior is
robust than MLE (due making the MLE solution move
to the regularization towards the prior (MAP solution is
effect) but the estimate kind of a “compromise between
of uncertainty is MLE solution of the mode of the
missing in both prior) 
approaches – both just
return a single
“optimal” solution by
solving an optimization Fully Bayesian
problem inference

 If we want more than just a point estimate, we can compute the full posterior
Computable analytically only
when the prior likelihood are 𝑝 ( 𝜃 ) 𝑝 (𝒚 ∨𝜃) In other cases, the posterior needs to
“friends” with each other (i.e., 𝑝 ( 𝜃|𝒚 )= be approximated (will see 1-2 such
they form a conjugate pair of 𝑝(𝒚 ) cases in this course; more detailed
distributions (distributions from An example: Bernoulli and Beta are treatment in the advanced course on
exponential family have conjugate conjugate. Will see some more such probabilistic modeling and inference)
CS771: Intro to ML
pairs
9
“Online” Nature of Bayesian Inference
 Fully Bayesian inference fits naturally into an “online” learning setting

Also, the posterior


becomes more and more
“concentrated” as the
number of observations
increases. For very large
N, you may expect it to
be peak around the MLE
solution

 Our belief about keeps getting updated as we see more and more data
CS771: Intro to ML
10
Fully Bayesian Inference: An Example Also, if you get more
Posterior is the same
distribution as the prior
(both Beta), just with
observations, you can treat the updated hyperparameters
 Let’s again consider the coin-toss problem current posterior as the new
prior and obtain a new posterior
(property when likelihood
using these extra observations and prior are conjugate to
 Bernoulli likelihood: each other)

 Beta prior: Number of heads ()


Number of tails ()

 The posterior can be computed as ∑


𝑁
𝑦𝑛
𝑁
𝑁 −∑ 𝑦𝑛
𝜃 𝑛 =1
  (1 − 𝜃 ) 𝑛= 1

Parts coming from the


numerator, which consist of
This is the numerator integrated/marginalized terms. We have ignored
over other constants in the
numerator, and the whole
In general, hard but with conjugate pairs of prior denominator which is also
and likelihood, we don’t need to compute this, as constant w.r.t.
we will see in this example 
Found the posterior just by simple
Aha! This is nothing but
This, of course, is not always possible inspection without having to calculate
but only in simple cases like this the constant of proportionality 
CS771: Intro to ML
11
Conjugacy
 Many pairs of distributions are conjugate to each other
 Bernoulli (likelihood) + Beta (prior) ⇒ Beta posterior
 Binomial (likelihood) + Beta (prior) ⇒ Beta posterior
 Multinomial (likelihood) + Dirichlet (prior) ⇒ Dirichlet posteriorNot true in general, but in
some cases (e.g., when mean
 Poisson (likelihood) + Gamma (prior) ⇒ Gamma posterior of the Gaussian prior is fixed)

 Gaussian (likelihood) + Gaussian (prior) ⇒ Gaussian posterior


 and many other such pairs ..
 Tip: If two distr are conjugate to each other, their functional forms are similar
 Example: Bernoulli and Beta have the forms This is why, when we multiply
them while computing the
posterior, the exponents get
Bernoulli ( 𝑦|𝜃 )=𝜃 𝑦  (1− 𝜃)1 − 𝑦 added and we get the same
form for the posterior as the
prior but with just updated
Γ (𝛼+ 𝛽) 𝛼− 1
Beta ( 𝜃|𝛼 , 𝛽 )= 𝜃 ( 1 − 𝜃 ) 𝛽− 1 hyperparameter. Also, we can
Γ ( 𝛼 ) Γ ( 𝛽) identify the posterior and its
hyperparameters simply by
inspection
CS771: Intro to ML
12
Probabilistic Models: Making Predictions
For example, PMF of the label of a
 Having estimated , we can now use it to make predictions new test input in classification

 Prediction entails computing the predictive distribution of a new observation, say


𝑝 ( 𝑦 ∗| 𝒚 )= ∫ 𝑝 ( 𝑦 ∗ , 𝜃|𝒚 ) 𝑑 𝜃 Marginalizing over the unknown

Conditional distribution of the new ¿ ∫ 𝑝 ( 𝑦 ∗|𝜃 , 𝒚 ) 𝑝 (𝜃∨𝒚 )𝑑 𝜃 Decomposing the joint using chain rule
observation, given past
observations
¿ ∫ 𝑝 ( 𝑦 ∗|𝜃 ) 𝑝 (𝜃∨𝒚 )𝑑 𝜃 Assuming i.i.d. data, given , does not depend on

 When doing MLE/MAP, we approximate the posterior by a single point


𝑝 ( 𝑦 ∗| 𝒚 )= ∫ 𝑝 ( 𝑦 ∗|𝜃 ) 𝑝 ( 𝜃|𝒚 ) 𝑑 𝜃 ≈ 𝑝 ( 𝑦 ∗∨𝜃𝑜𝑝𝑡 )
A “plug-in prediction” (simply
plugged in the singe estimate we had)

 When doing fully Bayesian est, getting the predictive dist. Will require computing
This computes the predictive distribution by averaging over
𝑝 ( 𝑦 ∗| 𝒚 )= ∫ 𝑝 ( 𝑦 ∗|𝜃 ) 𝑝 ( 𝜃|𝒚 ) 𝑑 𝜃 the full posterior – basically calculate for each possible ,
weighs it by how likely this is under the posterior , and sum all
such posterior weighted predictions. Note that not each value
𝔼𝑝 ( 𝜃|𝒚 ) [𝑝 ( 𝑦 ∗|𝜃 ) ] of theta is given equal importance here in the averaging CS771: Intro to ML
13
Probabilistic Models: Making Predictions (Example)
 For coin-toss example, let’s compute probability of the toss showing head

 This can be done using the MLE/MAP estimate, or using the full posterior
𝜃 𝑀𝐿𝐸 =
𝑁1
𝜃 𝑀𝐴𝑃 =
𝑁 1 +𝛼 −1 𝑝 ( 𝜃|𝒚 )=Beta ( 𝜃|𝛼 + 𝑁 1 , 𝛽+ 𝑁 0 )
𝑁 𝑁 +𝛼+ 𝛽 − 2

 Thus for this example (where observations are assumed to come from a Bernoulli)

Again, keep in mind that the posterior weighted averaged


prediction used in the fully Bayesian case would usually not be as Expectation of under the Beta posterior
simple to compute as it was in this case. We will look at some hard that we computed using fully Bayesian
cases later inference
CS771: Intro to ML
14
Probabilistic Modeling: A Summary
 Likelihood corresponds to a loss function; prior corresponds to a regularizer
 Can choose likelihoods and priors based on the nature/property of data/parameters
 MLE estimation = unregularized loss function minimization
 MAP estimation = regularized loss function minimization
 Allows us to do fully Bayesian learning (learning the full distribution of the
parameters)
 Makes robust predictions by posterior averaging (rather than using point estimate)
 Many other benefits, such as
 Estimate of confidence in the model’s prediction (useful for doing Active Learning)
 Can do automatic model selection, hyperparameter estimation, handle missing data, etc.
 Formulate latent variable models
 .. and many other benefits (a proper treatment deserves a separate course, but we will see some
of these in this course, too)
CS771: Intro to ML
15
Coming up next
 Probabilistic modeling for regression and classification problems

CS771: Intro to ML

You might also like