Professional Documents
Culture Documents
𝑝(𝒚 ) likelihoo
briefly look at this in CS771, although
will stay away going too deep in this
course –CS775 does that in more detail)
Two way now to report the answer now: d
Maximum-a- Fully
Report the maxima (mode) of the posterior: posteriori (MAP) Bayesian
estimation inference
Report the full posterior (and its properties, e.g., mean, mode, variance, quantiles, etc)
CS771: Intro to ML
4
Posterior
Posterior distribution tells us how probable different parameter values are after
we have observed some data
Height of posterior at each value gives the posterior probability of that value
Plugging in the expressions for Bernoulli and Beta and ignoring any terms that
don’t depend on , the log posterior simplifies to
𝑁
𝐿𝑃 ( 𝜃 )= ∑ [ 𝑦 𝑛 log θ+ ( 1 − 𝑦 𝑛 ) log ( 1− 𝜃 ) ¿+¿ ( 𝛼 − 1 ) log 𝜃 + ( 𝛽 − 1 ) log (1 −𝜃 )
𝑛 =1
Maximizing the above log post. (or min. of its negative) w.r.t. 𝜃 gives
Prior’s hyperparameters have an
𝑁
Using and gives us the same interesting interpretation. Can think
solution as MLE ∑ 𝑦𝑛+ 𝛼 − 1 of and as the number of heads and
𝑛 =1
𝜃 𝑀𝐴𝑃 = tails, respectively, before starting the
Recall that and for Beta distribution 𝑁 + 𝛼+ 𝛽 − 2 coin-toss experiment (akin to
is in fact equivalent to a uniform Such interpretations of prior’s hyperparameters as
“pseudo-observations”)
prior (hence making MAP being “pseudo-observations” exist for various other
prior distributions as well (in particular, distributions
equivalent to MLE) belonging to “exponential family” of distributions CS771: Intro to ML
8
Fully Bayesian Inference
MLE/MAP only give us a point estimate of Interesting fact to keep in mind:
MAP estimate is more Note that the use of the prior is
robust than MLE (due making the MLE solution move
to the regularization towards the prior (MAP solution is
effect) but the estimate kind of a “compromise between
of uncertainty is MLE solution of the mode of the
missing in both prior)
approaches – both just
return a single
“optimal” solution by
solving an optimization Fully Bayesian
problem inference
If we want more than just a point estimate, we can compute the full posterior
Computable analytically only
when the prior likelihood are 𝑝 ( 𝜃 ) 𝑝 (𝒚 ∨𝜃) In other cases, the posterior needs to
“friends” with each other (i.e., 𝑝 ( 𝜃|𝒚 )= be approximated (will see 1-2 such
they form a conjugate pair of 𝑝(𝒚 ) cases in this course; more detailed
distributions (distributions from An example: Bernoulli and Beta are treatment in the advanced course on
exponential family have conjugate conjugate. Will see some more such probabilistic modeling and inference)
CS771: Intro to ML
pairs
9
“Online” Nature of Bayesian Inference
Fully Bayesian inference fits naturally into an “online” learning setting
Our belief about keeps getting updated as we see more and more data
CS771: Intro to ML
10
Fully Bayesian Inference: An Example Also, if you get more
Posterior is the same
distribution as the prior
(both Beta), just with
observations, you can treat the updated hyperparameters
Let’s again consider the coin-toss problem current posterior as the new
prior and obtain a new posterior
(property when likelihood
using these extra observations and prior are conjugate to
Bernoulli likelihood: each other)
Conditional distribution of the new ¿ ∫ 𝑝 ( 𝑦 ∗|𝜃 , 𝒚 ) 𝑝 (𝜃∨𝒚 )𝑑 𝜃 Decomposing the joint using chain rule
observation, given past
observations
¿ ∫ 𝑝 ( 𝑦 ∗|𝜃 ) 𝑝 (𝜃∨𝒚 )𝑑 𝜃 Assuming i.i.d. data, given , does not depend on
When doing fully Bayesian est, getting the predictive dist. Will require computing
This computes the predictive distribution by averaging over
𝑝 ( 𝑦 ∗| 𝒚 )= ∫ 𝑝 ( 𝑦 ∗|𝜃 ) 𝑝 ( 𝜃|𝒚 ) 𝑑 𝜃 the full posterior – basically calculate for each possible ,
weighs it by how likely this is under the posterior , and sum all
such posterior weighted predictions. Note that not each value
𝔼𝑝 ( 𝜃|𝒚 ) [𝑝 ( 𝑦 ∗|𝜃 ) ] of theta is given equal importance here in the averaging CS771: Intro to ML
13
Probabilistic Models: Making Predictions (Example)
For coin-toss example, let’s compute probability of the toss showing head
This can be done using the MLE/MAP estimate, or using the full posterior
𝜃 𝑀𝐿𝐸 =
𝑁1
𝜃 𝑀𝐴𝑃 =
𝑁 1 +𝛼 −1 𝑝 ( 𝜃|𝒚 )=Beta ( 𝜃|𝛼 + 𝑁 1 , 𝛽+ 𝑁 0 )
𝑁 𝑁 +𝛼+ 𝛽 − 2
Thus for this example (where observations are assumed to come from a Bernoulli)
CS771: Intro to ML