You are on page 1of 4

[CURSO III]

2 Introduction to Bayesian data analysis - part 3: How to do Bayes?

Hallie’s hello as we say in Sweden I’m 00:03 Rasmus port and welcome to this last 00:05 part
of a three part introduction to 00:07 Bayesian data analysis which will go 00:10 into the how
Bayesian data analysis how 00:13 to do it efficiently and how to do it in 00:15 practice but
I’ll warn you will just 00:18 make the tiniest scratch on the far 00:20 reaching surface that is
Bayesian 00:22 computation.
The goal is just for you to 00:24 get some familiarity with words like Markov chain Monte
Carlo and parameter 00:29 space and to be able to run a simple 00:32 Bayesian model using a
powerful Bayesian computational framework called span also 00:37 if you haven’t watched part
1 and part 2 00:39 yet and especially if you didn’t do the exercise in part 1 I really recommend
that you do that first because what’s 00:46 going to follow isn’t going to make much sense
otherwise also this part 3 is quite a lot longer than the other parts so feel free to take a break at
any time
aaaaaa 00:57 so what is the problem with how we have 01:00 performed Bayesian data
analysis so far well we really coded things from scratch and from my experience that is slow
and 01:09 error-prone and it’s especially slow because we’ve been doing approximate Bayesian
computation which is the most 01:16 general and conceptually simplest method but also the
slowest method for fitting a Bayesian model.
But there are many 01:23 faster method for fitting Bayesian models that are faster because
they take 01:28 computational shortcuts and these faster methods have in common that they
require 01:33 that delighted that the generator model will generate any given data can be 01:38
calculated rather than simulated.
So what we did 01:43 when we did approximate Bayesian 01:44 computation was that
we defined a generative model function that took a 01:49 fixed parameter value and simulated
some 01:52 data and we figured out the likelihood of the generative model simulating the 01:57
actual data by running it many many times and then counting how many times 02:03 it reduced
data matching the actual data and all this simulation can be very very 02:09 time consuming
and fast the methods require 02:12 function like this that takes both data and fixed parameters
as input and 02:17 directly calculates the likelihood of these parameters producing the data it’s
02:23 not always straightforward to calculate likelihood for any generative model but 02:27 for
a large number of generative models someone has already done this work for 02:33 you. For
example for most common probability distributions the likelihood 02:37 is easy to calculate.
Another thing faster methods have in common is that 02:42 they explore the parameter
space in a 02:45 smarter way. 02:46
Rather than just sampling from the prior 02:49 as we have done, faster methods try to
02:52 find and explore the regions in 02:54 parameter space that have higher 02:55 probability
using different more or less 02:58 smart methods. More about parameter space 03:02 is soon.
but finally it is important to 03:05 remember that what you get from using a 03:08 SAS
the computational methods are 03:09 samples as if you would have done the 03:13 analysis
using the type of approximate 03:15 Bayesian computation we did in Part one 03:16 and Part
two you just get the result 03:19 much, much faster so if you’re using a 03:23 potentially faster
method that you don’t 03:25 completely understand but you do believe 03:28 it does its job then
you could interpret 03:30 the result as coming from the simple 03:33 approximate Bayesian
computation 03:35 procedure we used in part 1 and part 2.
03:37 it’s a little bit like when you do 03:40 optimization the the simplest the 03:42
slowest method is to do an exhaustive search of all possible parameter 03:47 combinations and
return the optimal 03:49 combination but this would just often 03:52 take forever so if you then
use a faster 03:55 method like algorithm optimization you 03:58 could interpret the resulting
optimum as 04:00 being the result of an exhaustive search 04:02 which is easier to understand
maybe if 04:05 you believe the Nelda (need?) need optimization 04:07 worked that is.
now this slide mentions 04:10 the concept of parameter spaces and since we haven’t
mentioned that before I 04:14 thought we would take a little detour and look at the parameter
space or a 04:18 slightly more advanced Bayesian model so 04:21 these slides are born from
another 04:23 presentation I held at the 2000 04:25 sixteen European our users meeting in
Poland so you will have to excuse the 04:30 sudden change in a graphical style.
so 04:33 here is the most advanced model we’ve looked at so far but here shown as a
04:37 nice little model diagram the data were two counts of successes X 1 and X 2 out 04:44 of
N 1 and n 2 trials we model this data as being generated by a binomial 04:50 distribution in part
one I think we called it a generative model of people 04:55 signing up for fish but that generator
04:58 model is more often called a binomial 05:00 distribution the squiggly little tilde 05:04
attached to the arrows should here be 05:06 read as is generated by or comes from.
We 05:11 have two parameters which means we have 05:13 a two dimensional parameter
space but 05:16 it’s a pretty uninteresting parameter 05:18 space as the parameters are not really
05:20 related in any way. You can easily see 05:23 that in the diagram p1 relates to X 1 and p2
only relates to X 2 but the 05:29 streams never cross.
let’s have a look at another simple Bayesian model but with a 05:36 more interesting
parameter space.
so say 05:41 you have some outcome Y and you would want to model how that depends
on 05:45 another variable X then a simple approach is to model this relationship 05:51 as a line
and use linear regression but which line should you choose one common 05:58 approach is to
calculate the difference between the line and outcome variable 06:02 and try to find the line that
minimizes this difference.
so maybe this line is 06:08 better maybe not. Actually if you try all different lines you’ve
got you’ll find 06:13 that this is the line that minimizes the difference however, this is not
how you 06:20 look at linear regression from the perspective of statistical modeling. In 06:24
statistical modeling you instead use probability to posit a generative model 06:29 a little story
for how the data came to be and the generative model behind 06:33 classical linear regression
goes like this
so we have outcome while and here we read from the bottom of the diagram or from the
top of 06:43 the text representation to the left whatever suits you so we have some outcome Y
that comes 06:50 from normal distributions that all share the same standard deviation Sigma but
06:55 that have different means you and here the MU depends on the predictor variable 07:00
X through the equation for the line. So for each Y you take the corresponding X 07:06 and
multiply it by the slope parameter and add the intercept parameter and that 07:10 gets you the
MU for that Y.
Now this model has three parameters three 07:15 unknowns the standard deviation Sigma,
the slope and the intercept and these 07:20 are what we want to figure out but, to the right of
just visualize this model 07:25 for some six parameter values so that we can take a look at an
instance of this 07:30 model.
so the black line shows how that mean of Y changes as a function of X and 07:36 the
shaded lines show a band of normal distribution seen from above.
the 07:42 stronger the shade the more likely it is that a data point will fall in that 07:46
region.
Now this is a generative model but it’s not just a bayesian model for 07:52 that we need to
represent all uncertainty by probability and add prior probability distributions over all parame-
ters and since smart priors are 08:03 not really the focus here we’re just going to add dumped
flat uniform 08:07 distributions for the time being
okay we have a Bayesian model let’s add the data 08:13 we looked at before to this.
here I added the data points to the fixed instance of 08:19 this model the blue data point
is the more likely it is to be generated under 08:24 this specific instance of the model, so you
can say this plot on the right shows 08:28 the likelihood of the data in data space which has the
dimensions x and y.
But we 08:34 can also look at the likelihood in parameter space holding the standard
08:39 deviation Sigma constant for now we have two free parameters the intercept and 08:44
the slope and the point on the Left shows where we currently are in 08:50 parameter space.
At an intercept of minus 0.5 and a slope 08:56 of 0.9 and the color of this point 09:00
corresponds to the product of all the 09:03 likelihoods of all the data points to 09:05 the right so
now we can move around in 09:08 parameter space and see how this 09:10 likelihood changes.
If we make the slope steeper it makes some data points very 09:15 unlikely and we see
that the overall 09:17 light load to the left goes down. We 09:20 could make the slope say zero
but then 09:23 all the data points become unlikely and 09:25 this doesn’t get better even if we
09:27 change the intercept
this is a pretty 09:31 good fit perhaps a bit too steep and negative slope results in a horrible
fit 09:37 which makes many data points extremely unlikely and actually the parameter 09:43
combination we started with is the combination that makes the data the most 09:48 likely the so-
called maximum likelihood estimate so if you recognize this term 09:54 it’s because maximum
likelihood estimation is a really common way of 09:57 fitting models in classical statistics.
09:59 and it does make sense that the parameter values that makes the data the 10:04
most likely might be a good guess for 10:07 what the best parameter values might be 10:09 but
in Bayesian statistics were not just 10:11 interested in one best guess we’re 10:13 interested
in exploring the full 10:15 parameter space and in how probable 10:17 different parameter
combinations are and 10:20 we’re almost there.
instead of searching through parameter space for a best 10:26 parameter combination we
could evaluate 10:28 the likelihood of all parameter 10:30 combinations or at least a fine grid
of combinations
so here we have the 10:36 likelihood of the data given different 10:39 combinations of
intercepts and slopes 10:41
which can be written more compactly like this that is the likelihood of the data 10:47 D
given a parameter combination theta but as we said before in this tutorial 10:52 we’re not really
interested in the 10:54 probability of the data we know what the 10:57 data is what we want to
know is the 10:59 probability of different parameter 11:02 combinations given the data.
and what 11:04 we’ve realized before is 11:05 that the probability of a parameter 11:07
combination is proportional to how likely it is that that parameter 11:12 combination generated
the data weighted by how probable that parameter 11:16 combination was to begin with.
to make 11:19 this proportionality into an equality we just normalize by 11:26 the weigh-
ted likelihood for all other parameter combinations and that gives us 11:31 the probability of
different parameters given the data and again this little 11:36 formula here of course is the Bayes
theorem. We can use Bayes theorem to calculate the 11:42 probability of all combinations in
11:44 parameter space looking at this 11:47 posterior probability distribution to 11:50 the left
we see that almost all the 11:52 probability is concentrated on a slope between 0 1 point 5 and
then intercept 11:58 between minus 2 and 1.
we could arrive at this probability distribution without 12:03 having to sample from a
generative model as we for this model could calculate the likelihood directly it was enough to
12:11 just explore parameter space and apply 12:13 Bayes theorem.
now that we have a 12:16 posterior probability distribution in parameter space we can do
a lot of 12:21 useful stuff with, it for example we 12:23 could draw a sample of intercept slope
12:25 combinations from this probability 12:27 distribution and then plot the resulting 12:30
lines back into data space.
this now 12:33 becomes a visual representation of the 12:35 uncertainty regarding the
best 12:36 regression line or instead of looking at 12:39 this 2d distribution to the left we 12:42
could look at the marginal distributions for each parameter so here again we get 12:47 that the
slope is likely between 0 1 point 5 and the intercept is somewhere 12:52 around minus 2 to 1
all right that was an example of a parameter space of a 12:59 slightly more complex Bayesian

You might also like