Bayesianism 3 Couses-3

model and how we could fit this model to the data 13:03 by evaluating the Lytle for all
parameter
combinations or at least a 13:08 fine grid of combinations.
but this is not always so easy as covering the 13:14 parameter space can be hard for
example, the parameter space can be 13:19 vast and multi-dimensional we will not 13:22 be
able to explore all relevant parts of 13:24 it the parameter space we just looked at 13:27 was
deceptively simple as it just had 13:30 two free parameters but it’s not 13:32 difficult to come
come up with models 13:34 that have ten hundreds or even thousands 13:37 of parameters
making it impossible to 13:39 cover all relevant parts of the 13:41 parameter space using a grid.
and we 13:44 might not even know which part we should 13:46 explore and where to start
exploring but fortunately there is a lot of research 13:50 going into making Bayesian computa-
tion easier for you and many more modern and 13:56 efficient ways of fitting Bayesian models
belong to a class of algorithms called 14:00 Markov chain Monte Carlo algorithms
so 14:04 this is a class of algorithms that 14:06 samples from probability distributions
by walking around in the parameter space 14:11 in clever ways and it’s fair to say that 14:14
the development of efficient Markov chain Monte Carlo algorithms is a main 14:18 reason why
bass became popular again
14:22 so the probability distributions we are interested in in Bayesian data analysis 14:27
or posterior distributions the probability of different parameter 14:31 combinations after we
have used the data
14:34 we mostly care about the parameter combinations that at least have some 14:38
probability but the problem with evaluating all possible parameter 14:42 combinations or at
least fine grid of combinations is that you have to decide 14:47 at what points to evaluate the
likelihood before you actually know how 14:51 the likelihood looks it’s a bit of a catch-22
situation.
a Markov chain Monte 14:56 Carlo algorithm instead explores or walks around the para-
meter space in such a way that in the long run it will 15:04 revisit each parameter combination
proportional to how probable it is how 15:11 it actually achieved this differs between algo-
rithms but when the 15:15 algorithm works and there are many situation where it won’t but
when it 15:20 does it allows you to start the 15:21 algorithm in some part of parameter 15:23
space record its location as it works 15:26 around and the resulting distribution of 15:28 visited
parameter combinations will be a 15:31 good representation of the 15:33 their distribution.
so here is an example 15:38 of a Markov chain Monte Carlo algorithm let loose in our
parameter space.
the red 15:43 dot is the current location and the black dots record where it’s being and
15:47 the longer it runs the better the cloud 15:50 of black dots represents the posterior 15:52
distribution we looked at before, where 15:54 the slope was likely between 0 1 and 15:57 the
intercept was likely between -2 and 16:00 1.
and again the result would be the same 16:03 independent of which algorithm we use
to 16:06 fit the Bayesian model, approximate 16:08 Bayesian computation evaluating a grid
16:10 of parameter combinations or Markov 16:12 chain Monte Carlo, 16:13 in theory at least,
because even if you 16:16 use different algorithms to fit the 16:18 model it’s still the same
model right?
16:21 but in practice different algorithms can 16:24 make a huge difference where some
are 16:27 much more efficient than others.
and even 16:30 if you just look at Markov chain Monte 16:32 Carlo algorithms there are
a very 16:35 wildering amount of different ones you 16:37 have metropolis Hastings and Gibbs
16:39 sampling two of the most common Markov 16:41 chain Monte Carlo algorithms but there
16:43 are many more with exotic names like hit 16:46 and run the t walk particle monte carlo
16:48 etc
another algorithm is Hamiltonian 16:53 Monte Carlo which can be an efficient more ef-
ficient Monte Carlo algorithm 16:58 that scales well but that could be difficult to set up that is
unless you 17:04 até aqui
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
use Stan so stan is a domain-specific 17:09 probabilistic programming language 17:12
influenced by our c++ and bugs where 17:15 bugs does not refer to errors in 17:17 compu-
ter programs but the bugs language 17:20 which was the first widespread 17:22 programming
language for Bayesian 17:24 modeling and also the winner for hardest 17:27 to Google for lan-
guage I mean just 17:30 imagine when you had a bug in bugs and 17:32 wanted to Google that
well so anyway 17:35 Stan is its own language but it’s really 17:38 designed to be used together
with 17:39 another language and it has interfaces 17:42 to languages like our Python MATLAB
etc 17:45 so what Stan allows you to do is to just 17:49 define your model and then Stan takes
17:52 care of fitting it efficiently by 17:54 compiling it down to C++ and using 17:57 hamilto-
nian multicolor to fit the model 17:59 and if you want to do more serious 18:02 Bayesian data
analysis Stan is a good 18:05 place to start so I said that Stan is a 18:08 domain-specific lan-
guage which means it 18:11 can’t do everything but it’s really good 18:13 within its domain and
Stan is really 18:16 tailor-made for defining generative 18:19 model so let’s take a look at some
stamp 18:21 code emitting declarations here is a 18:25 minimal Stan program that implements
a 18:28 simple binomial model basically the same 18:30 model as in the fish subscription 18:32
example from part one so each stand 18:35 program consists of a number of code 18:38 blocks
where the model block describes 18:41 the generative model and here we are 18:44 stating that
some X number of successes 18:47 out of n trials are distributed as a 18:50 binomial distri-
bution where the rate of 18:53 success P is distributed as a uniform 18:56 between 0 1 18:57
this program is going to give you the 19:00 same as the following or code where we 19:03 first
draw uniformly between 0 1 19:06 put that value into P and then draw X 19:09 successes out
of n trials with this P as 19:13 the rate of success so rubbing the stand 19:16 program or the
our program for a large 19:18 number of iterations is going to give 19:20 you the same a large
sample of randomly 19:23 drawn piece that is if there is no data 19:29 that is N and X or set to
0 in the stand 19:32 program no trials no successes so here 19:36 is a plot of a large sample of
piece 19:38 drawn using the stand program and it 19:41 does it does look pretty uniform which
19:43 is the output we also would expect from 19:46 the R code however if we add some data
19:49 to the stand program and X is set to say 19:52 10 with n being say 30 but if you have
19:55 10 successes out of 30 trials 19:57 then the stand program will produce 19:59 samples
of pee that are similar to 20:02 running the our code with an extra 20:04 approximate Bayesian
filtering step at 20:06 the end that is sampling piece uniformly 20:08 between 0 1 and then
filtering away all 20:11 those pieces that didn’t result in X 20:13 becoming 10 that is Stan will
generate 20:16 draws from the posterior distribution of 20:18 P given the model and the data
x equals 20:21 10 the result which you can see in the 20:24 histogram a difference here is of
course 20:26 that running the Stan program is going 20:28 to be much much more efficient than
20:31 running the approximate Bayesian 20:32 computation routine in R now it’s very 20:36
important to understand that Stan is not 20:38 actually doing the same as the R code 20:40 it
does give similar output but it does 20:43 not run an approximate Bayesian 20:45 computation
routine internally it 20:48 doesn’t first sample P from the uniform 20:50 distribution Stan uses
a Markov chain 20:53 Monte Carlo algorithm that walks around 20:55 the parameter space here
just the 1 20:58 dimensional parameter space of P where 21:00 if everything worked the records
of this 21:03 work will be distributed as the 21:05 posterior distribution how Stan actually 21:09
works is out of the scope of this 21:11 tutorial but the sweet thing with Stan 21:13 is that it often
works for many models 21:16 it’s actually enough just to define the 21:19 generative model and
Stan we do the rest 21:22 so more generally how do you define a 21:26 model in Stan well for
that you need to 21:29 know the syntax of Stan so what I’m 21:30 going to do now it is that
I’ll breathe 21:33 through a cheat sheet of the syntax of 21:36 Stan it will be very brief but
don’t 21:39 worry in the end you’ll get the link to 21:40 the sheet sheet so no need to take notes
21:44 all right 21:46 a crash course to the syntax of Stan the 21:51 basic syntax is similar to all
currently 21:53 bracket languages such as C and 21:55 JavaScript but vectorization is similar
21:58 to our comments can be written with 22:00 double slashes like in C++ or a hash 22:03
sign like in many other languages so if 22:06 you look through this statements you’ll 22:08 see
that it looks like something you 22:09 could see in say our 22:11 as in C and C++ you need
to end each 22:15 statement with a semicolon as opposed to 22:20 JavaScript or and Python
Stan is 22:22 statically typed so the types of all 22:25 variables parameters and data has to be
22:28 declared in advance so you have the 22:31 types you get in all languages like real 22:34
numbers and integers but there are also 22:37 a lot of types useful in statistical 22:40 modeling
like vectors and matrices so 22:43 vectors and matrices can only contain 22:46 real numbers
so if you want collections 22:49 of other types you would have to use the 22:51 more flexible
array type all types can 22:56 have added constraints and constraints 22:58 are required for
variables acting as 23:00 parameters for example if we have a 23:03 proportion parameter P we
wouldn’t want 23:05 that to ever be outside the range of 0 23:08 to 1 and constraints or how it
tells 23:11 them things like this so if you give a 23:14 parameter no constraints that’s the same
23:17 as saying it could be anything from 23:18 minus infinity to plus infinity a 23:20 standard
deviation parameter Sigma we 23:23 might want to give a lower bound at 0 23:25 and so on
there are some less speciality 23:28 types with special constraints like the 23:30 simplest sim-
plex vector which constrain 23:33 all its elements to be positive and to 23:35 sum to 1 so these
were some of these 23:38 small pieces of a stand program but 23:41 looking at the whole stand
program you 23:43 see that a stand program consists of a 23:46 number of code blocks where
each block 23:48 filled a different role so you have the 23:51 data block which can only contain
23:52 declarations where you tell Stan what 23:55 data this program accepts as input you 23:58
have the parameters block where you 23:59 declare what the parameters are the most 24:02
important block is the model block where 24:05 you define the actual generative model 24:07
and sometimes you want to calculate 24:09 derivative quantities like predictions 24:12 based
on the parameters and the code for 24:14 that you would put in the generated 24:16 generated
quantities block there are 24:19 more block types but these are the most 24:21 commonly used
so Stan WA 24:24 a probabilistic programming language and 24:27 what makes it a such is
that you can 24:29 define non-deterministic relations in 24:32 the language instant that is 24:34
distribution statements we should define 24:37 probabilistic relations between 24:39 parame-
ters and data so distribution 24:42 statements are written with a tilde and 24:44 allow you to
define say that X is 24:47 distributed as a normal distribution 24:49 with mean mu and standard
deviation 24:52 Sigma there are many many built-in 24:55 distributions and Stan and it’s not
too 24:58 difficult to define your own 24:59 distributions as in R and MATLAB many of 25:05
the functions are vectorized that is 25:07 they work on vectors as well as on reels 25:10 and in-
tegers this allows you to write 25:13 more concise code and you can often 25:15 avoid writing
for loops note that as 25:19 opposed to or for loops are not that 25:20 slow in span to pick out
individual 25:24 values from vectors and matrices you use 25:26 square brackets and if you’re
used to 25:28 Python or C++ you need to remember that 25:31 indexing starts at one so now we
have 25:35 all the pieces to put together a minimal 25:38 stand program here implementing a
25:40 binomial model again the same model as 25:43 we used in the Selman subscription 25:45
example so first we declare what data we 25:49 have in the data block and integer n the 25:51
number of trials and an integer X the 25:54 number of successes the model has a one 25:57
unknown one parameter which is the 25:59 underlying frequency of success T you 26:03 can’t
have less than zero or more than 26:05 100 percent successes so we need to 26:07 constrain
T to be between zero and one 26:10 finally we describe the generative model 26:13 where we
state that T is distributed as 26:15 a uniform distribution between zero and 26:18 one that is this
is the prior on P and X 26:22 is distributed as a binomial 26:24 distribution with n trials and
ap 26:27 frequency of success again you could 26:31 read this as that we first sample P from
26:34 a uniform distribution and then sample X 26:37 from 26:38 normal distribution you could
do this 26:40 but it’s not really what Stan is doing 26:42 under the hood at all one way of se-
eing 26:45 this is to put the uniform distribution 26:47 statement under the under the binomial
26:50 so the stand model is going to run just 26:52 fine even though from a sampling 26:54
perspective we’re doing stuff in the 26:56 wrong order so now we just need to get 27:00 Stan
to compile and fit this model 27:03 somehow this is usually done from 27:06 another language
such as Python or R and 27:10 assuming model on the score string 27:12 contains the model
from the last slide 27:14 this is how you would do it first you 27:17 need to define the data and
then you run 27:20 the stand command passing in the model 27:22 string and the data and after
waiting a 27:25 couple of seconds for Stan to compile 27:27 and fit the model you get the stand
27:29 object back which contains a large 27:31 sample of parameter values representing 27:34
the posterior distribution for P okay so 27:38 that was a whirlwind tour of Stan and if 27:41 you
feel confused that is totally fine 27:44 it’s not easy to learn a new programming 27:46 language
in ten minutes 27:47 nevertheless it’s time to now try out 27:51 Stan in exercise number two
Bayesian a/b 27:55 testing using Markov chain Monte Carlo 27:57 and stem so the purpose
of this exercise 28:00 is not to become masters in using Stan 28:04 the main purpose is just
to install Stan 28:07 and get something running so step one is 28:11 to install Stan for our or
Python which 28:13 might not be the easiest so for that 28:16 just follow the first link then with
the 28:20 help of the stand 28:21 cheat sheet we just went through we will 28:24 try to replicate
the analysis we 28:26 performed in exercise 1 in part 1 of 28:28 this tutorial so just installing
a 28:32 program and redoing something you’ve 28:34 done before might not seem like much
but 28:37 with Stan at your fingertips ready to do 28:41 your command you are able to explore
and 28:44 experiment with Bayesian data analysis 28:46 much much more easier so this is a
28:50 really good exercise 28:51 worth doing I’ll be waiting here when 28:55 you come back
do pootaroo to tentacle 28:59 creature - chica - chicken chicken coop 29:02 - you could do the
chicken but 29:05 do-do-do-do alright welcome back I hope 29:09 that that exercise went well
if you did 29:12 the exercise or if you at least took a 29:14 look at the solution you should have
29:16 come across the following figure at some 29:18 point it’s called a trace plot and it 29:22
shows the parameter draws from the 29:24 posterior for the underlying rates of 29:26 success
for group a and Group B and also 29:30 their difference but instead of being 29:33 displayed
as say a histogram which would 29:35 be easy to read off we get them as a 29:38 line plot
where the x-axis is the order 29:41 the parameter draws were produced in why 29:44 is this and
why would we want to look at 29:47 such a plot well Stan uses a mark of 29:51 shame of the
coral algorithm which if 29:53 you remember walks around the parameter 29:55 space with the
promise that in the long 29:58 run it will visit each parameter 30:00 combination proportion
often - its 30:02 posterior probability that is how 30:04 probable that parameter combination is
30:06 after having used the data now note that 30:09 I said in the long run as we just 30:14 al-
lowed Stan to go on for a little while 30:15 we cannot be sure if our run was long 30:18 enough
and while we can never be a 30:21 hundred percent certain of this we 30:23 should at least
do some Markov chain 30:26 Monte Carlo sanity checks because there 30:28 are many things
that could go wrong 30:30 for example our initial parameter values 30:34 might be way off
the Markov chain Monte 30:36 call algorithm has to start exploring 30:38 this parameter space
from somewhere and 30:41 if that somewhere is way off it might 30:44 just have wandered
around in a low 30:46 probability region and we stopped it 30:48 before it really got to the bulk
of the 30:50 probability in parameter space or maybe 30:53 the algorithm got stuck in a portal
30:55 parameter space which could happen for 30:57 example if it got stuck in a local 30:59
maximum so to try to avoid these 31:02 problems you could add a little burn and 31:05 when
running the more question Monte 31:06 Carlo algorithm that is you let it run 31:09 around for a
while so that it is more 31:11 likely that it starts from a good 31:13 starting position you can let
lose 31:15 multiple Markov chain Monte Carlo 31:17 Walker’s in the parameter space or 31:19
chains as they’re also called from 31:22 different starting points and see that 31:24 they explore
similar parts of the 31:26 parameter space and finally when you 31:28 plug the samples from
the algorithm you 31:30 should look for hairy caterpillars which 31:33 is what you would see
if the Markov 31:35 chain Monte Carlo algorithm worked well 31:36 during these three things
does not 31:39 guarantee in any way that the Markov 31:42 chain Monte Carlo algorithm did
work 31:44 well it’s just the bare minimum of 31:47 sanity checks you should perform so what
31:50 could this look like here is a crazed 31:54 bot from a model I found online which 31:56
one doesn’t really matter but if we look 31:59 at the plot of Sigma we see that both 32:01 the
blue and the purple chain starts off 32:04 a little bit off but later stabilizes so 32:08 to throw
away the first thousand 32:10 iterations as burnin wouldn’t be 32:12 completely off otherwise
the two 32:15 Shayne’s mixes ok and does have that 32:17 hairy caterpillar look the trace plot
32:20 for beat that looks bad as it gets stuck 32:22 for long periods and looks more like a 32:25
cityscape a solution in this case could 32:28 just be to run the model for a really 32:30 long
time which could make people look 32:33 more ok but as it is now it looks a 32:36 little bit off
if we instead look at the 32:38 output we got from Stan during the 32:40 exercise we see that
everything looks ok 32:43 big fat hairy caterpillars again this is 32:47 no guarantee everything is
alright it’s 32:49 it’s just the first sanity check but 32:51 since this model is so simple I’m pretty
32:54 sure it’s ok so we looked at Stan but 33:00 there are many many other tools for 33:02
fitting Bayesian models just of the top 33:05 of my head for our we have MC MC PAC an 33:07
old stable package that contains a lot 33:11 of pre specified models you can use out 33:13 of the
box for regression modeling etc 33:15 then there is jagz which 33:18 is a domain-specific lan-
guage to build 33:20 Bayesian models which has been around 33:22 for a while and then there
were Stan and 33:25 if you want to ease into the Stan 33:28 language there is the rethinking
package 33:30 developed by Richard Mikkel Roth 33:32 together with this great intro book to
33:34 base and that allows you to define stand 33:37 models using a much simplified syntax
33:40 directly from our in Python you again 33:43 have that using the Python interface but
33:46 there is also PI mc3 which allows you to 33:49 define Bayesian models directly in Python
33:53 alright so it’s time to wrap up this 33:56 tutorial but first let me say that there 33:59 was
a lot we didn’t cover we didn’t talk 34:03 much about priors and there are many 34:04 different
methods more or less principle 34:06 for coming up with those we didn’t talk 34:08 much about
statistical distributions we 34:11 almost only use the binomial 34:13 distribution and since sta-
tistical 34:14 distributions really are just small 34:17 generative models there are a lot of 34:19
different distributions used in Bayesian 34:21 steps as they are the building blocks 34:23 you
use when you construct a Bayesian 34:25 model we touched on decision analysis 34:28 but
this is really a whole field of 34:30 research we didn’t talk about model 34:33 selection where
you have two or more 34:35 competing Bayesian models that you want 34:37 to compare and
where some people use 34:39 so-called base factors for this I 34:42 deliberately decided not to
talk about 34:44 philosophy and I try to keep the math to 34:48 a minimum minimum but if
you’re into 34:51 math then rest assured there are a lot 34:54 of Bayesian textbooks full of that
so to 34:58 summarize this tutorial Bayesian data 35:01 analysis is a flexible method to fit any
35:04 type of statistical model and maximum 35:07 light can be seen as a special case of 35:09
Bayesian model fitting why use it well 35:12 it makes it possible to design highly 35:15 custom
models where you can include 35:17 information from many sources for 35:19 example both
data and expert knowledge 35:21 and Bayesian data analysis quantifies and 35:24 retains the
uncertainty in parameter 35:27 estimates and predictions which is super 35:29 useful not least
if you want to use the 35:31 results for 35:32 some type of decision analysis and how 35:35
to do it well with Python and or 35:37 together with some of all the great 35:38 packages like
Stan pine see Jags etc 35:43 finally if you want to look more into 35:46 Bayesian data analysis
and I hope you do 35:48 I would recommend you pick up something 35:50 as old-school as
a book both Bayesian 35:55 methods were hacker by Cameron Davis on 35:57 long and think
bass by Allen Downey are 35:59 good introductions if you’re comfortable 36:02 with Python
they are also available for 36:05 free which doesn’t hurt the book I 36:07 started out with was
doing Bayesian data 36:10 analysis by John Khrushchev which uses 36:12 both our Jags 36:14
and Stan and the dude don’t get fooled 36:16 by the pop is on the cover it’s a 36:18 seriously
great book another great book 36:21 is statistical rethinking by Richard 36:24 McIlrath 36:25
which is a fantastically clear 36:27 introduction to base it has been called 36:30 a pedagogical
masterpiece by me in an 36:35 Amazon review but I I do stand by it and 36:38 finally if you
want something more mass 36:40 heavy than machine learning a 36:42 probabilistic perspective
by Kevin 36:45 Murphy is an amazing introduction of 36:47 machine learning from a Bayesian
36:49 perspective or perhaps a Bayesian from a 36:52 machine learning perspective depending
36:54 on your point of view and finally 36:57 finally if you want to try out some more 37:00
advanced Bayesian modeling right now I 37:03 have a bonus exercise for you Bayesian 37:06
computation with Stan and farmer Jones 37:09 so it’s a forum themed exercise that 37:12 takes
off from the binomial model we 37:14 used in the last exercise and by just 37:17 changing this
model a tiny bit in each 37:20 question it will take us through most of 37:22 the basic stats
model all the way to 37:24 running a full Bayesian linear 37:26 regression in Stan so that was
that 37:31 thank you so much for watching this 37:33 tutorial and for staying with me to the
37:36 end I’m awesome sports and it’s been a 37:39 pleasure being your guide in this 37:41
introduction to Bayesian data analysis 37:45 you

Bayesianism 3 Couses-3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesianism 3 Couses-3

Uploaded by

Copyright:

Available Formats

model and how we could fit this model to the data 13:03 by evaluating the Lytle for all

You might also like