You are on page 1of 23

Bayesianism

[CURSO I]

1 Introduction to Bayesian data analysis

1.1 part 1: What is Bayes?

00:02 hello I’m Rasmus Bots and welcome to 00:06 this part one of a three part 00:07 intro-
duction to bayesian data analysis so 00:11 this is an introduction that I am being 00:12 giving
before for example at the 2015 00:15 use or conference and it’s targeted at 00:18 you who isn’t
necessarily that 00:21 well-versed in probability theory and 00:23 statistics but that do know
your way 00:25 around the programming language such as 00:27 R or Python and even though
it is in 00:31 three parts it is going to be quite 00:34 brief and I’m going to warn you that it
00:36 is also going to be quite hand-wavy in 00:39 parts. But I do hope it will give you 00:42
some intuition about what bayesian data 00:44 analysis is why it is useful and how you 00:48
can perform bayesian data analysis 00:50 yourself so this part one is about what 00:54 what
bayesian data analysis is. But before 00:58 we go into that I’m going to fade myself 01:00 out
and we’re going to start by looking 01:03 at some famous people.
So this is Nate 01:09 silver he’s one of the more famous 01:12 statisticians around not least
because 01:13 they did a very good job predicting the 01:15 outcome of the to Obama election
and he 01:19 wasn’t completely off in the Trump 01:20 election he’s currently the 01:22 editor-
in-chief of the well-known 01:24 data-driven news site 538 and here is 01:28 Sebastian Thrun
you want to draw up a 01:30 2005 challenge which was about building 01:32 a self-driving
car that could drive 01:34 through over 200 kilometres of rough 01:37 terrain and after that he
worked on 01:39 Google’s self-driving car finally here 01:41 is Alan Turing a giant in computer
01:44 science who helped crack the german 01:46 enigma cipher during the Second World
01:47 War which helped secure now at victory 01:50 unlikely shorten the war significantly.
01:52
So what do these three people have in 01:55 common? Well they all worked on complex
01:58 problems where there was a large 02:00 inherent uncertainty that needed to be 02:03
quantified and required efficient 02:05 integration of many sources of 02:07 information and
they all used bayesian 02:11 data analysis 02:13 that’s because data analysis is a 02:16 great
tool and are imply some are great 02:19 tools for doing Bayesian data analysis 02:21 but if you
google Bayesian there’s a 02:25 good chance you won’t find articles 02:27 about how this tool
could be used 02:28 instead you might get the philosophy.
02:31 You’ll find articles discussing whether 02:34 statistics should be subjective or 02:36
objective 02:37 whatever that means or whether 02:39 substation should adhere to frequent ism
02:41 or Bayesian ism as if there were 02:44 different religions within statistics 02:46 and
you will find heated arguments about 02:48 whether one should use or should not use 02:50
subjective probabilities rather than 02:52 p-values.
And in this tutorial I won’t 02:56 talk about any of this. I will just talk 02:58 about
Bayesian data analysis as one good 03:01 tool among many that you should have in 03:03 your
data science tool belt. So this 03:07 tutorial is about the what the why and how of Bayesian data
analysis.
(i) Part one, which you are watching right now, try to answer what is bayesian data analy-
sis, (ii) part two touches on why you should want to use bayesian data analysis and (iii) part
three have some hints on how you how to actually perform a bayesian data analysis in practice.
So let’s start part one 03:33 proper: What is bayesian data analysis? Well, this can be
characterized in a number of ways from more helpful than others one that isn’t too helpful. But
that is correct is that Bayesian statistics is when you use probability to represent uncertainty in
all parts of a statistical model. So, if you use probability to represent all uncertainty, then you
are, by definition, using a Bayesian model.
You could also see 04:05 Bayesian data analysis as a flexible 04:07 extension of maximum
likelihood maximum 04:10 likely being perhaps the most common way 04:12 of fitting models
in classical 04:13 statistics.
You can also argue that 04:16 Bayesian data analysis is potentially the 04:20 most infor-
mation efficient method to fix 04:24 statistical model 04:25 but it’s also the most complication
(computation?) of 04:29 the intensive method.
The characterization that we’re going to run 04:34 with in this tutorial is the following:
04:37
Bayesian data analysis is a method for 04:40 figuring out unknowns often called 04:42
parameters that requires three things. 04:45
● one, data,
● two, something called a generative 04:49 model and
● three, priors what information 04:53 the model has before seeing the data.
So 04:57 what is a generative model here?
Well, it’s a very simple concept. It’s any kind of computer program or mathematical
expression, or set of rules, that you can feed fixed parameter values and that will generate
simulated data. A typical example of a generative model is probability distributions like the
normal distribution which you can use to 05:20 simulate data, but also, any kind of function
that you can whip up in R or Python that simulates data counts as a generative model. So a
generative model is great if you know what parameter values you want but you’re interested in
how much the data could vary given those parameters. Because then you can simply plug in
those parameter values run your generative model a large number of times look at how much
the data jumps around. That is it’s a classical Monte Carlo simulation.
But often we are in the complete opposite situation: we know what the data is, It’s not un-
certain and we want to know what are the reasonable parameter values that could have given rise
to this data. That is, we want to work our way backwards from the data that we know and learn
about the parameter values that we don’t know. And 06:12 it is this step that Bayesian inference
06:15 helps you with. So now I’m going to 06:17 explain how Bayesian inference works 06:19
with a motivating example. That while 06:22 it’s up to you if you think it’s motivating but it’s
about fish and who doesn’t like fish.
But to keep things simple we’re actually going to start with just estimating the perfor-
mance of one method, so it’s just a A testing but it will come around to the B later.
06:55 Now Swedish Fish Incorporated is a 06:58 company that makes money by selling
fish 07:00 subscriptions. You know you sign up for a 07:04 year and every month you get a
frozen 07:06 salmon in the mail. They are huge in Sweden but now they want to break into
07:11 the lucrative danish market.
But how 07:16 should Swedish Fish Incorporated enter the Danish market? 07:20 Well
the CEO has already come up with a plan let’s call it method A he put together this colorful
brochure which advertises the one-year salmon subscription plan and marketing has 07:33 ac-
tually already tried this out on 16 07:36 randomly chosen things and out of the 16 07:40 Danes
that got a brochure six signed up for one year of salmon.
So what we want 07:46 to know now is how good is method A what should we expect
the percentage of of sign up to be if we start sending 07:55 brochures on a large scale? Well
we could of course calculated percent percentage 08:01 of sign-ups in our sample that’s just 6
divided by and 16 equals 38 percent and maybe that is a good guess, but surely 08:11 this guess
is quite uncertain. Especially 08:15 since we have such a small sample.
So not only do we want to know what’s a good 08:22 guess for the percentage of sign ups
but 08:24 we also want to know how uncertain is 08:26 this percentage and that’s what we’re
08:29 going to use Bayesian data analysis for. 08:32 So remember what that Bayesian data
analysis requires three things: data, and 08:38 we have data so check on that. Then we need a
generative model which we don’t have, 08:44 So let’s come up with that. Let’s come up with a
generative model of people 08:50 signing up for fish.
08:51 There are of course many ways of doing this I’m just going to go with something
simple here. So first let’s assume that there is one underlying rate1 with which people sign
up. So now let’s just have a number let’s 09:03 say 55% then we ask a number of people
09:08 where the chance of each person signing 09:11 up is then 55%. So ask is in quotes here
1
A hipótese H.
09:16 because we’re actually not going to ask 09:18 anybody this generation model is 09:19
something that we could implement in R 09:21 or Python and asking here just means we 09:25
use some random number function where there’s a 55% chance of getting a ‘yes’ 09:30 and a
45% chance of getting a ‘no’. So how could this look let’s ask in quotes 16 people because that’s
because that’s how many that was off in our data set and let’s see how many of those 16 people
sign up2 . So the first person didn’t sign up the second person didn’t sign up this third person
signed up and so on for all our 16 people. And finally we’re going to count how many signed
up. And in this instance 7 out of 16 people signed up so this time 7 out of 16 is our simulated
data. Great. So now we have a generative 10:16 model, we have a model work where we can
plug in fixed parameter values and generate simulated data. The problem is of course that this
is the opposite of what we really want we don’t want to simulate data we know what the data
is we know that when marketing asks sixteen randomly selected Dame’s six of them signed up
and what we want to do is the 10:37 opposite we want to go backwards from 10:40 the data that
we know and figure out 10:42 what could be reasonable parameters that 10:44 resulted in this
data. That is what is 10:47 likely the rate of sign-ups that resulted in six out of sixteen signing
up. The good news is that we are almost there we can almost do this we just need one more
10:58 thing. We have data we have a generation model but we also need priors we need to
specify what information the model has before seeing the data. And when you do bayesian data
analysis that is the same as saying that we need to 11:14 use probability to represent uncertainty
in all parts of the model. Right now 11:20 we’re not doing that here is the model we have so
far we have the generative model and we have one parameter one 11:28 unknown: the overall
rate of sign up if I look at this model I see uncertainty in two places there is uncertainty in the
generative model we don’t know what the simulated data will be each time we run it but that
uncertainty is already implicitly represented by probability as we’re using a random number
function to simulate data. And if you know your 11:51 probability distributions you might have
11:53 already recognized that the generative 11:54 model is actually the same as a binomial
11:57 probability distribution, but there is 12:00 also uncertainty in what the parameter 12:03
value is, the overall rate of sign up. And 12:06 that uncertainty is yet not represented 12:08 by
any probability or probability 12:10 distribution. So let’s do that there are 12:13 many different
ways you could go about 12:15 this but I’m just going to use an EC 12:18 off-the-shelf solution.
I’m going to 12:20 represent the uncertainty regarding the overall rate of sign up by a
uniform probability distribution from 0 to 1 that is, by using this probability distribution we’re
stating that prior to seeing any data the model assumes that any rate of sign up between 0 & 1
is equally likely.
The probability distributions that are used in this way to represent what the model knows
prior to seeing the data are called often prior distributions or just priors.
All 12:52 right, so now we have specified the prior 12:55 and all uncertainty in the model
is 12:57 represented by probability. If we now 13:01 look at our checklist we see that we 13:02
2
E aqui, o que ele está fazendo em relação ao 3b1b?
have data a generative model and priors 13:05 and we should be ready to roll. 13:07 Now we
just need to fit the model 13:10 somehow. Again, there are many different 13:13 ways of doing
this, but here is one for this conceptually simple. First we’re going to 13:20 start with our prior
and we’re going to 13:22 draw a random parameter value from it. 13:25 This time it happened
to be 0.21 that 13:28 is a rate of sign up of 21% then we’re going to take this parameter grow
and plug it in to our generative model and we’re going to use it to simulate some data.
This time, when we ran our 13:42 generative model for out of 16 sign up for a year of
salmon and now we’re going to do what we just did growing from the prior and simulating data
many many times say a hundred thousand times. So we’ve drawn from the prior and simulating
data again and again a 14:03 hundred thousand times but, but here we’re just looking at the first
four but really a hundred thousand times now it’s time to bring in what we actually know.
Now it’s time to bring in the data because if it is something we know it is 14:17 that when
marketing did this for real 6 out of 16 people signed up so now we’re going to filter away3 all
those parameter draws that didn’t result in data consistent with what we actually 14:29 observed
that’s because we’re interested in reality and what a reasonable rate of sign up could be in reality
and, in reality, 6 out of 16 people signed up so we’re going to remove the first 14:42 parameter
draw because it didn’t result 14:43 in 6 people signing up we’re going to 14:46 keep the second
as the parameter go 14:47 actually resulted in 6 people signing up 14:49 we’re removing the
third and we’re 14:52 keeping the fourth cause it resulted in 14:54 6 people signing up and so
on for all 14:56 the hundred thousand parameter draws. 14:58 Note that sometimes we keep
a certain 15:01 parameter value and sometimes we filter 15:04 the very same value away it
all depends 15:06 on whether the generator model simulated 15:08 matching data that specific
time. So for 15:11 example here we toss the first 21% 15:14 parameter draw but we’re keeping
the 15:16 fourth so what did all this work gave us? 15:19 Well this is what we started with this
15:22 is the distribution of the hundred 15:24 thousand draws from the prior before we 15:26
did the filtering step. 15:27 After (As the) Pryor was a uniform distribution 15:30 between zero
- one you shouldn’t be too 15:32 surprised to see that the hundred 15:34 thousand random rolls
those form a 15:36 pretty uniform distribution but if you 15:39 look carefully you see that the
bars (parts) 15:41 actually are slightly different.
Then 15:44 after having done the filtering step 15:46 where we removed all the parameter
15:48 drawers that didn’t result in matching 15:49 data, this is a distribution we ended up 15:53
with and this blue post filtering 15:56 distribution is actually the answer to 15:59 our original
question about what a 16:01 likely value of the sign-up rate is 16:03 because a parameter value
that is more 16:05 likely to generate the data we collected 16:07 is going to be proportionally
more 16:10 common in this blue distribution that is 16:12 a parameter value that is twice as
16:14 likely as some other parameter value to 16:17 generate the data we actually saw is 16:19
roughly going to be twice as common on 16:22 this in this blue distribution. So right 16:25
away we can see just by looking at the 16:27 distribution that parameter values below 16:29 0.1
3
Ele tá escolhendo, dentro da amostra, a parte significativa, para se ajustar à hipótese H.
and above 0.8 almost never resulted 16:34 in the data we observed so it should be 16:36 very
unlikely that this sign-up rate for salmon inscription is below 10% or about 80% we see that the
subscription rate is 16:44 likely somewhere between 20 and 60 16:47 percent with it most likely
being 16:50 between 30 and 40 percent but 16:52 importantly we see that the distribution 16:54
is pretty wide which means that even 16:57 after using our, not so impressive, data 16:59 set of
16 data points the sign-up rate is 17:02 still very uncertain and bayesian data 17:05 analysis was
all about representing 17:08 uncertainty with probability but we 17:10 still haven’t calculated
any 17:11 probabilities but, since we have a 17:12 distribution of samples it’s easy to do. 17:15
Favors (Say that we) want to calculate the probability that the sign of rate is between 30 and
40 percent say. Then we first count up how many parameter drawers that are between open
0.3 and 0.4 here it was 1900 and then we divide by the total number of drawers that survived
the filtering step which was 5,700 and if we white (write) 1900 by 5,700 17:41 there is a 33%
probability that the sign 17:45 of rate is between 30 and 40 percent we can of course do the same
calculation for all the bars (parts) of the distribution and what we end up with is a probability
distribution over likely sign-up rates? 17:57 Now it’s important to note that this probability
distribution doesn’t stand on its own it’s always given the assumptions and the data we used. In
18:07 bayesian jargon these two distributions are called the prior and the posterior distribution
that’s because the prior 18:15 distribution is what the model knows 18:16 about the parameters
before prior to 18:19 using the information in the data and 18:22 the posterior distribution is
what the 18:24 model knows about the parameters after, 18:26 posterior, to having used the
data in the 18:30 filtering step. Now the posterior 18:32 distribution is really the end product
18:35 of a Bayesian analysis it contains both 18:37 the information from the model and from
18:40 the data. And we can use it to answer all 18:42 kinds of questions. For example what
sign 18:45 up rate is the most probable this is 18:47 just the parameter value with the 18:49
highest probability in the posterior 18:51 since the plot of the posteriors is pinned ?? 18:53 into
ten percent spins it’s a little 18:56 difficult to read from the plot but the 18:57 sign of rate with
the highest 18:59 probability is actually 38 percent so if 19:03 we just had to report a single
number, 38 19:06 percent could be given as a best guess. 19:09 Now as we use the uniform
prior this is actually also the parameter value that is the most likely to generate the data we
actually observed and in classical statistics this type of estimate is well known under a specific
name. Do you want to guess it’s the parameter value with the maximum likelihood to generate
the 19:32 data we observed yeah it’s the so-called 19:35 maximum likelihood estimate and
maximum 19:37 likelihood estimation is one of the most common ways of fitting models in
19:41 classical statistics. And this is the reason for why bayesian data analysis can 19:47 be
seen as an extension of maximum likelihood estimation because as long as you use flash priors
you’ll always get 19:53 the maximum likelihood estimate for free, when you fit a Bayesian
model. 19:58 But there are other ways you can 20:00 summarize a posterior distribution 20:01
except for the maximum likelihood 20:02 estimate you can take the mean of the 20:05 posterior
distribution the posterior 20:07 mean as (is) another best guess of the rate 20:09 of sign up in
this case it’s almost the 20:12 same as the maximum likelihood estimate 20:13 but that’s not
always the case or you 20:16 might want to summarize the uncertainty 20:19 of the sign operate
as an interval and 20:20 then you can sign the shortest interval 20:23 that covers say 90% of the
probability. 20:27 This is often called a [[]credible interval]] 20:31 and here we can see that the
90% 20:32 credible interval go between 0.3 and 0.54. 20:35 So we can state that the sign-up
20:39 rate is between thirty and fifty four 20:41 percent with 90 percent probability. All 20:45
right that was a simple example of 20:49 bayesian data analysis and at this point 20:52 in the
tutorial we usually do a small 20:54 exercise where we replicate the analysis I just described to
you. So if you are in front of a computer and have either R or Python installed I recommend that
you pause this video and try out this exercise by following either of the links here. It’s a really
useful exercise for helping you understand bayesian data 21:15 analysis. So pause the video
now and do 21:19 it. I’ll be waiting for you when you come 21:22 back. boom boom boom
boom boom go to to 21:27 go to the compose button boom boom boom 21:32 boom boom

alright okay welcome back I 21:36 hope the exercise went well and if it didn’t you should
take a look at the solution at the bottom of the exercise page now we’re going to go through
what we just did so nothing new here but this time we’re going to use a tiny bit of math notation.
So the green distribution here is the prior that’s what we started with and the blue distribution
is the posterior that’s what we ended up with after having used the data. So how did we go
from the prior to the posterior? Let’s take a sign-up rate of 35% as an example first a sign-up
rate of 35drawn from the prior and that it did with some probability here the P with parentheses
stands for the probability 22:24 of drawing a parameter value of 35 22:26 percent from the
prior. Then, in order to 22:29 not throw this parameter to draw away, it 22:32 had to simulate
data which match the 22:34 data we actually observed. And that it 22:37 also did with some
probability. Here the 22:41 p with the vertical bar stands for the probability of generating
six sign-ups given the bar should be read as given a parameter value of 35 percent and by
multiplying these two probabilities together we get the probability of first generating a sign-up
rate of 35 percent and then simulating data that matched the data we observed. And this will
be proportional to the probability of 35 percent being the test parameter value given the data.
This is in the same way as before where 1900 parameter draws in the posterior between 30
and 40 percent was proportional to the probability of the sign-up rate being in that range. But,
it wasn’t the probability was 1900 parameters so it wasn’t the probability until we divided by
the total number of draws. And in the same way we have two 23:37 here divided by the total
sum of the 23:40 probabilities of generating the data for 23:42 all the parameter values. And
this gives 23:45 us the probability for a sign-up rate of 23:48 35 percent given the data. And
of course 23:52 we could use the same procedure for all 23:54 the other sign-up rates and to
retrieve 23:57 a full probability distribution. Again 24:00 this was nothing new this is just what
we did before but now using a little bit of probability notation. So, what have we done? We have
specified prior information, a generator model and we have calculated 24:19 the probability of
different parameter 24:21 values given 24:22 the data. In this example we used a 24:25 binary
rate binomial model with one parameter but, the cool thing here is that the general method
works for any generative model with any number of parameters. That is, you can have any
number of unknowns, also known as parameters, that you plug into any generative model that
you can implement and the data can be multivariate or it consists of it can consist of completely
different data sets. And the bayesian machinery that we used in the simple case works in the
same way here. Now the 24:57 equation down at the bottom here is just a generalized version
of the one we used in the salmon subscription problem where here D is the data and the θ, the
dough with the - - and are the ??? parameters. 25:10 So this equation isn’t anything new it’s
25:12 just what we did before and that 25:15 equation is what’s usually called [[]Bayes 25:18
theorem]]. So there it is.

Now I need to 25:23 mention that the specific computational method we used in the sound
unsubscription problem only work in rare cases it’s called approximate bayesian computation
and what specific to this method is that you code up a generative model then simulate from
it and only keep parameter draws that match the data. and it’s a good method because it’s
conceptually simple but it’s a bad method because it can be incredibly slow and scales horribly
with larger data if you use navely. But there are many many faster methods that are faster
because they can take computational shortcuts, but the important part is that if one of those
faster methods worked the end result will be the same as if you would have used approximate
bayesian computation you will just get this answer much much faster. See if you hear about the
cool method 26:18 like [[]Hamiltonian Monte Carlo]] you don’t 26:21 need to be too nervous if
you don’t know 26:23 how it works, because it’s just another 26:25 method of fitting Bayesian
models and 26:28 when it works, the outputs output will be 26:31 the same as if you would have
fitted the 26:34 model using approximate bayesian 26:35 computation. 26:36 But Hamiltonian
Monte Carlo will probably get used a result today rather than in a hundred years say. So now
we talked about what bayesian data analysis is. Let’s talk a little bit about what it 26:52 is not
so first it’s not a categorical 26:56 models it not it’s not like you have 26:57 regression models
decision trees neural 27:00 networks and by ocean (bayesian) models. Bayesian 27:02 data
analysis is more a way of thinking 27:04 about and constructing models and many 27:06 parts
of statistics and machine learning 27:08 can be done in a bayesian framework so 27:11 you can
have bayesian regression models, 27:13 bayesian decision trees, bayesian neural 27:15 networks
and so on. Bayesian data analysis 27:18 is also not more subjective than all the 27:21 types of
statistics as far as I can see 27:23 or rather all statistics is equally 27:26 subjective also physical
methods require 27:29 that you make some assumptions and the 27:31 result always has to be
interpreted in 27:33 the light of those assumptions. It’s the 27:36 same in bayesian statistics and
27:38 classical statistics. Also even though 27:40 bayesian methods have become more 27:42
popular quite recently starting in the 27:44 90s when suddenly everybody got the PC. 27:46 It’s
important to remember that that it 27:49 is not anything new, it’s actually called 27:53 bayesian
because of this guy, Thomas Bayes 27:56 who live in the 1700s. It’s named 27:59 after him
because he failed to publish 28:02 an essay on solving a specific 28:04 probability problem
but which was then 28:07 published after his death. But bayesian 28:10 statistics should really
be named after 28:12 this guy Pierre Simon of Laplace, who was first to 28:17 describe the
general theory of bayesian 28:19 data analysis, but he didn’t call it 28:21 bayesian rather he
called it [[]inverse probability]] which sort of makes sense as we use probability to reversely
go backwards from what we know the data to figure out what we don’t know, that parameters.
The first known person to have used the word bayesian was actually Ronald Fisher, the guy that
popularized that p-value and he didn’t meant it as a 28:45 compliment because he famously low
that 28:48 bayesian statistics.
28:49 so maybe bayesian data analysis is not the best of names and the better name would
actually just be probabilistic modeling because that’s really just what it is. All right that con-
cludes part 1 of 29:06 this three part introduction to bayesian data analysis. Now that we know
what bayesian data analysis is, we’re going to, in part two, take a look at why you would want
to use bayesian data analysis. Again, I’m Russell Bots and thanks for staying with me to the
end.
então pedimos a várias pessoas onde a chance de cada pessoa assinar 09:11 até então é
de 55%. A expressão ´´pergunte” está entre aspas aqui porque na verdade não vamos perguntar
a ninguém. Este modelo de geração é algo que poderı́amos implementar na linguagém R ou
em Python e perguntar aqui apenas significa que nós 09:25 usamos alguma função de número
aleatório em que 09:27 há uma chance de 55% de se obter um sim e uma chance de 45 e não tão
como 09:35 poderia este olhar, vamos perguntar entre aspas 16 09:39 pessoas porque é porque
é assim que 09:41 muitos que estavam fora do nosso conjunto de dados e 09:43 vamos ver
quantas dessas 16 pessoas 09:46 inscreva-se para que a primeira pessoa não se inscreva 09:50
a segunda pessoa não se inscreveu neste 09:53 terceira pessoa se inscreveu e assim por diante
09:57 nossas 16 pessoas e, finalmente, vamos 10:03 contar quantas se inscreveram
e neste 10:06 exemplo, 7 das 16 pessoas se inscreveram. 10:09 D esta vez 7 em 16 é
o nosso simulado 10:12 dados ótimos então agora temos um modelo generativo, temos um
modelo de trabalho onde podemos 10:18 conectar valores de parâmetros fixos e 10:20 gerar
dados simulados, o problema é 10:23 claro que
ssssssssssssssssssssssssssss 10:20 gerar dados simulados, o problema é é claro que isso
é o oposto daquilo que realmente queremos. Não queremos 10:28 simular dados, sabemos o
que são os dados 10:30 sabemos que quando o marketing pede dezesseis 10:32 selecionados
aleatoriamente seis de Dame 10:35 se inscreveu e o que queremos fazer é o 10:37 oposto quere-
mos ir para trás a partir de 10:40 os dados que conhecemos e descobrimos 10:42 o que poderia
ser parâmetros razoáveis que 10:44 resultou nesses dados que é o que é 10:47 provavelmente a
taxa de inscrições que resultaram 10:50 em seis dos dezesseis inscritos no 10:53 boa notı́cia é
que estamos quase lá 10:56 quase podemos fazer isso, só precisamos de mais uma 10:58 coisa
que temos dados, temos uma geração 11:02 modelo, mas também precisamos de priores 11:05
precisamos especificar quais informações o 11:07 modelo tem antes de ver os dados e 11:09
quando você faz a análise de dados bayesianos que 11:12 é o mesmo que dizer que precisamos
11:14 use probabilidade para representar incerteza 11:17 em todas as partes do modelo agora
11:20 nós não estamos fazendo isso aqui é o modelo 11:23 temos até agora temos a genera-
tiva 11:26 modelo e temos um parâmetro um 11:28 desconhecido a taxa geral de inscrição se eu
11:32 olhe para este modelo vejo incerteza em 11:34 dois lugares, há incerteza no 11:37 modelo
generativo não sabemos o que o 11:39 dados simulados serão cada vez que executamos 11:40
mas essa incerteza já está 11:43 implicitamente representado pela probabilidade como 11:46
estamos usando uma função de número aleatório para 11:49 simular dados e se você conhece
seu 11:51 distribuições de probabilidade que você possa ter 11:53 já reconheceu que o gera-
dor 11:54 modelo é realmente o mesmo que um binômio 11:57 distribuição de probabilidade,
mas há 12:00 também incerteza em que o parâmetro 12:03 valor é a taxa geral de inscrição e
12:06 que a incerteza ainda não está representada 12:08 por qualquer probabilidade ou proba-
bilidade 12:10 distribuição então vamos fazer isso existem 12:13 muitas maneiras diferentes
pelas quais você poderia seguir 12:15 isso, mas eu só vou usar um EC 12:18 solução pronta
para uso que eu vou 12:20 representam a incerteza quanto ao 12:22 taxa geral de inscrição
por um uniforme 12:25 distribuição de probabilidade de 0 a 1 12:27 isto é, usando esta pro-
babilidade 12:30 distribuição, estamos afirmando que antes de 12:33 vendo quaisquer dados, o
modelo assume que 12:35 qualquer taxa de inscrição entre 0 e 1 é 12:40 igualmente provável
a probabilidade 12:42 distribuições que são usadas dessa maneira 12:44 para representar o que
o modelo sabe antes 12:46 para ver os dados são chamados frequentemente 12:48 distribuições
anteriores ou apenas anteriores todos 12:52 certo agora nós especificamos o anterior 12:55 e
toda a incerteza no modelo é 12:57 representado pela probabilidade se agora 13:01 olhe para
a nossa lista de verificação, vemos que nós 13:02 ter dados um modelo generativo e anteriores
13:05 e devemos estar prontos para rolar 13:07 agora só precisamos ajustar o modelo 13:10
de alguma forma, novamente, existem muitos diferentes 13:13 maneiras de fazer isso, mas
aqui está uma para 13:17 isso em 13:18 realmente simples primeiro vamos 13:20 começar com
o nosso anterior e vamos 13:22 desenhe um valor de parâmetro aleatório a partir dele 13:25
Desta vez, passou a ser 0,1 t1 que 13:28 é uma taxa de inscrição de 21%, então estamos 13:32
vai levar esse parâmetro crescer e 13:34 conecte-o ao nosso modelo generativo e 13:37 vamos
usá-lo para simular alguns 13:39 dados desta vez quando executamos nossa 13:42 modelo ge-
nerativo para dentre 16 inscritos 13:45 por um ano ou mais e agora vamos 13:49 para fazer o
que acabamos de fazer a partir do 13:51 dados anteriores e simulando muitos 13:54 vezes di-
zem cem mil vezes tão 13:57 tiramos do anterior e 13:59 simulando dados repetidamente uma
14:03 cem mil vezes, mas mas aqui 14:06 estamos apenas olhando para os quatro primeiros,
mas 14:07 realmente cem mil vezes agora é 14:10 hora de trazer o que realmente sabemos
14:12 agora é hora de trazer os dados 14:15 porque se é algo que sabemos que é 14:17 que
quando o marketing fez isso de verdade 14:20 6 de 16 pessoas se inscreveram agora 14:23
nós vamos filtrar todos aqueles 14:25 desenho de parâmetro que não resultou em 14:27 dados
consistentes com o que realmente 14:29 observou que é porque estamos interessados 14:32 na
realidade e que taxa razoável de 14:34 inscrever-se poderia ser na realidade e em 14:37 rea-
lidade 6 das 16 pessoas se inscreveram para 14:40 vamos remover o primeiro 14:42 desenho
de parâmetro porque não resultou 14:43 em 6 pessoas se inscrevendo vamos 14:46 mantenha
o segundo conforme o parâmetro 14:47 realmente resultou em 6 pessoas se inscrevendo 14:49
estamos removendo o terceiro e estamos 14:52 mantendo a quarta causa, resultou em 14:54 6
pessoas se inscrevendo e assim por diante 14:56 os cem mil parâmetros desenham 14:58 note
que às vezes mantemos uma certa 15:01 valor do parâmetro e às vezes filtramos 15:04 o mesmo
valor de distância tudo depende 15:06 se o modelo do gerador simula 15:08 combinando dados
nesse horário especı́fico para wwwwwwwwwwwwwwwwwwwwwwwww 15:06 se o modelo
do gerador simula 15:08 combinando dados nesse horário especı́fico para 15:11 exemplo aqui
jogamos os primeiros 21% 15:14 empate de parâmetros, mas estamos mantendo o 15:16 quarta,
então o que todo esse trabalho nos deu 15:19 bem, foi isso que começamos com isso 15:22 é
a distribuição dos cem 15:24 mil empates do anterior antes de 15:26 fez a etapa de filtragem
15:27 depois que Pryor foi uma distribuição uniforme 15:30 entre um soro você não deve ser
muito 15:32 surpreso ao ver que os cem 15:34 mil rolos aleatórios aqueles formam um 15:36
distribuição bastante uniforme, mas se você 15:39 olhe atentamente você vê que as barras 15:41
na verdade, são um pouco diferentes, então 15:44 depois de ter feito a etapa de filtragem 15:46
onde removemos todo o parâmetro 15:48 gavetas que não resultaram em correspondência 15:49
dados esta é uma distribuição acabamos 15:53 com e este post azul filtragem 15:56 distribuição
é realmente a resposta para 15:59 nossa pergunta original sobre o que 16:01 O valor provável da
taxa de inscrição é 16:03 porque um valor de parâmetro que é mais 16:05 provável gerar os da-
dos que coletamos 16:07 vai ser proporcionalmente mais 16:10 comum nesta distribuição azul
que é 16:12 um valor de parâmetro duas vezes mais 16:14 provavelmente como outro valor de
parâmetro para 16:17 gerar os dados que realmente vimos é 16:19 aproximadamente será duas
vezes mais comum em 16:22 isso nesta distribuição azul tão certo 16:25 longe, podemos ver
apenas olhando para o 16:27 distribuição que valores de parâmetro abaixo 16:29 0,1 e acima de
0,8 quase nunca resultaram 16:34 nos dados que observamos, deve ser 16:36 muito improvável
que essa taxa de inscrição para 16:39 descrição silenciosa é inferior a 10% ou aproximadamente
16:41 80%, vemos que a taxa de assinatura é 16:44 provavelmente em algum lugar entre 20 e
60 16:47 por cento com isso provavelmente sendo 16:50 entre 30 e 40 por cento, mas 16:52
importante, vemos que a distribuição 16:54 é bastante amplo, o que significa que mesmo 16:57
depois de usar nossos dados não tão impressionantes 16:59 conjunto de 16 pontos de dados, a
taxa de inscrição é 17:02 dados ainda muito incertos e bayesianos 17:05 análise foi toda sobre
a representação 17:08 incerteza com probabilidade, mas nós 17:10 ainda não calculou nenhum
17:11 probabilidades, mas uma vez que temos um 17:12 distribuição de amostras é fácil de fa-
zer 17:15 favorece quer calcular a probabilidade 17:18 que o sinal da taxa está entre 30 e 17:20
40% dizem que primeiro contamos 17:23 quantas gavetas de parâmetros que são 17:25 entre
aberto 3 e 0,4 aqui foi 1900 17:29 e depois dividimos pelo número total 17:32 gavetas que so-
breviveram à filtragem 17:33 passo que foi 5.700 e se branco 17:38 1900 por 5.700 17:41 existe
uma probabilidade de 33% de que o sinal 17:45 da taxa é entre 30 e 40 por cento nós 17:48 claro
que pode fazer o mesmo cálculo 17:49 para todas as barras da distribuição e 17:52 o que aca-
bamos com é uma probabilidade 17:55 distribuição sobre as taxas prováveis de inscrição 17:57
agora é importante notar que isso 18:00 distribuição de probabilidade não está de pé 18:02 por
conta própria, é sempre dada a 18:04 suposições e os dados que usamos 18:07 jargão paciente
essas duas distribuições 18:09 são chamados de anterior e posterior 18:12 distribuição que é
porque o anterior 18:15 distribuição é o que o modelo sabe 18:16 sobre os parâmetros antes
antes de 18:19 usando as informações nos dados e 18:22 a distribuição posterior é o que o 18:24
modelo conhece os parâmetros após 18:26 posterior a ter utilizado os dados no 18:30 etapa de
filtragem agora a posterior 18:32 distribuição é realmente o produto final 18:35 de uma análise
bayesiana, contém tanto 18:37 as informações do modelo e do 18:40 os dados e podemos usá-lo
para responder a todos 18:42 tipos de perguntas, por exemplo, que sinal 18:45 taxa de up é a
mais provável que seja 18:47 apenas o valor do parâmetro com o 18:49 maior probabilidade
na região posterior 18:51 desde que o enredo da preferência esteja marcado 18:53 em dez por
cento gira é um pouco 18:56 difı́cil de ler a partir da trama, mas o 18:57 sinal de taxa com o
mais alto 18:59 probabilidade é realmente de 38 por cento, por isso, se 19:03 nós apenas tive-
mos que relatar um único número 38 19:06 por cento poderia ser dado como um melhor palpite
19:09 Agora, como usamos o uniforme anterior, este é 19:12 na verdade, também o valor do
parâmetro que 19:15 é o mais provável para gerar os dados 19:18 nós realmente observamos e
no clássico 19:21 estatı́sticas este tipo de estimativa está bem 19:23 conhecido sob um nome
especı́fico você quer 19:26 adivinhar que é o valor do parâmetro com 19:29 a probabilidade
máxima de gerar o 19:32 dados que observamos sim, é o chamado 19:35 estimativa de proba-
bilidade máxima e máxima 19:37 estimativa de probabilidade é uma das mais 19:39 maneiras
comuns de ajustar modelos em 19:41 estatı́sticas clássicas e este é o 19:44 razão pela qual a
análise de dados de viés pode 19:47 ser visto como uma extensão do máximo 19:49 estimativa
de probabilidade porque enquanto 19:51 você usa flash anteriores, você sempre terá 19:53 t
aaaaaaaaaaaaaaaaa 19:49 estimativa de probabilidade porque enquanto 19:51 você usa flash an-
teriores, você sempre terá 19:53 a estimativa de probabilidade máxima 19:55 de graça quando
você se encaixa em um modelo bayesiano 19:58 mas existem outras maneiras de você 20:00 re-
sumir uma distribuição posterior 20:01 exceto para a probabilidade máxima 20:02 estimar que
você pode calcular a média do 20:05 distribuição posterior posterior 20:07 significa como outro
melhor palpite da taxa 20:09 de se inscrever, neste caso, é quase o 20:12 igual à estimativa de
probabilidade máxima 20:13 mas nem sempre é esse o caso ou você 20:16 pode querer resumir
a incerteza 20:19 do sinal operar como um intervalo e 20:20 então você pode assinar o menor
intervalo 20:23 que cobre digamos 90% da probabilidade 20:27 isso geralmente é chamado de
intervalo credı́vel 20:31 e aqui podemos ver que os 90% 20:32 intervalo credı́vel variam entre
0,3 e 0,5 20:35 para 4, para que possamos afirmar que a inscrição 20:39 taxa é entre trinta e
cinquenta e quatro 20:41 por cento com 90 por cento de probabilidade tudo 20:45 certo que era
um exemplo simples de 20:49 análise de dados bayesianos e neste momento 20:52 no tutorial,
geralmente fazemos uma pequena 20:54 exercı́cio onde replicamos a análise 20:56 Eu acabei
de descrever para você, se você estiver em 21:00 frente de um computador e tenha R ou 21:02
Python instalado, eu recomendo que você 21:05 pausar este vı́deo e experimentar isso 21:07
exercı́cio, seguindo um dos 21:09 links aqui é um exercı́cio realmente útil 21:12 para ajudar
você a entender os dados bayesianos 21:15 análise para pausar o vı́deo agora e fazer 21:19 eu
estarei esperando por você quando você vier 21:22 boom boom de volta boom boom boom vá
para 21:27 vá para o botão de composição boom boom boom 21:32 boom boom tudo bem ok
bem-vindo de volta I 21:36 espero que o exercı́cio tenha corrido bem e se 21:39 você não de-
veria dar uma olhada no 21:41 solução na parte inferior do exercı́cio 21:43 página agora vamos
passar pelo que 21:46 acabamos de fazer nada de novo aqui, mas isso 21:48 vez que vamos usar
um pouquinho de 21:51 notação matemática para a distribuição verde 21:54 aqui é o prior é isso
que começamos 21:56 com e a distribuição azul é a 21:58 posterior foi o que acabamos com
22:01 depois de ter usado os dados, como fizemos 22:04 vá do anterior para o posterior 22:08
considere uma taxa de inscrição de 35% como exemplo 22:12 primeiro, uma taxa de inscrição de
35% precisou ser 22:15 retirado do anterior e que ele fez 22:18 com alguma probabilidade aqui
o P com 22:22 parênteses representa a probabilidade 22:24 de desenhar um valor de parâmetro
de 35 22:26 por cento do anterior então, a fim de 22:29 não jogue este prompt para afastá-lo
22:32 teve que simular dados que correspondam ao 22:34 dados que realmente observamos e
que 22:37 também fez com alguma probabilidade aqui o 22:41 p com a barra vertical repre-
senta a 22:44 probabilidade de gerar seis inscrições 22:46 dado o bar deve ser lido como dado
um 22:49 valor do parâmetro de 35 por cento e por 22:53 multiplicando essas duas probabilida-
des 22:55 juntos obtemos a probabilidade de primeiro 22:57 gerando uma taxa de inscrição de
35% 22:59 e, em seguida, simulando dados correspondentes 23:03 os dados que observamos e
isso será 23:06 proporcional à probabilidade de 35 23:09 por cento sendo o valor do parâmetro
de teste 23:11 dados os dados, é da mesma maneira 23:14 como antes, onde o parâmetro 1900
atrai 23:17 o posterior entre 30 e 40 por cento 23:19 foi proporcional à probabilidade de 23:22
a taxa de inscrição estar nesse intervalo, mas 23:25 não era a probabilidade era de 1900 23:28
parâmetros, por isso não era a probabilidade 23:30 até dividirmos pelo número total de 23:33
desenha e da mesma maneira que temos dois 23:37 aqui dividido pela soma total do 23:40 pro-
babilidades de gerar os dados para 23:42 todos os valores dos parâmetros e isso dá 23:45 nos
a probabilidade de uma taxa de inscrição de 23:48 35 por cento dados os dados e, claro, 23:52
poderı́amos usar o mesmo procedimento para todos 23:54 as outras taxas de inscrição e para
recuperar 23:57 uma distribuição de probabilidade completa novamente 24:00 isso não era no-
vidade, é exatamente isso 24:03 fizemos antes, mas agora usando um pouco 24:06 de notação de
probabilidade então o que temos 24:11 feito, especificamos informações prévias 24:14 um mo-
delo de gerador e calculamos 24:19 a probabilidade de parâmetro diferente 24:21 valores 24:22
e os dados neste exemplo, usamos um 24:25 modelo binomial de taxa binária com um 24:28
parâmetro, mas a coisa legal aqui é 24:31 que o método geral funciona para qualquer 24:33
modelo generativo com qualquer número de 24:36 parâmetros que é que você pode ter qualquer
24:38 número de incógnitas também conhecido como 24:40 parâmetros que você conecta em
qualquer 24:42 modelo generativo que você pode implementar 24:44 e os dados podem ser mul-
tivariados ou 24:46 consiste nele pode consistir em completamente 24:48 diferentes conjuntos
de dados e o paciente 24:50 máquinas que usamos no simples 24:52 caso funciona da mesma
maneira aqui agora o 24:57 equação na parte inferior aqui é apenas 24:59 uma versão gene-
ralizada daquele que você ttttttttttttttttttttttttttttttttttttttttttttttttttt 24:59 uma versão generalizada
da que usamos 25:01 no problema de assinatura de salmão onde 25:04 aqui D são os dados
e o teta o 25:07 massa com o - - e são os 25:09 parâmetros 25:10 então essa equação não é
nada novo, é 25:12 exatamente o que fizemos antes e que 25:15 equação é o que geralmente
é chamado Bayes 25:18 teorema então aqui está agora eu preciso 25:23 mencionar que o es-
pecı́fico computacional 25:26 método que usamos no som 25:27 problema de cancelamento de
assinatura só funciona em raros 25:30 casos é chamabayesianos aproximado 25:33 computação
e o que especı́fico para isso 25:35 método é que você codifique um generativo 25:37 modelo,
em seguida, simular a partir dele e apenas 25:40 manter parâmetros que correspondam aos da-
dos 25:42 e é um bom método porque é 25:45 conceitualmente simples, mas é um mau 25:48
método porque pode ser incrivelmente lento 25:51 e escala horrivelmente com dados maiores,
se 25:55 você Snavely, mas há muitos 25:58 métodos mais rápidos que são mais rápidos por-
que 26:01 eles podem usar atalhos computacionais 26:03 mas a parte importante é que se um
dos 26:05 esses métodos mais rápidos funcionaram no final 26:07 resultado será o mesmo que
você faria 26:10 usaram pacientes aproximados 26:12 computação você só vai conseguir isso
26:14 responda muito, muito mais rápido 26:16 veja se você ouve sobre o método legal 26:18
como Hamiltoniano Monte Carlo você não 26:21 precisa ficar muito nervoso se você não sabe
26:23 como funciona porque é apenas mais um 26:25 método de ajuste de modelos bayesianos
e 26:28 quando funciona, a saı́da será 26:31 o mesmo que se você tivesse montado o 26:34
modelo usanbayesianos aproximado 26:35 computação 26:36 mas o Monte Carlo Hamiltoni-
ano 26:39 provavelmente se acostumar com um resultado hoje em vez 26:41 do que em cem
anos dizê-lo agora nós 26:46 falou sobre o que a análise de dados bayesianos 26:48 é que va-
mos falar um pouco sobre o que 26:52 não é tão primeiro, não é um categórico 26:56 modelos
não, não é como você tem 26:57 modelos de regressão árvores de decisão neurais 27:00 redes
e por modelos oceânicos paciente 27:02 análise de dados é mais uma maneira de pensar 27:04
sobre e construção de modelos e muitos 27:06 partes de estatı́stica e aprendizado de máquina
27:08 pode ser feito em uma estrutura bi-asiática 27:11 você pode ter modelos de regressão de
pacientes 27:13 árvores de decisão bayesianos paciente neural 27:15 redes e assim por diante
análise de dados bayesianos 27:18 também não é mais subjetivo do que todos os 27:21 tipos
de estatı́sticas, tanto quanto eu posso ver 27:23 ou melhor, todas as estatı́sticas são igualmente
27:26 métodos subjetivos também fı́sicos exigem 27:29 que você faz algumas suposições e
o 27:31 O resultado sempre deve ser interpretado em 27:33 a luz dessas suposições é a 27:36
mesmo nas estatı́sticas bayesianas e 27:38 estatı́sticas clássicas também, embora 27:40 Métodos
bayesianos tornaram-se mais 27:42 popular muito recentemente a partir do 27:44 90, quando
de repente todo mundo pegou o PC 27:46 é importante lembrar que isso 27:49 não é nada de
novo, é realmente chamado 27:53 Bayesiano por causa desse cara Tom do 27:56 dias que vi-
vem na década de 1700 é nomeado 27:59 depois dele porque ele não publicou 28:02 um ensaio
sobre a resolução de um especı́fico 28:04 problema de probabilidade, mas que era então 28:07
publicado após sua morte, mas paciente 28:10 estatı́sticas realmente devem ser nomeadas após
28:12 esse cara pls Monica, que foi o primeiro a 28:17 descrever a teoria geral dos bayesia-
nos 28:19 análise de dados, mas ele não chamou 28:21 Bayesiano, em vez disso, ele chamou
inverso 28:24 probabilidade que meio que faz sentido como 28:28 usamos a probabilidade de
reverter 28:30 retroceder do que conhecemos os dados 28:32 para descobrir o que não sabemos
28:35 parâmetros a primeira pessoa conhecida a 28:38 usaram a palavra paciente era realmente
28:40 Ronald Fisher o cara que popularizou 28:42 esse valor-p e ele não quis dizer isso como
um 28:45 elogio, porque ele famosa baixa que 28:48 estatı́stica da estação 28:49 então talvez
a análise de dados bayesiana não seja 28:53 o melhor dos nomes e o melhor nome 28:55 se-
ria apenas probabilı́stico 28:58 modelagem, porque isso é exatamente o que 29:01 está tudo
bem que conclui a parte 1 de 29:06 esta introdução de três partes para o paciente 29:08 análise
de dados agora que sabemos o que 29:11 análise de dados bayesianos é que vamos 29:13 em
parte para dar uma olhada por que você faria 29:17 deseja usar a análise de dados bayesianos
novamente 29:20 Eu sou Russell esportes e obrigado por 29:22 ficar comigo até o fim.

[CURSO II]

Introduction to Bayesian data analysis - Part 2: Why use Bayes?

00:03 this part 2 of a 3 part introduction to Bayesian data analysis which will go into the why
of Bayesian data analysis if you haven’t checked out part 1 yet I really recommend you do that
first.
So, why use Bayesian data analysis? Why could it be a useful approach rather than using
say classical statistics? Well, I’m going to give you a couple of reasons.
00:28 One reason to use Bayesian data analysis is that you have great flexibility when
building models and can focus on that rather than on computational issues.
Now 00:40 if you’ve done some Bayesian modeling before this might sound a little bit
strange to you as there are often 00:46 computational issues when you want to fit your model.
What I mean here is that 00:50 since there is a very clean separation between specifying and
fitting a model in a Bayesian framework you often don’t have to focus 00:58 too much on how
your model is going to be computed when you construct it.
That 01:04 means that you can focus on what assumptions are reasonable and what 01:08
information you should use rather than on algorithms when doing the actual modeling and with
many good tools that 01:16 help you fit Bayesian models like Stan 01:18 Jag’s and PI MC there
is a good chance 01:21 that just specifying the model actually 01:23 is enough if it’s not too
complicated so 01:27 let me give you an example of how easy 01:30 it is to change a Bayesian
model while 01:32 the computation stays the same so this 01:35 is the CEO of Swedish Krish
incorporated 01:39 and he is telling us that I’ve come up 01:42 with a new brilliant way of
marketing 01:44 our salmon subscription service.
So I 01:47 guess we no longer have only one method to advertise salmon subscriptions
with and that means it’s time to bring back 01:54 that be in a B testing. So remember that
method A involves sending out a colorful brochure to advertise the salmon subscription service
and when marketing try this on 16 randomly selected Danes 6 02:08 out of 16 signed up.
The new method our CEO proposes 02:13 let’s call it method B involves sending 02:17
out the very same colorful brochure but 02:19 this time accompanied by a sample frozen 02:22
salmon and marketing has actually 02:25 already tried this method on another 16 games and
this time 10 out of 16 signed 02:31 up.
So what we now want to know is which 02:34 seems to be the better method ensure there
is some evidence that method B is better but how certain or uncertain 02:42 should we be that
this is the case so 02:45 what we want to do is to specify and fit 02:48 a Bayesian model that
helps us answer 02:50 these questions.
This is the model we had before when we just had one advertising 02:56 method.
We drew a rate of sign up from one prior and ran a generative model 03:01 that gave us
one simulated data set but 03:04 now, I have two advertising methods but 03:07 the cool thing
here is that all we need 03:10 to do is to copy and paste the one group 03:13 model. So instead
we draw two rates of 03:16 sign up independently from two priors 03:19 and separately run to
generative models 03:21 to simulate two data sets.
This is the only change we need to make to fit this 03:28 new model. We can use the
same procedure as we use the for part one of this tutorial 03:32 going on to the long name
approximate equation computation.
So here we again 03:38 first draw fixed parameter values from 03:41 the priors 03:42 this
time we happen to draw a sign-up 03:44 rate of 20% for method a and the rate of 72% for method
B and then we plug these 03:50 parameter draws into the generative models and simulate some
data. 03:54 This time we got for sign up for method A and 10 signups for method B.
But then we keep these parameter draws only if 04:02 the simulated data match the actual
data. 04:05 And this time it didn’t so we’re going to filter it away.
Shirt for method B the 04:11 simulated data match the actual data since we in the reality
got 10 signups, 04:15 but it doesn’t match for method a as we in reality got 6 signups there.
And we want all the simulated data to match the 04:25 real data and for these prompt
drawers have to go.
So we do it again this time we draw some 04:32 all the parameter values and when we run
the generative model this time well what 04:37 do you know this time we simulate the data that
04:39 matched, so we’re keeping these parameter goals and now as last time we do this 04:45
whole draw simulate react procedure many 04:48 many times say a million times.
And what we are left with are two distributions 04:54 the distributions of the parameters
goes for method A and method B that made it 05:01 past the rejection filtering step.
Here 05:05 is this distribution for the rate of signup for method a and since it’s the 05:10
probability distribution over likely parameter values that we got after 05:14 having used the data
05:16 it’s what’s usually called a posterior distribution. It should look familiar to 05:21 you as it
is the same as before when we 05:23 only have the data for method A. So again 05:26 it seems
likely that the right designer 05:28 prayed for method a is somewhere between 05:30 20 and 60
percent with it most likely 05:34 being somewhere around 35 percent.
And 05:37 here is the posterior distribution for 05:40 method B and just looking at it it
seems 05:43 there is some evidence that method B 05:45 would result in more signups as the
bulk 05:48 of the distribution is between 40 and 80 05:50 percent with a sign-up rate most likely
05:53 being around 65 percent.
But this is just 05:56 as eyeballing the posterior 05:58 distributions and we really would
like 06:00 to calculate some probabilities say the 06:03 probability that method B do have a
higher rate of sign-up than method A.
06:08 Fortunately this is very easy to do as these posterior probability 06:13 distributions
are represented by a long list of parameter draws.
So here are the 06:20 numbers behind the two posterior distributions I only show the first
06:24 eight rows but there are many many more rows in this table. So here each row is a 06:29
pair of parameter draws that when 06:32 plugged into the generator model 06:33 simulated data
matching the actual real 06:36 data. So the way these parameter drawers are distribute 06:40
that represents the uncertainty around 06:42 what the rate of sign up could be.
Now, if 06:46 you calculate new measures and we do it separately for each row then we
retain 06:52 this uncertainty and the resulting distributions of these new measures can 06:57
also be interpreted as posterior probability distributions, that is what 07:02 is known about
these new measures given the model and the data.
So what could 07:07 such a measure be? Well since we’re interested in which your method
a and 07:11 method B gives the highest rate of sign up 07:14 why not calculate the difference
between grade a and rate be using some are like 07:19 pseudocode it could look something like
this and when applied to each row it 07:23 would give us a new column for the distribution
of the difference between 07:27 method a and method B where a positive number would be in
favor of method B.
So 07:33 now we could take a look at this new derivative distribution.
Just eyeballing 07:39 it we see that it is quite likely that 07:41 method B has a higher rate
of sign up 07:43 almost all of the probability is to the 07:46 right of the zero mark with the right
B 07:48 being most likely around 25 percentage 07:51 point higher than rate A.
Again since we 07:55 are working with a table of parameter growth it is very easy to
calculate the 07:59 probability that rate B is higher than 08:01 rate A. We simply sum up how
many rows of 08:05 the rate difference was above zero that 08:07 is how many times rate B was
higher than 08:10 rate a and then we divide by the total 08:13 number of draws. This time we
get that 08:16 92% of the rate difference distribution 08:19 is above zero that is there is a 92%
probability that rate B is better than rate A.
To arrive at this probability we 08:28 didn’t need to change the way we fitted the model
we could use the same method 08:32 as when we just had data for mass of A
all we needed to do was change the model 08:37 and add a prior and a generative model
for method B and then we just did some 08:42 simple post-processing of the posterior draws
using basic arithmetic.
08:49 so another reason to use Bayesian data analysis is that it allows you to 08:54 include
information sources in addition to the data. For example expert opinion.
09:00 here is again the CEO of Swedish Fish Incorporated and he’s come to tell us 09:05
that the signup rate has never been higher than 20% not even in Norway and 09:11 it’s usually
between 5% and 15%.
now I’m not really sure exactly how much we 09:19 should trust our CEO. I mean I I think
is smoking tobacco but I don’t know, but for 09:25 now let’s roll with this new information and
see how we can include this expert 09:30 opinion into the model again this is the model we have
so far I’ve forgotten 09:36 about method B for the time being so now our back just estimating
the rate of 09:41 signup flow method a
so how can we include the CEOs information? well a 09:47 natural place to include it is
in the prior what the model knows about the 09:53 rate of signup before seeing the data.
what we need to do is to change the 09:58 prior from a uniform prior which basically says
that any rate between 10:03 zero and 100% is equally likely to a more informative distribution
that 10:08 favors values between 5 and 50 percent. Now there are many ways to define custom
10:14 prior distributions we could stitch together a couple of uniform 10:18 distributions where
we put more probability on the distributions 10:21 covering 5 to 15% or we could even draw a
probability distribution with pen and 10:26 paper and scan it in. But often the easiest solution
is to use assembler or 10:31 probability distributions that is 10:33 flexible enough to represent
the 10:35 information that we have. And that we 10:37 will tweak until it represents that 10:39
information.
for us a good choice would 10:42 be the beta distribution so the beta is 10:45 a continuous
distribution bounded 10:47 between 0 and, 1 which is good because 10:50 the rate of signup
compa less than 0% nor more than 100. It has two parameters 10:55 alpha and beta that allow it
to take all the forms depicted here. For example when 11:01 alpha and beta are one it becomes
a uniform distribution. 11:05
the larger the alpha and beta parameters are the more keep shaped and peaked it 11:10
will become
so here is a uniform distribution we’re using right now as 11:16 the prior for the signup
rate. A uniform prior is sometimes called a non informative 11:21 tip prior as it really doesn’t
contain that much information with regards to 11:25 what the signup rate could be
and here is a proposal for what a more 11:31 informative prior could be this is a beta dis-
tribution with the Alpha parameter set to 3 and the beta 11:37 parameter set to 25 but the specific
parameter values really doesn’t matter 11:41 here what matters is what shape the distribution
has.
And here I wanted to 11:46 capture the information from our CEO that the rate of signup
usually is 11:51 between 5 and 15 percent so this 11:54 informative prior puts most the 11:57
probability between 5 and 15 percent but 12:00 does not rule out the possibility that 12:02 the
sign of red could be up to 30 12:05 percent.
you could certainly capture the 12:07 CEOs information in many other ways but this is
what we’re going to roll with so 12:13 this is our new model it’s the same as 12:15 before but
now with the informative 12:17 prior on top and the cool thing again is 12:19 that we don’t need
to change the computational part of how we fit this 12:23 model we can use the same procedure
as before the only difference is that we 12:28 will draw the parameter draws from our 12:30
new informative prior distribution 12:32 instead from the uniform distribution as 12:35 before
here is a distribution you should 12:40 recognize it’s the posterior probability 12:42 distribution
of the likely rate of sign 12:44 up using the uniform non informative 12:47 prior and here is
what we got using the new informative prior
looking at it it 12:54 seems that after having used info from 12:57 the CEO and the info
from the data it is 13:00 most probable that the rate of signup is 13:02 between 10 and 30%.
So the information in 13:06 the data point is the rate of sign up 13:08 being somewhere around
40% and the CEO 13:11 stated that it’s usually around 5 to 15% 13:14 so it shouldn’t come as
a surprise 13:16 the resulting posterior distribution 13:18 looks like a mix between these two
13:21 information sources
now if we had more 13:24 data the information in the prior would 13:27 have less and
less influence with enough 13:30 data the prior wouldn’t matter at all 13:32 similarly if we had
less data that 13:34 posture would look more like the prior 13:36 and if we had no data at all
the 13:39 posterior would be the same as the prior 13:41
now we are in a slightly confusion 13:44 situation, however. That we have run two 13:48
different models and have two different 13:50 results from the same data set and at 13:53 some
point we should decide whether we 13:55 want to go with a non informative prior 13:57 or the
prior from the CEO.
but it’s 14:00 totally fine to try out different models 14:03 in different priors and it can
be 14:05 worthwhile to try out an informative 14:07 prior because if you’re not using an 14:10
informative prior you’re leaving money 14:12 on the table as Robert Weiss puts it.
14:15 that is if you’re not using an informant 14:17 the prior you’re really leaving out
14:19 information from the analysis that you 14:21 have which seems like a waste.
all right 14:26 a third reason why Bayesian data analysis 14:29 is useful is because they’re
the result 14:32 of a Bayesian analysis retains the 14:34 uncertainty of the estimated parameters.
14:36 Which is very useful in prediction and 14:39 decision analysis here decision analysis
14:43 is when you take the results of analysis 14:45 and bring it closer to what you care 14:48
about. Usually you don’t ultimately care 14:51 about the parameter value what you care 14:53
about is often things like money and 14:56 what decision to make to get mortgage or 14:59
what you could do to avoid different 15:00 types of loss.
we never seem to get rid 15:05 of our CEO and here he is again he asks 15:09 us so what
should we do 15:12 and by the way marketing forgot to tell 15:14 you that the cost of sending
a brochure 15:16 is 30 Kronus the cost of sending a 15:19 salmon is 300 Krona 15:20 and if a
person signs up we make 15:23 thousand crona’s on average.
okay so 15:27 so what should we do here are the two 15:32 methods that we are consi-
dering a 15:33 sending a colorful brochure or be 15:36 sending a brochure and a sample frozen
15:39 Selman and this is the result we got 15:41 after having fitted the model with the 15:43
data from both method a and method B we 15:46 did that before remember.
and while this 15:50 showed that it is probable that method B 15:52 has a higher rate of
sign up it doesn’t 15:54 directly tell us what to do, because 15:56 we’re not really interested in
the rate 15:58 of sign up we’re really interested in 16:00 which method will give us the most
money 16:03 and while method B seems to have a 16:05 higher rate of sign up it also involves
16:07 sending out costly samples Salomon’s.
but 16:11 since we did a Bayesian analysis and we 16:13 have access to the raw draws
behind 16:15 these two probability distributions it’s 16:17 very easy to do a quick decision
16:19 analysis to figure out which method will 16:22 probably give us the most money.
so to 16:25 the left here we have the first eight 16:27 rows from the many many drawers
that 16:29 make up these two posterior probability 16:32 distributions.
and the distribution of 16:35 these draws represents the uncertainty 16:38 regarding what
the underlying rate of 16:40 signups are for these two methods
and 16:42 remember that any calculation we perform 16:45 row wise here will give us a
new 16:47 posterior probability distribution that 16:49 retains this uncertainty
so some 16:53 reasonable things to calculate would 16:54 here be the expected profit
when using 16:57 method a which is the rate of signup 17:00 times a thousand crooners we
make per 17:02 sign up minus the cost of sending the 17:04 brochure
so for the first row that would 17:07 be 33% 10,000 crooners which means we 17:11
would make 331 crona’s on average minus 17:14 the 30 kronas the brochure costs.
so an 17:17 expected profit of 301 crona’s percent 17:20 for sure and so on for all the
rows
and 17:24 similarly we can calculate the expected 17:27 profit for method B, which is
almost the 17:30 same same calculation but now minus 300 17:33 crooners for the salmon
and finally 17:36 since we’re interested in which of these 17:38 two methods would give
the higher 17:40 profits we will calculate the 17:43 difference in profit between the methods
17:45 where a positive difference here means 17:48 method B is better. Just looking at
these 17:51 first eight rows we see that five out of 17:54 eight rows are actually favoring method
17:56 a but of course we should look at the 17:59 profit difference distribution for all 18:01 the
rows
so here we see that there is 18:04 much uncertainty regarding which method 18:07 would
give the highest profit
it could 18:10 be that net would be is better but if we 18:13 count up how many drawers
are in favor 18:15 of method A we actually get that there 18:18 is a sixty-one percent probability
that 18:20 method a would result in better profits
18:22 so if we had to decide this small 18:26 decision analysis tells us that we 18:27
should go for method A even if method B 18:31 has a higher rate of sign up
but the 18:34 main take over here should really be 18:35 that given the data that we have
there 18:37 is much uncertainty and we really would 18:40 need some better data before making
a 18:42 decision.
all right, so we went from 18:45 estimated rate parameters to a posterior 18:49 probability
distribution of the light 18:51 difference in profit between these two 18:53 methods
and I hope you saw how easy that 18:57 was since we started from the result of 18:59
a Bayesian analysis that is probability 19:02 distributions represented as a long 19:04 table of
parameter growth
if we instead 19:08 would have used classical statistical 19:10 methods like maximum li-
kelihood 19:12 estimation, would just have gotten out 19:14 point estimates which we wouldn’t
be 19:17 able to post process into something that 19:19 informed us about the expected profit
19:21 and that included some measure of 19:23 uncertainty or certainty regarding the 19:25
expected profits
but with base it was 19:28 pretty simple
so and last reason to use 19:32 Bayesian data analysis but there are many 19:34 more
reasons but a last reason to use it 19:36 is because you probably are already
what 19:41 I mean here is that a lot of classical statistical procedures that you might 19:44
already be familiar with such as 19:46 classical linear regression or the 19:49 bootstrap can be
interpreted as a 19:51 Bayesian model with priors and generative model and the same is true
19:56 for many machine learning procedures
and 19:58 while you don’t have to interpret the 20:01 statistical model to use from baye-
sian 20:03 perspective it helped me better 20:06 understand what many statistical 20:08 proce-
dures actually do
not least was a 20:11 Bayesian perspective super useful for me when understanding how
mixed models and 20:15 hierarchical models worked which are simple and straightforward from
a 20:20 Bayesian perspective but slightly 20:22 mysterious from a classical perspective.
20:26 So that were some reasons for why to use Bayesian data analysis.
Let’s look at some 20:34 reasons for why not to use Bayesian data analysis so maybe
everything is working fine as it is and you just have to with 20:43 your tools and your workflow
then you might not need Bayesian data analysis or 20:48 maybe you’re not that interested in
20:49 uncertainty there are many good machine 20:51 learning tools that just give 20:52 pre-
dictions but with no indication of 20:54 these predictions uncertainty and if you 20:57 want that
maybe you don’t need Bayes or 21:00 maybe Bayesian statistics is too 21:03 computationally
demanding, maybe you 21:05 would want to fit the Bayesian model but 21:06 your data set is so
large it just take 21:08 too long time or maybe you just feel a 21:12 Bayesian statistics take too
much work to 21:15 set up even if you would want to try the 21:17 cost-benefit situation doesn’t
allowed 21:20 these are perfectly good reasons not to 21:24 use space and what I wanted to say
with 21:27 this slide here is that Bayesian data 21:29 analysis is just one two out of many in
21:32 your data science tool belt and while it 21:35 can be a very useful tool it’s not the 21:37
be-all end-all of data analytical 21:40 methods even though it’s sometimes 21:42 presented as
that.
So that concludes part 21:47 two of this three part introduction to 21:49 Bayesian data
analysis if you want to try 21:52 out what we talked about here you could 21:54 go back to
your solution to the exercise 21:56 in part 1 and change it according to the 21:58 CEOs request
that is try adding an 22:02 informative prior to the model change 22:04 the model so that it can
accommodate 22:05 data from both method a 22:07 and method B and you can also try 22:10
replicating the decision analysis where 22:12 we looked at the expected profit of each 22:14
method. However, if you try this you could 22:18 run into some trouble because when you add
the second data source to the model 22:22 you might find that it takes a really 22:24 long time to
run that’s because the method we used to fit Bayesian models in part one approximate Bayesian
computation 22:32 was conceptually simple but also extremely slow. So in part three of this
introduction I will give you some hints 22:40 on how you can do speedy Bayesian computation.
And especially we will look 22:45 at a useful tool called Stan but for now I’m rational
sports and 22:50 thanks for staying with me to the end.
[CURSO III]

2 Introduction to Bayesian data analysis - part 3: How to do Bayes?

Hallie’s hello as we say in Sweden I’m 00:03 Rasmus port and welcome to this last 00:05 part
of a three part introduction to 00:07 Bayesian data analysis which will go 00:10 into the how
Bayesian data analysis how 00:13 to do it efficiently and how to do it in 00:15 practice but
I’ll warn you will just 00:18 make the tiniest scratch on the far 00:20 reaching surface that is
Bayesian 00:22 computation.
The goal is just for you to 00:24 get some familiarity with words like Markov chain Monte
Carlo and parameter 00:29 space and to be able to run a simple 00:32 Bayesian model using a
powerful Bayesian computational framework called span also 00:37 if you haven’t watched part
1 and part 2 00:39 yet and especially if you didn’t do the exercise in part 1 I really recommend
that you do that first because what’s 00:46 going to follow isn’t going to make much sense
otherwise also this part 3 is quite a lot longer than the other parts so feel free to take a break at
any time
aaaaaa 00:57 so what is the problem with how we have 01:00 performed Bayesian data
analysis so far well we really coded things from scratch and from my experience that is slow
and 01:09 error-prone and it’s especially slow because we’ve been doing approximate Bayesian
computation which is the most 01:16 general and conceptually simplest method but also the
slowest method for fitting a Bayesian model.
But there are many 01:23 faster method for fitting Bayesian models that are faster because
they take 01:28 computational shortcuts and these faster methods have in common that they
require 01:33 that delighted that the generator model will generate any given data can be 01:38
calculated rather than simulated.
So what we did 01:43 when we did approximate Bayesian 01:44 computation was that
we defined a generative model function that took a 01:49 fixed parameter value and simulated
some 01:52 data and we figured out the likelihood of the generative model simulating the 01:57
actual data by running it many many times and then counting how many times 02:03 it reduced
data matching the actual data and all this simulation can be very very 02:09 time consuming
and fast the methods require 02:12 function like this that takes both data and fixed parameters
as input and 02:17 directly calculates the likelihood of these parameters producing the data it’s
02:23 not always straightforward to calculate likelihood for any generative model but 02:27 for
a large number of generative models someone has already done this work for 02:33 you. For
example for most common probability distributions the likelihood 02:37 is easy to calculate.
Another thing faster methods have in common is that 02:42 they explore the parameter
space in a 02:45 smarter way. 02:46
Rather than just sampling from the prior 02:49 as we have done, faster methods try to
02:52 find and explore the regions in 02:54 parameter space that have higher 02:55 probability

You might also like