Professional Documents
Culture Documents
Bayesian Inference
180
to do this is particularly strong in decision theory. Erroneous decisions may
be quite costly, and it may not be reasonable to ignore pertinent information
just because it is not "objective" sample information.
1. A tea-drinker claims she can detect from a cup of tea whether the milk
was added before or after the tea.
3. A friend claims she can predict the outcome of tossing a fair coin.
In each of these cases the distribution is binomial with n=10 and unknown
parameter π. In frequentist statistics we would establish the null (π=.5), pick
a probability of Type I error (let’s say in this case, α =.06), and establish a
rejection region for the null against a one-tailed inexact hypothesis (R=8, R=
9, and R=10).
181
Now, let’s suppose that in each case the number of correct answers equals 8
(which would be in the Neyman-Pearson rejection region). Solely in terms
of the data, we would be forced to draw the same inferences in each case,
and make the same decision: reject the null hypothesis. But our prior beliefs
suggest that we are likely to remain skeptical about the friend’s coin
guessing ability, slightly impressed by the tea drinker, and not at all
surprised by the music expert. Bayesians would say we should be more
willing to accept the music expert’s claim based on this sample evidence
than we should be to accept the friend’s claim, because of our prior beliefs.
The essential point for Bayesians is that experiments are not abstract
devices. Invariably we have some knowledge about the process being
investigated before obtaining the data and it is sensible that inference be
based on the combined information that the prior knowledge and the data
represent.
182
Given the much documented criticisms that have been leveled at frequentist
statistics, it may seem a little surprising that Bayesian techniques are not
more commonly used in social sciences. There are probably a number of
reasons for this general lack of use. First, Bayesian methods do allow for
(although they don’t require) a subjectivist definition of probability, with
which many people are uncomfortable. Second, Bayesian techniques are
much less amenable to a “cook-book” style teaching approach than
frequentist techniques – at first glance they appear to require more technical
knowledge than many social scientists possess or desire. Thirdly, Bayesian
methods are seemingly computation-intensive.
Recently however, the use of Bayesian methods has been increasing, partly
because improvements in computation have made these methods easier to
apply in practice and partly because the Bayesian approach seems to be able
to get useful solutions in some applications where frequentist approaches
cannot.
P ( Ak ∩ B) P ( B | Ak ) ⋅ P( Ak )
P( Ak | B) = = n
P( B)
∑ P( B | Ai ) ⋅ P( Ai )
i =1
P( X | θ k ) ⋅ P(θ k )
P(θ k | X ) = n
∑ P( X | θ ) ⋅ P(θ )
i =1
i i
The left-hand side term of the above equation is known as the posterior
probability. It is the probability that the population parameter θ is equal to
θk, given the sample (reflected in the sample statistic X). The numerator of
the right-hand side term is likelihood of the sample (given the population
183
parameter) times the prior probability that the population parameter θ is
equal to θk. The denominator of the right-hand side is the sum of the
likelihoods times the priors of all possible values of θ.
184
The marginal probability of the data is simply the probability of the data
under all possible hypotheses (θ) in the sample space (S). Hence, it is
simply a weighted average of sorts. Specifically, in the discrete case:
n
P ( data ) = ∑ P ( X | θ i ) ⋅ P (θ i )
i =1
Remember that for continuous distributions the summation sign is replaced
by the integral sign, but the intuitive meaning of the equation remains
unchanged.
1. Prior Information: All problems are unique and have their own
context from which prior information can be derived. It is the
formulation and exploitation of this prior knowledge which sets
Bayesian inference apart from frequentist inference.
185
A number of substantive objections to Bayesian inference have been raised
by frequentist statisticians.
186
purely subjective determination. Also, similarly, the decision to
declare a statistically significant result scientifically meaningful (i.e.,
establish a meaningful alternative hypothesis) is a subjective decision.
Consider the researcher who believes a coin is either a fair coin (so, π, the
probability of heads is 0.50) or a trick coin (in which case π is .75). There
are no other possible values the population parameter can take on. The
researcher believes that the coin is likely a trick coin and assigns a prior
probability of .7 to that value of the parameter (π=.75) and a prior
probability of .3 to the “fair” value of the parameter (π=.50). He intends to
flip the coin ten times and use the results to make an inference.
The coin flip is a Bernoulli trial and the number of successes in ten flips is a
binomial random variable. Once he flips the coin, he will get a certain
number of heads (somewhere between 0 and 10). The number of heads he
gets (R) is the sample information.
We already know that the likelihood P(R | π ) function L(π | R ) is just equal
to the probability of R given π, , which is easily calculated from the formula
of the binomial distribution (or EXCEL, which is even easier). Earlier we
calculated the probabilities for each possible value of R, given π=.5 and
π=.75. In frequentist terms, these were the null and alternative distributions.
187
Table 4 (again)
P(R=r, π=.5, P(R=r, π=.75,
r n=10) n=10)
0 0.001 0.000
1 0.010 0.000
2 0.044 0.000
3 0.117 0.003
4 0.205 0.016
5 0.246 0.058
6 0.205 0.146
7 0.117 0.250
8 0.044 0.282
9 0.010 0.188
10 0.001 0.056
Let’s suppose the researcher flips the coin ten times and obtains a head
seven times (R=7). Now we are in a position to use Bayes’ theorem to
update the researcher’s prior probabilities:
What would happen if the researcher’s priors had been different? Let’s
suppose that the researcher had previously thought there was a 50-50 chance
the coin was rigged. Now, using Bayes’ theorem:
188
Not surprisingly, the lower prior of π=.75 reduces the posterior as well. The
sensitivity of the results to the choice of prior is sometimes described as a
measure of how much “signal” there is in the data.
Notice, in the above two examples that the prior probabilities sum to unity,
and the posterior probabilities sum to unity. However the likelihoods do not.
This is because the two likelihoods come from two different binomial
distributions, one is P(R=7, π=.5, n=10) and the other is P(R=7, π=.75,
n=10). Thus there is no restriction that they must sum to unity. Notice also
that the prior probability of every other possible value of π except π=.50 and
π=.75 has implicitly been set to zero. (Obviously this is a somewhat
contrived example, designed to simplify the arithmetic.) However, one of the
implications of that is the posterior probability every other possible value of
π except π=.50 and π=.75 will equal zero, no matter what the sample reveals.
This result is simple arithmetic. The posterior is equal to the prior times the
likelihood and if the prior is zero the posterior will also be zero. In Bayesian
analysis when you set a prior probability equal to zero, you have determined
that it is impossible for the population parameter to equal that value.
where the symbol ∝ means “is proportional to”. Essentially the P(data) is a
normalizing constant that ensures that the posterior probabilities sum to
unity.
Conjugate Priors
At this stage students may be wondering why the Bayesian approach is
regarded as so much more technically difficult than the frequentist approach.
Thus far it just seems like a pretty simple extension to the frequentist
approach to allow for prior information to be incorporated. The problem can
be understood by looking at the top of the formula:
189
P( X | θ k ) ⋅ P(θ k )
P(θ k | X ) = n
∑ P( X | θ ) ⋅ P(θ )
i =1
i i
190
β, where α is the reciprocal of λ. The hyperparameter α is called the shape
parameter, and the hyperparameter β is called the location or scale
parameter. The PDF of the gamma distribution is a pretty awful looking and
I won’t bother giving it to you here.
As you can see from the diagram, holding the scale parameter constant, the
larger the shape parameter the more the distribution becomes symmetrical. If
the shape parameter was set to 100 the distribution would look like a normal
distribution, and this would be true for any value of the scale parameter.
Figure 1
191
Now let’s consider two different Gamma distributions, one with α = 5 and
β=2, and a second one with α=5 and β=1. The two distributions are plotted
in Figure 2. The curve in blue corresponds to shape (α)=5 and scale (β) = 2
and the curve in red corresponds to shape (α)=5 and scale (β) = 1.
As you can see from Figure 2 the larger the scale parameter the further to the
left is the centre of mass of the curve. Setting the scale parameter even
larger, say to 10, would move the median ot 0.57 and setting it to 20 would
move the median to 0.36.
Figure 2
192
In addition, one of the primary uses of the distribution is a strictly
mathematical one – it serves as the parent to the PDF of a number of very
important statistical distributions – including the normal and the chi-square.
Notice that First Bayes, in its wisdom, puts the scale parameter before the
shape parameter. EXCEL does it the other way around. First Bayes also
helpfully calculates some of the moments. The mean of this member of the
gamma family is 2.5 (α/β) and the variance is 1.25 (α/β2).
193
By hitting the “Plot” button you can ask First Bayes to plot out the
distribution. The plot of the distribution should be the same as the blue curve
in Figure 2, and it is.
194
A random variable X has a beta distribution with hyperparameters (α, β), α,
β > 0, if
1
f (x) = x α − 1 (1 − x ) β − 1 , 0 < x < 1
B (α , β )
where
Γ(α ) Γ( β )
B (α , β ) =
Γ(α + β )
Like the gamma, the beta is a very flexible distribution that can be used to
model a number of different phenomena.
αβ
V (X ) =
(α + β ) 2 (α + β + 1)
To plot out the beta, again we can use First Bayes. In the distributions menu
simply enter α=2 and β= 10 (First Bayes calls these parameters p and q),
and hit “Next”.
195
First Bayes automatically calcualtes the mean, median, variance, and
quartiles of the beta(2,10) distribution. Notice that it calculates the mean as
0.16667 which is indeed 2/(2+10). It also calculates the variance as 0.01068
which is indeed (2)(10)/(2+10)2(2+10+1).
196
To get a feel for how the two parameters play out, the beta(2,20) and the
beta(4,10) have been plotted out (in the two figures on the next page) by
First Bayes.
By comparing the three plots, it’s pretty easy to see that increasing the first
parameter gives the beta more of a symmetric looking distribution, and
increasing the second parameter moved the distribution’s centre of gravity
more to the left.
197
198
Back to Conjugate Priors
If the likelihood function is binomial, and the PDF of the prior probility
distribution is beta, then the posterior will also be beta, albeit with different
parameters. It’s easiest to illustrate with an example. Suppose you have a
coin that you think is probably fair, or close to it. Your prior probability
might be described by the beta(50,50), which looks like this:
Now you flip the coin ten times. Let success be if a “heads” come up, and X
is the number of successes in ten flips of the coin. This, of course is a
binomial experiment. From Note 8 we remember that the binomial
likelihood function looks like this:
199
n
L(π | x1 , x 2 , x3 ,...x10 ) ≡ p ( x1 , x 2 , x3 ,...x10 | π ) = ∏ p ( X i = xi | π )
i =1
10
= ∏ π xi (1 − π )1− xi
i =1
= π ∑ xi (1 − π ) n − ∑ xi
200
If the prior is a beta(50, 50) and we get three successes in 10 trials, the
posterior distribution is a beta(53, 57). Plotting the prior, likelihood, and
posterior distributions on one graph gives us this.
Here’s what’s happened. The researcher thought the coin was probably fair,
or at least close to it. He was pretty sure that the true value of π was
somewhere between 0.4 and 0.6. Flipping only three “heads” in 10 trials
doesn’t seem very convincing to him, he has only modified his beliefs
slightly.
Suppose he had a lot more data. Let’s say he had flipped the coin 100 times,
and “heads” came up on 30 of those flips. The maximum likelihood
estimate would still be the same, 0.3. But what would the Bayesian’s new
posterior distribution be?
You can probably guess that the posterior distribution is going to be
beta(80,120). In general, the α of the posterior is equal to the α of the prior
plus (Σx) and the β of the posterior is equal to the β of the plus (n-Σx).
201
We can see that the researcher now thinks it is very likely that π is less than
0.5 (almost no area under his posterior distribution is to the right of 0.5).
If the likelihood is poisson and the prior is gamma, the posterior will also be
gamma. For example, suppose someone is trying to estimate the number of
days a machine will run between breakdowns. The researcher’s prior is that
the average number of days (λ) is somewhere between 3 and 5. Her prior
distribution might be described by a gamma distribution with shape
parameter 40 and scale parameter 10. We know that’s a pretty normal
looking curve (a little skewed) with mean equal to 4 (shape/scale) and
variance equal to 0.4 (shape/scale2). Any combination of shape and scale
parameters with shape/scale=4 will give us a mean of 4, and the larger the
numbers the smaller the variance.
If we aske First Bayes to plot out that member of the gamma family for us
we will see that it is indeed a pretty normal looking curve with mean equal
to 4 and variance equal to 0.4.
202
Now we collect a sample of 30 observations (30 intervals between
breakdowns over a six month period). The sample is [7, 1, 12, 6, 6, 1, 17, 5,
2, 7, 0, 0, 0, 2, 6, 1, 10, 3, 4, 8, 10, 3, 4, 2, 3, 0, 1, 4, 12, 13]. We could use
that sample to calculate the maximum likelihood estimate of the mean
number of days between breakdowns but instead we’ll ask First Bayes to
calculate the posterior probability:
203
We can see that the posterior distribution is also a gamma, now with shape
parameter 190 and scale parameter 40. Plotting the three distributions gives
us this:
Before collecting the sample the researcher thought the average was
somewhere in the 3-5 range. Now, having analysed the sample, she thinks
it’s probably in the 4-6 range. Previously she would have said there was
about a 50% chance the true value was between 2 and 4, now she would say
there if almost no chance it is below 4.
204
particular type of prior distribution. This is a fundamental issue, and will be
discussed again later. However, some points should be noted:
3. Often we might have a "rough idea" of what the prior should look like
(e.g. perhaps we could give a mean and variance), but cannot be more
specific than that. In such situations we could use a "convenient"
form for the prior which is consistent with our beliefs, but which also
makes the mathematics relatively straightforward.
Uninformative Priors
One issue of concern in Bayesian inference is how strongly the particular
selection of a prior distribution influences the results of the inference.
Particularly if results are to be used by people who may question the prior
distribution used by the researcher, it is desirable to have enough data that
the influence of the prior is slight. An uninformative prior (sometimes
called a flat prior) is one that provides little or no information. Depending on
the situation, uninformative priors may be quite disperse, may avoid only
impossible or quite preposterous values of the parameter, or may not have
modes. Bayesian analysis using uninformative priors sometimes give results
similar to those obtained by traditional frequentist methods.
Point Estimation
Bayesians are typically less interested in point estimation than are
frequentists. Bayesians are more interested in all the information contained
in the posterior distribution. However, it is possible to use Bayesian
205
techniques to provide point estimates by simply taking the maximum point
of the posterior distribution. Notice that the posterior distribution is
proportional to the likelihood function when the prior is uninformative. In
such cases Bayesian analysis amounts to the same thing as maximum
likelihood analysis. Where the priors are informative the results will not be
the same. For example in the breakdown example a frequentist would simply
calculate the arithmetic average of 150/30 = 5. That would be the point
estimate of the average number days between breakdowns. The Bayesian
point estimate (and remember, point estimation is not a key function of
Bayesian analysis) is 4.725. These numbers are close, because the “signal”
contained in the data is strong. In the coin-flip example, the frequentist
estimate would be 0.3 (three successes in 10 flips) but the Bayesian estimate
was 0.48. The reason the sample had less influence on the posterior here is
that the sample size was only ten in this example.
Interval Estimation
In Bayesian analysis the term used in interval estimation is “credible
interval” rather than “confidence interval”. It’s actually pretty easy to
calculate, since we have the full posterior distribution. For example, with the
coin flip example, if the posterior distribution is Beta(53, 57) we can
calculate the CDF of the posterior by typing =BETADIST(X,53,57,0,1) into
a cell in EXCEL and it will return the probability that the population
parameter in question is less than or equal to X (obviously you have to put a
number in for X).
Normally, the computer program you use will be able to calculate the HDR
for you. Here is a 90% HDR calcualted for the coin flip example [0.4037,
0.55987].
206
Of course, one of the big advantages of the Bayesian credible interval is that
the “natural” interpretation is also the correct one. The Bayesian credible
interval is the probability, given the data, that the true value of the
population parameter lies within the specified interval. The frequentist
confidence interval is the region of the sampling distribution for the statistic
(not the population parameter) such that, in repeated samples, one would
expect the interval, calculated in that way, to “cover” the population
parameter a certain percent of the time.
Often frequentist confidence intervals and Bayesian credible intervals appear
similar. If Bayesians use uninformative priors and there are a large number
of observations, often several dozen will do, HDRs and frequentist
confidence intervals will tend to coincide numerically. The interpretation, of
course, remains very different.
207
Hypothesis Testing
Bayesian hypothesis testing is less formal than non-Bayesian varieties. In
fact, Bayesian researchers typically summarize the posterior distribution
without applying a rigid decision process. If one wanted to apply a formal
process, Bayesian decision theory is the way to go because it provides a
posterior probability distribution over the parameter space and one can make
expected utility calculations based on the costs and benefits of different
outcomes. This is a great strength of Bayesian analysis. However, without
some decision-theoretic analysis of the costs and benefits of making various
mistakes about parameter values hypothesis testing, in the Bayesian
approach, is kind of useless. Calculating HDRs makes much more sense.
P(θ 0 | X )
P(θ1 | X )
One possible decision rule would be to choose the null anytime the posterior
odds ratio is greater than one, otherwise choose the alternative, but
remember that Bayesians are not that excited about this style of hypothesis
testing.
208
One of the main objections made to Bayesian methods is the subjective input
required in specifying the prior, and the way that affects the posterior
probabilities. To partially alleviate this difficulty, often Bayes Factors are
computed. The Bayes Factor is
P( X | θ 0 )
BF =
P( X | θ 1 )
To see how this relates to the posterior probabilities of the hypotheses, note:
209