You are on page 1of 30

11

Bayesian Inference

Frequentist inference is based entirely on sample information. In general,


certain assumptions are made about the population or process being sampled
from, and these assumptions determine the sampling distributions for
statistics such as the sample mean, sample variance, or sample proportion.
On the basis of these sampling distributions, properties of estimators and
tests of hypotheses can be investigated, and "good" estimators and tests can
be found.

Generally speaking, frequentist statistics consists of inferential and decision-


making procedures that formally are based solely on sample evidence,
although other information may be used to determine the null and alternate
hypotheses, and to set the probability of Type I and Type II errors. In fact,
non-sample information must be used to determine the null and alternate
hypotheses, and to set the probability of Type I and Type II errors since the
same sample evidence cannot be used both to determine the hypotheses to be
tested and to perform the tests. Further, properly, decisions about the null
and alternate hypotheses and the probability of Type I and Type II errors
must be made before the data are collected.

Another approach to statistical inference and decision exists in which non-


sample information is use in a formal way for inferential procedures. The
mechanism used to combine information is Bayes' theorem and as a result,
the term Bayesian is used to describe this general approach to statistics.
Bayesian and frequentist statistical inference take fundamentally different
viewpoints toward statistical decision making.

The frequentist view of probability, and thus of statistical inference, is based


on the idea of an experiment that can be repeated very many times. Because
it is impossible to talk about information in terms of frequency probabilities
unless the information arises from a random sample from a well-defined
population or process, frequentist statisticians formally admit only sample
information in inferential and decision-making procedures. On the other
hand, the Bayesian approach to statistics enables us to base inferences and
decisions on all of the information that is available, whether or not the
information is in the form of sample information. The motivation to be able

180
to do this is particularly strong in decision theory. Erroneous decisions may
be quite costly, and it may not be reasonable to ignore pertinent information
just because it is not "objective" sample information.

In practice, a researcher often has some information about a parameter θ


prior to taking a sample. This information may be of a subjective nature, in
which case the prior distribution will consist of a set of subjective
probabilities. In the frequentist approach to statistics, all probabilities should
be based on the long-run frequency interpretation of probability, hence
subjective probabilities are not admissible. Most statisticians advocating the
Bayesian viewpoint contend that all available information about a parameter,
whether it be of an objective or subjective nature, should be utilized in
making inferences or decision. Since followers of the frequency
interpretation of probability do not admit subjective probabilities, they do not
admit probability statements about a population parameter θ. Frequentists
contend that θ has a certain value, which may be unknown, and that it is
senseless to talk of the probability that θ equals some number; either it does
or it does not. The subjectivist, on the other hand, does think of θ as a
random variable and thus allows probability statements concerning θ.

Here’s a well known example. Consider three possible experiments: In each


case below, a trial of ten observations will be conducted to test the
hypothesis expressed:

1. A tea-drinker claims she can detect from a cup of tea whether the milk
was added before or after the tea.

2. A music expert claims that he can distinguish between a page of


Hayden's work and a page of Mozart.

3. A friend claims she can predict the outcome of tossing a fair coin.

In each of these cases the distribution is binomial with n=10 and unknown
parameter π. In frequentist statistics we would establish the null (π=.5), pick
a probability of Type I error (let’s say in this case, α =.06), and establish a
rejection region for the null against a one-tailed inexact hypothesis (R=8, R=
9, and R=10).

181
Now, let’s suppose that in each case the number of correct answers equals 8
(which would be in the Neyman-Pearson rejection region). Solely in terms
of the data, we would be forced to draw the same inferences in each case,
and make the same decision: reject the null hypothesis. But our prior beliefs
suggest that we are likely to remain skeptical about the friend’s coin
guessing ability, slightly impressed by the tea drinker, and not at all
surprised by the music expert. Bayesians would say we should be more
willing to accept the music expert’s claim based on this sample evidence
than we should be to accept the friend’s claim, because of our prior beliefs.
The essential point for Bayesians is that experiments are not abstract
devices. Invariably we have some knowledge about the process being
investigated before obtaining the data and it is sensible that inference be
based on the combined information that the prior knowledge and the data
represent.

In order to be able to express all information in probabilistic terms, most


(but not all) Bayesians follow the subjective interpretation of probability.
Often, much of the available information consists of the judgments of one or
more persons. If probabilities are interpreted as degrees of belief of
individuals, this sort of information can be formally included in the analysis.
It should be remembered, however, that the terms "Bayesian" and
"subjective" are not synonymous; it is possible to develop the Bayesian
approach without using subjective probability. For example, a pollster may
take a poll every week to determine which party would garner the most votes
“if the election were held today”. For a frequentist, each sample is
independent, what happened in previous samples cannot be admitted.
However a Bayesian could use what he had learned from previous samples
in drawing inferences from new samples. This “updating” procedure does
not require a subjective interpretation of probability.

In addition, the subjective interpretation of probability does not prevent the


utilization of the sampling distributions that have been discussed in previous
lecture notes. If a researcher subjectively accepts the assumptions such as
normality, equal variance, and so forth, underlying a particular sampling
distribution, then that distribution can be given a subjective interpretation.
Since the Bayesian will utilize the sampling distributions most commonly
encountered in frequentist statistics and, in addition, will use other
information, the Bayesian approach can be thought of as an extension of the
frequentist approach. Under certain conditions, Bayesian and frequentist
methods produce similar or even identical results.

182
Given the much documented criticisms that have been leveled at frequentist
statistics, it may seem a little surprising that Bayesian techniques are not
more commonly used in social sciences. There are probably a number of
reasons for this general lack of use. First, Bayesian methods do allow for
(although they don’t require) a subjectivist definition of probability, with
which many people are uncomfortable. Second, Bayesian techniques are
much less amenable to a “cook-book” style teaching approach than
frequentist techniques – at first glance they appear to require more technical
knowledge than many social scientists possess or desire. Thirdly, Bayesian
methods are seemingly computation-intensive.

Recently however, the use of Bayesian methods has been increasing, partly
because improvements in computation have made these methods easier to
apply in practice and partly because the Bayesian approach seems to be able
to get useful solutions in some applications where frequentist approaches
cannot.

Probability and Bayes’ Theorem


The definition of Bayes’ Theorem can be expressed thusly: A1, A2, ....,An be
a collection of n mutually exclusive and exhaustive events with P( Ai ) >0
for i=1,2,....,n. Then for any event B for which P(B)>0,

P ( Ak ∩ B) P ( B | Ak ) ⋅ P( Ak )
P( Ak | B) = = n
P( B)
∑ P( B | Ai ) ⋅ P( Ai )
i =1

Now consider the collection of n events to be a collection of n possible


values of a population parameter θ from a discrete distribution, and the event
B to be a sample statistic X. Re-writing, we get:

P( X | θ k ) ⋅ P(θ k )
P(θ k | X ) = n

∑ P( X | θ ) ⋅ P(θ )
i =1
i i

The left-hand side term of the above equation is known as the posterior
probability. It is the probability that the population parameter θ is equal to
θk, given the sample (reflected in the sample statistic X). The numerator of
the right-hand side term is likelihood of the sample (given the population

183
parameter) times the prior probability that the population parameter θ is
equal to θk. The denominator of the right-hand side is the sum of the
likelihoods times the priors of all possible values of θ.

Another way to think about it would be:

p(data hypothesis) p(hypothesis)


p (hypothesis data ) =
p(data)

Suppose that we are interested in making inferences about a parameter θ, and


we are willing to assume that θ can only take on J possible values θ1, …,θj.
These J values might be thought of as J competing hypotheses concerning θ.
Furthermore, assume that the information concerning θ can be summarized
by a probability distribution consisting of a set of probabilities P(θ = θi).
This probability distribution is called the prior distribution of θ. We then
observe a sample, and the outcome of the sample can be summarized by X,
the observed value of a sample statistic. The likelihood function is then of
the form P( X | θ ) if X is discrete or f ( X | θ ) if X is continuous. In frequentist
statistics, all inferences are based on the sample information, and some
frequentist techniques are based directly upon the likelihood function. Recall
that in estimation, the likelihood function is used to find maximum-
likelihood estimators and in hypothesis testing, the likelihood function is
used to develop likelihood ratio tests. The likelihood function is also used as
an input to Bayes' theorem, and it is the input representing the sample
information. For any θi, P( X | θ i ) can be thought of as the likelihood of the
sample result X, given θi, L(θ i | X ) .

These equations say that the ‘posterior’ probability of the scientific


hypothesis we are interested in testing (posterior because it is determined
after observing the data), given the data that we have observed in our
sample, equals the probability of the data given the hypothesis we are testing
(called the ‘likelihood function’ for the data), multiplied by a ‘prior’
probability for the hypothesis (a probability assigned prior to observing any
data), divided by the (marginal) probability of the data.

184
The marginal probability of the data is simply the probability of the data
under all possible hypotheses (θ) in the sample space (S). Hence, it is
simply a weighted average of sorts. Specifically, in the discrete case:

n
P ( data ) = ∑ P ( X | θ i ) ⋅ P (θ i )
i =1
Remember that for continuous distributions the summation sign is replaced
by the integral sign, but the intuitive meaning of the equation remains
unchanged.

Summary of the Bayesian Approach


We can now identify four fundamental aspects that characterize the
Bayesian approach to inference:

1. Prior Information: All problems are unique and have their own
context from which prior information can be derived. It is the
formulation and exploitation of this prior knowledge which sets
Bayesian inference apart from frequentist inference.

2. Subjective Probability: In contrast to frequentist statistics, the


Bayesian framework formalizes explicitly the idea that probabilities
can be subjective. Inference is based on the posterior distribution
whose form will depend (via Bayes' Theorem) on the prior
specification.

3. Self-consistency: By treating θ as random, it emerges that the whole


development of Bayesian inference stems quite naturally from
probability theory only. This has many advantages and means that all
inferential issues can be addressed as probability statements about θ
which derive directly from the posterior distribution.

4. No "ad-hockery:" Because frequentist inference cannot make


probability statements about θ, various criteria must be developed to
judge whether a particular estimator is in some sense "good".
Bayesian inference sidesteps this tendency to invent ad-hoc criteria
for judging and comparing estimators by relying on the posterior
distribution to express in straightforward probabilistic terms the entire
inference about the unknown θ.

185
A number of substantive objections to Bayesian inference have been raised
by frequentist statisticians.

1. Because parameters are fixed, it is unreasonable to place a


probability distribution on them (since they simply are what they
are). More formally, parameters and data can’t share the same sample
space. However, recall that the Bayesian perspective on probability
is that probability is an approach to uncertainty. Whether or not a
parameter is indeed fixed is, to a Bayesian, irrelevant, because we are
still uncertain about its true value. Thus, imposing a probability
distribution over a parameter space is reasonable, because it provides
a means to measure our uncertainty about the parameter’s true value.
Bayesians would also note that doing so has powerful results:
confidence intervals become ‘credible intervals’ with clear
interpretation. Research can build upon previous research, rather than
testing the same ‘null’ hypotheses again and again, etc. Specifically,
the standard frequentist approach yields a statistical test that relates
the probability of the data under some hypothesis. Rejecting (or
failing to reject) the hypothesis tells us nothing about what the true
value of the parameter is! The Bayesian approach, on the other hand,
relates the probability of the hypothesis given prior knowledge
combined with the observed data at hand.

2. Incorporating a prior injects too much subjectivity into statistical


modeling. Bayesians have a number of responses to this argument.
First, all statistics is subjective. The choice of sampling density
(likelihood) to use in a specific project is a subjective determination.
Students in ECON 328 will learn that the art of the econometrician
requires a significant amount of judgment about which model to
specify, a decision that is ultimately based subjectively on the
econometrician’s personal experience. In frequentist inference,
maximum likelihood estimators are obtained by choosing the point in
the parameter space that maximizes the likelihood surface. But one
way of thinking about the Bayesian approach is that it amounts
averaging across the likelihood surface rather than maximizing. The
controversy comes in as the averaging is weighted according to the
prior distribution. But even in frequentist inference it is quite
common to give different weights to different pieces of information
(e.g. in weighted regression). Furthermore, the choice of a level of
Type I error at which to declare a result ‘statistically significant’ is a

186
purely subjective determination. Also, similarly, the decision to
declare a statistically significant result scientifically meaningful (i.e.,
establish a meaningful alternative hypothesis) is a subjective decision.

3. The conclusions of the analysis will depend on the choice of prior.


This is true, but empirically priors tend to be overwhelmed by data.
The prior distribution generally contributes to the posterior once,
whereas data enter into the likelihood function multiple times. In the
limit, as n→∞, the prior’s influence on the posterior becomes
negligible. In addition, priors can be uninformative: a prior can be
made to contribute little or no information to the posterior.

Bayes’ Theorem and Updating Probabilities


One use of Bayesian techniques is to update probabilities based on sample
information. Here’s a very simple example.

Consider the researcher who believes a coin is either a fair coin (so, π, the
probability of heads is 0.50) or a trick coin (in which case π is .75). There
are no other possible values the population parameter can take on. The
researcher believes that the coin is likely a trick coin and assigns a prior
probability of .7 to that value of the parameter (π=.75) and a prior
probability of .3 to the “fair” value of the parameter (π=.50). He intends to
flip the coin ten times and use the results to make an inference.

The coin flip is a Bernoulli trial and the number of successes in ten flips is a
binomial random variable. Once he flips the coin, he will get a certain
number of heads (somewhere between 0 and 10). The number of heads he
gets (R) is the sample information.

We already know that the likelihood P(R | π ) function L(π | R ) is just equal
to the probability of R given π, , which is easily calculated from the formula
of the binomial distribution (or EXCEL, which is even easier). Earlier we
calculated the probabilities for each possible value of R, given π=.5 and
π=.75. In frequentist terms, these were the null and alternative distributions.

Here’s what they looked like.

187
Table 4 (again)
P(R=r, π=.5, P(R=r, π=.75,
r n=10) n=10)
0 0.001 0.000
1 0.010 0.000
2 0.044 0.000
3 0.117 0.003
4 0.205 0.016
5 0.246 0.058
6 0.205 0.146
7 0.117 0.250
8 0.044 0.282
9 0.010 0.188
10 0.001 0.056

Let’s suppose the researcher flips the coin ten times and obtains a head
seven times (R=7). Now we are in a position to use Bayes’ theorem to
update the researcher’s prior probabilities:

π Prior Likelihood Prior x Likelihood Posterior


.50 .30 .117 (.30) x (.117) = .0351 .0351/(.0351 + .175) = .167
.75 .70 .250 (.70) x (.250) = .175 .175/(.0351 + .175) = .833

The sample information (R=7) is used to update the researcher’s prior


probability. Before he took the sample, he believed there was a 70% chance
the coin was rigged. Now, after the sample, he has revised that subjective
probability to 83.3% chance.

What would happen if the researcher’s priors had been different? Let’s
suppose that the researcher had previously thought there was a 50-50 chance
the coin was rigged. Now, using Bayes’ theorem:

π Prior Likelihood Prior x Likelihood Posterior


.50 .50 .117 (.50) x (.117) = .0585 .0585/(.0585 + .25) = .1897
.75 .50 .250 (.50) x (.250) = .25 .25/(.0585 + .25) = .8103

188
Not surprisingly, the lower prior of π=.75 reduces the posterior as well. The
sensitivity of the results to the choice of prior is sometimes described as a
measure of how much “signal” there is in the data.

Notice, in the above two examples that the prior probabilities sum to unity,
and the posterior probabilities sum to unity. However the likelihoods do not.
This is because the two likelihoods come from two different binomial
distributions, one is P(R=7, π=.5, n=10) and the other is P(R=7, π=.75,
n=10). Thus there is no restriction that they must sum to unity. Notice also
that the prior probability of every other possible value of π except π=.50 and
π=.75 has implicitly been set to zero. (Obviously this is a somewhat
contrived example, designed to simplify the arithmetic.) However, one of the
implications of that is the posterior probability every other possible value of
π except π=.50 and π=.75 will equal zero, no matter what the sample reveals.
This result is simple arithmetic. The posterior is equal to the prior times the
likelihood and if the prior is zero the posterior will also be zero. In Bayesian
analysis when you set a prior probability equal to zero, you have determined
that it is impossible for the population parameter to equal that value.

Another useful thing to notice is this: Since what is generally of interest to a


statistician is the probability of the hypothesis, and the denominator does not
provide us any information about (or does not depend on) θ, you will often
see Bayes’ rule written as:

p (hypothesis data ) ∝ p ( data hypothesis ) p ( hypothesis ) , or


posterior ∝ likelihood × prior ,

where the symbol ∝ means “is proportional to”. Essentially the P(data) is a
normalizing constant that ensures that the posterior probabilities sum to
unity.

Conjugate Priors
At this stage students may be wondering why the Bayesian approach is
regarded as so much more technically difficult than the frequentist approach.
Thus far it just seems like a pretty simple extension to the frequentist
approach to allow for prior information to be incorporated. The problem can
be understood by looking at the top of the formula:

189
P( X | θ k ) ⋅ P(θ k )
P(θ k | X ) = n

∑ P( X | θ ) ⋅ P(θ )
i =1
i i

The posterior distribution is proportional to the prior distribution times the


likelihood function. As a general rule, when you multiply the equation of the
likelihood function by the equation of the PDF of your prior distribution,
you’re going to end up with an unusable mess for a posterior distribution,
not a nice neat PDF. Dealing with this problem can be quite tricky.
However, conceptually, Bayesian inference is pretty simple. In fact, it’s
probably more logical than frequentist inference.

In 1961 Howard Raiffa and Robert Schlaifer’s book “Applied Statistical


Decision Theory” introduced the concept of conjugate priors. These are prior
distributions that, when multiplied by certain standard likelihood functions,
would return posterior distributions that were in the same family as the
priors. The beta and gamma distributions proved to particularly useful in this
context. This allowed for much easier calculations of Bayesian posterior
probabilities. Specifically, if the liklihood is function is binomial or
geometric, the beta distribution is conjugate (a beta times a binomial returns
a beta. Similarly, if the likelihood funciton is Poisson or exponential the
gamma distribution is conjugate. There are other conjugate pairings as well
but these two will suffice to explain the concept.

The Gamma Distribution


The gamma distribution is derived from the exponential distribution. As you
will recall, the exponential distribution with parameter λ is the waiting time
distribution until the first observation in a Poisson process with average rate
= λ. Suppose we were interested in the waiting time until the αth
observation. The αth arrival time is a continuous random variable and its
density function is simply the k−fold convolution of the exponential
distribution function.

The resulting function is a bi-parametric family of distributions. Parameters


of conjugate priors are called hyperparametres, in the vain hope that this
convention will prevent confusion between the parameters of the prior and
posterior distributions, and the parameters of the population. By convention
the two hyperparameters are usually (but not allways) called α and

190
β, where α is the reciprocal of λ. The hyperparameter α is called the shape
parameter, and the hyperparameter β is called the location or scale
parameter. The PDF of the gamma distribution is a pretty awful looking and
I won’t bother giving it to you here.

Not surprisingly, given the derivation of the gamma distribution, the


exponential distribution is a special case of the gamma distribution, where
α=1.
Consider two Gamma distributions, one with α=2 and β=2 and a second one
with α=5 and β=2. The two distributions are shown in Figure 1. The curve
in blue corresponds to shape (α)=2 and scale (β) = 2 and the curve in red
corresponds to shape (α)=5 and scale (β) = 2.

As you can see from the diagram, holding the scale parameter constant, the
larger the shape parameter the more the distribution becomes symmetrical. If
the shape parameter was set to 100 the distribution would look like a normal
distribution, and this would be true for any value of the scale parameter.
Figure 1

191
Now let’s consider two different Gamma distributions, one with α = 5 and
β=2, and a second one with α=5 and β=1. The two distributions are plotted
in Figure 2. The curve in blue corresponds to shape (α)=5 and scale (β) = 2
and the curve in red corresponds to shape (α)=5 and scale (β) = 1.

As you can see from Figure 2 the larger the scale parameter the further to the
left is the centre of mass of the curve. Setting the scale parameter even
larger, say to 10, would move the median ot 0.57 and setting it to 20 would
move the median to 0.36.
Figure 2

The gamma has a number of practical applications. Obviously it is used as a


model for the distribution of times between the occurrences of successive
events. The gamma distribution can be used for the segment of time or space
occurring until some specified number of events has transpired. It is also
used in certain time-to-first fail problems in electronic systems with standby
exponentially distributed backups, and in certain other engineering
applications. In addition, because of its very flexible shape, it can be used to
model any number of empirical phenomena.

192
In addition, one of the primary uses of the distribution is a strictly
mathematical one – it serves as the parent to the PDF of a number of very
important statistical distributions – including the normal and the chi-square.

The EXCEL command =GAMMADIST(x,α,β α,β,FALSE)


α,β gives the PDF of
the gamma(α, β) evaluated at X=x, and the EXCEL command
=GAMMADIST(x,α,β α,β,TRUE)
α,β gives the CDF of the gamma(α, β)
evaluated at X=x.
Rather than using EXCEL, it’s a lot easire to plot out the beta using a free
program called First Bayes. It’s a teaching program, not a research program,
but very useful for playing around with simple Bayesian problems. Just to
demonstrate how we can plot out the gamma(α=5 and β= 2) I go to the
“distributions” tab and set shape=5 and scale=2, and hit next. This is what I
get:

Notice that First Bayes, in its wisdom, puts the scale parameter before the
shape parameter. EXCEL does it the other way around. First Bayes also
helpfully calculates some of the moments. The mean of this member of the
gamma family is 2.5 (α/β) and the variance is 1.25 (α/β2).

193
By hitting the “Plot” button you can ask First Bayes to plot out the
distribution. The plot of the distribution should be the same as the blue curve
in Figure 2, and it is.

The Beta Distribution


The beta distribution is effectively derived from the uniform distribution.
The beta distribution describes a family of curves that are nonzero only on
the interval (0, 1). The beta is most useful for modeling proportions. By
adjusting its parameters, one can achieve almost any desired density having
a domain on (0, 1).

Let X1, X2, …, Xn be a sequence of independent random variables each


having a U(0, 1) distribution. Let X(1) be the smallest of these random
variables, X(2) be the second smallest, …, X(n) be the largest. The sequence
X(1), X(2), …, X(n) is called an order statistic. Note that if n = 2k+1, then the
median, M, of these random variables is X(k), If n = 2k, then M = [X(k) +
X(k+1)] / 2. It can be shown that the median of (2n+1) random numbers from
(0, 1) is a beta(α, β) random variable.

194
A random variable X has a beta distribution with hyperparameters (α, β), α,
β > 0, if

1
f (x) = x α − 1 (1 − x ) β − 1 , 0 < x < 1
B (α , β )

where
Γ(α ) Γ( β )
B (α , β ) =
Γ(α + β )

The beta(1, 1) distribution is equivalent to the uniform(0,1) distribution.

Like the gamma, the beta is a very flexible distribution that can be used to
model a number of different phenomena.

The mean and variance of a beta random variable are given by


α
E( X ) =
α +β

αβ
V (X ) =
(α + β ) 2 (α + β + 1)

The EXCEL command =BETADIST(x,α, α,β


α,β,0,1) gives you the CDF of the
beta(α,β) evaluated at X=x. EXCEL does not have a PDF command for this
distribution. Also note that the EXCEL command refers to a more general
beta distribution that relates not to the standard [0, 1] uniform distribution
but to any uniform distribution. That’s why there are two extra parameters
on the command.

To plot out the beta, again we can use First Bayes. In the distributions menu
simply enter α=2 and β= 10 (First Bayes calls these parameters p and q),
and hit “Next”.

195
First Bayes automatically calcualtes the mean, median, variance, and
quartiles of the beta(2,10) distribution. Notice that it calculates the mean as
0.16667 which is indeed 2/(2+10). It also calculates the variance as 0.01068
which is indeed (2)(10)/(2+10)2(2+10+1).

Here’s the shape of the beta(2,10).

196
To get a feel for how the two parameters play out, the beta(2,20) and the
beta(4,10) have been plotted out (in the two figures on the next page) by
First Bayes.

By comparing the three plots, it’s pretty easy to see that increasing the first
parameter gives the beta more of a symmetric looking distribution, and
increasing the second parameter moved the distribution’s centre of gravity
more to the left.

197
198
Back to Conjugate Priors

If the likelihood function is binomial, and the PDF of the prior probility
distribution is beta, then the posterior will also be beta, albeit with different
parameters. It’s easiest to illustrate with an example. Suppose you have a
coin that you think is probably fair, or close to it. Your prior probability
might be described by the beta(50,50), which looks like this:

Now you flip the coin ten times. Let success be if a “heads” come up, and X
is the number of successes in ten flips of the coin. This, of course is a
binomial experiment. From Note 8 we remember that the binomial
likelihood function looks like this:

199
n
L(π | x1 , x 2 , x3 ,...x10 ) ≡ p ( x1 , x 2 , x3 ,...x10 | π ) = ∏ p ( X i = xi | π )
i =1
10
= ∏ π xi (1 − π )1− xi
i =1

= π ∑ xi (1 − π ) n − ∑ xi

In this case n is 10 and Σx is the number of successes.


Now suppose the number of successes (Σx) in this experiment turns out to be
three. We know that the likelihood function will peak at 0.3, which is the
maximum likelihood estimate of the unknown π. The maximum likelihood
estimate is a frequentist tool. In frequentist statistics only sample data is
allowed so, if you flip a coin 10 times and three “heads” come up, the value
of π that makes that data most likely is 0.3. On the other hand a Bayesian
would take into account the prior probabilities and perform the usual
calculation to come up with a posterior distribution. If we ask First Bayes to
do that for us we get this:

200
If the prior is a beta(50, 50) and we get three successes in 10 trials, the
posterior distribution is a beta(53, 57). Plotting the prior, likelihood, and
posterior distributions on one graph gives us this.

Here’s what’s happened. The researcher thought the coin was probably fair,
or at least close to it. He was pretty sure that the true value of π was
somewhere between 0.4 and 0.6. Flipping only three “heads” in 10 trials
doesn’t seem very convincing to him, he has only modified his beliefs
slightly.
Suppose he had a lot more data. Let’s say he had flipped the coin 100 times,
and “heads” came up on 30 of those flips. The maximum likelihood
estimate would still be the same, 0.3. But what would the Bayesian’s new
posterior distribution be?
You can probably guess that the posterior distribution is going to be
beta(80,120). In general, the α of the posterior is equal to the α of the prior
plus (Σx) and the β of the posterior is equal to the β of the plus (n-Σx).

201
We can see that the researcher now thinks it is very likely that π is less than
0.5 (almost no area under his posterior distribution is to the right of 0.5).

If the likelihood is poisson and the prior is gamma, the posterior will also be
gamma. For example, suppose someone is trying to estimate the number of
days a machine will run between breakdowns. The researcher’s prior is that
the average number of days (λ) is somewhere between 3 and 5. Her prior
distribution might be described by a gamma distribution with shape
parameter 40 and scale parameter 10. We know that’s a pretty normal
looking curve (a little skewed) with mean equal to 4 (shape/scale) and
variance equal to 0.4 (shape/scale2). Any combination of shape and scale
parameters with shape/scale=4 will give us a mean of 4, and the larger the
numbers the smaller the variance.

If we aske First Bayes to plot out that member of the gamma family for us
we will see that it is indeed a pretty normal looking curve with mean equal
to 4 and variance equal to 0.4.

202
Now we collect a sample of 30 observations (30 intervals between
breakdowns over a six month period). The sample is [7, 1, 12, 6, 6, 1, 17, 5,
2, 7, 0, 0, 0, 2, 6, 1, 10, 3, 4, 8, 10, 3, 4, 2, 3, 0, 1, 4, 12, 13]. We could use
that sample to calculate the maximum likelihood estimate of the mean
number of days between breakdowns but instead we’ll ask First Bayes to
calculate the posterior probability:

203
We can see that the posterior distribution is also a gamma, now with shape
parameter 190 and scale parameter 40. Plotting the three distributions gives
us this:

Before collecting the sample the researcher thought the average was
somewhere in the 3-5 range. Now, having analysed the sample, she thinks
it’s probably in the 4-6 range. Previously she would have said there was
about a 50% chance the true value was between 2 and 4, now she would say
there if almost no chance it is below 4.

The Choice of a Prior


The Bayesian approach to statistical inference treats population parameters
as random variables (not as fixed, unknown constants). As we know the
distributions of these parameters are called prior distributions. Often both
expert knowledge and mathematical convenience play a role in selecting a

204
particular type of prior distribution. This is a fundamental issue, and will be
discussed again later. However, some points should be noted:

1. Because the prior represents your beliefs about the unknown


parameter before observing the data, if follows that the subsequent
analysis is unique to you. Someone else's priors would lead to a
different posterior analysis. In this sense, the analysis is subjective.

2. As long as the prior is not "completely unreasonable" then the effect


of the prior becomes less influential as more and more data become
available. Thus there is a sense in which misspecification of the prior
is unimportant so long as there is enough data available.

3. Often we might have a "rough idea" of what the prior should look like
(e.g. perhaps we could give a mean and variance), but cannot be more
specific than that. In such situations we could use a "convenient"
form for the prior which is consistent with our beliefs, but which also
makes the mathematics relatively straightforward.

4. Sometimes we might feel that we have no prior information about a


parameter. In such situations we might wish to use a prior which
reflects our ignorance. This is often possible but there are some
difficulties involved.

Uninformative Priors
One issue of concern in Bayesian inference is how strongly the particular
selection of a prior distribution influences the results of the inference.
Particularly if results are to be used by people who may question the prior
distribution used by the researcher, it is desirable to have enough data that
the influence of the prior is slight. An uninformative prior (sometimes
called a flat prior) is one that provides little or no information. Depending on
the situation, uninformative priors may be quite disperse, may avoid only
impossible or quite preposterous values of the parameter, or may not have
modes. Bayesian analysis using uninformative priors sometimes give results
similar to those obtained by traditional frequentist methods.

Point Estimation
Bayesians are typically less interested in point estimation than are
frequentists. Bayesians are more interested in all the information contained
in the posterior distribution. However, it is possible to use Bayesian

205
techniques to provide point estimates by simply taking the maximum point
of the posterior distribution. Notice that the posterior distribution is
proportional to the likelihood function when the prior is uninformative. In
such cases Bayesian analysis amounts to the same thing as maximum
likelihood analysis. Where the priors are informative the results will not be
the same. For example in the breakdown example a frequentist would simply
calculate the arithmetic average of 150/30 = 5. That would be the point
estimate of the average number days between breakdowns. The Bayesian
point estimate (and remember, point estimation is not a key function of
Bayesian analysis) is 4.725. These numbers are close, because the “signal”
contained in the data is strong. In the coin-flip example, the frequentist
estimate would be 0.3 (three successes in 10 flips) but the Bayesian estimate
was 0.48. The reason the sample had less influence on the posterior here is
that the sample size was only ten in this example.

Interval Estimation
In Bayesian analysis the term used in interval estimation is “credible
interval” rather than “confidence interval”. It’s actually pretty easy to
calculate, since we have the full posterior distribution. For example, with the
coin flip example, if the posterior distribution is Beta(53, 57) we can
calculate the CDF of the posterior by typing =BETADIST(X,53,57,0,1) into
a cell in EXCEL and it will return the probability that the population
parameter in question is less than or equal to X (obviously you have to put a
number in for X).

A useful concept is Bayesian inference is the highest density region (HDR),


sometimes called the highest density interval (HDI). The HDR is the
smallest interval containing a specified posterior probability. If the posterior
distribution is roughly symmetric the HDR for a (say 90%) credible interval
will be found by finding the points in the distribution that has 5% of the
probability below it, and the point that has 5% of the probability above it.
However with non-symmetric posterior distributions, you may find that
moving the interval slightly in one direction or the other will provide a
smaller interval with the same level of probability.

Normally, the computer program you use will be able to calculate the HDR
for you. Here is a 90% HDR calcualted for the coin flip example [0.4037,
0.55987].

206
Of course, one of the big advantages of the Bayesian credible interval is that
the “natural” interpretation is also the correct one. The Bayesian credible
interval is the probability, given the data, that the true value of the
population parameter lies within the specified interval. The frequentist
confidence interval is the region of the sampling distribution for the statistic
(not the population parameter) such that, in repeated samples, one would
expect the interval, calculated in that way, to “cover” the population
parameter a certain percent of the time.
Often frequentist confidence intervals and Bayesian credible intervals appear
similar. If Bayesians use uninformative priors and there are a large number
of observations, often several dozen will do, HDRs and frequentist
confidence intervals will tend to coincide numerically. The interpretation, of
course, remains very different.

207
Hypothesis Testing
Bayesian hypothesis testing is less formal than non-Bayesian varieties. In
fact, Bayesian researchers typically summarize the posterior distribution
without applying a rigid decision process. If one wanted to apply a formal
process, Bayesian decision theory is the way to go because it provides a
posterior probability distribution over the parameter space and one can make
expected utility calculations based on the costs and benefits of different
outcomes. This is a great strength of Bayesian analysis. However, without
some decision-theoretic analysis of the costs and benefits of making various
mistakes about parameter values hypothesis testing, in the Bayesian
approach, is kind of useless. Calculating HDRs makes much more sense.

It is possible to derive tests of fequentist type null and alternative hypotheses


using Bayesian techniques, although most Bayesians would prefer a
decision-theoretic approach. Nonetheless, using a Bayesian formulation, we
may consider the individual probabilities that each of the two competing
hypotheses H0:θ = θ0 or H1: θ = θ1 are true. The hypothesis testing problem
thus reduces to determining the posterior (given the data) probabilities the
two hypotheses are true (these probabilities will always sum to 1).

We begin by placing prior probabilities on the two hypotheses subject to the


constraint that they sum to unity (we are assuming one of the hypotheses is
true). Then the posterior probability that H0 is true is, by Bayes Theorem,
P( X | θ 0 ) ⋅ P(θ 0 )
P(θ 0 | X ) =
⋅ P( X | θ 0 ) P(θ 0 ) + P( X | θ 1 ) P(θ 1 )

Of course, once we calculate this, then the posterior probability H1 is true is


just one minus the probability that H0 is true. This allows us to calculate a
posterior odds ratio:

P(θ 0 | X )
P(θ1 | X )
One possible decision rule would be to choose the null anytime the posterior
odds ratio is greater than one, otherwise choose the alternative, but
remember that Bayesians are not that excited about this style of hypothesis
testing.

208
One of the main objections made to Bayesian methods is the subjective input
required in specifying the prior, and the way that affects the posterior
probabilities. To partially alleviate this difficulty, often Bayes Factors are
computed. The Bayes Factor is

P( X | θ 0 )
BF =
P( X | θ 1 )
To see how this relates to the posterior probabilities of the hypotheses, note:

P(θ 0 | X ) P( X | θ 0 ) ⋅ P(θ 0 ) P(θ 0 )


= = BF
P(θ 1 | X ) P( X | θ 1 ) ⋅ P(θ 1 ) P(θ 1 )
Thus, the Bayes Factor relates the prior odds ratio to the posterior odds ratio.
To find the posterior odds, compute the prior odds and multiply by the
Bayes Factor. In this sense, the Bayes Factor describes the effect of the data
upon our prior and does not involve the prior probabilities.
Bayes Factors are the dominant method of Bayesian model testing. They are
the Bayesian analogues of likelihood ratio tests. The basic intuition is that
prior and posterior information are combined in a ratio that provides
evidence in favor of one model specification verses another.
Bayes Factors are very flexible, allowing multiple hypotheses to be
compared simultaneously and nested models are not required in order to
make comparisons -- it goes without saying that compared models should
obviously have the same dependent variable. Unfortunately, while Bayes
Factors are rather intuitive, as a practical matter they are often quite difficult
to calculate in problems that are more complicated than our simple coin-flip
case.
In truth, Bayesian analysis is not well suited to frequentist style hypothesis
testing. Instead it is much better suited to decision-theoretic analysis. This is
really a strength of Bayesian analysis, not a weakness. Unfortunately there is
never enough time in a one-semester course for decision theory. However,
Economics students who do some reading on this topic on their own will
soon find themselves in familiar territory with such decision theoretic
concepts as von-Neumann Morgenstern expected utility, game theory,
quadratic loss functions, and Bayes-Nash equilibria.

209

You might also like