You are on page 1of 8

1

Bayesian Econometrics

There exist two very different approaches to statistics. The traditional


“classical” or “frequentist” approach is what has been presented heretofore in
this course; almost all econometrics textbooks exposit this approach, with little
or no mention of its competitor, the Bayesian approach. One reason for this is
the violent controversy among statisticians concerning the relative merits of
the Bayesian and non-Bayesian methods, centering on the very different
notions of probability they employ. This controversy notwithstanding, it seems
that the main reason the Bayesian approach is used so seldom in econometrics
is that there exist several practical difficulties with its application. In recent
years, with the development of more powerful computers, new software, and
computational innovations, these practical difficulties have for the most part
been overcome; it therefore seems but a matter of time before Bayesian
analyses become common in econometrics.

What Is A Bayesian Analysis?


Suppose that, for illustrative purposes, we are interested in estimating the
value of an unknown parameter, β. Using the classical approach, the data are
fed into an estimating formula β$ to produce a specific point estimate β$ 0 of β.
This estimate β$ 0 is viewed as a single drawing out of the sampling distribution
of β$ . This sampling distribution indicates the relative frequency of estimates
β$ would produce in hypothetical repeated samples. If the assumptions of the
CNLR model hold, as is usually assumed to be the case, β$ is the OLS
estimator and its sampling distribution is normal in form, with mean equal to
the true (unknown) value of β. Any particular estimate β$ 0 of β is viewed as a
random drawing from this sampling distribution, and the use of β$ 0 as a point
estimate of β is defended by appealing to the “desirable” properties, such as
unbiasedness, of the sampling distribution of β$ . This summarizes the
essentials of the classical, non-Bayesian approach.
The output from a Bayesian analysis is very different. Instead of producing a
point estimate of β, a Bayesian analysis produces as its prime piece of output a
density function for β called the “posterior” density function. This density
function looks like a sampling distribution, but is a completely different thing.
This density function relates to β, not β$ , so it most definitely is not a sampling
distribution; it is interpreted as reflecting the odds the researcher would give
when taking bets on the true value of β. For example, the researcher should be
willing to bet three dollars, to win one dollar, that the true value of β is above
the lower quartile of his or her posterior density for β. This “subjective” notion
of probability is a conceptually different concept of probability from the
2

“frequentist” or “objective” concept employed in the classical approach; this


difference is the main bone of contention between the Bayesians and non-
Bayesians.
Following this subjective notion of probability, it is easy to imagine that
before looking at the data the researcher could have a “prior” density function
of β, reflecting the odds that he or she would give, before looking at the data,
if asked to take bets on the true value of β. This prior distribution, when
combined with the data via Bayes’ theorem, produces the posterior distribution
referred to above. This posterior density function is in essence a weighted
average of the prior density and the likelihood (a density providing the
“probability” of getting the actual data).
It may seem strange that the main output of the Bayesian analysis is a
density function instead of a point estimate as in the classical analysis. The
reason for this is that the posterior can be used as input to decision problems,
only one example of which is the problem of choosing a point estimate. An
illustration of how the posterior can be used in this way should clarify this. To
begin, a loss function must be specified, giving the loss incurred, using a
specific point estimate β*0, for every possible true value of β. The expected
loss associated with using β*0 can be calculated by taking the expectation over
all possible values of β, using for this calculation the posterior density of β.
Note that this expectation is not taken over repeated samples. Next, we
calculate the expected loss associated with every possible point estimate β*.
Finally, we choose our point estimate as the point estimate that creates the
smallest expected loss.
So once the expected losses associated with all alternative estimates have
been calculated, the Bayesian point estimate for that loss function is chosen as
the estimate whose expected loss is the smallest. In reality this process is done
either via a computer or via algebra – to check an infinite number of possible
estimates is otherwise impossible! The most common loss function is the
quadratic loss function where the loss is proportional to the square of the
difference between the estimate and the true value of β. In this case it turns out
that the mean of the posterior distribution minimizes expected loss and so is
chosen as the Bayesian point estimate.
To summarize, the Bayesian approach consists of three steps.
(1) A prior distribution is formalized, reflecting the researcher’s beliefs about
the parameter(s) in question before looking at the data.
(2) This prior is combined with the data, via Bayes’ theorem, to produce the
posterior distribution, the main output of a Bayesian analysis.
(3) This posterior is combined with a loss or utility function to allow a
decision to be made on the basis of minimizing expected loss or
maximizing expected utility. This third step is optional.

Advantages of the Bayesian Approach


3

The Bayesian approach claims several advantages over the classical approach,
of which the following are some examples.
(1) The Bayesian approach is concerned with how information in data
modifies a researcher’s beliefs about parameter values and allows
computation of probabilities associated with alternative hypotheses or
models; this corresponds directly to the approach to these problems taken
by most researchers.
(2) Extraneous information is routinely incorporated in a consistent fashion in
the Bayesian method through the formulation of the prior; in the classical
approach such information is more likely to be ignored, and when
incorporated is usually done so in ad hoc ways.
(3) The Bayesian approach can tailor the estimate to the purpose of the study,
through selection of the loss function; in general, its compatibility with
decision analysis is a decided advantage.
(4) There is no need to justify the estimating procedure in terms of the
awkward concept of the performance of the estimator in hypothetical
repeated samples; the Bayesian approach is justified solely on the basis of
the prior and the sample data.

The essence of the debate between the frequentists and the Bayesians rests on
the acceptability of the subjectivist notion of probability. Once one is willing
to view probability in this way, the advantages of the Bayesian approach are
compelling. But most practitioners, even though they have no strong aversion
to the subjectivist notion of probability, do not choose to adopt the Bayesian
approach. The reasons are practical in nature.
(1) Formalizing prior beliefs into a prior distribution is not an easy task.
(2) The mechanics of finding the posterior distribution are formidable.
(3) Convincing others of the validity of Bayesian results is difficult because
they view those results as being “contaminated” by personal (prior) beliefs.
In recent years these practical difficulties have been alleviated by the
development of appropriate computer software. These problems are discussed
in the next section.

Overcoming Practitioners’ Complaints


Complaint 1: Choosing a Prior. A prior distribution tries to capture the
“information which gives rise to that almost inevitable disappointment one
feels when confronted with a straightforward estimation of one’s preferred
structural model.” Non-Bayesians usually employ this information to lead
them to add, drop or modify variables in an ad hoc search for a “better” result.
Bayesians incorporate this information into their prior, exploiting it ex ante in
an explicit, up-front fashion; they maintain that, since human judgement is
inevitably an ingredient in statistical procedures, it should be incorporated in a
formal, consistent manner.
4

Although non-Bayesian researchers do use such information implicitly in


undertaking ad hoc specification searches, they are extremely reluctant to
formalize this information in the form of a prior distribution or to believe that
others are capable of doing so. To those unaccustomed to the Bayesian
approach, formulating a prior can be a daunting task. This prompts some
researchers to employ an “ignorance” prior, which, as its name implies,
reflects complete ignorance about the values of the parameters in question. In
this circumstance the outcome of the Bayesian analysis is based on the data
alone; it usually produces an answer identical, except for interpretation, to that
of the classical approach. Cases in which a researcher can legitimately claim
that he or she has absolutely no idea of the values of the parameters are rare,
however; in most cases an “informative” prior must be formulated. There are
three basic ways in which this can be done.
(a) Using previous studies A researcher can allow results from previous
studies to define his or her prior. An earlier study, for example, may have
produced an estimate of the parameter in question, along with an estimate of
that estimate’s variance. These numbers could be employed by the researcher
as the mean and variance of his or her prior. (Notice that this changes
dramatically the interpretation of these estimates.)
(b) Placing hypothetical bets Since the prior distribution reflects the odds
the researcher would give, before looking at the data, when taking hypothetical
bets on the value of the unknown parameter β, a natural way of determining
the prior is to ask the researcher (or an expert in the area, since researchers
often allow their prior to be determined by advice from experts) various
questions relating to hypothetical bets. For example, via a series of questions a
value β0 may be determined for which the researcher would be indifferent to
betting that the true value of β (1) lies above β0, or (2) lies below β0. As
another example, a similar series of questions could determine the smallest
interval that he or she would be willing to bet, at even odds, contains the true
value of β. Information obtained in this way can be used to calculate the prior
distribution.
(c) Using predictive distributions One problem with method (b) above is
that for many researchers, and particularly for experts whose opinions may be
used to formulate the researcher’s prior, it is difficult to think in terms of
model parameters and to quantify information in terms of a distribution for
those parameters. They may be more comfortable thinking in terms of the
value of the dependent variable associated with given values of the
independent variables. Given a particular combination of values of the
independent variables, the expert is asked for his or her assessment of the
corresponding value of the dependent variable (i.e., a prior is formed on the
dependent variable, not the parameters). This distribution, called a “predictive”
distribution, involves observable variables rather than unobservable
parameters, and thus should relate more directly to the expert’s knowledge and
5

experience. By eliciting facts about an expert’s predictive distributions at


various settings of the independent variables, it is possible to infer the expert’s
associated (implicit) prior distribution concerning the parameters of the model.
For many researchers, even the use of these methods cannot allow them to
feel comfortable with the prior developed. For these people the only way in
which a Bayesian analysis can be undertaken is by structuring a range of prior
distributions encompassing all prior distributions the researcher feels are
reasonable.

Complaint 2: Finding the Posterior. The algebra of Bayesian analyses is


more difficult than that of classical analyses, especially in multidimensional
problems. For example, the classical analysis of a multiple regression with
normally distributed errors in the Bayesian context requires a multivariate
normal-gamma prior which, when combined with a multivariate normal
likelihood function, produces a multivariate normal-gamma posterior from
which the posterior marginal distribution (marginal with respect to the
unknown variance of the error term) of the vector of slope coefficients can be
derived as a multivariate t distribution. This both sounds and is mathematically
demanding.
From the practitioner’s viewpoint, however, this mathematics is not
necessary. Bayesian textbooks spell out the nature of the priors and likelihoods
relevant to a wide variety of estimation problems, and discuss the form taken
by the resulting output. Armed with this knowledge, the practitioner can call
on several computer packages to perform the calculations required to produce
the posterior distribution. And then when, say, the mean of the posterior
distribution must be found to use as a point estimate, recently-developed
computer techniques can be used to perform the required numerical
integration. Despite all this, some econometricians complain that the Bayesian
approach is messy, requiring numerical integration instead of producing
analytical approximations, and has for this reason taken the fun out of
statistics.

Complaint 3: Convincing Others. The problem Bayesians have of


convincing others of the validity of their results is captured neatly by ‘Leaving
your opinions out of this, what does your experimental evidence say?’
One way of addressing this problem is to employ either an ignorance prior
or a prior reflecting only results of earlier studies. But a better way of resolving
this problem is to report a range of empirical results corresponding to a range
of priors. This procedure has several advantages. First, it should alleviate any
uncomfortable feeling the researcher may have with respect to his or her
choice of prior. Second, a realistic range of priors should encompass the prior
of an adversary, so that the range of results reported should include a result
convincing to that adversary. Third, if the results are not too sensitive to the
nature of the prior, a strong case can be made for the usefulness of these
6

results. And fourth, if the results are sensitive to the prior, this should be made
known so that the usefulness of such “fragile” results can be evaluated in that
light.

Special Notes

● It cannot be stressed too strongly that the main difference between


Bayesians and non-Bayesians is the concept of probability employed. For
the Bayesian, probability is regarded as representing a degree of reasonable
belief; numerical probabilities are associated with degrees of confidence that
the researcher has in propositions about empirical phenomena. For the non-
Bayesian (or “frequentist”), probability is regarded as representing the
frequency with which an event would occur in repeated trials.
● The concept of a confidence interval can be used to illustrate the different
concepts of probability employed by the Bayesians and non-Bayesians.
Suppose that the points D and E are placed such that 2.5% of the area under
the posterior distribution appears in each tail; the interval DE can then be
interpreted as being such that the probability that β falls in that interval is
95%. This is the way in which many clients of classical/frequentist
statisticians want to and do interpret classical 95% confidence intervals, in
spite of the fact that it is illegitimate to do so. The comparable classical
confidence interval must be interpreted as either covering or not covering
the true value of β, but being calculated in such a way that, if such intervals
were calculated for a large number of repeated samples, then 95% of these
intervals would cover the true value of β.
● When the decision problem is one of choosing a point estimate for β, the
estimate chosen depends on the loss function employed. For example, if the
loss function is quadratic, proportional to the square of the difference
between the chosen point estimate and the true value of β, then the mean of
the posterior distribution is chosen as the point estimate. If the loss is
proportional to the absolute value of this difference, the median is chosen. A
zero loss for a correct estimate and a constant loss for an incorrect estimate
(an all-or-nothing loss function) leads to the choice of the mode. The
popularity of the squared error or quadratic loss function has led to the mean
of the posterior distribution being referred to as the Bayesian point estimate.
Note that, if the posterior distribution is symmetric, these three examples of
loss functions lead to the same choice of estimate.
● Bayesian object strenuously to being evaluated on the basis of hypothetical
repeated samples because they do not believe that justification of an
estimator on the basis of its properties in repeated samples is relevant. They
maintain that because the estimate is calculated from the data at hand it must
be justified on the basis of those data. Bayesians recognize, however, that
reliance on an estimate calculated from a single sample could be dangerous,
7

particularly if the sample size is small. In the Bayesian view sample data
should be tempered by subjective knowledge of what the researcher feels is
most likely to be the true value of the parameter. In this way the influence of
unrepresentative samples (not unusual if the sample size is small) is
moderated. The classical statistician, on the other hand, fears that
calculations using typical samples will become contaminated with poor
prior information.
● The functional form of the prior is chosen for mathematical convenience, to
facilitate calculation of the posterior for the problem at hand. For example,
if we are attempting to estimate the parameter of a binomial distribution, the
derivation of the posterior is much easier if the prior takes the form of a beta
distribution. In this example the posterior also is a beta distribution. Such a
result is very convenient because then the posterior, when used as the prior
for a later analysis, keeps the analysis mathematically easy. The choice of a
functional form for the prior is innocuous: very few people have prior
information so precise that it cannot be approximated adequately by a
mathematically-convenient distribution.
● There exists considerable evidence that people can be inaccurate in their
personal assessments of probabilities; The existence of this phenomenon
underlines the importance of reporting estimates for a range of priors.
● Bayesians undertake hypothesis testing by estimating the probability that
the null hypothesis is true and comparing it to the probability that the
alternative hypothesis is true. These two probabilities are used in
conjunction with a loss function to decide whether to accept the null or the
alternative. Here are the major differences between the classical and
Bayesian hypothesis testing procedures:
(a) Bayesians “compare” rather than “test” hypotheses; they select one
hypothesis in preference to the other, based on minimizing an expected
loss function.
(b) Bayesians do not adopt an arbitrarily determined type I error rate, instead
allowing this error rate to be whatever minimization of the expected loss
function implies for the data at hand. One implication of this is that as
the sample size grows the Bayesian allows both the type I and type II
error rates to move towards zero whereas the classical statistician forces
the type I error rate to be constant.
(c) Bayesians build prior beliefs explicitly into the hypothesis choice
through the prior density.
● A useful way of comparing Bayesian and classical estimates is to view the
classical estimates as resulting from a choice of a single “best” specification
and the Bayesian estimates as resulting from a weighted average of several
alternative specifications, where the weights are the “probabilities” of these
alternative specification being correct. For example, the classical estimate of
parameters in the presence of heteroskedasticity (or autocorrelation) is found
by choosing the “best” estimate of the heteroskedasticity (or autocorrelation)
8

parameter and then using it to calculate GLS to produce the parameter


estimates. In contrast, the Bayesian estimate results from taking a weighted
average of GLS estimates associated with different possible
heteroskedasticity (or autocorrelation) parameter values. The weights are the
probabilities of these heteroskedasticity (autocorrelation) parameter values
being the “right” ones, and are found from the posterior distribution of this
parameter. In other words, loosely speaking, the Bayesian estimator for the
case of an autocorrelated error is a weighted average of an infinite number
of GLS estimates, corresponding to an infinite number of different values of
ρ, where the weights are given by the heights of the posterior distribution of
ρ.On a grander scale, all of this suggests that empirical analysis using the
Bayesian approach requires averaging different models when making
decisions.

You might also like