You are on page 1of 10

Bayes' rule

In probability theory and applications, Bayes' rule relates the odds of event   to event  , before


and after conditioning on event  . The relationship is expressed in terms of the Bayes factor,  .
Bayes' rule is derived from and closely related to Bayes' theorem. Bayes' rule may be preferred to
Bayes' theorem when the relative probability (that is, the odds) of two events matters, but the
individual probabilities do not. This is because in Bayes' rule,   is eliminated and need not be
calculated (see Derivation). It is commonly used in science andengineering, notably for model
selection.
Under the frequentist interpretation of probability, Bayes' rule is a general relationship
between   and  , for any events  ,   and   in the same event
space. In this case,   represents the impact of the conditioning on the odds.
Under the Bayesian interpretation of probability, Bayes' rule relates the odds on probability
models   and   before and after evidence   is observed. In this case,   represents the impact
of the evidence on the odds. This is a form of Bayesian inference - the quantity   is
called the prior odds, and   the posterior odds. By analogy to the prior and posterior
probability terms in Bayes' theorem, Bayes' rule can be seen as Bayes' theorem in odds form. For
more detail on the application of Bayes' rule under the Bayesian interpretation of probability,
see Bayesian model selection.

The rule
[edit]Single event
Given events  ,   and  , Bayes' rule states that the conditional odds of   given   are
equal to the marginal odds of   multiplied by the Bayes factor  :

where

In the special case that   and  , this may be written as

[edit]Multiple events
Bayes' rule may be conditioned on an arbitrary number of events. For two events   
and  ,

where

In this special case, the equivalent notation is


Derivation
Consider two instances of Bayes' theorem:

Combining these gives

Now defining

this implies

A similar derivation applies for conditioning on multiple events, using the appropriate extension of
Bayes' theorem

Examples
[edit]Frequentist example
Consider the drug testing example in the article on Bayes' theorem.
The same results may be obtained using Bayes' rule. The prior odds on an individual being a drug-

user are 199 to 1 against, as   and  . The Bayes factor when an

individual tests positive is   in favour of being a drug-user: this is the ratio of the
probability of a drug-user testing positive, to the probability of a non-drug user testing positive. The
posterior odds on being a drug user are therefore  , which is
very close to  . In round numbers, only one in three of those testing positive are
actually drug-users.

Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an
uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician
named Smith in a future election) is the probability distribution that would express one's uncertainty
about p before the "data" (for example, an opinion poll) is taken into account. It is meant to attribute
uncertainty rather than randomness to the uncertain quantity. The unknown quantity may be
a parameter or latent variable.
One applies Bayes' theorem, multiplying the prior by the likelihood function and then normalizing, to
get the posterior probability distribution, which is the conditional distribution of the uncertain quantity
given the data.
A prior is often the purely subjective assessment of an experienced expert. Some will choose
a conjugate prior when they can, to make calculation of the posterior distribution easier.
Parameters of prior distributions are called hyperparameters, to distinguish them from parameters of
the model of the underlying data. For instance, if one is using a beta distribution to model the
distribution of the parameter p of a Bernoulli distribution, then:
 p is a parameter of the underlying system (Bernoulli distribution), and
 α and β are parameters of the prior distribution (beta distribution), hence hyperparameters.

Informative priors
An informative prior expresses specific, definite information about a variable. An example is a prior
distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior
a normal distribution with expected value equal to today's noontime temperature, withvariance equal
to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that
day of the year.
This example has a property in common with many priors, namely, that the posterior from one
problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-
existing evidence which has already been taken into account is part of the prior and as more evidence
accumulates the prior is determined largely by the evidence rather than any original assumption,
provided that the original assumption admitted the possibility of what the evidence is suggesting. The
terms "prior" and "posterior" are generally relative to a specific datum or observation.

Uninformative priors
An uninformative prior expresses vague or general information about a variable. The term
"uninformative prior" may be somewhat of a misnomer; often, such a prior might be called a not very
informative prior, or an objective prior, i.e. one that's not subjectively elicited. Uninformative priors can
express "objective" information such as "the variable is positive" or "the variable is less than some
limit".
The simplest and oldest rule for determining a non-informative prior is the principle of indifference,
which assigns equal probabilities to all possibilities.
In parameter estimation problems, the use of an uninformative prior typically yields results which are
not too different from conventional statistical analysis, as the likelihood function often yields more
information than the uninformative prior.
Some attempts have been made at finding a priori probabilities, i.e. probability distributions in some
sense logically required by the nature of one's state of uncertainty; these are a subject of
philosophical controversy, with Bayesians being roughly divided into two schools: "objective
Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who
believe that in practice priors usually represent subjective judgements of opinion that cannot be
rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism
were given by Edwin T. Jaynes.
As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a
ball has been hidden under one of three cups, A, B or C, but no other information is available about its

location. In this case a uniform prior of   seems intuitively like the


only reasonable choice. More formally, we can see that the problem remains the same if we swap
around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which
a permutation of the labels would cause a change in our predictions about which cup the ball will be
found under; the uniform prior is the only one which preserves this invariance. If one accepts this
invariance principle then one can see that the uniform prior is the logically correct prior to represent
this state of knowledge. It should be noted that this prior is "objective" in the sense of being the
correct choice to represent a particular state of knowledge, but it is not objective in the sense of being
an observer-independent feature of the world: in reality the ball exists under a particular cup, and it
only makes sense to speak of probabilities in this situation if there is an observer with limited
knowledge about the system.
As a more contentious example, Jaynes published an argument (Jaynes 1968) based on Lie
groups that suggests that the prior representing complete uncertainty about a probability should be
the Haldane prior p−1(1 − p)−1. The example Jaynes gives is of finding a chemical in a lab and asking
whether it will dissolve in water in repeated experiments. The Haldane prior [1] gives by far the most
weight to   and  , indicating that the sample will either dissolve every time or never
dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in
one experiment and not to dissolve in another experiment then this prior is updated to the  uniform
distribution on the interval [0, 1]. This is obtained by applying Bayes' theorem to the data set
consisting of one observation of dissolving and one of not dissolving, using the above prior. The
Haldane prior has been criticized[by whom?] on the grounds that it yields an improper posterior distribution
that puts 100% of the probability content at either p = 0 or at p = 1 if a finite number of observations
have given the same result. The Jeffreys prior p−1/2(1 − p)−1/2 is therefore preferred[by whom?] (see below).
Priors can be constructed which are proportional to the Haar measure if the parameter
space X carries a natural group structure which leaves invariant our Bayesian state of knowledge
(Jaynes, 1968). This can be seen as a generalisation of the invariance principle used to justify the
uniform prior over the three cups in the example above. For example, in physics we might expect that
an experiment will give the same results regardless of our choice of the origin of a coordinate system.
This induces the group structure of thetranslation group on X, which determines the prior probability
as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an
arbitrary scale (i.e., it doesn't matter if we use centimeters or inches, we should get results that are
physically the same). In such a case, the scale group is the natural group structure, and the
corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant
or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine
group are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct
choice.
Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum
entropy (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures
the amount of information contained in the distribution. The larger the entropy, the less information is
provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability
distributions on X, one finds the distribution that is least informative in the sense that it contains the
least amount of information consistent with the constraints that define the set. For example, the
maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the
prior that assigns equal probability to each state. And in the continuous case, the maximum entropy
prior given that the density is normalized with mean zero and variance unity is the standard  normal
distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating"
an arbitrary prior distribution with suitability constraints in the maximum-entropy sense.
A related idea, reference priors, was introduced by José-Miguel Bernardo. Here, the idea is to
maximize the expected Kullback–Leibler divergence of the posterior distribution relative to the prior.
This maximizes the expected posterior information about X when the prior density is p(x); thus, in
some sense, p(x) is the "least informative" prior about X. The reference prior is defined in the
asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points
goes to infinity. Reference priors are often the objective prior of choice in multivariate problems, since
other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior.
Objective prior distributions may also be derived from other principles, such as information or coding
theory (see e.g. minimum description length) or frequentist statistics (seefrequentist matching).
Philosophical problems associated with uninformative priors are associated with the choice of an
appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner
who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but
alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is
proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which
is to be preferred. Jaynes' often-overlooked method of transformation groups can answer this
question in some situations.[2]
Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all
proportions are equally likely and use a uniform prior. Alternatively, we might say that all orders of
magnitude for the proportion are equally likely, the logarithmic prior, which is the uniform prior on the
logarithm of proportion. The Jeffreys prior attempts to solve this problem by computing a prior which
expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown
proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.
Priors based on notions of algorithmic probability are used in inductive inference as a basis for
induction in very general settings.
Practical problems associated with uninformative priors include the requirement that the posterior
distribution be proper. The usual uninformative priors on continuous, unbounded variables are
improper. This need not be a problem if the posterior distribution is proper. Another issue of
importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it
should have good frequentist properties. Normally a Bayesian would not be concerned with such
issues, but it can be important in this situation. For example, one would want any decision rule based
on the posterior distribution to be admissible under the adopted loss function. Unfortunately,
admissibility is often difficult to check, although some results are known (e.g., Berger and
Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors
(e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the
hierarchy.

Improper priors
If Bayes' theorem is written as

then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were
multiplied by a given constant; the same would be true for a continuous random variable. If the
summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1
even if the prior values do not, and so the priors may only need to be specified in the correct
proportion. Taking this idea further, in many cases the sum or integral of the prior values may not
even need to be finite to get sensible answers for the posterior probabilities. When this is the case,
the prior is called an improper prior. However, the posterior distribution need not be a proper
distribution if the prior is improper. This is clear from the case where event  B is independent of all of
the Aj.
Some statisticians[citation needed] use improper priors as uninformative priors. For example, if they need a
prior distribution for the mean and variance of a random variable, they may
assume p(m, v) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely"
and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many
authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) [citation needed] warn against the
danger of over-interpreting those priors since they are not probability densities. The only relevance
they have is found in the corresponding posterior, as long as it is well-defined for all observations.
(The Haldane prior is a typical counterexample.[clarification needed][citation needed])

Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is
the conditional probability that is assigned after the relevant evidence is taken into account. Similarly,
the posterior probability distribution is the distribution of an unknown quantity, treated as a random
variable, conditional on the evidence obtained from an experiment or survey.
Definition
The posterior probability is the probability of the parameters   given the evidence  :  .
It contrasts with the likelihood function, which is the probability of the evidence given the
parameters:  .
The two are related as follows:

Let us have a prior belief that the probability distribution function is   and observations   with
the likelihood  , then the posterior probability is defined as

[1]

The posterior probability can be written in the memorable form as

Example
Suppose there is a mixed school having 60% boys and 40% girls as students. The girl students wear
trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student
from a distance; all the observer can see is that this student is wearing trousers. What is the
probability this student is a girl? The correct answer can be computed using Bayes' theorem.
The event G is that the student observed is a girl, and the event T is that the student observed is
wearing trousers. To compute P(G|T), we first need to know:
 P(G), or the probability that the student is a girl regardless of any other information. Since the
observer sees a random student, meaning that all students have the same probability of being
observed, and the percentage of girls among the students is 40%, this probability equals 0.4.
 P(G'), or the probability that the student is not a girl (i.e., a boy) regardless of any other
information (G' is the complementary event to G). This is 60%, or 0.6.
 P(T|G), or the probability of the student wearing trousers given that the student is a girl. As
they are as likely to wear skirts as trousers, this is 0.5.
 P(T|G'), or the probability of the student wearing trousers given that the student is a boy. This
is given as 1.
 P(T), or the probability of a (randomly selected) student wearing trousers regardless of any
other information. Since P(T) = P(T|G)P(G) + P(T|G')P(G') (via the law of total probability), this
is 0.5×0.4 + 1×0.6 = 0.8.
Given all this information, the probability of the observer having spotted a girl given that the observed
student is wearing trousers can be computed by substituting these values in the formula:

[edit]Calculation

The posterior probability distribution of one random variable given the value of another can be
calculated with Bayes' theorem by multiplying the prior probability distribution by thelikelihood
function, and then dividing by the normalizing constant, as follows:

gives the posterior probability density function for a random variable X given the data Y = y, where
  is the prior density of X,

  is the likelihood function as a function of x,

  is the normalizing constant, and

  is the posterior density of X given the data Y = y.

Likelihood function
In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of
a statistical model, defined as follows: the likelihood of a set of parameter values given some
observed outcomes is equal to the probability of those observed outcomes given those parameter
values. Likelihood functions play a key role in statistical inference, especially methods of estimating a
parameter from a set of statistics.
In non-technical parlance, "likelihood" is usually a synonym for "probability" but in statistical usage, a
clear technical distinction is made. One may ask "If I were to flip a fair coin 100 times, what is
the probability of it landing heads-up every time?" or "Given that I have flipped a coin 100 times and it
has landed heads-up 100 times, what is the likelihood that the coin is fair?" but it would be improper to
switch "likelihood" and "probability" in the two sentences.
If a probability distribution depends on a parameter, one may on the one hand consider—for a given
value of the parameter—the probability (density) of the different outcomes, and on the other hand
consider—for a given outcome—the probability (density) this outcome has occurred for different
values of the parameter. The first approach interprets the probability distribution as a function of the
outcome, given a fixed parameter value, while the second interprets it as a function of the parameter,
given a fixed outcome. In the latter case the function is called the "likelihood function" of the
parameter, and indicates how likely a parameter value is in light of the observed outcome.

Definition
For the definition of the likelihood function, one has to distinguish between discrete and continuous
probability distributions.

[edit]Discrete probability distribution


Let X be a random variable with a discrete probability distribution p depending on a parameter θ. Then
the function

considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X).
Sometimes the probability on the value x of X for the parameter value θ is written as  ,
but should not be considered as a conditional probability, because θ is a parameter and not a random
variable.

[edit]Continuous probability distribution


Let X be a random variable with a continuous probability distribution with density function f depending
on a parameter θ. Then the function

considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X).
Sometimes the density function for the value x of X for the parameter value θ is written as  ,
but should not be considered as a conditional probability density.
The actual value of a likelihood function bears no meaning. Its use lies in comparing one value with
another. E.g., one value of the parameter may be more likely than another, given the outcome of the
sample. Or a specific value will be most likely: the maximum likelihood estimate. Comparison may
also be performed in considering the quotient of two likelihood values. That is why    is
generally permitted to be any positive multiple of the above defined function  . More precisely, then,
a likelihood function is any representative from an equivalence class of functions,

where the constant of proportionality α > 0 is not permitted to depend upon θ, and is required to be
the same for all likelihood functions used in any one comparison. In particular, the numerical value 
(θ | x) alone is immaterial; all that matters are maximum values of  , or likelihood ratios, such as
those of the form

that are invariant with respect to the constant of proportionality α.


A. W. F. Edwards defined support to be the natural logarithm of the likelihood ratio, and the support
function as the natural logarithm of the likelihood function (the same as the log-likelihood; see
below).[1] However, there is potential for confusion with the mathematical meaning of 'support', and
this terminology is not widely used outside Edwards' main applied field of phylogenetics.
For more about making inferences via likelihood functions, see also the method of maximum
likelihood, and likelihood-ratio testing.

[edit]Log-likelihood

For many applications involving likelihood functions, it is more convenient to work in terms of
the natural logarithm of the likelihood function, called the log-likelihood, than in terms of the
likelihood function itself. Because the logarithm is a monotonically increasing function, the logarithm of
a function achieves its maximum value at the same points as the function itself, and hence the log-
likelihood can be used in place of the likelihood in maximum likelihood estimation and related
techniques. Finding the maximum of a function often involves taking the derivative of a function and
solving for the parameter being maximized, and this is often easier when the function being
maximized is a log-likelihood rather than the original likelihood function.
For example, some likelihood functions are for the parameters that explain a collection of statistically
independent observations. In such a situation, the likelihood function factors into a product of
individual likelihood functions. The logarithm of this product is a sum of individual logarithms, and
the derivative of a sum of terms is often easier to compute than the derivative of a product. In addition,
several common distributions have likelihood functions that contain products of factors
involving exponentiation. The logarithm of such a function is a sum of products, again easier to
differentiate than the original function.
As an example, consider the gamma distribution, whose likelihood function is

and suppose we wish to find the maximum likelihood estimate of β for a single observed value x. This
function looks rather daunting. Its logarithm, however, is much simpler to work with:

The partial derivative with respect to β is simply

If there are a number of independent random samples x1,…,xn, then the joint log-likelihood will be the
sum of individual log-likelihoods, and the derivative of this sum will be the sum of individual
derivatives:
Setting that equal to zero and solving for β yields

where   denotes the maximum-likelihood estimate and   is the sample mean of the
observations.

Likelihood function of a parameterized model


Among many applications, we consider here one of broad theoretical and practical importance. Given
a parameterized family of probability density functions (or probability mass functions in the case of
discrete distributions)

where θ is the parameter, the likelihood function is

written

where x is the observed outcome of an experiment. In other words, when f(x | θ) is viewed as a


function of x with θ fixed, it is a probability density function, and when viewed as a function
of θ with x fixed, it is a likelihood function.
Note: This is not the same as the probability that those parameters are the right ones, given the
observed sample. Attempting to interpret the likelihood of a hypothesis given observed evidence as
the probability of the hypothesis is a common error, with potentially disastrous real-world
consequences in medicine, engineering or jurisprudence. Seeprosecutor's fallacy for an example of
this.
From a geometric standpoint, if we consider f (x, θ) as a function of two variables then the family of
probability distributions can be viewed as a family of curves parallel to the x-axis, while the family of
likelihood functions are the orthogonal curves parallel to the θ-axis.

Likelihoods for continuous distributions


The use of the probability density instead of a probability in specifying the likelihood function above
may be justified in a simple way. Suppose that, instead of an exact observation,  x, the observation is
the value in a short interval (xj−1, xj), with length Δj, where the subscripts refer to a predefined set of
intervals. Then the probability of getting this observation (of being in interval j) is approximately

where x* can be any point in interval j. Then, recalling that the likelihood function is defined up to a
multiplicative constant, it is just as valid to say that the likelihood function is approximately

and then, on considering the lengths of the intervals to decrease to zero,

Likelihoods for mixed continuous–discrete distributions


The above can be extended in a simple way to allow consideration of distributions which contain both
discrete and continuous components. Suppose that the distribution consists of a number of discrete
probability masses pk(θ) and a density f(x | θ), where the sum of all the p's added to the integral of f is
always one. Assuming that it is possible to distinguish an observation corresponding to one of the
discrete probability masses from one which corresponds to the density component, the likelihood
function for an observation from the continuous component can be dealt with as above by setting the
interval length short enough to exclude any of the discrete masses. For an observation from the
discrete component, the probability can either be written down directly or treated within the above
context by saying that the probability of getting an observation in an interval that does contain a
discrete component (of being in interval j which contains discrete component k) is approximately

where   can be any point in interval j. Then, on considering the lengths of the intervals to decrease
to zero, the likelihood function for a observation from the discrete component is

where k is the index of the discrete probability mass corresponding to observation x.


The fact that the likelihood function can be defined in a way that includes contributions that are not
commensurate (the density and the probability mass) arises from the way in which the likelihood
function is defined up to a constant of proportionality, where this "constant" can change with the
observation x, but not with the parameter θ.

Example 1
Let   be the probability that a certain coin lands heads up (H) when tossed. So, the probability of
getting two heads in two tosses (HH) is  . If  , then the probability of seeing two heads
is 0.25.
In symbols, we can say the above as:

Another way of saying this is to reverse it and say that "the likelihood that   , given the
observation HH, is 0.25"; that is:

But this is not the same as saying that the probability that  , given the observation HH, is
0.25.

Notice that the likelihood that  , given the observation HH, is 1. But it is clearly not true that
theprobability that  , given the observation HH, is 1. Two heads in a row hardly proves that
the coinalways comes up heads. In fact, two heads in a row is possible for any  .
The likelihood function is not a probability density function. Notice that the integral of a likelihood
function is not in general 1. In this example, the integral of the likelihood over the interval [0, 1] in   
is 1/3, demonstrating that the likelihood function cannot be interpreted as a probability density function
for  .

You might also like