Professional Documents
Culture Documents
The rule
[edit]Single event
Given events , and , Bayes' rule states that the conditional odds of given are
equal to the marginal odds of multiplied by the Bayes factor :
where
[edit]Multiple events
Bayes' rule may be conditioned on an arbitrary number of events. For two events
and ,
where
Now defining
this implies
A similar derivation applies for conditioning on multiple events, using the appropriate extension of
Bayes' theorem
Examples
[edit]Frequentist example
Consider the drug testing example in the article on Bayes' theorem.
The same results may be obtained using Bayes' rule. The prior odds on an individual being a drug-
individual tests positive is in favour of being a drug-user: this is the ratio of the
probability of a drug-user testing positive, to the probability of a non-drug user testing positive. The
posterior odds on being a drug user are therefore , which is
very close to . In round numbers, only one in three of those testing positive are
actually drug-users.
Prior probability
In Bayesian statistical inference, a prior probability distribution, often called simply the prior, of an
uncertain quantity p (for example, suppose p is the proportion of voters who will vote for the politician
named Smith in a future election) is the probability distribution that would express one's uncertainty
about p before the "data" (for example, an opinion poll) is taken into account. It is meant to attribute
uncertainty rather than randomness to the uncertain quantity. The unknown quantity may be
a parameter or latent variable.
One applies Bayes' theorem, multiplying the prior by the likelihood function and then normalizing, to
get the posterior probability distribution, which is the conditional distribution of the uncertain quantity
given the data.
A prior is often the purely subjective assessment of an experienced expert. Some will choose
a conjugate prior when they can, to make calculation of the posterior distribution easier.
Parameters of prior distributions are called hyperparameters, to distinguish them from parameters of
the model of the underlying data. For instance, if one is using a beta distribution to model the
distribution of the parameter p of a Bernoulli distribution, then:
p is a parameter of the underlying system (Bernoulli distribution), and
α and β are parameters of the prior distribution (beta distribution), hence hyperparameters.
Informative priors
An informative prior expresses specific, definite information about a variable. An example is a prior
distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior
a normal distribution with expected value equal to today's noontime temperature, withvariance equal
to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that
day of the year.
This example has a property in common with many priors, namely, that the posterior from one
problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-
existing evidence which has already been taken into account is part of the prior and as more evidence
accumulates the prior is determined largely by the evidence rather than any original assumption,
provided that the original assumption admitted the possibility of what the evidence is suggesting. The
terms "prior" and "posterior" are generally relative to a specific datum or observation.
Uninformative priors
An uninformative prior expresses vague or general information about a variable. The term
"uninformative prior" may be somewhat of a misnomer; often, such a prior might be called a not very
informative prior, or an objective prior, i.e. one that's not subjectively elicited. Uninformative priors can
express "objective" information such as "the variable is positive" or "the variable is less than some
limit".
The simplest and oldest rule for determining a non-informative prior is the principle of indifference,
which assigns equal probabilities to all possibilities.
In parameter estimation problems, the use of an uninformative prior typically yields results which are
not too different from conventional statistical analysis, as the likelihood function often yields more
information than the uninformative prior.
Some attempts have been made at finding a priori probabilities, i.e. probability distributions in some
sense logically required by the nature of one's state of uncertainty; these are a subject of
philosophical controversy, with Bayesians being roughly divided into two schools: "objective
Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who
believe that in practice priors usually represent subjective judgements of opinion that cannot be
rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism
were given by Edwin T. Jaynes.
As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a
ball has been hidden under one of three cups, A, B or C, but no other information is available about its
Improper priors
If Bayes' theorem is written as
then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were
multiplied by a given constant; the same would be true for a continuous random variable. If the
summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1
even if the prior values do not, and so the priors may only need to be specified in the correct
proportion. Taking this idea further, in many cases the sum or integral of the prior values may not
even need to be finite to get sensible answers for the posterior probabilities. When this is the case,
the prior is called an improper prior. However, the posterior distribution need not be a proper
distribution if the prior is improper. This is clear from the case where event B is independent of all of
the Aj.
Some statisticians[citation needed] use improper priors as uninformative priors. For example, if they need a
prior distribution for the mean and variance of a random variable, they may
assume p(m, v) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely"
and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many
authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996) [citation needed] warn against the
danger of over-interpreting those priors since they are not probability densities. The only relevance
they have is found in the corresponding posterior, as long as it is well-defined for all observations.
(The Haldane prior is a typical counterexample.[clarification needed][citation needed])
Posterior probability
In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is
the conditional probability that is assigned after the relevant evidence is taken into account. Similarly,
the posterior probability distribution is the distribution of an unknown quantity, treated as a random
variable, conditional on the evidence obtained from an experiment or survey.
Definition
The posterior probability is the probability of the parameters given the evidence : .
It contrasts with the likelihood function, which is the probability of the evidence given the
parameters: .
The two are related as follows:
Let us have a prior belief that the probability distribution function is and observations with
the likelihood , then the posterior probability is defined as
[1]
Example
Suppose there is a mixed school having 60% boys and 40% girls as students. The girl students wear
trousers or skirts in equal numbers; the boys all wear trousers. An observer sees a (random) student
from a distance; all the observer can see is that this student is wearing trousers. What is the
probability this student is a girl? The correct answer can be computed using Bayes' theorem.
The event G is that the student observed is a girl, and the event T is that the student observed is
wearing trousers. To compute P(G|T), we first need to know:
P(G), or the probability that the student is a girl regardless of any other information. Since the
observer sees a random student, meaning that all students have the same probability of being
observed, and the percentage of girls among the students is 40%, this probability equals 0.4.
P(G'), or the probability that the student is not a girl (i.e., a boy) regardless of any other
information (G' is the complementary event to G). This is 60%, or 0.6.
P(T|G), or the probability of the student wearing trousers given that the student is a girl. As
they are as likely to wear skirts as trousers, this is 0.5.
P(T|G'), or the probability of the student wearing trousers given that the student is a boy. This
is given as 1.
P(T), or the probability of a (randomly selected) student wearing trousers regardless of any
other information. Since P(T) = P(T|G)P(G) + P(T|G')P(G') (via the law of total probability), this
is 0.5×0.4 + 1×0.6 = 0.8.
Given all this information, the probability of the observer having spotted a girl given that the observed
student is wearing trousers can be computed by substituting these values in the formula:
[edit]Calculation
The posterior probability distribution of one random variable given the value of another can be
calculated with Bayes' theorem by multiplying the prior probability distribution by thelikelihood
function, and then dividing by the normalizing constant, as follows:
gives the posterior probability density function for a random variable X given the data Y = y, where
is the prior density of X,
Likelihood function
In statistics, a likelihood function (often simply the likelihood) is a function of the parameters of
a statistical model, defined as follows: the likelihood of a set of parameter values given some
observed outcomes is equal to the probability of those observed outcomes given those parameter
values. Likelihood functions play a key role in statistical inference, especially methods of estimating a
parameter from a set of statistics.
In non-technical parlance, "likelihood" is usually a synonym for "probability" but in statistical usage, a
clear technical distinction is made. One may ask "If I were to flip a fair coin 100 times, what is
the probability of it landing heads-up every time?" or "Given that I have flipped a coin 100 times and it
has landed heads-up 100 times, what is the likelihood that the coin is fair?" but it would be improper to
switch "likelihood" and "probability" in the two sentences.
If a probability distribution depends on a parameter, one may on the one hand consider—for a given
value of the parameter—the probability (density) of the different outcomes, and on the other hand
consider—for a given outcome—the probability (density) this outcome has occurred for different
values of the parameter. The first approach interprets the probability distribution as a function of the
outcome, given a fixed parameter value, while the second interprets it as a function of the parameter,
given a fixed outcome. In the latter case the function is called the "likelihood function" of the
parameter, and indicates how likely a parameter value is in light of the observed outcome.
Definition
For the definition of the likelihood function, one has to distinguish between discrete and continuous
probability distributions.
considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X).
Sometimes the probability on the value x of X for the parameter value θ is written as ,
but should not be considered as a conditional probability, because θ is a parameter and not a random
variable.
considered as a function of θ, is called the likelihood function (of θ, given the outcome x of X).
Sometimes the density function for the value x of X for the parameter value θ is written as ,
but should not be considered as a conditional probability density.
The actual value of a likelihood function bears no meaning. Its use lies in comparing one value with
another. E.g., one value of the parameter may be more likely than another, given the outcome of the
sample. Or a specific value will be most likely: the maximum likelihood estimate. Comparison may
also be performed in considering the quotient of two likelihood values. That is why is
generally permitted to be any positive multiple of the above defined function . More precisely, then,
a likelihood function is any representative from an equivalence class of functions,
where the constant of proportionality α > 0 is not permitted to depend upon θ, and is required to be
the same for all likelihood functions used in any one comparison. In particular, the numerical value
(θ | x) alone is immaterial; all that matters are maximum values of , or likelihood ratios, such as
those of the form
[edit]Log-likelihood
For many applications involving likelihood functions, it is more convenient to work in terms of
the natural logarithm of the likelihood function, called the log-likelihood, than in terms of the
likelihood function itself. Because the logarithm is a monotonically increasing function, the logarithm of
a function achieves its maximum value at the same points as the function itself, and hence the log-
likelihood can be used in place of the likelihood in maximum likelihood estimation and related
techniques. Finding the maximum of a function often involves taking the derivative of a function and
solving for the parameter being maximized, and this is often easier when the function being
maximized is a log-likelihood rather than the original likelihood function.
For example, some likelihood functions are for the parameters that explain a collection of statistically
independent observations. In such a situation, the likelihood function factors into a product of
individual likelihood functions. The logarithm of this product is a sum of individual logarithms, and
the derivative of a sum of terms is often easier to compute than the derivative of a product. In addition,
several common distributions have likelihood functions that contain products of factors
involving exponentiation. The logarithm of such a function is a sum of products, again easier to
differentiate than the original function.
As an example, consider the gamma distribution, whose likelihood function is
and suppose we wish to find the maximum likelihood estimate of β for a single observed value x. This
function looks rather daunting. Its logarithm, however, is much simpler to work with:
If there are a number of independent random samples x1,…,xn, then the joint log-likelihood will be the
sum of individual log-likelihoods, and the derivative of this sum will be the sum of individual
derivatives:
Setting that equal to zero and solving for β yields
where denotes the maximum-likelihood estimate and is the sample mean of the
observations.
written
where x* can be any point in interval j. Then, recalling that the likelihood function is defined up to a
multiplicative constant, it is just as valid to say that the likelihood function is approximately
where can be any point in interval j. Then, on considering the lengths of the intervals to decrease
to zero, the likelihood function for a observation from the discrete component is
Example 1
Let be the probability that a certain coin lands heads up (H) when tossed. So, the probability of
getting two heads in two tosses (HH) is . If , then the probability of seeing two heads
is 0.25.
In symbols, we can say the above as:
Another way of saying this is to reverse it and say that "the likelihood that , given the
observation HH, is 0.25"; that is:
But this is not the same as saying that the probability that , given the observation HH, is
0.25.
Notice that the likelihood that , given the observation HH, is 1. But it is clearly not true that
theprobability that , given the observation HH, is 1. Two heads in a row hardly proves that
the coinalways comes up heads. In fact, two heads in a row is possible for any .
The likelihood function is not a probability density function. Notice that the integral of a likelihood
function is not in general 1. In this example, the integral of the likelihood over the interval [0, 1] in
is 1/3, demonstrating that the likelihood function cannot be interpreted as a probability density function
for .