You are on page 1of 34

Introduction to Data Science

Bayesian statistics II

Applications of Bayes theorem


The concept of probability
in the two frameworks
• Frequentist conception: Relative frequency of the
outcome of interest as a proportion of the whole
sample space, in the long run. (“Objective”
probability)

• Bayesian conception: Degree of belief, plausibility.


Beliefs are constantly updated with new
information. (“Subjective” probability)
Probability vs. Likelihood: Probability

Probability

p( z > 1.65 | normal distribution with mean = 0, SD = 1) = 0.05


Probability vs. Likelihood: Probability

Probability

p( z > 1.96 | normal distribution with mean = 0, SD = 1) = 0.025


Probability vs. Likelihood: Likelihood
0.35

0.3

0.25
Probability density

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x

L(normal distribution with mean = 0, SD = 1.3 | x = 2) = 0.094


Probability vs. Likelihood: Likelihood
0.35

0.3

0.25
Probability density

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x

L(normal distribution with mean = 1, SD = 1.3 | x = 2) = 0.2283


Maximum likelihood estimates:
Introducing Bayes Factors
• NHST has a somewhat convoluted (yet solid) logic:
Assuming that something makes no difference, then
showing that the observed results are unlikely, given
that assumption. Then rejecting this assumptions on
these grounds. And concluding that there probably was
a difference after all.
• Also somewhat weak: What is the difference?
• Bayes Factors provide an alternative/complement to
classical null hypothesis significance testing that
addresses both of these issues:
• p(D|H) à p (H|D)
• Bayes Theorem: Allows to invert conditional
probabilities if (and only if) we know the prior
probabilities of the hypotheses.
The Bayes Factor
!(#! |%) !(%|#! ) !(#! )
= ×
!(#" |%) !(%|#" ) !(#" )

Posterior Bayes Prior


odds Factor odds
So what are Bayes factors?
• In essence, likelihood ratios:

Likelihood
p(A)∗ p(B | A)
p(A | B) =
p(B) Priors
Posterior

p(D | H1 )
BF =
p(D | H 0 )
The history of Bayes factors
• Developed by Jeffreys (1935).
• Called “significance tests”.
• This alternative was not really appreciated until
about the 1990s.
• They are still “catching on”
What do Bayes Factors give us?
• They quantify the strength of the evidence in favor
of a hypothesis, given the data.
• Can be used for model comparison (which model is
more likely, given the data).
Deriving the Bayes factor
Strength of evidence for H1 or H0, given they are
equally likely a priori, from Lee & Wagenmakers, 2013
Example: Determining the most likely effectiveness of
a medication using maximum likelihood estimation
Bayes theorem also allows us to assess
the veracity of the published literature
• What is the probability that a published result
is actually true?
• If we set the significance level α to 0.05, does
this mean that the false positive rate in the
field is 5%?
• No!
• This is a common misconception.
• It also depends on the prior probability of a
true effect in a given field as well as the
statistical power of the study.
Introducing PPV (Ioannidis, 2005):
• PPV = “positive predictive value” – *post* study
probability that a result is true:
• R: Ratio of true to false effects in a field.
• PPV links power, alpha and R:

PPV = (1− β )R / (R − β R + α )
PPV = (1− β )R / ((1− β )R + α )
Deriving PPV
RESEARCH/REALITY
TRUE FALSE TOTAL
SIG 1-β α 1-β + α

NONSIG β 1-α β + 1-α

TOTAL 1 1 2

RESEARCH/REALITY
TRUE FALSE TOTAL
SIG NT (1-β) NFŸα NT (1-β) + NF Ÿα
NONSIG NTŸβ NF (1-α) NT Ÿβ + NF (1-α)
TOTAL NT NF NT+NF=c
R = NT/NF R Ÿ NF=c - NF R = NT/NF NT = cŸR/(R+1)
NT = R Ÿ NF R Ÿ NF + NF= c R/(R+1) = (NT/NF)/c/NF
NT = c - NF R + 1 = c/NF R/(R+1) = NT/c NF = c/(R+1)
Deriving PPV
RESEARCH/REALITY
TRUE FALSE TOTAL
SIG NT (1-β) NFŸα NT (1-β) + NF Ÿα
NONSIG NTŸβ NF (1-α) NT Ÿβ + NF (1-α)
TOTAL NT NF NT+NF=c
R = NT/NF R Ÿ NF=c - NF R = NT/NF NT = cŸR/(R+1)
NT = R Ÿ NF R Ÿ NF + NF= c R/(R+1) = (NT/NF)/c/NF
NT = c - NF R + 1 = c/NF R/(R+1) = NT/c NF = c/(R+1)

RESEARCH/REALITY
TRUE FALSE TOTAL
SIG c(1-β)ŸR/(R+1) cŸα/(R+1) cŸ(RŸ(1-β)+α)/(R+1)
NONSIG cŸβŸR/(R+1) cŸ(1-α) /(R+1) cŸ(1 + RŸβ-α)/(R+1)
TOTAL cŸR/(R+1) c/(R+1) c
Deriving PPV
RESEARCH/REALITY
TRUE FALSE TOTAL
SIG c(1-β)ŸR/(R+1) cŸα/(R+1) cŸ(RŸ(1-β)+α)/(R+1)
NONSIG cŸβŸR/(R+1) cŸ(1-α) /(R+1) cŸ(1 + RŸβ-α)/(R+1)
TOTAL cŸR/(R+1) c/(R+1) c
PPV = p(effect true | significant)= p(true)/p(sig)Ÿp(sig|true)
p(true) = NT/c = R/(R+1)
p(sig|true) = 1-β
p(sig) = p(sig|true)Ÿp(true)+p(sig|false)Ÿp(false)
p(sig) = 1-βŸR/(R+1)+αŸNF/c
p(sig) = 1-βŸR/(R+1)+αŸ(R+1)
PPV = R/(R+1)/1-βŸR/(R+1)+αŸ(R+1)Ÿ1-β
PPV = (1-β)ŸR/((1-β)ŸR+α)
A study converts the prior odds of the
effect being true (R) into the posterior –
PPV, as a function of statistical power:
Confidence interval vs. credible interval
• Frequentist approach: The value of a parameter 𝜽 is
unknown, but fixed.
• We can estimate it by taking samples from the
population. This yields a (tychenic) distribution of
sample means, which we can use to calculate the CI.
• Example: You are an epidemiologist and want to
estimate the prevalence of Herpes Simplex in the
population using a frequentist approach:
𝜽: The prevalence of Herpes simplex in
the population
0.14

0.12

0.1

0.08
Proportion

0.06

0.04

0.02

0
0.44 0.46 0.48 0.5 0.52 0.54 0.56
In a Bayesian framework, 𝜽 is a RV and has
a probability distribution
• This probability distribution corresponds to our
degree of belief.
• If this is our prior belief about the value of the
parameter, this is called the prior distribution.
• The prior distribution can take any shape. A
completely flat prior distribution is called an
“uninformative prior”.
• The sharper the prior distribution, the more
informative it is.
• Often used to model the prior distribution of a
proportion: The Beta distribution
Prior distributions of varying degrees of
informativeness:
So what?
• In Bayesian analysis, we can use the data from a
study (which yields the likelihood) in combination
with the prior distribution to compute a posterior
distribution:
• Posterior = Prior x Likelihood
• In something like Herpes simplex, we could model
the likelihood with a binominal distribution (how
many people infected, as a proportion of sample
size:
• p(𝜽 | y) = p(𝜽) x p (y | 𝜽)
• Data = y
Example: We are epidemiologists and want to
know the likely value of 𝜽 in a certain location.
• We have a somewhat informative prior from the
literature (say a Beta distribution with α = β = 10).
• We take a local sample of 100 people and see how
many are infected with the virus.
• This yields a posterior distribution of 𝜽:
What does our study yield?

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Posterior = Prior x Likelihood


How do we get a new estimate of 𝜽
from the credible interval?
The credible interval is the area under the
posterior distribution:
𝜽 = 0.373, at 95% credibility

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


The relative strength of the prior distribution to
determine the location and shape of the
posterior distribution at fixed likelihood:
The impact of larger samples (stronger
likelihood) on the posterior distribution at
a fixed prior
At small sample sizes, the prior matters a lot:
n = 20 n = 20 n = 20

0 0.5 1 0 0.5 1 0 0.5 1

n = 20 n = 20 n = 20

0 0.5 1 0 0.5 1 0 0.5 1

n = 20 n = 20 n = 20

0 0.5 1 0 0.5 1 0 0.5 1


At large sample sizes, the prior doesn’t
matter much:
n = 500 n = 500 n = 500

0 0.5 1 0 0.5 1 0 0.5 1

n = 500 n = 500 n = 500

0 0.5 1 0 0.5 1 0 0.5 1

n = 500 n = 500 n = 500

0 0.5 1 0 0.5 1 0 0.5 1

You might also like