Bayesian Statistics: Thomas Bayes

5/14/2021
10. BAYESIAN STATISTICS

LECTURER: DR. NGUYEN THI THANH SANG
REFERENCES:
[10] ANDREW GELMAN, JOHN B. CARLIN, HAL S. STERN, DAVID B. DUNSON, AKI VEHTARI, DONALD B. RUBIN, (2013)
BAYESIAN DATA ANALYSIS, 1439840954.
[11] JOHN K. KRUSCHKE, DOING BAYESIAN DATA ANALYSIS (SECOND EDITION), ACADEMIC PRESS, 2015, ISBN
9780124058880, HTTPS://DOI.ORG/10.1016/B978-0-12-405888-0.09997-9.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 1
THOMAS BAYES (1701-

1761)
IMAGE FROM THE
WIKIPEDIA
1
5/14/2021
BAYES’ THEOREM
• Bayes’ Theorem is about conditional probability.

• It has statistical applications.
CONDITIONAL
PROBABILITY
2
5/14/2021
DEFINE “EVENTS” IN TERMS OF RANDOM VARIABLES

INSTEAD OF A, B, ETC.
FOR CONTINUOUS RANDOM VARIABLES
• We have conditional densities:
3
5/14/2021
THERE ARE MANY VERSIONS OF BAYES’ THEOREM
• For discrete random variables,
FOR CONTINUOUS RANDOM VARIABLES
4
5/14/2021
COMPARE
WHAT IS PROBABILITY A MODEL?
• Frequentist: Probability is long-run relative frequency.

• Bayesian: Probability is degree of subjective belief
10
5
5/14/2021
Why Bayes?
Because Bayes answers the questions we really
care about.
Pr(I have disease | test +) vs Pr(test + | disease)
Pr(A better than B | data) vs Pr(extreme data | A=B)
Bayes is natural (vs interpreting a CI or a P-value)
Note: blue = Bayesian, red = frequentist
11
You are waiting on a subway platform for a train

that is known to run on a regular schedule, only
you don't know how much time is scheduled to
pass between train arrivals, nor how long it's
been since the last train departed.
As more time passes, do you (a) grow more

confident that the train will arrive soon, since
its eventual arrival can only be getting closer,
not further away, or (b) grow less confident
that the train will arrive soon, since the longer
you wait, the more likely it seems that either the
scheduled arrival times are far apart or else
that you happened to arrive just after the last
train left – or both.
12
6
5/14/2021
If you choose (a), you're thinking like a

frequentist. If you choose (b), you're
thinking like a Bayesian.
from Floyd Bullard
13
Probability
Limiting relative frequency:
#times a happens in n trials
P(a) = lim
n®¥ n
A nice definition mathematically, but not so great
in practice. (So we often appeal to symmetry…)
What if you can’t get all the way to infinity today?
What if there is only 1 trial? E.g., P(snow tomorrow)
What if appealing to symmetry fails? E.g., I take a
penny out of my pocket and spin it. What is P(H)?
14
7
5/14/2021
Subjective probability
odds prob
Prob = Odds =
1+ odds 1- prob
Fair die
Event Prob Odds
even # 1/2 1 [or 1:1]
X>2 2/3 2 [or 2:1]
roll a 2 1/6 1/5 [or 1/5:1 or 1:5]
Persi Diaconis: "Probability isn't a fact about the world;
probability is a fact about an observer's knowledge."
15
Bayes’ Theorem
P(a|b) = P(a,b)/P(b)
P(b|a) = P(a,b)/P(a) P(a,b) = P(a)P(b|a)
Thus, P(a|b) = P(a,b)/P(b) = P(a)P(b|a)/P(b)
But what is P(b)? P(b) = Σ P(a,b) = Σ P(a)P(b|a)

a a
Thus, P(a|b) = P(a)P(b | a)

å P(a*)P(b | a*)
a*
Where a* means “any value of a”
16
8
5/14/2021
P(a)P(b | a)
P(a|b) =
å P(a*)P(b | a*)
a*
P(a|b) µ P(a) P(b|a)
posterior µ (prior) (likelihood)
Bayes’ Theorem is used to take a prior probability, update it

with data (the likelihood), and get a posterior probability.
17
Medical test example. Suppose a test is 95% accurate when

a disease is present and 97% accurate when the disease
is absent. Suppose that 1% of the population has the disease.
What is P(have the disease | test +)?
18
9
5/14/2021
P(dis)P(test+ | dis)
p(dis | test+) =
P(dis)P(test+ | dis) + P(Ødis)P(test+ | Ødis)
(0.01)(0.95) 0.0095
= = » 0.24
(0.01)(0.95) + (0.99)(0.03) 0.0095 + 0.0297
19
Typical statistics problem: There is a parameter, θ,

that we want to estimate, and we have some data.
Traditional (frequentist) methods: Study and describe
P(data | θ). If the data are unlikely for a given θ, then
state “that value of θ is not supported by the data.” (A
hyp. test asks whether a particular value of θ might be
correct; a CI presents a range of plausible values.)
Bayesian methods: Describe the distribution P(θ | data).
A frequentist thinks of θ as fixed (but unknown) while

a Bayesian thinks of θ as a random variable that has
a distribution.
Bayesian reasoning is natural and easy to think

about. It is becoming much more commonly used.
20
10
5/14/2021
If Bayes is so great, why hasn’t it always been popular?
(1) Without Markov Chain Monte Carlo, it wasn’t practical.
(2) Some people distrust prior distributions, thinking

that science should be objective (as if that were
possible).
Bayes is becoming much more common, due to MCMC.
E.g., Three recent years worth of J. Amer. Stat.

Assoc. “applications and case studies” papers: 46 of
84 papers used Bayesian methods (+ 4 others merely
included an application of Bayes’ Theorem).
21
What is MCMC?
We want to find P(θ | data), which is equal to
P(θ)*P(data |θ)/P(data) and is proportional to
P(θ)*P(data | θ), which is "prior*likelihood”.
But to use Bayesian methods we need to be
able to evaluate the denominator, which is the
integral of the numerator, integrated over the
parameter space. In general, this integral is
very hard to evaluate.
Note: There are two special cases in which the
mathematics works out: for a beta-prior with binomial-
likelihood (which gives a beta posterior) and for a
normal prior with normal likelihood (which gives a
normal
IT142IU - ANALYTICS posterior).
FOR OBSERVATIONAL DATA 5/14/2021 22
22
11
5/14/2021
See Effect on posterior of large n.R. Data: 1 success in 4 trials.
23
See Effect on posterior of large n.R. Data: 10/40 (vs ¼ )
24
12
5/14/2021
See Effect on posterior of sharp prior.R
25
See BernBetaExampleLoveBlindMen.R. Data: 16 of 36

men correctly identify their partner.
26
13
5/14/2021
PRIOR DISTRIBUTIONS
As with all statistical analyses we start by positing a model which specifies

p(x| θ)
This is the likelihood which relates all variables into a ‘full probability
model’
However from a Bayesian point of view :
•  is unknown so should have a probability distribution reflecting our
uncertainty about it before seeing the data
• Therefore we specify a prior distribution p(θ)
Note this is like the prevalence in the example
27
POSTERIOR DISTRIBUTIONS
Also x is known so should be conditioned on and here we use Bayes

theorem to obtain the conditional distribution for unobserved
quantities given the data which is known as the posterior
distribution.
p ( ) p ( x |  )
p ( | x ) = µ p ( ) p ( x |  )
 p( ) p( x |  )d
The prior distribution expresses our uncertainty about 
before seeing the data.
The posterior distribution expresses our uncertainty about 
after seeing the data.
28
14
5/14/2021
EXAMPLES OF BAYESIAN INFERENCE

USING THE NORMAL DISTRIBUTION
Known variance, unknown mean
It is easier to consider first a model with 1 unknown parameter. Suppose we
have a sample of Normal data: xi ~ N (  ,  2 ), i = 1,..., n.
Let us assume we know the variance, 2 and we assume a prior distribution for
the mean,  based on our prior beliefs:
Now we wish to construct the posterior distribution p(|x).
 ~ N (  0 ,  02 )
29
POSTERIOR FOR NORMAL DISTRIBUTION

MEAN
-1
So we have p(  ) = (2 02 ) 2 exp(- 12 (  - 0 ) 2 /  02 )
-1
p( xi |  ) = (2 2 ) 2 exp(- 12 ( xi -  ) 2 /  2 )
and hence
p (  | x) = p (  ) p( x |  )
-1
= (2 02 ) 2 exp(- 12 ( - 0 ) 2 /  02 ) 
N
 (2
i =1
2 - 12
) exp(- 12 ( xi -  ) 2 /  2 )
µ exp(- 12  2 (1 /  02 + n /  2 ) +  ( 0 /  02 + å xi / 2 ) + cons)
i
30
15
5/14/2021
POSTERIOR FOR NORMAL DISTRIBUTION

MEAN (CONTINUED)
For a Normal distribution with response y with mean  and variance  we have
-1
f ( y ) = (2 ) 2 exp{- 12 ( y -  ) 2 / }
µ exp{- 12 y 2 -1 + y /  + cons}
We can equate this to our posterior as follows:
µ exp( - 12  2 (1 /  02 + n /  2 ) +  (  0 /  02 + å x i / 2 ) + cons )
i
®  = (1 /  + n /  )
2
0
2 -1
and  =  (  0 /  + å x i / 2 )
2
0
i
31
PRECISIONS AND MEANS
• In Bayesian statistics the precision = 1/variance is often more

important than the variance.
1 /  = (1 /  02 + n /  2 ) and  =  (  0 /  02 + x /( 2 / n))
• For the Normal model we have
In other words the posterior precision = sum of
prior precision and data precision, and the
posterior mean is a (precision weighted) average
of the prior mean and data mean.
32
16
5/14/2021
LARGE SAMPLE PROPERTIES
As n ® ¥
Posterior precision 1 /  = (1 /  02 + n /  2 ) ® n /  2
So posterior variance ® 2 /n
Posterior mean  =  (  0 /  02 + x /( 2 / n)) ® x
And so posterior distribution
p (  | x) ® N ( x ,  2 / n)
Compared to p ( x |  ) = N (  ,  2 / n) in the frequentist setting
33
GIRLS HEIGHTS EXAMPLE
• 10 girls aged 18 had both their heights and weights measured.

• Their heights (in cm) where as follows:
169.6,166.8,157.1,181.1,158.4,165.6,166.7,156.5,168.1,165.3
We will assume the variance is known to be 50.

Two individuals gave the following prior distributions for the mean height
Individual 1 p1 (  ) ~ N (165,2 2 )
Individual 2 p2 (  ) ~ N (170,32 )
34
17
5/14/2021
CONSTRUCTING POSTERIOR 1
• To construct the posterior, we use the formulae we have just calculated

• From the prior,  0 = 165,  02 = 4
• From the data, x = 165.52,  2 = 50, n = 10
• The posterior is therefore
p (  | x) ~ N (1 , 1 )
where 1 = ( 14 + 10
50 )
-1
= 2.222,
1 = 1 ( 165
4 + 50 ) = 165.23.
1655.2
35
PRIOR AND POSTERIOR COMPARISON
36
18
5/14/2021
CONSTRUCTING POSTERIOR 2
• Again to construct the posterior we use the earlier formulae we have just
calculaed
• From the prior,  0 = 170,  02 = 9
• From the data, x = 165.52,  2 = 50, n = 10
• The posterior is therefore
p (  | x) ~ N ( 2 , 2 )
where 2 = ( 19 + 10
50 )
-1
= 3.214,
 2 = 2 ( 170
9 + 50 ) = 167.12.
1655.2
37
PRIOR 2 COMPARISON
Note this prior is not as close to the data as prior 1 and hence posterior is somewhere
between prior and likelihood.
38
19
5/14/2021
OTHER CONJUGATE EXAMPLES

• When the posterior is in the same family as the prior we have
conjugacy. Examples include:
Likelihood Parameter Prior Posterior
Normal Mean Normal Normal
Normal Precision Gamma Gamma
Binomial Probability Beta Beta
Poisson Mean Gamma Gamma
39
CREDIBLE INTERVALS
• If we consider the heights example with our first prior then our posterior is
P(μ|x)~ N(165.23,2.222),
and a 95% credible interval for μ is
165.23±1.96×sqrt(2.222) =
(162.31,168.15).
Similarly prior 2 results in a 95% credible interval for μ is
(163.61,170.63).
Note that credible intervals can be interpreted in the more natural way that
there is a probability of 0.95 that the interval contains μ rather than the
frequentist conclusion that 95% of such intervals contain μ.
40
20
5/14/2021
HYPOTHESIS TESTING
Another big issue in statistical modelling is the ability to test

hypotheses and model comparisons in general.
The Bayesian approach is in some ways more straightforward.
For an unknown parameter θ
we simply calculate the posterior probabilities
p0 = P (   0 | x), p1 = P (  1 | x)
and decide between H0 and H1 accordingly.

 0 = P (   0 ),  1 = P (  1 )
We also require the prior probabilities to achieve this
41
BAYES FACTORS
• Prior odds on H0 against H1 is π0 /π1

• Posterior odds on H0 against H1 is p0 /p1
• The Bayes factor B in favour of H0 against H1 is
( p0 / p1 ) p0 1
B= =
( 0 /  1 ) p1 0
Note that when hypotheses are simple B is the likelihood
ratio of H0 against H1 i.e. the odds in favour of H0
against H1 that are given by the data however in
complex hypotheses B also involves the prior
distributions.
42
21
5/14/2021
BAYES FACTORS – GIRLS HEIGHT EXAMPLE PRIOR 1
Let us assume that H0 is μ >165 and hence H1 is μ ≤165. Now we have π0=
π1=0.5 under the N(165,4) prior
The posterior is N(165.23,2.222) which results in p0 =0.561 p1=0.439 and
results in a Bayes factor of 0.561/0.439=1.278 here the Bayes factor is
close to 1 and so the data has not much altered our beliefs about the
hypothesis under discussion.
43
BAYES FACTORS – GIRLS HEIGHT EXAMPLE

PRIOR 2
Now under the N(170,9) prior we have π0=0.952 and π1=0.048 so
strong a priori evidence for H0 against H1
The posterior is N(167.12,3.214) which results in p0 =0.881,
p1=0.119 and results in a Bayes factor of
(0.881×0.048)/(0.952×0.119) = 0.373 so in the case the Bayes
factor is smaller than 1 as the data gives less evidence for H0
against H1 than the prior distribution.
It should be noted that care needs to be taken when using Bayes
factors and non-informative priors.
44
22

Bayesian Statistics: Thomas Bayes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Statistics: Thomas Bayes

Uploaded by

Copyright:

Available Formats

5/14/2021

10. BAYESIAN STATISTICS

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 1

THOMAS BAYES (1701-

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 2

• Bayes’ Theorem is about conditional probability.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 3

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 4

DEFINE “EVENTS” IN TERMS OF RANDOM VARIABLES

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 5

FOR CONTINUOUS RANDOM VARIABLES

• We have conditional densities:

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 6

THERE ARE MANY VERSIONS OF BAYES’ THEOREM

• For discrete random variables,

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 7

FOR CONTINUOUS RANDOM VARIABLES

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 8

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 9

WHAT IS PROBABILITY A MODEL?

• Frequentist: Probability is long-run relative frequency.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 10

Pr(A better than B | data) vs Pr(extreme data | A=B)

Bayes is natural (vs interpreting a CI or a P-value)

Note: blue = Bayesian, red = frequentist

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 11

You are waiting on a subway platform for a train

As more time passes, do you (a) grow more

If you choose (a), you're thinking like a

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 13

P(b|a) = P(a,b)/P(a) P(a,b) = P(a)P(b|a)

Thus, P(a|b) = P(a,b)/P(b) = P(a)P(b|a)/P(b)

But what is P(b)? P(b) = Σ P(a,b) = Σ P(a)P(b|a)

Thus, P(a|b) = P(a)P(b | a)

P(a|b) µ P(a) P(b|a)

posterior µ (prior) (likelihood)

Bayes’ Theorem is used to take a prior probability, update it

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 17

Medical test example. Suppose a test is 95% accurate when

What is P(have the disease | test +)?

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 18

Typical statistics problem: There is a parameter, θ,

A frequentist thinks of θ as fixed (but unknown) while

Bayesian reasoning is natural and easy to think

If Bayes is so great, why hasn’t it always been popular?

(1) Without Markov Chain Monte Carlo, it wasn’t practical.

(2) Some people distrust prior distributions, thinking

Bayes is becoming much more common, due to MCMC.

E.g., Three recent years worth of J. Amer. Stat.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 21

See Effect on posterior of large n.R. Data: 1 success in 4 trials.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 23

See Effect on posterior of large n.R. Data: 10/40 (vs ¼ )

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 24

See Effect on posterior of sharp prior.R

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 25

See BernBetaExampleLoveBlindMen.R. Data: 16 of 36

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 26

As with all statistical analyses we start by positing a model which specifies

Also x is known so should be conditioned on and here we use Bayes

EXAMPLES OF BAYESIAN INFERENCE

POSTERIOR FOR NORMAL DISTRIBUTION

POSTERIOR FOR NORMAL DISTRIBUTION