You are on page 1of 22

5/14/2021

10. BAYESIAN STATISTICS


LECTURER: DR. NGUYEN THI THANH SANG
REFERENCES:
[10] ANDREW GELMAN, JOHN B. CARLIN, HAL S. STERN, DAVID B. DUNSON, AKI VEHTARI, DONALD B. RUBIN, (2013)
BAYESIAN DATA ANALYSIS, 1439840954.
[11] JOHN K. KRUSCHKE, DOING BAYESIAN DATA ANALYSIS (SECOND EDITION), ACADEMIC PRESS, 2015, ISBN
9780124058880, HTTPS://DOI.ORG/10.1016/B978-0-12-405888-0.09997-9.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 1

THOMAS BAYES (1701-


1761)
IMAGE FROM THE
WIKIPEDIA

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 2

1
5/14/2021

BAYES’ THEOREM

• Bayes’ Theorem is about conditional probability.


• It has statistical applications.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 3

CONDITIONAL
PROBABILITY

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 4

2
5/14/2021

DEFINE “EVENTS” IN TERMS OF RANDOM VARIABLES


INSTEAD OF A, B, ETC.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 5

FOR CONTINUOUS RANDOM VARIABLES

• We have conditional densities:

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 6

3
5/14/2021

THERE ARE MANY VERSIONS OF BAYES’ THEOREM

• For discrete random variables,

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 7

FOR CONTINUOUS RANDOM VARIABLES

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 8

4
5/14/2021

COMPARE

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 9

WHAT IS PROBABILITY A MODEL?

• Frequentist: Probability is long-run relative frequency.


• Bayesian: Probability is degree of subjective belief

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 10

10

5
5/14/2021

Why Bayes?
Because Bayes answers the questions we really
care about.
Pr(I have disease | test +) vs Pr(test + | disease)

Pr(A better than B | data) vs Pr(extreme data | A=B)

Bayes is natural (vs interpreting a CI or a P-value)

Note: blue = Bayesian, red = frequentist

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 11

11

You are waiting on a subway platform for a train


that is known to run on a regular schedule, only
you don't know how much time is scheduled to
pass between train arrivals, nor how long it's
been since the last train departed.

As more time passes, do you (a) grow more


confident that the train will arrive soon, since
its eventual arrival can only be getting closer,
not further away, or (b) grow less confident
that the train will arrive soon, since the longer
you wait, the more likely it seems that either the
scheduled arrival times are far apart or else
that you happened to arrive just after the last
train left – or both.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 12

12

6
5/14/2021

If you choose (a), you're thinking like a


frequentist. If you choose (b), you're
thinking like a Bayesian.
from Floyd Bullard

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 13

13

Probability
Limiting relative frequency:
#times a happens in n trials
P(a) = lim
n®¥ n
A nice definition mathematically, but not so great
in practice. (So we often appeal to symmetry…)
What if you can’t get all the way to infinity today?
What if there is only 1 trial? E.g., P(snow tomorrow)
What if appealing to symmetry fails? E.g., I take a
penny out of my pocket and spin it. What is P(H)?
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 14

14

7
5/14/2021

Subjective probability
odds prob
Prob = Odds =
1+ odds 1- prob
Fair die
Event Prob Odds
even # 1/2 1 [or 1:1]
X>2 2/3 2 [or 2:1]
roll a 2 1/6 1/5 [or 1/5:1 or 1:5]
Persi Diaconis: "Probability isn't a fact about the world;
probability is a fact about an observer's knowledge."
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 15

15

Bayes’ Theorem
P(a|b) = P(a,b)/P(b)

P(b|a) = P(a,b)/P(a) P(a,b) = P(a)P(b|a)

Thus, P(a|b) = P(a,b)/P(b) = P(a)P(b|a)/P(b)

But what is P(b)? P(b) = Σ P(a,b) = Σ P(a)P(b|a)


a a

Thus, P(a|b) = P(a)P(b | a)


å P(a*)P(b | a*)
a*
Where a* means “any value of a”
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 16

16

8
5/14/2021

P(a)P(b | a)
P(a|b) =
å P(a*)P(b | a*)
a*

P(a|b) µ P(a) P(b|a)

posterior µ (prior) (likelihood)

Bayes’ Theorem is used to take a prior probability, update it


with data (the likelihood), and get a posterior probability.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 17

17

Medical test example. Suppose a test is 95% accurate when


a disease is present and 97% accurate when the disease
is absent. Suppose that 1% of the population has the disease.

What is P(have the disease | test +)?

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 18

18

9
5/14/2021

P(dis)P(test+ | dis)
p(dis | test+) =
P(dis)P(test+ | dis) + P(Ødis)P(test+ | Ødis)

(0.01)(0.95) 0.0095
= = » 0.24
(0.01)(0.95) + (0.99)(0.03) 0.0095 + 0.0297
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 19

19

Typical statistics problem: There is a parameter, θ,


that we want to estimate, and we have some data.
Traditional (frequentist) methods: Study and describe
P(data | θ). If the data are unlikely for a given θ, then
state “that value of θ is not supported by the data.” (A
hyp. test asks whether a particular value of θ might be
correct; a CI presents a range of plausible values.)
Bayesian methods: Describe the distribution P(θ | data).

A frequentist thinks of θ as fixed (but unknown) while


a Bayesian thinks of θ as a random variable that has
a distribution.

Bayesian reasoning is natural and easy to think


about. It is becoming much more commonly used.
IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 20

20

10
5/14/2021

If Bayes is so great, why hasn’t it always been popular?

(1) Without Markov Chain Monte Carlo, it wasn’t practical.

(2) Some people distrust prior distributions, thinking


that science should be objective (as if that were
possible).

Bayes is becoming much more common, due to MCMC.

E.g., Three recent years worth of J. Amer. Stat.


Assoc. “applications and case studies” papers: 46 of
84 papers used Bayesian methods (+ 4 others merely
included an application of Bayes’ Theorem).

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 21

21

What is MCMC?
We want to find P(θ | data), which is equal to
P(θ)*P(data |θ)/P(data) and is proportional to
P(θ)*P(data | θ), which is "prior*likelihood”.
But to use Bayesian methods we need to be
able to evaluate the denominator, which is the
integral of the numerator, integrated over the
parameter space. In general, this integral is
very hard to evaluate.
Note: There are two special cases in which the
mathematics works out: for a beta-prior with binomial-
likelihood (which gives a beta posterior) and for a
normal prior with normal likelihood (which gives a
normal
IT142IU - ANALYTICS posterior).
FOR OBSERVATIONAL DATA 5/14/2021 22

22

11
5/14/2021

See Effect on posterior of large n.R. Data: 1 success in 4 trials.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 23

23

See Effect on posterior of large n.R. Data: 10/40 (vs ¼ )

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 24

24

12
5/14/2021

See Effect on posterior of sharp prior.R

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 25

25

See BernBetaExampleLoveBlindMen.R. Data: 16 of 36


men correctly identify their partner.

IT142IU - ANALYTICS FOR OBSERVATIONAL DATA 5/14/2021 26

26

13
5/14/2021

PRIOR DISTRIBUTIONS

As with all statistical analyses we start by positing a model which specifies


p(x| θ)
This is the likelihood which relates all variables into a ‘full probability
model’
However from a Bayesian point of view :
•  is unknown so should have a probability distribution reflecting our
uncertainty about it before seeing the data
• Therefore we specify a prior distribution p(θ)
Note this is like the prevalence in the example

27

POSTERIOR DISTRIBUTIONS

Also x is known so should be conditioned on and here we use Bayes


theorem to obtain the conditional distribution for unobserved
quantities given the data which is known as the posterior
distribution.
p ( ) p ( x |  )
p ( | x ) = µ p ( ) p ( x |  )
 p( ) p( x |  )d
The prior distribution expresses our uncertainty about 
before seeing the data.
The posterior distribution expresses our uncertainty about 
after seeing the data.

28

14
5/14/2021

EXAMPLES OF BAYESIAN INFERENCE


USING THE NORMAL DISTRIBUTION
Known variance, unknown mean
It is easier to consider first a model with 1 unknown parameter. Suppose we
have a sample of Normal data: xi ~ N (  ,  2 ), i = 1,..., n.
Let us assume we know the variance, 2 and we assume a prior distribution for
the mean,  based on our prior beliefs:
Now we wish to construct the posterior distribution p(|x).
 ~ N (  0 ,  02 )

29

POSTERIOR FOR NORMAL DISTRIBUTION


MEAN
-1
So we have p(  ) = (2 02 ) 2 exp(- 12 (  - 0 ) 2 /  02 )
-1
p( xi |  ) = (2 2 ) 2 exp(- 12 ( xi -  ) 2 /  2 )
and hence
p (  | x) = p (  ) p( x |  )
-1
= (2 02 ) 2 exp(- 12 ( - 0 ) 2 /  02 ) 
N

 (2
i =1
2 - 12
) exp(- 12 ( xi -  ) 2 /  2 )

µ exp(- 12  2 (1 /  02 + n /  2 ) +  ( 0 /  02 + å xi / 2 ) + cons)
i

30

15
5/14/2021

POSTERIOR FOR NORMAL DISTRIBUTION


MEAN (CONTINUED)
For a Normal distribution with response y with mean  and variance  we have
-1
f ( y ) = (2 ) 2 exp{- 12 ( y -  ) 2 / }
µ exp{- 12 y 2 -1 + y /  + cons}
We can equate this to our posterior as follows:

µ exp( - 12  2 (1 /  02 + n /  2 ) +  (  0 /  02 + å x i / 2 ) + cons )
i

®  = (1 /  + n /  )
2
0
2 -1
and  =  (  0 /  + å x i / 2 )
2
0
i

31

PRECISIONS AND MEANS

• In Bayesian statistics the precision = 1/variance is often more


important than the variance.
1 /  = (1 /  02 + n /  2 ) and  =  (  0 /  02 + x /( 2 / n))
• For the Normal model we have
In other words the posterior precision = sum of
prior precision and data precision, and the
posterior mean is a (precision weighted) average
of the prior mean and data mean.

32

16
5/14/2021

LARGE SAMPLE PROPERTIES

As n ® ¥
Posterior precision 1 /  = (1 /  02 + n /  2 ) ® n /  2
So posterior variance ® 2 /n
Posterior mean  =  (  0 /  02 + x /( 2 / n)) ® x
And so posterior distribution
p (  | x) ® N ( x ,  2 / n)
Compared to p ( x |  ) = N (  ,  2 / n) in the frequentist setting

33

GIRLS HEIGHTS EXAMPLE

• 10 girls aged 18 had both their heights and weights measured.


• Their heights (in cm) where as follows:
169.6,166.8,157.1,181.1,158.4,165.6,166.7,156.5,168.1,165.3

We will assume the variance is known to be 50.


Two individuals gave the following prior distributions for the mean height
Individual 1 p1 (  ) ~ N (165,2 2 )
Individual 2 p2 (  ) ~ N (170,32 )

34

17
5/14/2021

CONSTRUCTING POSTERIOR 1

• To construct the posterior, we use the formulae we have just calculated


• From the prior,  0 = 165,  02 = 4
• From the data, x = 165.52,  2 = 50, n = 10
• The posterior is therefore
p (  | x) ~ N (1 , 1 )
where 1 = ( 14 + 10
50 )
-1
= 2.222,
1 = 1 ( 165
4 + 50 ) = 165.23.
1655.2

35

PRIOR AND POSTERIOR COMPARISON

36

18
5/14/2021

CONSTRUCTING POSTERIOR 2

• Again to construct the posterior we use the earlier formulae we have just
calculaed
• From the prior,  0 = 170,  02 = 9
• From the data, x = 165.52,  2 = 50, n = 10
• The posterior is therefore
p (  | x) ~ N ( 2 , 2 )
where 2 = ( 19 + 10
50 )
-1
= 3.214,
 2 = 2 ( 170
9 + 50 ) = 167.12.
1655.2

37

PRIOR 2 COMPARISON
Note this prior is not as close to the data as prior 1 and hence posterior is somewhere
between prior and likelihood.

38

19
5/14/2021

OTHER CONJUGATE EXAMPLES


• When the posterior is in the same family as the prior we have
conjugacy. Examples include:
Likelihood Parameter Prior Posterior

Normal Mean Normal Normal

Normal Precision Gamma Gamma

Binomial Probability Beta Beta

Poisson Mean Gamma Gamma

39

CREDIBLE INTERVALS

• If we consider the heights example with our first prior then our posterior is
P(μ|x)~ N(165.23,2.222),
and a 95% credible interval for μ is
165.23±1.96×sqrt(2.222) =
(162.31,168.15).
Similarly prior 2 results in a 95% credible interval for μ is
(163.61,170.63).
Note that credible intervals can be interpreted in the more natural way that
there is a probability of 0.95 that the interval contains μ rather than the
frequentist conclusion that 95% of such intervals contain μ.

40

20
5/14/2021

HYPOTHESIS TESTING

Another big issue in statistical modelling is the ability to test


hypotheses and model comparisons in general.
The Bayesian approach is in some ways more straightforward.
For an unknown parameter θ
we simply calculate the posterior probabilities
p0 = P (   0 | x), p1 = P (  1 | x)

and decide between H0 and H1 accordingly.


 0 = P (   0 ),  1 = P (  1 )
We also require the prior probabilities to achieve this

41

BAYES FACTORS

• Prior odds on H0 against H1 is π0 /π1


• Posterior odds on H0 against H1 is p0 /p1
• The Bayes factor B in favour of H0 against H1 is
( p0 / p1 ) p0 1
B= =
( 0 /  1 ) p1 0
Note that when hypotheses are simple B is the likelihood
ratio of H0 against H1 i.e. the odds in favour of H0
against H1 that are given by the data however in
complex hypotheses B also involves the prior
distributions.

42

21
5/14/2021

BAYES FACTORS – GIRLS HEIGHT EXAMPLE PRIOR 1

Let us assume that H0 is μ >165 and hence H1 is μ ≤165. Now we have π0=
π1=0.5 under the N(165,4) prior
The posterior is N(165.23,2.222) which results in p0 =0.561 p1=0.439 and
results in a Bayes factor of 0.561/0.439=1.278 here the Bayes factor is
close to 1 and so the data has not much altered our beliefs about the
hypothesis under discussion.

43

BAYES FACTORS – GIRLS HEIGHT EXAMPLE


PRIOR 2
Now under the N(170,9) prior we have π0=0.952 and π1=0.048 so
strong a priori evidence for H0 against H1
The posterior is N(167.12,3.214) which results in p0 =0.881,
p1=0.119 and results in a Bayes factor of
(0.881×0.048)/(0.952×0.119) = 0.373 so in the case the Bayes
factor is smaller than 1 as the data gives less evidence for H0
against H1 than the prior distribution.
It should be noted that care needs to be taken when using Bayes
factors and non-informative priors.

44

22

You might also like