Stat 535 C - Statistical Computing & Monte Carlo Methods
Arnaud Doucet
Email: arnaud@cs.ubc.ca
1
CS students: dont forget to re-register in CS-535D.
Even if you just audit this course, please do register.
2
2.1 Outline
Bayesian Statistics.
Testing Hypotheses: The Bayesian way.
Bayesian Model Selection.
Overview of the Lecture Page 3
3.1 Ingredients of Bayesian Inference
Given the prior () and the likelihood l ( | x) = f ( x| ) then
Bayess formula yields
f ( x| ) ()
( | x) = .
f ( x| ) () d
It represents all the information on than can be extracted from x.
It satises suciency and likelihood principles.
On average (with respect to X), reduce the uncertainty about ; i.e.
E [var [ | X]] = var [] var [E [ | X]] var [] .
Bayesian Statistics Page 4
3.2 Variance Decomposition Identity
If (, X) are two scalar random variables then we have
var () = E (var ( | X)) + var (E ( | X)) .
Proof:
2
var () = E E ()2
2
= E E X (E (E ( | X)))2
2
= E E X E (E ( | X))
2
2 2
+E (E ( | X)) (E (E ( | X)))
= E (var ( | X)) + var (E ( | X)) .
Bayesian Statistics Page 5
3.3 Be careful
Such results appear attractive but one should be careful.
Here there is an underlying assumption that the observations
are indeed distributed according to (x) = () f ( x| ) d.
Bayesian Statistics Page 6
3.4 Simple Binomial example
(Bayes, 1764): A billiard ball W is rolled on a line of length one,
with a uniform probability of stopping anywhere. It stops at .
A second ball O is then rolled n times under the same assumptions
and X denotes the number of times the ball O stopped on the left of W.
Given X, what inference can we make on ?
We X| B (n, ) binomial distribution and select U [0, 1] and
n x
nx
x nx (1 ) 1[0,1] ()
Pr ( X = x| ) = f ( x| ) = (1 ) ( | x) = 1
x (1 )nx d
0
x
Bayesian Statistics Page 7
3.4 Simple Binomial example
We have
1
1
(x) = Pr ( X = x| ) () d = for x = 0, ..., n
0 n+1
It follows that ( | x) = Be (x + 1, n + 1 x).
Prediction. Given X = x, you roll the ball once more and
Pr ( Y = 1| ) = then
Pr ( Y = 1| x) = Pr ( Y = 1| , x) ( | x) d
x+1
= ( | x) d = E [ | x] = .
n+2
Bayesian Statistics Page 8
3.4 Simple Binomial example
Application. Laplace developed independently such a model.
From 1745 to 1770, 241,945 girls and 251,527 boys were born
in Paris. Let be the probability that any birth is female, then
n = 251, 527 + 241, 945
Pr ( 0.5| x = 241, 945) 1.15 1042 .
Remark: This is completely dierent from a p-value. We do
not integrate over observations we have never seen.
Bayesian Statistics Page 9
3.5 A Simple Gaussian example
Consider X1 | N , 2
and N 0 , 0
m 2
2 2
(x1 ) ( m0 )
( | x1 ) f ( x1 | ) () exp
2 2 202
2 x
1 1 1 m
exp + + + 2
2 2 02 2
1 2
exp 2 ( m1 )
21
| x1 N m1 , 12
2 2
1 1 1
with 2 = + 2 1 = 2
2 0
,
1 02 0 + 2
2 x1 m
m1 = 1 + 2 .
2 0
Bayesian Statistics Page 10
3.5 A Simple Gaussian example
To predict the distribution of a new observation X| N , 2
in light
of x1 we use the predictive distribution
f ( x| x1 ) = f ( x| ) ( | x1 ) d
We can do direct calculations or alternatively use the fact that
f ( x| x1 ) is Gaussian so characterized by its mean and variance
E [ X| x1 ] = E [ + V | x1 ] = E [ | x1 ] = m1 ,
var [ X| x1 ] = var [ + V | x1 ] = var [ | x1 ] + var [V ] = 12 + 2 .
Bayesian Statistics Page 11
3.5 A Simple Gaussian example
Now assume that you observe a realization x2 of X2 | N , . 2
Then you are interested now in
( | x1 , x2 ) f ( x2 | ) f ( x1 | ) ()
f ( x2 | ) ( | x1 )
f ( x1 | ) ( | x2 ) .
Updating the prior one observation at a time, or all observations
together, does not matter.
The sequential approach can be useful for massive dataset.
In this case at time n
( | x1 , ..., xn ) f ( xn | ) ( | x1 , ..., xn1 ) ;
i.e. the prior at time n is the posterior at time n 1.
Bayesian Statistics Page 12
3.6 Simple Gaussian example: Bayes vs ML
ML estimate of at time n is simply
n n
1
M L = arg sup f ( xi | ) = xi .
i=1
n i=1
Posterior of at time n is
| x1 , ..., xn N mn , n2
where
1 1 n 02 2 2
= 2 + 2 n =
2
,
n2 0 n02 + 2 n n
n n
i=1 xi m i=1 xi
mn = n2 + .
2 02 n n
Asymptotically in n the prior is washed out by the data and
E [ | x1 , ..., xn ] = mn M L .
Bayesian Statistics Page 13
3.7 Bayes vs ML
However, keep in mind that information provided by a Bayesian
approach is much richer.
You can compute for example posterior probabilities
Pr ( A| x1 , ..., xn ) or var ( | x1 , ..., xn )
or compute the distributions of future observations
f ( x| x1 , ..., xn ) .
ML can be reassuring because of consistency and eciency.
For nite sample sizes, do you really care?
For time series models for example, there is no such thing.
Bayesian Statistics Page 14
3.8 A Simple Poisson Model
i.i.d.
Assume you have some couting observations Xi P (); i.e.
xi
f ( xi | ) = e
xi !
Assume we adopt a Gamma prior for ; i.e. Ga (, )
1
() = Ga (; , ) = e .
()
We have
n
( | x1 , ..., xn ) = Ga ; + xi , + n .
i=1
Bayesian Statistics Page 15
4.1 Testing hypotheses in a Bayesian framework
Consider the problem
where we have () = U [0, 1] and
n
x nx
Pr ( X = x| ) = (1 ) then ( | x) = Be (x + 1, n + 1 x) .
x
If we want to test H0 : 1
2 vs H1 : < 1
2 then, in a Bayesian approach,
you can simply compute
1
( H0 | x) = 1 ( H1 | x) = ( | x) d.
1/2
Golden rule of Bayesians: Thou shalt not integrate with respect to
observations (except for design...)
Contrary to frequentists, your test is never based on observations you
dont observe.
Testing Hypotheses Page 16
4.2 Bayes Factors
More generally,ones wants to compare two hypothesis: H0 : 0
versus H1 : 1 then the prior is
() = (H0 ) 0 () + (H1 ) 1 ()
where (H0 ) + (H1 ) = 1.
1 1
In the previous example, 0 () = U 2 , 1 and 1 () = U 0, 2
and (H0 ) = (H1 ) = 12 .
To compare H0 versus H1 , we typically compute the Bayes factor
which partially eliminated the inuence of the prior modelling (i.e. (Hi ))
( x| H1 ) f ( x| ) 1 () d
B10 = =
( x| H0 ) f ( x| ) 0 () d
( H1 | x) (H0 )
=
( H0 | x) (H1 )
Testing Hypotheses Page 17
4.3 Towards Bayes Model Selection
Bayes factors are not limited to the comparison of models with the
same parameter space.
Assume you have some data and two statistical models.
Under H0 , 0 0 , the prior is 0 (0 ) and the likelihood is f0 ( x| 0 ) ,
under H1 , 1 1 , the prior is 1 (1 ) and the likelihood is f1 ( x| 1 )
then
( x| H1 ) f1 ( x| 1 ) 1 (1 ) d1
B10 = =
( x| H0 ) f0 ( x| 2 ) 0 (0 ) d0
One can have 0 = R and 1 = R1000 .
Testing Hypotheses Page 18
4.3 Towards Bayes Model Selection
Jereys scale of evidence says that
- if log10 (B10 ) varies between 0 and 0.5, the evidence against H0 is poor,
- if it is between 0.5 and 1, it is substantial,
- if it is between 1 and 2, it is strong, and
- if it is above 2, it is decisive.
Bayes factor tell you where one should prefer H0 to H1 : it does NOT
tell you whether model H1 any of these models are sensible!
Testing Hypotheses Page 19
4.3 Towards Bayes Model Selection
Bayes procedures can be directly used to test point null hypothesis; i.e.
H0 : = 0 (that is 0 () = 0 ()) versus H1 : 1 where the prior
is then dened as
() = (H0 ) 0 () + (H1 ) 1 ()
The associated Bayes factor is simply
( x| H1 ) f ( x| ) 1 () d
B10 (x) = = .
( x| H0 ) f ( x| 0 )
Testing Hypotheses Page 20
4.4 Example: The celebrated coin example
Assume you have a coin, you toss it 10 times and gets x = 10 heads.
Is it biased?
Let be the proba of having an head then we can test H0 : = 12 .
The p-value Pr ( X 10| H0 ) = 29 and the hypothesis is rejected.
1
In a Bayesian framework, we test H0 versus H1 : U 2 , 1 using
1 x 10x 1
1
2 1 (1 ) d 1
2 1 10 d
1 10
50.
2 2
B10 = 1 x
1 10x
=
2 1 2 2
Testing Hypotheses Page 21
4.5 Testing the mean of a Gaussian
Assume you have X| , 2
N , 2
where 2 is assumed known but
(the parameter ) is unknown.
We want to test H0 : = 0 vs H1 : N , 2
then
( x| H1 ) N x; , N ; , d
2 2
B10 (x) = =
( x| H0 ) f ( x| 0)
2 2
x
= exp 2 2 2
.
2
+ 2 2 ( + )
Alternatively if (H0 ) = = 1 (H1 ) then
1
1
( H0 | x) = ( = 0| x) = 1 + B10 (x)
Testing Hypotheses Page 22
4.5 Testing the mean of a Gaussian
The Bayes factor depends heavily on 2 . As 2 , the prior becomes
uniformative but then B10 (x) 0 whatever being x and ( H0 | x) 1.
We will see that next week but using vague priors for model selection
is a very very bad idea... (Lindleys paradox).
Testing Hypotheses Page 23