You are on page 1of 23

Stat 535 C - Statistical Computing & Monte Carlo Methods

Arnaud Doucet

Email: arnaud@cs.ubc.ca

1
CS students: dont forget to re-register in CS-535D.

Even if you just audit this course, please do register.

2
2.1 Outline

Bayesian Statistics.

Testing Hypotheses: The Bayesian way.

Bayesian Model Selection.

Overview of the Lecture Page 3


3.1 Ingredients of Bayesian Inference

Given the prior () and the likelihood l ( | x) = f ( x| ) then


Bayess formula yields

f ( x| ) ()
( | x) =  .
f ( x| ) () d

It represents all the information on than can be extracted from x.

It satises suciency and likelihood principles.

On average (with respect to X), reduce the uncertainty about ; i.e.

E [var [ | X]] = var [] var [E [ | X]] var [] .

Bayesian Statistics Page 4


3.2 Variance Decomposition Identity

If (, X) are two scalar random variables then we have

var () = E (var ( | X)) + var (E ( | X)) .

Proof:

 2
var () = E E ()2
  2  
= E E  X (E (E ( | X)))2
  2    
= E E  X E (E ( | X))
2

 
2 2
+E (E ( | X)) (E (E ( | X)))

= E (var ( | X)) + var (E ( | X)) .

Bayesian Statistics Page 5


3.3 Be careful

Such results appear attractive but one should be careful.

Here there is an underlying assumption that the observations



are indeed distributed according to (x) = () f ( x| ) d.

Bayesian Statistics Page 6


3.4 Simple Binomial example

(Bayes, 1764): A billiard ball W is rolled on a line of length one,


with a uniform probability of stopping anywhere. It stops at .
A second ball O is then rolled n times under the same assumptions
and X denotes the number of times the ball O stopped on the left of W.
Given X, what inference can we make on ?
We X| B (n, ) binomial distribution and select U [0, 1] and

n x
nx
x nx (1 ) 1[0,1] ()
Pr ( X = x| ) = f ( x| ) = (1 ) ( | x) =  1
x (1 )nx d
0
x

Bayesian Statistics Page 7


3.4 Simple Binomial example

We have
 1
1
(x) = Pr ( X = x| ) () d = for x = 0, ..., n
0 n+1

It follows that ( | x) = Be (x + 1, n + 1 x).

Prediction. Given X = x, you roll the ball once more and


Pr ( Y = 1| ) = then


Pr ( Y = 1| x) = Pr ( Y = 1| , x) ( | x) d


x+1
= ( | x) d = E [ | x] = .
n+2

Bayesian Statistics Page 8


3.4 Simple Binomial example

Application. Laplace developed independently such a model.


From 1745 to 1770, 241,945 girls and 251,527 boys were born
in Paris. Let be the probability that any birth is female, then
n = 251, 527 + 241, 945

Pr ( 0.5| x = 241, 945) 1.15 1042 .

Remark: This is completely dierent from a p-value. We do


not integrate over observations we have never seen.

Bayesian Statistics Page 9


3.5 A Simple Gaussian example

   
Consider X1 | N , 2
and N 0 , 0
m 2

2 2
(x1 ) ( m0 )
( | x1 ) f ( x1 | ) () exp
2 2 202
 2  x 
1 1 1 m
exp + + + 2
2 2 02 2
 
1 2
exp 2 ( m1 )
21
 
| x1 N m1 , 12

2 2
1 1 1
with 2 = + 2 1 = 2
2 0
,
1 02 0 + 2
 
2 x1 m
m1 = 1 + 2 .
2 0

Bayesian Statistics Page 10


3.5 A Simple Gaussian example

 
To predict the distribution of a new observation X| N , 2
in light
of x1 we use the predictive distribution


f ( x| x1 ) = f ( x| ) ( | x1 ) d

We can do direct calculations or alternatively use the fact that


f ( x| x1 ) is Gaussian so characterized by its mean and variance

E [ X| x1 ] = E [ + V | x1 ] = E [ | x1 ] = m1 ,

var [ X| x1 ] = var [ + V | x1 ] = var [ | x1 ] + var [V ] = 12 + 2 .

Bayesian Statistics Page 11


3.5 A Simple Gaussian example

 
Now assume that you observe a realization x2 of X2 | N , . 2

Then you are interested now in


( | x1 , x2 ) f ( x2 | ) f ( x1 | ) ()

f ( x2 | ) ( | x1 )

f ( x1 | ) ( | x2 ) .

Updating the prior one observation at a time, or all observations


together, does not matter.

The sequential approach can be useful for massive dataset.


In this case at time n
( | x1 , ..., xn ) f ( xn | ) ( | x1 , ..., xn1 ) ;
i.e. the prior at time n is the posterior at time n 1.

Bayesian Statistics Page 12


3.6 Simple Gaussian example: Bayes vs ML

ML estimate of at time n is simply


n n
1
M L = arg sup f ( xi | ) = xi .
i=1
n i=1

Posterior of at time n is
 
| x1 , ..., xn N mn , n2

where

1 1 n 02 2 2
= 2 + 2 n =
2
,
n2 0 n02 + 2 n n
 n  n
i=1 xi m i=1 xi
mn = n2 + .
2 02 n n
Asymptotically in n the prior is washed out by the data and
E [ | x1 , ..., xn ] = mn M L .

Bayesian Statistics Page 13


3.7 Bayes vs ML

However, keep in mind that information provided by a Bayesian


approach is much richer.

You can compute for example posterior probabilities

Pr ( A| x1 , ..., xn ) or var ( | x1 , ..., xn )

or compute the distributions of future observations

f ( x| x1 , ..., xn ) .

ML can be reassuring because of consistency and eciency.


For nite sample sizes, do you really care?
For time series models for example, there is no such thing.

Bayesian Statistics Page 14


3.8 A Simple Poisson Model

i.i.d.
Assume you have some couting observations Xi P (); i.e.

xi

f ( xi | ) = e
xi !

Assume we adopt a Gamma prior for ; i.e. Ga (, )

1
() = Ga (; , ) = e .
()

We have
 n


( | x1 , ..., xn ) = Ga ; + xi , + n .
i=1

Bayesian Statistics Page 15


4.1 Testing hypotheses in a Bayesian framework

Consider the problem


where we have () = U [0, 1] and
n
x nx
Pr ( X = x| ) = (1 ) then ( | x) = Be (x + 1, n + 1 x) .

x

If we want to test H0 : 1
2 vs H1 : < 1
2 then, in a Bayesian approach,
you can simply compute
 1
( H0 | x) = 1 ( H1 | x) = ( | x) d.
1/2

Golden rule of Bayesians: Thou shalt not integrate with respect to


observations (except for design...)
Contrary to frequentists, your test is never based on observations you
dont observe.

Testing Hypotheses Page 16


4.2 Bayes Factors

More generally,ones wants to compare two hypothesis: H0 : 0


versus H1 : 1 then the prior is
() = (H0 ) 0 () + (H1 ) 1 ()
where (H0 ) + (H1 ) = 1.
1  1
In the previous example, 0 () = U 2 , 1 and 1 () = U 0, 2
and (H0 ) = (H1 ) = 12 .

To compare H0 versus H1 , we typically compute the Bayes factor


which partially eliminated the inuence of the prior modelling (i.e. (Hi ))
( x| H1 ) f ( x| ) 1 () d
B10 = = 
( x| H0 ) f ( x| ) 0 () d

( H1 | x) (H0 )
=
( H0 | x) (H1 )

Testing Hypotheses Page 17


4.3 Towards Bayes Model Selection

Bayes factors are not limited to the comparison of models with the
same parameter space.

Assume you have some data and two statistical models.


Under H0 , 0 0 , the prior is 0 (0 ) and the likelihood is f0 ( x| 0 ) ,
under H1 , 1 1 , the prior is 1 (1 ) and the likelihood is f1 ( x| 1 )
then

( x| H1 ) f1 ( x| 1 ) 1 (1 ) d1
B10 = = 
( x| H0 ) f0 ( x| 2 ) 0 (0 ) d0

One can have 0 = R and 1 = R1000 .

Testing Hypotheses Page 18


4.3 Towards Bayes Model Selection

Jereys scale of evidence says that


- if log10 (B10 ) varies between 0 and 0.5, the evidence against H0 is poor,

- if it is between 0.5 and 1, it is substantial,

- if it is between 1 and 2, it is strong, and

- if it is above 2, it is decisive.

Bayes factor tell you where one should prefer H0 to H1 : it does NOT
tell you whether model H1 any of these models are sensible!

Testing Hypotheses Page 19


4.3 Towards Bayes Model Selection

Bayes procedures can be directly used to test point null hypothesis; i.e.
H0 : = 0 (that is 0 () = 0 ()) versus H1 : 1 where the prior
is then dened as
() = (H0 ) 0 () + (H1 ) 1 ()

The associated Bayes factor is simply


( x| H1 ) f ( x| ) 1 () d
B10 (x) = = .
( x| H0 ) f ( x| 0 )

Testing Hypotheses Page 20


4.4 Example: The celebrated coin example

Assume you have a coin, you toss it 10 times and gets x = 10 heads.
Is it biased?

Let be the proba of having an head then we can test H0 : = 12 .

The p-value Pr ( X 10| H0 ) = 29 and the hypothesis is rejected.

 1
In a Bayesian framework, we test H0 versus H1 : U 2 , 1 using

1 x 10x 1
1
2 1 (1 ) d 1
2 1 10 d

 1 10
50.
2 2
B10 =  1 x  
1 10x
=
2 1 2 2

Testing Hypotheses Page 21


4.5 Testing the mean of a Gaussian

   
Assume you have X| , 2
N , 2
where 2 is assumed known but

(the parameter ) is unknown.


 
We want to test H0 : = 0 vs H1 : N , 2
then
   
( x| H1 ) N x; , N ; , d
2 2
B10 (x) = =
( x| H0 ) f ( x| 0)
 2 2

x
= exp 2 2 2
.
2
+ 2 2 ( + )

Alternatively if (H0 ) = = 1 (H1 ) then


 1
1
( H0 | x) = ( = 0| x) = 1 + B10 (x)

Testing Hypotheses Page 22


4.5 Testing the mean of a Gaussian

The Bayes factor depends heavily on 2 . As 2 , the prior becomes



uniformative but then B10 (x) 0 whatever being x and ( H0 | x) 1.

We will see that next week but using vague priors for model selection
is a very very bad idea... (Lindleys paradox).

Testing Hypotheses Page 23

You might also like