Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet

Stat 535 C - Statistical Computing & Monte Carlo Methods
Arnaud Doucet
Email: arnaud@cs.ubc.ca
1
CS students: dont forget to re-register in CS-535D.
Even if you just audit this course, please do register.
2
2.1 Outline
Bayesian Statistics.
Testing Hypotheses: The Bayesian way.
Bayesian Model Selection.
Overview of the Lecture Page 3

3.1 Ingredients of Bayesian Inference
Given the prior () and the likelihood l ( | x) = f ( x| ) then

Bayess formula yields
f ( x| ) ()
( | x) = .
f ( x| ) () d
It represents all the information on than can be extracted from x.
It satises suciency and likelihood principles.
On average (with respect to X), reduce the uncertainty about ; i.e.
E [var [ | X]] = var [] var [E [ | X]] var [] .
Bayesian Statistics Page 4

3.2 Variance Decomposition Identity
If (, X) are two scalar random variables then we have
var () = E (var ( | X)) + var (E ( | X)) .
Proof:
2
var () = E E ()2
2
= E E X (E (E ( | X)))2
2
= E E X E (E ( | X))
2

2 2
+E (E ( | X)) (E (E ( | X)))
= E (var ( | X)) + var (E ( | X)) .

3.3 Be careful
Such results appear attractive but one should be careful.
Here there is an underlying assumption that the observations

are indeed distributed according to (x) = () f ( x| ) d.

3.4 Simple Binomial example
(Bayes, 1764): A billiard ball W is rolled on a line of length one,

with a uniform probability of stopping anywhere. It stops at .
A second ball O is then rolled n times under the same assumptions
and X denotes the number of times the ball O stopped on the left of W.
Given X, what inference can we make on ?
We X| B (n, ) binomial distribution and select U [0, 1] and

n x
nx
x nx (1 ) 1[0,1] ()
Pr ( X = x| ) = f ( x| ) = (1 ) ( | x) = 1
x (1 )nx d
0
x

We have
1
1
(x) = Pr ( X = x| ) () d = for x = 0, ..., n
0 n+1
It follows that ( | x) = Be (x + 1, n + 1 x).
Prediction. Given X = x, you roll the ball once more and

Pr ( Y = 1| ) = then

Pr ( Y = 1| x) = Pr ( Y = 1| , x) ( | x) d

x+1
= ( | x) d = E [ | x] = .
n+2

Application. Laplace developed independently such a model.

From 1745 to 1770, 241,945 girls and 251,527 boys were born
in Paris. Let be the probability that any birth is female, then
n = 251, 527 + 241, 945
Pr ( 0.5| x = 241, 945) 1.15 1042 .
Remark: This is completely dierent from a p-value. We do

not integrate over observations we have never seen.

3.5 A Simple Gaussian example

Consider X1 | N , 2
and N 0 , 0
m 2

2 2
(x1 ) ( m0 )
( | x1 ) f ( x1 | ) () exp
2 2 202
2 x
1 1 1 m
exp + + + 2
2 2 02 2

1 2
exp 2 ( m1 )
21

| x1 N m1 , 12
2 2
1 1 1
with 2 = + 2 1 = 2
2 0
,
1 02 0 + 2

2 x1 m
m1 = 1 + 2 .
2 0


To predict the distribution of a new observation X| N , 2
in light
of x1 we use the predictive distribution

f ( x| x1 ) = f ( x| ) ( | x1 ) d
We can do direct calculations or alternatively use the fact that

f ( x| x1 ) is Gaussian so characterized by its mean and variance
E [ X| x1 ] = E [ + V | x1 ] = E [ | x1 ] = m1 ,
var [ X| x1 ] = var [ + V | x1 ] = var [ | x1 ] + var [V ] = 12 + 2 .


Now assume that you observe a realization x2 of X2 | N , . 2
Then you are interested now in

( | x1 , x2 ) f ( x2 | ) f ( x1 | ) ()
f ( x2 | ) ( | x1 )
f ( x1 | ) ( | x2 ) .
Updating the prior one observation at a time, or all observations

together, does not matter.
The sequential approach can be useful for massive dataset.

In this case at time n
( | x1 , ..., xn ) f ( xn | ) ( | x1 , ..., xn1 ) ;
i.e. the prior at time n is the posterior at time n 1.

3.6 Simple Gaussian example: Bayes vs ML
ML estimate of at time n is simply

n n
1
M L = arg sup f ( xi | ) = xi .
i=1
n i=1
Posterior of at time n is

| x1 , ..., xn N mn , n2
where
1 1 n 02 2 2
= 2 + 2 n =
2
,
n2 0 n02 + 2 n n
n n
i=1 xi m i=1 xi
mn = n2 + .
2 02 n n
Asymptotically in n the prior is washed out by the data and
E [ | x1 , ..., xn ] = mn M L .

3.7 Bayes vs ML
However, keep in mind that information provided by a Bayesian

approach is much richer.
You can compute for example posterior probabilities
Pr ( A| x1 , ..., xn ) or var ( | x1 , ..., xn )
or compute the distributions of future observations
f ( x| x1 , ..., xn ) .
ML can be reassuring because of consistency and eciency.

For nite sample sizes, do you really care?
For time series models for example, there is no such thing.

3.8 A Simple Poisson Model
i.i.d.
Assume you have some couting observations Xi P (); i.e.
xi

f ( xi | ) = e
xi !
Assume we adopt a Gamma prior for ; i.e. Ga (, )
1
() = Ga (; , ) = e .
()
We have
n

( | x1 , ..., xn ) = Ga ; + xi , + n .
i=1

4.1 Testing hypotheses in a Bayesian framework
Consider the problem

where we have () = U [0, 1] and
n
x nx
Pr ( X = x| ) = (1 ) then ( | x) = Be (x + 1, n + 1 x) .

x
If we want to test H0 : 1
2 vs H1 : < 1
2 then, in a Bayesian approach,
you can simply compute
1
( H0 | x) = 1 ( H1 | x) = ( | x) d.
1/2
Golden rule of Bayesians: Thou shalt not integrate with respect to

observations (except for design...)
Contrary to frequentists, your test is never based on observations you
dont observe.
Testing Hypotheses Page 16

4.2 Bayes Factors
More generally,ones wants to compare two hypothesis: H0 : 0

versus H1 : 1 then the prior is
() = (H0 ) 0 () + (H1 ) 1 ()
where (H0 ) + (H1 ) = 1.
1 1
In the previous example, 0 () = U 2 , 1 and 1 () = U 0, 2
and (H0 ) = (H1 ) = 12 .
To compare H0 versus H1 , we typically compute the Bayes factor

which partially eliminated the inuence of the prior modelling (i.e. (Hi ))
( x| H1 ) f ( x| ) 1 () d
B10 = =
( x| H0 ) f ( x| ) 0 () d
( H1 | x) (H0 )
=
( H0 | x) (H1 )

4.3 Towards Bayes Model Selection
Bayes factors are not limited to the comparison of models with the
same parameter space.
Assume you have some data and two statistical models.

Under H0 , 0 0 , the prior is 0 (0 ) and the likelihood is f0 ( x| 0 ) ,
under H1 , 1 1 , the prior is 1 (1 ) and the likelihood is f1 ( x| 1 )
then

( x| H1 ) f1 ( x| 1 ) 1 (1 ) d1
B10 = =
( x| H0 ) f0 ( x| 2 ) 0 (0 ) d0
One can have 0 = R and 1 = R1000 .

Jereys scale of evidence says that

- if log10 (B10 ) varies between 0 and 0.5, the evidence against H0 is poor,
- if it is between 0.5 and 1, it is substantial,
- if it is between 1 and 2, it is strong, and
- if it is above 2, it is decisive.
Bayes factor tell you where one should prefer H0 to H1 : it does NOT
tell you whether model H1 any of these models are sensible!

Bayes procedures can be directly used to test point null hypothesis; i.e.
H0 : = 0 (that is 0 () = 0 ()) versus H1 : 1 where the prior
is then dened as
() = (H0 ) 0 () + (H1 ) 1 ()
The associated Bayes factor is simply

( x| H1 ) f ( x| ) 1 () d
B10 (x) = = .
( x| H0 ) f ( x| 0 )

4.4 Example: The celebrated coin example
Assume you have a coin, you toss it 10 times and gets x = 10 heads.
Is it biased?
Let be the proba of having an head then we can test H0 : = 12 .
The p-value Pr ( X 10| H0 ) = 29 and the hypothesis is rejected.
1
In a Bayesian framework, we test H0 versus H1 : U 2 , 1 using
1 x 10x 1
1
2 1 (1 ) d 1
2 1 10 d

1 10
50.
2 2
B10 = 1 x
1 10x
=
2 1 2 2

4.5 Testing the mean of a Gaussian

Assume you have X| , 2
N , 2
where 2 is assumed known but
(the parameter ) is unknown.

We want to test H0 : = 0 vs H1 : N , 2
then

( x| H1 ) N x; , N ; , d
2 2
B10 (x) = =
( x| H0 ) f ( x| 0)
2 2

x
= exp 2 2 2
.
2
+ 2 2 ( + )
Alternatively if (H0 ) = = 1 (H1 ) then

1
1
( H0 | x) = ( = 0| x) = 1 + B10 (x)


4.5 Testing the mean of a Gaussian
The Bayes factor depends heavily on 2 . As 2 , the prior becomes

uniformative but then B10 (x) 0 whatever being x and ( H0 | x) 1.
We will see that next week but using vague priors for model selection
is a very very bad idea... (Lindleys paradox).

Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet

Uploaded by

Copyright:

Available Formats

Stat 535 C - Statistical Computing & Monte Carlo Methods

Even if you just audit this course, please do register.

Testing Hypotheses: The Bayesian way.

Bayesian Model Selection.

Overview of the Lecture Page 3

Given the prior () and the likelihood l ( | x) = f ( x| ) then

It represents all the information on than can be extracted from x.

It satises suciency and likelihood principles.

On average (with respect to X), reduce the uncertainty about ; i.e.

E [var [ | X]] = var [] var [E [ | X]] var [] .

Bayesian Statistics Page 4

If (, X) are two scalar random variables then we have

var () = E (var ( | X)) + var (E ( | X)) .

= E (var ( | X)) + var (E ( | X)) .

Bayesian Statistics Page 5

Such results appear attractive but one should be careful.

Here there is an underlying assumption that the observations

Bayesian Statistics Page 6

(Bayes, 1764): A billiard ball W is rolled on a line of length one,

Bayesian Statistics Page 7

It follows that ( | x) = Be (x + 1, n + 1 x).

Prediction. Given X = x, you roll the ball once more and

Bayesian Statistics Page 8

Application. Laplace developed independently such a model.

Pr ( 0.5| x = 241, 945) 1.15 1042 .

Remark: This is completely dierent from a p-value. We do

Bayesian Statistics Page 9

Bayesian Statistics Page 10

We can do direct calculations or alternatively use the fact that

var [ X| x1 ] = var [ + V | x1 ] = var [ | x1 ] + var [V ] = 12 + 2 .

Bayesian Statistics Page 11

Then you are interested now in

Updating the prior one observation at a time, or all observations

The sequential approach can be useful for massive dataset.

Bayesian Statistics Page 12

ML estimate of at time n is simply

Bayesian Statistics Page 13

However, keep in mind that information provided by a Bayesian

You can compute for example posterior probabilities

Pr ( A| x1 , ..., xn ) or var ( | x1 , ..., xn )

or compute the distributions of future observations

ML can be reassuring because of consistency and eciency.

Bayesian Statistics Page 14

Assume we adopt a Gamma prior for ; i.e. Ga (, )

Bayesian Statistics Page 15

Consider the problem

Golden rule of Bayesians: Thou shalt not integrate with respect to

Testing Hypotheses Page 16

More generally,ones wants to compare two hypothesis: H0 : 0

To compare H0 versus H1 , we typically compute the Bayes factor

Testing Hypotheses Page 17

Assume you have some data and two statistical models.

One can have 0 = R and 1 = R1000 .

Testing Hypotheses Page 18

Jereys scale of evidence says that

- if it is between 0.5 and 1, it is substantial,

- if it is between 1 and 2, it is strong, and

Testing Hypotheses Page 19

The associated Bayes factor is simply

Testing Hypotheses Page 20

Let be the proba of having an head then we can test H0 : = 12 .

The p-value Pr ( X 10| H0 ) = 29 and the hypothesis is rejected.

Testing Hypotheses Page 21