You are on page 1of 45

# A Gentle Tutorial in Bayesian Statistics

1 / 29

Warning

## This talk includes

This tutorial should be accessible even if the equations might look hard.

2 / 29

## Outline of the Talk

The need for (statistical) modelling; two examples (a linear model/tractography) introduction to statistical inference (frequentist); introduction to the Bayesian approach to parameter estimation; more examples and Bayesian inference in practice conclusions.

3 / 29

## Use of Statistics in Clinical Sciences (1)

Examples include: Sample Size Determination Comparison between two (or more) groups
t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

4 / 29

## Use of Statistics in Clinical Sciences (1)

Examples include: Sample Size Determination Comparison between two (or more) groups
t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

4 / 29

## Use of Statistics in Clinical Sciences (1)

Examples include: Sample Size Determination Comparison between two (or more) groups
t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

4 / 29

## Use of Statistics in Clinical Sciences (1)

Examples include: Sample Size Determination Comparison between two (or more) groups
t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

4 / 29

## Use of Statistics in Clinical Sciences (1)

Examples include: Sample Size Determination Comparison between two (or more) groups
t-tests, Z-tests; Analysis of variance (ANOVA); tests for proportions etc;

4 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Use of Statistics in Clinical Sciences (2)

One (of the best) ways(s) to describe some data is by tting a (statistical) model. Examples include: (linear/logistic/loglinear) regression models; survival analysis; longitudinal data analysis; infectious disease modelling; image/shape analysis; ...

5 / 29

## Aims of Statistical Modelling: A Simple Example

Perhaps we can t a straight line? y = + x + error
1.0
q q q q q q q q q q qq q q q qq q q q q q q q q q q q q

0.8

q qq q q q q q q q q qq q q

response (y)

q q q q qq q q q qq q q qq q q q

0.4

q q qq q q qq qq qq q q q q q q q q q q q q q q qq q q q

qq

q q

0.6

q q

0.2

explanatory (x)

6 / 29

An Example in DW-MRI

Suppose that we are interested in tractography. We use the diusion tensor to model local diusion within a voxel. The (model) assumption made is that local diusion could be modelled with a 3D Gaussian distribution whose variance-covariance matrix is proportional to the diusion tensor, D .

7 / 29

An Example in DW-MRI
The resulting diusion-weighted signal, i along a gradient direction gi with b -value bi is modelled as: i = S0 exp {bi gT i D g} where D11 D12 D13 D = D21 D22 D23 D31 D32 D33 (1)

S0 is the signal with no diusion weight gradients applied (i.e. b0 = 0). The eigenvectors of D give an orthogonal coordinate system and dene the orientation of the ellipsoid axes. The eigenvalues of D give the length of these axes. If we sort the eigenvalues by magnitude we can derive the the orientation of the major axis of the ellipsoid and the orientation of the minor axes.
8 / 29

An Example in DW-MRI
Although this may look a bit complicate, actually, it can be written in terms of a linear model.

9 / 29

## Aims of Statistical Modelling

Models have parameters some of which (if not all) are unknown, e.g. and . In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for and such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (, )? No! Why? Because there may be many pairs (, ) (often not very dierent from each other) which may equally well describe the data uncertainty
10 / 29

## Aims of Statistical Modelling

Models have parameters some of which (if not all) are unknown, e.g. and . In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for and such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (, )? No! Why? Because there may be many pairs (, ) (often not very dierent from each other) which may equally well describe the data uncertainty
10 / 29

## Aims of Statistical Modelling

Models have parameters some of which (if not all) are unknown, e.g. and . In statical modelling we are interested in inferring (e.g. estimating) the unknown parameters from data inference. Parameter estimation needs be done in a formal way. In other words we ask ourselves the question: what are the best values for and such that the proposed model (straight line) best describe the observed data? Should we only look for a single estimate for (, )? No! Why? Because there may be many pairs (, ) (often not very dierent from each other) which may equally well describe the data uncertainty
10 / 29

## The likelihood function

The likelihood function plays a fundamental role in statistical inference. In non-technical terms, the likelihood function is a function that when evaluated at a particular point, say (0 , 0 ), is the probability of observing the (observed) data given that the parameters (, ) take the values 0 and 0 . Lets think of a very simple example: Suppose we are interested in estimating the probability of success (denoted by ) for one particular experiment. Data: Out of 100 times we repeated the experiment we observed 80 successes. What about L(0.1), L(0.7), L(0.99)?

11 / 29

## Classical (Frequentist) Inference

Frequentist inference tell us that: we should for parameter values that maximise the likelihood function maximum likelihood estimator (MLE) associate parameters uncertainty with the calculation of standard errors . . . . . . which in turn enable us to construct condence intervals for the parameters. Whats wrong with that? Nothing, but . . . . . . it is approximate, counter-intuitive (data is assumed to be random, parameter is xed) and often mathematically intractable.
12 / 29

## Classical (Frequentist) Inference

Frequentist inference tell us that: we should for parameter values that maximise the likelihood function maximum likelihood estimator (MLE) associate parameters uncertainty with the calculation of standard errors . . . . . . which in turn enable us to construct condence intervals for the parameters. Whats wrong with that? Nothing, but . . . . . . it is approximate, counter-intuitive (data is assumed to be random, parameter is xed) and often mathematically intractable.
12 / 29

## Classical (Frequentist) Inference - Some Issues

1. what is the probability that the (unknown) probability of success in the previous experiment is greater than 0.6? i.e. compute the quantity P ( > 0.6) . . . 2. or something like, P (0.3 < < 0.9);

Sometime we are interested in (not necessarily) functions of parameters, e.g. 1 + 2 , 1 /(1 1 ) 2 /(1 2 )

Whilst in some cases, the frequentist approach oers a solution which is not exact but approximate, there other where it cannot or it is very hard to do so.
13 / 29

Bayesian Inference

When drawing inference within a Bayesian framework, the data are treated as a xed quantity and the parameters are treated as random variables. That allows us to assign to parameters (and models) probabilities, making the inferential framework far more intuitive and more straightforward (at least in principle!)

14 / 29

## Bayesian Inference (2)

Denote by the parameters and by y the observed data. Bayes theorem allows to write: (|y) = where (|y) denotes the posterior distribution of the parameters given the data; (y|) = L() is the likelihood function; () is the prior distribution of which express our beliefs about the parameters, before we see the data; (y) is often called the marginal likelihood and plays the role of the normalising constant of the density of the posterior distribution
15 / 29

(y|) () = (y)

(y|) () (y| ) ( ) d

## Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

## Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

## Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

## Bayesian vs Frequentist Inference

Everything is assigned distributions (prior, posterior); we are allowed to incorporate prior information about the parameter . . . which is then updated by using the likelihood function . . . leading to the posterior distribution which tell us everything we need about the parameter.

16 / 29

## Bayesian Inference: The Prior

One of the biggest criticisms to the Bayesian paradigm is the use of the prior distribution. Choose a very informative prior to come up with favourable results; I know nothing about the parameter; what prior do I choose? Arguments against that criticism: priors should be chosen before we see the data and it is very often the case that there is some prior information available (e.g. previous studies) if we know nothing about the parameter, then we could assign to it a so-called uninformative (or vague) prior; if there is not a lot of data available then the posterior distribution would not be inuenced by the prior (too much) and vice versa;
17 / 29

## Bayesian Inference: The Posterior

Although Bayesian inference has been around for long time it is only the last two decades that it has really revolutionized the way we do statistical modelling. Although, in principle, Bayesian inference is straightforward and intuitive when it comes to computations it could be very hard to implement it. Thanks to computational developments such as Markov Chain Monte Carlo (MCMC) doing Bayesian inference is a lot easier.

18 / 29

## Bayesian Inference: Some Examples

83/100 successes: interested in probability of success
10 8

posterior

0
0.0

0.2

0.4 theta

0.6

0.8

1.0
19 / 29

## Bayesian Inference: Some Examples

83/100 successes: interested in probability of success

10

posterior

0
0.0

0.2

0.4 theta

0.6

0.8

1.0
20 / 29

## Bayesian Inference: Some Examples

83/100 successes: interested in probability of success
10 8

posterior

0
0.0

0.2

0.4 theta

0.6

0.8

1.0
21 / 29

## Bayesian Inference: Some Examples

8/10 successes: interested in probability of success
10

posterior 0
0.0

0.2

0.4 theta

0.6

0.8

1.0
22 / 29

## Bayesian Inference: Some Examples

83/100 successes: interested in probability of success

10

posterior

0
0.0

0.2

0.4 theta

0.6

0.8

1.0
23 / 29

## Comparing Dierent Hypotheses: Bayesian Model Choice

Suppose that we are interested in testing two competing model hypotheses, M1 and M2 . Within a Bayesian framework, the model index M can be treated as a an extra parameter (as well as the other parameters in M1 and M2 . So, it is natural to ask what is the posterior model probability given the observed data?, i.e. (M1 |y) or P (M2 |y) Bayes Theorem: P (M1 |y) = where
(y|M1 ) is the marginal likelihood (also called the evidence), (M1 ) is the prior model probability
24 / 29

(y|M1 ) (M1 ) (D )

## Bayesian Model Choice (2)

Given a model selection problem in which we have to choose between two models, on the basis of observed data y. . . . . .the plausibility of the two dierent models M1 and M2 , parametrised by model parameter vectors 1 and 2 is assessed by the Bayes factor given by: P (y|M1 ) = P (y|M2 )
1 2

(y|1 , M1 ) (1 ) d1 (y|2 , M2 ) (2 ) d2

The Bayesian model comparison does not depend on the parameters used by each model. Instead, it considers the probability of the model considering all possible parameter values. This is similar to a likelihood-ratio test, but instead of maximizing the likelihood, we average over all the parameters.
25 / 29

## Bayesian Model Choice (3)

Why bother? An advantage of the use of Bayes factors is that it automatically, and quite naturally, includes a penalty for including too much model structure. It thus guards against overtting. No free lunch! In practical situations, the calculation of Bayes Factor relies on the employment computationally intensive methods, such Reversible-Jump Markov Chain Monte Carlo (RJ-MCMC) which require a certain amount of expertise from the end-user.

26 / 29

## An Example in DW-MRI Analysis

We assume that the voxels intensity can be modelled by assuming that Si /S0 N (i , 2 ) where we could consider (at least) two dierent models: 1. Diusion Tensor Model (Model 1) assumes that: i = exp {bi gT i D g} 2. Simple Partial Volume Model (Model 2) assumes that: i = f exp {bd } + (1 f ) exp {bd gi C gT i }

27 / 29

## An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) for each voxel. We could t the two dierent models (on the same dataset). Question: How do we tell which model ts the data best taking into account the uncertainty associated with the parameters in each model? Answer: Calculate the Bayes factor!

28 / 29

## An Example in DW-MRI Analysis (2)

Suppose that we have some measurements (intensities) for each voxel. We could t the two dierent models (on the same dataset). Question: How do we tell which model ts the data best taking into account the uncertainty associated with the parameters in each model? Answer: Calculate the Bayes factor!

28 / 29

Conclusions

Quantication of the uncertainty both in parameter estimation and model choice is essential in any modelling exercise. A Bayesian approach oers a natural framework to deal with parameter and model uncertainty. It oers much more than a single best t or any sort sensitivity analysis. There is no free lunch, unfortunately. To do fancy things, often one has to write his/her own computer programs. Software available: R, Winbugs, BayesX . . .

29 / 29