0 Up votes0 Down votes

37 views111 pagesDiapositivas que contienen una explicación concisa de Inferencia Bayesiana. Curso útil para la finanzas avanzadas (riesgos, mercado de derivados, etc.)

Aug 11, 2015

© © All Rights Reserved

PDF, TXT or read online from Scribd

Diapositivas que contienen una explicación concisa de Inferencia Bayesiana. Curso útil para la finanzas avanzadas (riesgos, mercado de derivados, etc.)

© All Rights Reserved

37 views

Diapositivas que contienen una explicación concisa de Inferencia Bayesiana. Curso útil para la finanzas avanzadas (riesgos, mercado de derivados, etc.)

© All Rights Reserved

- The Law of Explosive Growth: Lesson 20 from The 21 Irrefutable Laws of Leadership
- Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
- Hidden Figures Young Readers' Edition
- The E-Myth Revisited: Why Most Small Businesses Don't Work and
- Micro: A Novel
- The Wright Brothers
- The Other Einstein: A Novel
- State of Fear
- State of Fear
- The Power of Discipline: 7 Ways it Can Change Your Life
- The Kiss Quotient: A Novel
- The 10X Rule: The Only Difference Between Success and Failure
- Being Wrong: Adventures in the Margin of Error
- Algorithms to Live By: The Computer Science of Human Decisions
- The 6th Extinction
- The Black Swan
- The Art of Thinking Clearly
- The Last Battle
- Prince Caspian
- A Mind for Numbers: How to Excel at Math and Science Even If You Flunked Algebra

You are on page 1of 111

Tom Loredo

Dept. of Astronomy, Cornell University

http://www.astro.cornell.edu/staff/loredo/bayes/

1 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

2 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

3 / 111

Scientific Method

Science is more than a body of knowledge; it is a way of thinking.

The method of science, as stodgy and grumpy as it may seem,

is far more important than the findings of science.

Carl Sagan

Scientists argue!

Argument Collection of statements comprising an act of

reasoning from premises to a conclusion

A key goal of science: Explain or predict quantitative

measurements (data!)

Data analysis constructs and appraises arguments that reason from

data to interesting scientific conclusions (explanations, predictions)

4 / 111

We dont just tabulate data, we analyze data.

We gather data so they may speak for or against existing

hypotheses, and guide the formation of new hypotheses.

A key role of data in science is to be among the premises in

scientific arguments.

5 / 111

Data Analysis

Building & Appraising Arguments Using Data

analyzing data.

6 / 111

not just another method in the list: BLUE, maximum

likelihood, 2 testing, ANOVA, survival analysis . . . )

arguments (i.e., a more abstract view than restricting PT to

describe variability in repeated random experiments)

rather than devising and calibrating procedures

7 / 111

I find conclusion C based on data Dobs . . .

Frequentist assessment

It was found with a procedure thats right 95% of the time

over the set {Dhyp } that includes Dobs .

Probabilities are properties of procedures, not of particular

results.

Bayesian assessment

The strength of the chain of reasoning from Dobs to C is

0.95, on a scale where 1= certainty.

Probabilities are properties of specific results.

Performance must be separately evaluated (and is typically

good by frequentist criteria).

8 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

9 / 111

LogicSome Essentials

Logic can be defined as the analysis and appraisal of arguments

Gensler, Intro to Logic

Build arguments with propositions and logical

operators/connectives

P:

B:

is not 0

B:

not B, i.e., = 0

A:

Connectives:

AB :

AB :

A or B is true, or both are

10 / 111

Arguments

Argument: Assertion that an hypothesized conclusion, H, follows

from premises, P = {A, B, C , . . .} (take , = and)

Notation:

H|P :

Premises P imply H

H follows from P

Central role of arguments special terminology for true/false:

A false argument is invalid or fallacious

11 / 111

Content vs. form

(it has good content).

premises (it has good form).

(it has good form and content).

theory address validity, but there is no formal approach for

addressing factual correctness there is always a subjective

element to an argument.

12 / 111

Factual Correctness

Although logic can teach us something about validity and

invalidity, it can teach us very little about factual correctness. The

question of the truth or falsity of individual statements is primarily

the subject matter of the sciences.

Hardegree, Symbolic Logic

To test the truth or falsehood of premisses is the task of

science. . . . But as a matter of fact we are interested in, and must

often depend upon, the correctness of arguments whose premisses

are not known to be true.

Copi, Introduction to Logic

13 / 111

Premises

postulate (parallel lines)

Desperate presumption!

first 4 postulates (line segment b/t 2 points; congruency of

right angles . . . )

14 / 111

DeductionSyllogism as prototype

Premise 1: A implies H

Premise 2: A is true

Deduction: H is true

H|P is valid

InductionAnalogy as prototype

Premise 1: A, B, C , D, E all share properties x, y , z

Premise 2: F has properties x, y

Induction: F has property z

F has z|P is not strictly valid, but may still be rational

(likely, plausible, probable); some such arguments are stronger

than others

Boolean algebra (and/or/not over {0, 1}) quantifies deduction.

Bayesian probability theory (and/or/not over [0, 1]) generalizes this

to quantify the strength of inductive arguments.

15 / 111

P(H|P) strength of argument H|P

P = 0 Argument is invalid

= 1 Argument is valid

AND (product rule): P(A B|P) = P(A|P) P(B|A P)

= P(B|P) P(A|B P)

P(A B|P)

NOT:

P(A|P) = 1 P(A|P)

16 / 111

If we like there is no harm in saying that a probability expresses a

degree of reasonable belief. . . . Degree of confirmation has been

used by Carnap, and possibly avoids some confusion. But whatever

verbal expression we use to try to convey the primitive idea, this

expression cannot amount to a definition. Essentially the notion

can only be described by reference to instances where it is used. It

is intended to express a kind of relation between data and

consequence that habitually arises in science and in everyday life,

and the reader should be able to recognize the relation from

examples of the circumstances when it arises.

Sir Harold Jeffreys, Scientific Inference

17 / 111

More On Interpretation

Physics uses words drawn from ordinary languagemass, weight,

momentum, force, temperature, heat, etc.but their technical meaning

is more abstract than their colloquial meaning. We can map between the

colloquial and abstract meanings associated with specific values by using

specific instances as calibrators.

A Thermal Analogy

Intuitive notion

Quantification

Calibration

Hot, cold

Temperature, T

Boiling hot = 373K

uncertainty

Probability, P

Certainty = 0, 1

p = 1/36:

plausible as snakes eyes

p = 1/1024:

plausible as 10 heads

18 / 111

Bayesian

Probability quantifies uncertainty in an inductive inference. p(x)

describes how probability is distributed over the possible values x

might have taken in the single case before us:

p is distributed

x has a single,

uncertain value

Frequentist

an ensemble. p(x) describes variability, how the values of x are

distributed among the cases in the ensemble:

P

x is distributed

19 / 111

Arguments Relating

Hypotheses, Data, and Models

We seek to appraise scientific hypotheses in light of observed data

and modeling assumptions.

Consider the data and modeling assumptions to be the premises of

an argument with each of various hypotheses, Hi , as conclusions:

Hi |Dobs , I . (I = background information, everything deemed

relevant besides the observed data)

P(Hi |Dobs , I ) measures the degree to which (Dobs , I ) allow one to

deduce Hi . It provides an ordering among arguments for various Hi

that share common premises.

Probability theory tells us how to analyze and appraise the

argument, i.e., how to calculate P(Hi |Dobs , I ) from simpler,

hopefully more accessible probabilities.

20 / 111

Assess hypotheses by calculating their probabilities p(Hi | . . .)

conditional on known and/or presumed information using the

rules of probability theory.

Probability Theory Axioms:

OR (sum rule):

P(H1 H2 |I )

= P(H1 |I ) + P(H2 |I )

P(H1 , H2 |I )

= P(H1 |I ) P(D|H1 , I )

NOT:

P(H1 |I )

1 P(H1 |I )

21 / 111

Bayess Theorem (BT)

Consider P(Hi , Dobs |I ) using the product rule:

P(Hi , Dobs |I ) = P(Hi |I ) P(Dobs |Hi , I )

P(Hi |Dobs , I ) = P(Hi |I )

P(Dobs |Hi , I )

P(Dobs |I )

data the factors have names:

posterior prior likelihood

norm. const. P(Dobs |I ) = prior predictive

22 / 111

Consider exclusive, exhaustive {Bi } (I asserts one of them

must be true),

X

X

P(A, Bi |I ) =

P(Bi |A, I )P(A|I ) = P(A|I )

i

P(Bi |I )P(A|Bi , I )

{Bi } and use it as a basisextend the conversation:

X

P(A|I ) =

P(Bi |I )P(A|Bi , I )

i

P(A|I ) from the joint probabilitiesmarginalization:

X

P(A|I ) =

P(A, Bi |I )

i

23 / 111

X

P(Dobs |I ) =

P(Dobs , Hi |I )

i

X

i

(a.k.a. marginal likelihood)

Normalization

For exclusive, exhaustive Hi ,

X

P(Hi | ) = 1

i

24 / 111

Well-Posed Problems

The rules express desired probabilities in terms of other

probabilities.

To get a numerical value out, at some point we have to put

numerical values in.

Direct probabilities are probabilities with numerical values

determined directly by premises (via modeling assumptions,

symmetry arguments, previous calculations, desperate

presumption . . . ).

An inference problem is well posed only if all the needed

probabilities are assignable based on the premises. We may need to

add new assumptions as we see what needs to be assigned. We

may not be entirely comfortable with what we need to assume!

(Remember Euclids fifth postulate!)

Should explore how results depend on uncomfortable assumptions

(robustness).

25 / 111

Recap

Bayesian inference is more than BT

Bayesian inference quantifies uncertainty by reporting

probabilities for things we are uncertain of, given specified

premises.

It uses all of probability theory, not just (or even primarily)

Bayess theorem.

that you know or are willing to presume to be true (for the

sake of the argument!).

hypothesis, its appraisal must account for all of them.

26 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

27 / 111

sampling distn (conditional predictive distn for possible data):

p(D|i , Mi )

The i dependence when we fix attention on the observed data is

the likelihood function:

Li (i ) p(Dobs |i , Mi )

We may be uncertain about i (model uncertainty) or i (parameter

uncertainty).

28 / 111

Parameter Estimation

Premise = choice of model (pick specific i)

What can we say about i ?

Model Assessment

What can we say about i?

Is M1 adequate?

Model Averaging

Models share some common params: i = {, i }

What can we say about w/o committing to one model?

(Systematic error is an example)

29 / 111

Parameter Estimation

Problem statement

I = Model M with parameters (+ any addl info)

Hi = statements about ; e.g. [2.5, 3.5], or > 0

Probability for any such statement can be found using a

probability density function (PDF) for :

P( [, + d]| ) = f ()d

= p(| )d

p(|D, M) = R

p(|M) L()

d p(|M) L()

30 / 111

Summaries of posterior

Uncertainties:

Marginal distributions

maximizes p(|D, M)

Mode, ,

R

Posterior mean, hi = d p(|D, M)

Credible region of probability

C:

R

C = P( |D, M) = d p(|D, M)

Highest Posterior Density (HPD) region has p(|D, M) higher

inside than outside

Posterior standard deviation, variance, covariances

Interesting parameters , nuisance parameters

R

Marginal distn for : p(|D, M) = d p(, |D, M)

31 / 111

those of ultimate interest: nuisance parameters.

Example

We have data from measuring a rate r = s + b that is a sum

of an interesting signal s and a background b.

We have additional data just about b.

What do the data tell us about s?

32 / 111

p(s|D, M) =

db p(s, b|D, M)

Z

p(s|M) db p(b|s) L(s, b)

p(s|M)Lm (s)

Lm (s) p(bs |s) L(s, bs ) bs

best b given s

b uncertainty given s

space volume factor

s2 = r2 + 2

E.g., Gaussians: s = r b,

b

Background subtraction is a special case of background marginalization.

33 / 111

Model Comparison

Problem statement

I = (M1 M2 . . .) Specify a set of models.

Hi = Mi Hypothesis chooses a model.

p(D|Mi , I )

p(D|I )

p(Mi |I )L(Mi )

R

But L(Mi ) = p(D|Mi ) = di p(i |Mi )p(D|i , Mi ).

p(Mi |D, I ) = p(Mi |I )

L(Mi ) = hL(i )i

likelihood = Marginal likelihood = (Weight of) Evidence for model

34 / 111

A ratio of probabilities for two propositions using the same

premises is called the odds favoring one over the other:

Oij

p(Mi |D, I )

p(Mj |D, I )

p(Mi |I ) p(D|Mj , I )

p(Mj |I ) p(D|Mj , I )

Bij

p(D|Mj , I )

p(D|Mj , I )

cases when the likelihoods are marginal/average likelihoods.

35 / 111

Predictive probabilities can favor simpler models

p(D|Mi ) =

P(D|H)

Simple H

Complicated H

D obs

D

36 / 111

p, L

Likelihood

Prior

p(D|Mi ) =

i

i

= Maximum Likelihood Occam Factor

L(i )

probable for the best fit

Occam factor penalizes models for wasted volume of

parameter space

Quantifies intuition that models shouldnt require fine-tuning

37 / 111

Model Averaging

Problem statement

I = (M1 M2 . . .) Specify a set of models

Models all share a set of interesting parameters,

Each has different set of nuisance parameters i (or different

prior info about them)

Hi = statements about

Model averaging

Calculate posterior PDF for :

X

p(|D, I ) =

p(Mi |D, I ) p(|D, Mi )

i

X

i

L(Mi )

di p(, i |D, Mi )

38 / 111

Bayesian calculations sum/integrate over parameter/hypothesis

space!

(Frequentist calculations average over sample space & typically optimize

over parameter space.)

factor for the nuisance parameters.

parameter space volume factors.

accounting for the size of parameter space. This idea does not

arise naturally in frequentist statistics (but it can be added by

hand).

39 / 111

Prior has two roles

Convert likelihood from intensity to measure

Accounts for size of hypothesis space

Physical analogy

Heat:

Q =

Probability: P

dV cv (r)T (r)

d p(|I )L()

Bayes focuses on the hypotheses with the most heat.

A high-T region may contain little heat if its cv is low or if

its volume is small.

A high-L region may contain little probability if its prior is low or if

its volume is small.

40 / 111

Three theorems: BT, LTP, Normalization

Calculations characterized by parameter space integrals

Credible regions, posterior expectations

Marginalization over nuisance parameters

Occams razor via marginal likelihoods

41 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

42 / 111

Binary Outcomes:

Parameter Estimation

M = Existence of two outcomes, S and F ; each trial has same

probability for S or F

Hi = Statements about , the probability for success on the next

trial seek p(|D, M)

D = Sequence of results from N observed trials:

FFSSSSFSSSFS (n = 8 successes in N = 12 trials)

Likelihood:

p(D|, M) = p(failure|, M) p(success|, M)

= n (1 )Nn

= L()

43 / 111

Prior

Starting with no information about beyond its definition,

use as an uninformative prior p(|M) = 1. Justifications:

Intuition: Dont prefer any interval to any other of same size

Bayess justification: Ignorance means that before doing the

N trials, we have no preference for how many will be successes:

P(n success|M) =

1

N +1

p(|M) = 1

make the problem well posed.

44 / 111

Prior Predictive

p(D|M) =

d n (1 )Nn

= B(n + 1, N n + 1) =

A Beta integral, B(a, b)

n!(N n)!

(N + 1)!

dx x a1 (1 x)b1 =

(a)(b)

(a+b) .

45 / 111

Posterior

p(|D, M) =

(N + 1)! n

(1 )Nn

n!(N n)!

n+1

0.64

Best-fit: = Nn = 2/3; hi = N+2

q

Uncertainty: = (n+1)(Nn+1)

0.12

(N+2)2 (N+3)

Find credible regions numerically, or with incomplete beta

function

Note that the posterior depends on the data only through n,

not the N binary numbers describing the sequence.

n is a (minimal) Sufficient Statistic.

46 / 111

47 / 111

Equal Probabilities?

M1 : = 1/2

M2 : [0, 1] with flat prior.

Maximum Likelihoods

M1 :

M2 :

p(D|M1 ) =

1

= 2.44 104

2N

n Nn

1

2

= 4.82 104

L(

) =

3

3

p(D|M1 )

= 0.51

p(D|

, M2 )

48 / 111

1

;

2N

and

B12

p(D|M1 )

p(D|M2 )

p(D|M1 ) =

p(D|M2 ) =

n!(N n)!

(N + 1)!

(N + 1)!

n!(N n)!2N

= 1.57

Note that for n = 6, B12 = 2.93; for this small amount of

data, we can never be very sure results are equiprobable.

If n = 0, B12 1/315; if n = 2, B12 1/4.8; for extreme

data, 12 flips can be enough to lead us to strongly suspect

outcomes have different probabilities.

(Frequentist significance tests can reject null for any sample size.)

49 / 111

Suppose D = n (number of heads in N trials), rather than the

actual sequence. What is p(|n, M)?

Likelihood

Let S = a sequence of flips with n heads.

p(n|, M) =

p(S|, M) p(n|S, , M)

n (1 )Nn

[ # successes = n]

= n (1 )Nn Cn,N

Cn,N = # of sequences of length N with n heads.

p(n|, M) =

N!

n (1 )Nn

n!(N n)!

50 / 111

Posterior

p(|n, M) =

p(n|M) =

=

N!

n

n!(Nn)! (1

p(n|M)

N!

n!(N n)!

1

N +1

p(|n, M) =

)Nn

d n (1 )Nn

(N + 1)! n

(1 )Nn

n!(N n)!

51 / 111

Suppose D = N, the number of trials it took to obtain a predifined

number of successes, n = 8. What is p(|N, M)?

Likelihood

p(N|, M) is probability for n 1 successes in N 1 trials,

times probability that the final trial is a success:

p(N|, M) =

=

(N 1)!

n1 (1 )Nn

(n 1)!(N n)!

(N 1)!

n (1 )Nn

(n 1)!(N n)!

52 / 111

Posterior

p(|D, M) = Cn,N

p(D|M) =

p(|D, M) =

Cn,N

n (1 )Nn

p(D|M)

d n (1 )Nn

(N + 1)! n

(1 )Nn

n!(N n)!

53 / 111

Suppose D = (N, n), the number of samples and number of

successes in an observing run whose total number was determined

by the weather at the telescope. What is p(|D, M )?

(M adds info about weather to M.)

Likelihood

p(D|, M ) is the binomial distribution times the probability

that the weather allowed N samples, W (N):

p(D|, M ) = W (N)

Let Cn,N = W (N)

N

n

N!

n (1 )Nn

n!(N n)!

54 / 111

Likelihood Principle

To define L(Hi ) = p(Dobs |Hi , I ), we must contemplate what other

data we might have obtained. But the real sample space may be

determined by many complicated, seemingly irrelevant factors; it

may not be well-specified at all. Should this concern us?

Likelihood principle: The result of inferences depends only on how

p(Dobs |Hi , I ) varies w.r.t. hypotheses. We can ignore aspects of the

observing/sampling procedure that do not affect this dependence.

This is a sensible property that frequentist methods do not share.

Frequentist probabilities are long run rates of performance, and

depend on details of the sample space that are irrelevant in a

Bayesian calculation.

Example: Predict 10% of sample is Type A; observe nA = 5 for N = 96

Significance test accepts = 0.1 for binomial sampling;

p(> 2 | = 0.1) = 0.12

Significance test rejects = 0.1 for negative binomial sampling;

p(> 2 | = 0.1) = 0.03

55 / 111

Gaussian PDF

(x)2

1

p(x|, ) = e 22

2

over [, ]

Parameters

= hxi

2

dx x p(x|, )

Z

= h(x )2 i dx (x )2 p(x|, )

56 / 111

Suppose our data consist of N measurements, di = + i .

Suppose the noise contributions are independent, and

i N(0, 2 ).

Y

p(D|, , M) =

p(di |, , M)

i

Y

i

Y

i

p(i = di |, , M)

1

(di )2

exp

2 2

2

1

2

e Q()/2

N (2)N/2

57 / 111

X

Q =

(di )2

i

X

i

di2 + N2 2Nd

= N( d)2 + Nr 2

where d

1 X

di

N

i

1 X

where r 2

(di d)2

N

i

1

N( d)2

Nr 2

L(, ) = N

exp 2 exp

2

2 2

(2)N/2

This is a miraculous compression of informationthe normal distn

is highly abnormal in this respect!

58 / 111

Problem specification

Model: di = + i , i N(0, 2 ), is known I = (, M).

Parameter space: ; seek p(|D, , M)

Likelihood

Nr 2

N( d)2

1

exp

exp

2 2

2 2

N (2)N/2

N( d)2

exp

2 2

p(D|, , M) =

59 / 111

Uninformative prior

Translation invariance p() C , a constant.

This prior is improper unless bounded.

Prior predictive/normalization

Z

N( d)2

p(D|, M) =

d C exp

2 2

= C (/ N) 2

60 / 111

Posterior

1

N( d)2

exp

p(|D, , M) =

2 2

(/ N) 2

Note that C drops out limit of infinite prior range is well

behaved.

61 / 111

Use a normal prior, N(0 , w02 )

Posterior

Normal N(

, w

2 ), but mean, std. deviation shrink towards

prior.

2

Define B = w 2w+w 2 , so B < 1 and B = 0 when w0 is large.

0

Then

e = (1 B) d + B 0

e = w 1B

w

only when data are not informative relative to prior.

62 / 111

Problem specification

Model: di = + i , i N(0, 2 ), is unknown

Parameter space: (, ); seek p(|D, , M)

Likelihood

p(D|, , M) =

N( d)2

Nr 2

1

exp

exp

2 2

2 2

N (2)N/2

1 Q/22

e

N

where

Q = N r 2 + ( d)2

63 / 111

Uninformative Priors

Assume priors for and are independent.

Translation invariance p() C , a constant.

Scale invariance p() 1/ (flat in log ).

p(, |D, M)

1 Q()/22

e

N+1

64 / 111

Marginal Posterior

p(|D, M)

Let =

Q

2 2

so =

Q

2

1

N+1

e Q/2

N/2

p(|D, M) 2

N/2

Q N/2

Q

2

N

1

2

65 / 111

Write Q =

Nr 2

p(|D, M) =

1+

d

r

2

and normalize:

"

2 #N/2

1 ! 1

1 d

1+

N r/ N

23 ! r

N

2

N

2

r/ N

A bell curve, but with power-law tails

Large N:

p(|D, M) e N(d)

2 /2r 2

66 / 111

Problem: Observe n counts in T ; infer rate, r

Likelihood

L(r ) p(n|r , M) = p(n|r , M) =

(rT )n rT

e

n!

Prior

Two simple standard choices (or conjugate gamma distn):

r known to be nonzero; it is a scale parameter:

p(r |M) =

1

1

ln(ru /rl ) r

p(r |M) =

1

ru

67 / 111

Prior predictive

p(n|M) =

=

Z

1 1 ru

dr (rT )n e rT

ru n! 0

Z

1 1 ru T

d(rT )(rT )n e rT

ru T n! 0

1

n

for ru

ru T

T

Posterior

A gamma distribution:

p(r |n, M) =

T (rT )n rT

e

n!

68 / 111

Gamma Distributions

A 2-parameter family of distributions over nonnegative x, with

shape parameter and scale parameter s:

p (x|, s) =

1 x 1 x/s

e

s() s

Moments:

E(x) = s

Var(x) = s 2

Mode r = Tn ; mean hr i = n+1

T (shift down 1 with 1/r prior)

;

credible

regions found by integrating (can

T

use incomplete gamma function)

69 / 111

70 / 111

Bayess justification: Not that ignorance of r p(r |I ) = C

Require (discrete) predictive distribution to be flat:

Z

p(n|I ) =

dr p(r |I )p(n|r , I ) = C

p(r |I ) = C

Useful conventions

Use a log-flat prior ( 1/r ) for a nonzero scale parameter

Use proper (normalized, bounded) priors

Plot posterior with abscissa that makes prior flat

71 / 111

Basic problem

Count Noff photons in interval Toff

Count Non photons in interval Ton

Infer s

Conventional solution

r = Non /Ton ; r = qNon /Ton

s = r b;

s = r2 + b2

72 / 111

Examples

Spectra of X-Ray Sources

Bassani et al. 1989

73 / 111

Nagano & Watson 2000

10

HiRes-2 Monocular

HiRes-1 Monocular

AGASA

Flux*E /10

24

-2

-1

-1

(eV m s sr )

17

17.5

18

18.5

19

19.5

20

20.5

21

log10(E) (eV)

74 / 111

N is Never Large

Sample sizes are never large. If N is too small to get a

sufficiently-precise estimate, you need to get more data (or make

more assumptions). But once N is large enough, you can start

subdividing the data to learn more (for example, in a public

opinion poll, once you have a good estimate for the entire country,

you can estimate among men and women, northerners and

southerners, different age groups, etc etc). N is never enough

because if it were enough youd already be on to the next

problem for which you need more data.

Similarly, you never have quite enough money. But thats another

story.

Andrew Gelman (blog entry, 31 July 2005)

75 / 111

Measure background rate b = b b with source off. Measure total

rate r = r r with source on. Infer signal source strength s, where

r = s + b. With flat priors,

#

2

(s + b r )2

(b b)

exp

2r2

2b2

"

76 / 111

square to isolate b dependence; then do a simple Gaussian

integral over b):

(s s )2

s = r b

p(s|D, M) exp

s2 = r2 + b2

2s2

Background subtraction is a special case of background

marginalization.

77 / 111

First consider off-source data; use it to estimate b:

p(b|Noff , Ioff ) =

Noff !

analysis Iall = (Ion , Noff , Ioff ):

p(s, b|Non ) p(s)p(b)[(s + b)Ton ]Non e (s+b)Ton

|| Iall

p(s, b|Non , Iall ) (s + b)Non bNoff e sTon e b(Ton +Toff )

78 / 111

Z

p(s|Non , Iall ) =

db p(s, b | Non , Iall )

Z

Expand (s + b)Non and do the resulting integrals:

Ci

Non

X

i!

i=0

i

Toff

(Non + Noff i)!

1+

Ton

(Non i)!

p(s|Non , Iall ) =

Ci

different number of on-source counts to the source. (Evaluate via

recursive algorithm or confluent hypergeometric function.)

79 / 111

Ton = 1

80 / 111

Ton = 1

81 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

82 / 111

Star counts, galaxy counts, GRBs, TNOs . . .

BATSE 4B Catalog ( 1200 GRBs)

2

2 2

F D 2 /(d

d )

83 / 111

Typically treated by correcting data

Most sophisticated: product-limit estimators

Typically ignored (average out?)

84 / 111

Auger data above GZK cutoff (Nov 2007)

85 / 111

History

Eddington, Jeffreys (1920s 1940)

m Uncertainty

^

m

Malmquist, Lutz-Kelker

data (flux + distance indicator, parallax)

Assume homogeneous spatial distribution

86 / 111

Galaxies (Eddington, Malmquist; 1990s)

Linear regression (1990s)

GRBs (1990s)

X-ray sources (1990s; 2000s)

TNOs/KBOs (c. 2000)

Galaxy redshift distns (2007+)

87 / 111

Introduce latent/hidden/incidental parameters

Suppose f (x|) is a distribution for an observable, x.

Y

L() p({xi }|) =

f (xi |)

i

88 / 111

Graphical representation

x1

x2

xN

Y

i

f (xi |)

89 / 111

L(, {xi }) p({Di }|, {xi })

Y

=

i (xi )f (xi |)

i

Marginalize over to summarize results for {xi }.

Key point: Maximizing over xi and integrating over xi can give

very different results!

90 / 111

Graphical representation

x1

x2

xN

D1

D2

DN

L(, {xi })

Y

Y

=

p(Di |xi )f (xi |) =

i (xi )f (xi |)

i

91 / 111

Measure m = 2.5 log(flux) from sources following a rolling power law

distribution (inspired by trans-Neptunian objects)

f(m)

23

Simulate data for photon-counting instrument, fixed count threshold.

Measurements have uncertainties 1% (bright) to 30% (dim).

92 / 111

(crosses):

93 / 111

Similar toy survey, with parameters to mimic SDSS QSO surveys (few %

errors at dim end):

94 / 111

GRB repetition (Luo+ 1996; Graziani+ 1996)

Petit+ 2008)

GRB peak flux distn (Loredo & Wasserman 1998);

energy vs. time (Loredo & Lamb 2002)

95 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

96 / 111

Bayesian Computation

Large sample size: Laplace approximation

Approximate posterior as multivariate normal det(covar) factors

Uses ingredients available in 2 /ML fitting software (MLE, Hessian)

Often accurate to O(1/N)

10 to 20)

Adaptive cubature

Monte Carlo integration (importance sampling, quasirandom MC)

5)

Posterior samplingcreate RNG that samples posterior

MCMC is most general framework

97 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

98 / 111

Frequencies are relevant when modeling repeated trials, or

repeated sampling from a population or ensemble.

Frequencies are observables:

When unavailable, can be predicted

Bayesian/Frequentist relationships:

Long-run performance of Bayesian procedures

Examples of Bayesian/frequentist differences

99 / 111

Frequency from probability

Bernoullis law of large numbers: In repeated i.i.d. trials, given

P(success| . . .) = , predict

Nsuccess

Ntotal

as

Ntotal

Bayess An Essay Towards Solving a Problem in the Doctrine

of Chances First use of Bayess theorem:

Probability for success in next trial of i.i.d. sequence:

E

Nsuccess

Ntotal

as

Ntotal

100 / 111

Predict frequency in dependent trials

rt = result of trial t; p(r1 , r2 . . . rN |M) known; predict f :

1 X

p(rt = success|M)

hf i =

N t

X

X

where

p(r1 |M) =

p(r1 , r2 . . . |M3 )

r2

rN

average probability for outcome across trials.

But also find that f neednt converge to 0.

Shrinkage: Biased estimators of the probability that share info

across trials are better than unbiased/BLUE/MLE estimators.

A formalism that distinguishes p from f from the outset is particularly

valuable for exploring subtle connections. E.g., shrinkage is explored via

hierarchical and empirical Bayes.

101 / 111

Many results known for parametric Bayes performance:

Estimates are consistent if the prior doesnt exclude the true value.

Credible regions found with flat priors are typically confidence

regions to O(n1/2 ); reference priors can improve their

performance to O(n1 ).

Marginal distributions have better frequentist performance than

conventional methods like profile likelihood. (Bartlett correction,

ancillaries, bootstrap are competitive but hard.)

Bayesian model comparison is asymptotically consistent (not true of

significance/NP tests, AIC).

For separate (not nested) models, the posterior probability for the

true model converges to 1 exponentially quickly.

Walds complete class theorem: Optimal frequentist methods are

Bayes rules (equivalent to Bayes for some prior)

...

(Not so clear in nonparametric problems.)

102 / 111

Outline

1 The Big Picture

2 FoundationsLogic & Probability Theory

3 Inference With Parametric Models

Parameter Estimation

Model Uncertainty

4 Simple Examples

Binary Outcomes

Normal Distribution

Poisson Distribution

5 Measurement Error Applications

6 Bayesian Computation

7 Probability & Frequency

8 Endnotes: Hotspots, tools, reflections

103 / 111

Cosmology

Extrasolar planets

Nonparametric modeling of SN Ia multicolor light curves

Nonparametric emulation of cosmological models

Parametric modeling of Keplerian reflex motion (planet

detection, orbit estimation)

Optimal scheduling via Bayesian experimental design

Upper limits, hardness ratios

Parametric spectroscopy (line detection, etc.)

Parametric modeling of binary inspirals

Hi-multiplicity parametric modeling of white dwarf background

104 / 111

Astronomer/Physicist Tools

BIE http://www.astro.umass.edu/~weinberg/proto_bie/

Bayesian Inference Engine: General framework for Bayesian inference, tailored to

astronomical and earth-science survey data. Built-in database capability to

support analysis of terabyte-scale data sets. Inference is by Bayes via MCMC.

XSpec, CIAO/Sherpa

Both environments have some basic Bayesian capability (including basic MCMC

in XSpec)

CosmoMC http://cosmologist.info/cosmomc/

Parameter estimation for cosmological models using CMB and other data via

MCMC

ExoFit http://zuserver2.star.ucl.ac.uk/~lahav/exofit.html

Adaptive MCMC for fitting exoplanet RV data

http://www-cdf.fnal.gov/physics/statistics/statistics_software.html

Limits for Poisson counting processes, with background & efficiency

uncertainties

root http://root.cern.ch/

Bayesian support? (BayesDivide)

Several self-contained Bayesian modules (Gaussian, Poisson, directional

processes); Parametric Inference Engine (PIE) supports 2 , likelihood, and Bayes

105 / 111

Python

PyMC http://trichech.us/pymc

A framework for MCMC via Metropolis-Hastings; also implements Kalman

filters and Gaussian processes. Targets biometrics, but is general.

SimPy http://simpy.sourceforge.net/

Intro to SimPy http://heather.cs.ucdavis.edu/ matloff/simpy.html SimPy

(rhymes with Blimpie) is a process-oriented public-domain package for

discrete-event simulation.

RSPython http://www.omegahat.org/

Bi-directional communication between Python and R

MDP http://mdp-toolkit.sourceforge.net/

Modular toolkit for Data Processing: Current emphasis is on machine learning

(PCA, ICA. . . ). Modularity allows combination of algorithms and other data

processing elements into flows.

Orange http://www.ailab.si/orange/

Component-based data mining, with preprocessing, modeling, and exploration

components. Python/GUI interfaces to C + + implementations. Some Bayesian

components.

ELEFANT http://rubis.rsise.anu.edu.au/elefant

Machine learning library and platform providing Python interfaces to efficient,

lower-level implementations. Some Bayesian components (Gaussian processes;

Bayesian ICA/PCA).

106 / 111

R and S

http://cran.r-project.org/src/contrib/Views/Bayesian.html

Overview of many R packages implementing various Bayesian models and

methods

Omega-hat http://www.omegahat.org/

RPython, RMatlab, R-Xlisp

BOA http://www.public-health.uiowa.edu/boa/

Bayesian Output Analysis: Convergence diagnostics and statistical and graphical

analysis of MCMC output; can read BUGS output files.

CODA

http://www.mrc-bsu.cam.ac.uk/bugs/documentation/coda03/cdaman03.html

Convergence Diagnosis and Output Analysis: Menu-driven R/S plugins for

analyzing BUGS output

107 / 111

Java

Omega-hat http://www.omegahat.org/

Java environment for statistical computing, being developed by XLisp-stat and

R developers

Hydra http://research.warnes.net/projects/mcmc/hydra/

HYDRA provides methods for implementing MCMC samplers using Metropolis,

Metropolis-Hastings, Gibbs methods. In addition, it provides classes

implementing several unique adaptive and multiple chain/parallel MCMC

methods.

YADAS http://www.stat.lanl.gov/yadas/home.html

Software system for statistical analysis using MCMC, based on the

multi-parameter Metropolis-Hastings algorithm (rather than

parameter-at-a-time Gibbs sampling)

108 / 111

C/C++/Fortran

BayeSys 3 http://www.inference.phy.cam.ac.uk/bayesys/

Sophisticated suite of MCMC samplers including transdimensional capability, by

the author of MemSys

fbm http://www.cs.utoronto.ca/~radford/fbm.software.html

Flexible Bayesian Modeling: MCMC for simple Bayes, Bayesian regression and

classification models based on neural networks and Gaussian processes, and

Bayesian density estimation and clustering using mixture models and Dirichlet

diffusion trees

BayesPack, DCUHRE

http://www.sci.wsu.edu/math/faculty/genz/homepage

Adaptive quadrature, randomized quadrature, Monte Carlo integration

109 / 111

BUGS/WinBUGS http://www.mrc-bsu.cam.ac.uk/bugs/

Bayesian Inference Using Gibbs Sampling: Flexible software for the Bayesian

analysis of complex statistical models using MCMC

OpenBUGS http://mathstat.helsinki.fi/openbugs/

BUGS on Windows and Linux, and from inside the R

XLisp-stat http://www.stat.uiowa.edu/~luke/xls/xlsinfo/xlsinfo.html

Lisp-based data analysis environment, with an emphasis on providing a

framework for exploring the use of dynamic graphical methods

ReBEL http://choosh.csee.ogi.edu/rebel/

Library supporting recursive Bayesian estimation in Matlab (Kalman filter,

particle filters, sequential Monte Carlo).

110 / 111

Closing Reflections

Philip Dawid (2000)

What is the principal distinction between Bayesian and classical statistics? It is

that Bayesian statistics is fundamentally boring. There is so little to do: just

specify the model and the prior, and turn the Bayesian handle. There is no

room for clever tricks or an alphabetic cornucopia of definitions and optimality

criteria. I have heard people use this dullness as an argument against

Bayesianism. One might as well complain that Newtons dynamics, being based

on three simple laws of motion and one of gravitation, is a poor substitute for

the richness of Ptolemys epicyclic system.

All my experience teaches me that it is invariably more fruitful, and leads to

deeper insights and better data analyses, to explore the consequences of being a

thoroughly boring Bayesian.

The philosophy places more emphasis on model construction than on formal

inference. . . I do agree with Dawid that Bayesian statistics is fundamentally

boring. . . My only qualification would be that the theory may be boring but the

applications are exciting.

111 / 111

- Decision Making Using Game TheoryUploaded bykaasim
- A Gentle Tutorial in Bayesian Statistics.pdfUploaded bymbuyiselwa
- bayesianUploaded byAnonymous 61kSe1M
- Bayesian Methods for HackersUploaded byohdoubleyou
- Bayesian ProgrammingUploaded by정관용
- Bayesian Methods for Hackers_ P - Cameron Davidson-PilonUploaded byJarlinton Moreno Zea
- Bayesian TutorialUploaded byMárcio Pavan
- Computational Bayesian Statistics. An Introduction - Amaral, Paulino, Muller.pdfUploaded byJonathan Calvo
- An Introduction to Bayesian StatisticsUploaded byjamesyu
- AWA TemplatesUploaded bySanthi Bharath Punati
- Lambert Ben a Students GuiUploaded byHaythem Mzoughi
- BayesianTheory_BernardoSmith2000Uploaded byMichelle Anzarut
- A Bayesian Theory of Games: Iterative Conjectures and Determination of EquilibriumUploaded byChartridge Books Oxford
- rhet finalUploaded byapi-325739798
- comments for 2Uploaded byapi-490412325
- reasoning.rtfUploaded byMichelle Martins
- Black BoxUploaded byKgotsofalang Kayson Nqhwaki
- Fallas Bayesianas - Contreras - Brown 2018.pdfUploaded bydianamorabonilla
- Bayesian RankingUploaded byJohnhee12
- Prof.peppin_Critical and Analytical Thinking SkillsUploaded bySatyabrataNayak
- Logic Midterm Reviewer(Categorical)Uploaded byZyra Mainot
- Validity and TruthUploaded byJc Isidro
- persuasive essay rubricUploaded byapi-283304781
- Professional SkeptismUploaded byBasit Sattar
- argumentative mascot essay rubricUploaded byapi-279505912
- debates for educandus.pdfUploaded byMaximiliano Muñoz García
- conflict narrative analysis paperUploaded byapi-242243992
- Position PaperUploaded byMariel Olarte
- Legtch Midterm ReviwerUploaded bysujee
- CriticalThinking Tutorial 1.pdfUploaded byMafe Salazar

- a1 Saser Siegfried Demo 6Uploaded bymofiw
- Comparative Study on Braking Wave Forces on Vertical Walls With Cantilever SurfacesUploaded bysethhoffman85
- P dimensionsUploaded byuenmsrae
- A Multidimensional Approach, Poverty Measurement and BeyondUploaded byLuis Felipe Cruz Olivera
- IntroUploaded bykalyansrip
- Presentation on 7 Tools of q.c.Uploaded byJetesh Devgun
- wang2018.pdfUploaded byaslam aslam
- Ab Initio QuestionsUploaded bysnehajoshi_89
- 20060018286_2006030997Uploaded byRoberto Rodriguez
- School Geometry III IV HallUploaded byKirti Chhaya
- Location NetworkUploaded bypeyman_qz_co
- Tuga Ketiga PmmkUploaded byERika Pratiwi
- David Vernon Software Engineering NotesUploaded by250587
- Buckling of Thin-walled Circular Cylinders--NASA CR- 1160Uploaded byvalstav
- Quants tricks.pdfUploaded byAkshay Shukla
- a2 42 GravitationUploaded byHany ElGezawy
- chiaUploaded byluis
- A Software Testing PrimerUploaded bymarijan_220
- Visual Calc Mami KonUploaded byRichard Mcdaniel
- e7000_Instruction Manual_v10Uploaded byedcoo
- mpi-unit-1Uploaded bysree ramya
- ENG311Uploaded byHứa Chí Quân
- jones c-graphing assignment lesson plan agebra 1 hs- 4 13 2014--sharpstowntakesstockabUploaded byapi-253742618
- 24 2 Properties Fourier TrnsformUploaded byHassan Allawi
- Speed Dating Parent GraphsUploaded byRebecka Kermanshahi Peterson
- paneldata_1Uploaded byDavid Terreros
- DIGITAL CIRCUIT SimulationUploaded byNidheesh KM
- CHE_F244_1183Uploaded byPiyush hurkat
- Grade 11 Exam Booklet Nov 2011Uploaded byTravis Wade Rebello
- CLE Self Survey Worksheet #2Uploaded byfsqpua14

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.