You are on page 1of 45

A Very Brief Summary of Statistical Inference, and Examples

Hilary Term 2007 Prof. Gesine Reinert

Data x = x1, x2, . . . , xn, realisations of random variables X1, X2, . . . , Xn

with distribution (model) f (x1, x2, . . . , xn; θ)

Frequentist: θ ∈ Θ unknown constant

1
1. Likelihood and Sufficiency

(Fisherian) Likelihood approach:

L(θ) = L(θ, x) = f (x1, x2, . . . , xn; θ)

Often: X1, X2, . . . , Xn independent, identically distributed (i.i.d.);

Qn
then L(θ, x) = i=1 f (xi , θ)

Summarize information about θ: find a minimal sufficient statis-

tic t(x); from the Factorization Theorem: T = t(X) is sufficient for

θ if and only if there exists functions g(t, θ) and h(x) such that for

all x and θ

f (x, θ) = g(t(x), θ)h(x).

Moveover T = t(X) is minimal sufficient when it holds that

f (x, θ)
is constant in θ ⇐⇒ t(x) = t(y)
f (y, θ)
2
Example

Let X1, . . . , Xn be a random sample from the truncated exponential

distribution, where

fXi (xi) = eθ−xi , xi > θ

or, using the indicator function notation,

fXi (xi) = eθ−xi I(θ,∞)(xi).

Show that Y1 = min(Xi) is sufficient for θ.

Let T = T (X1, . . . , Xn) = Y1. We need the pdf fT (t) of the

smallest order statistic. Calculate the cdf for Xi,

Z x
F (x) = eθ−z dz = eθ [e−θ − e−x] = 1 − eθ−x.
θ

3
Now
n
Y
P (T > t) = (1 − F (t)) = (1 − F (t))n.
i=1

Differentiating gives that fT (t) equals

n[1 − F (t)]n−1f (t) = ne(θ−t)(n−1) × eθ−t = nen(θ−t), t > θ.

So the conditional density of X1, . . . Xn given T = t is

P

eθ−x1 eθ−x2 θ−xn
...e e xi
= , xi ≥ t, i = 1, . . . n,
nen(θ−t) ne−nt

which does not depend on θ for each fixed t = min(xi). Note that

since xi ≥ t, i = 1, . . . , n, neither the expression nor the range space

depends on θ, so the first order statistic, X(1), is a sufficient statistic

for θ.

4
Alternatively, use Factorization Theorem:

n
Y
fX(x) = eθ−xi I(θ,∞)(xi)
i=1
n
Y
= I(θ,∞)(x(1)) eθ−xi
i=1
Pn
nθ − i=1 xi
= e I(θ,∞)(x(1))e

with

g(t, θ) = enθ I(θ,∞)(t)

and

Pn
− i=1 xi
h(x) = e .

5
2. Point Estimation

Estimate θ by a function t(x1, . . . , xn) of the data; often by maximum-

likelihood:

θ̂ = arg max L(θ),


θ

or by method of moments

Neither are unbiased in general, but the m.l.e. is asymptotically

unbiased and asymptotically efficient; under some regularity assump-

tions,

θ̂ ≈ N (θ, In−1(θ)),

where
" 2 #
∂`(θ, X)
In(θ) = E
∂θ

is the Fisher information (matrix)


6
Under more regularity,

∂ 2`(θ, X)
 
In(θ) = −E
∂θ2

If x is a random sample, then In(θ) = nI1(θ) ; often abbreviate

I1(θ) = I(θ)

The m.l.e. is a function of a sufficient statistic; recall that, for

scalar θ, there is a nice theory using the Cramer-Rao lower bound

and the Rao-Blackwell theorem on how to obtain minimum variance

unbiased estimators based on a sufficient statistic and an unbiased

estimator

Nice invariance property: The m.l.e. of a function φ(θ) is φ(θ̂)

7
Example

Suppose X1, X2, . . . , Xn random sample from Gamma distribution

with density

xc−1 − βx
f (x; c, β) = e , x > 0,
Γ(c)β c

where c > 0, β > 0; θ = (c, β);

`(θ) = −n log Γ(c) − nc log β

Y 1X
+(c − 1) log xi − xi
β


Put D1(c) = ∂c log Γ(c), then

∂` nc 1 X
= − + 2 xi
∂β β β
∂` Y
= −nD1(c) − n log β + log( xi)
∂c

8
Setting equal to zero yields

x
β̂ = ,

where ĉ solves

Y
D1(ĉ) − log(ĉ) = log([ xi]1/n/x)

P Q
(Could calculate that sufficient statistics is indeed ( xi, [ xi), and

is minimal sufficient)

∂2
Calculate Fisher information: Put D2(c) = ∂c2
log Γ(c),

∂ 2` nc 2 X
= 2− 3 xi
∂β 2 β β
∂ 2` n
= −
∂β∂c β
∂ 2`
= −nD2(c)
∂c2

9
Use that EXi = cβ to obtain
 
 c 1 
 β2 β 
I(θ) = n 
 

 
 1 
β D2(c)

Recall also that there is an iterative method to compute m.l.e.s,

related to the Newton-Raphson method.

10
3. Hypothesis Testing

For simple null hypothesis and simple alternative, the Neyman-

Pearson Lemma says that the most powerful tests are likelihood-ratio

tests. These can sometimes be generalized to one-sided alternatives

in such a way as to yield uniformly most powerful tests.

11
Example

Suppose as above that X1, . . . , Xn is a random sample from the

truncated exponential distribution, where

fXi (xi; θ) = eθ−xi , xi > θ.

Find a UMP test of size α for testing H0 : θ = θ0 against H1 : θ > θ0:

Let θ1 > θ0, then the LR is

f (x, θ1)
= en(θ1−θ0)I(θ1,∞)(x(1)),
f (x, θ0)

which increases with x(1), so we reject H0 if the smallest observation,

T = X(1), is large. Under H0, we have calculated that the pdf of fT

is

nen(θ0−y1), y1 > θ0.

12
So for t > θ

Z ∞
P (T > t) = nen(θ0−s)ds = en(θ0−t).
t

Choose t such that

ln α
t = θ0 − .
n

ln α
The LR test rejects H0 if X(1) > θ0 − n . The test is the same for

all θ > θ0, so is UMP.

13
For general null hypothesis and general alternative:

H0 : θ ∈ Θ0

H1 : θ ∈ Θ1 = Θ \ Θ0

The (generalized) LR test uses the likelihood ratio statistic

max L(θ; X)
θ∈Θ
T =
max L(θ; X)
θ∈Θ0

rejects H0 for large values of T ; use chisquare asymptotics 2 log T ≈

χ2p, where p = dimΘ − dimΘ0 (nested models only); or use score

tests, which are based on the score function ∂`/∂θ, with asymptotics

in terms of normal distribution N (0, I(θ))

An important example is Pearson’s Chisquare test, which we de-

rived as a score test, and we saw that it is asymptotically equivalent

to the generalized likelihood ratio test

14
Example

Let X1, . . . , Xn be a random sample from a geometric distribution

with parameter p;

P (Xi = k) = p(1 − p)k , k = 0, 1, 2, . . . .

Then

P
n (xi )
L(p) = p (1 − p)

and

`(p) = n ln p + nx ln(1 − p),

so that
 
1 x
∂`/∂p(p) = n −
p 1−p

and
 
1 x
∂ 2`/∂p2(p) = n − 2 − .
p (1 − p)2
15
We know that EX1 = (1 − p)/p. Calculate the information

n
I(p) = .
p2(1 − p)

Suppose n = 20, x = 3, H0 : p0 = 0.15 and H1 : p0 6= 0.15. Then

∂`/∂p(0.15) = −62.7 and I(0.15) = 1045.8. Test statistic


Z = −62.7/ 1045.8 = 1.9388.

Compare to 1.96 for a test at level α = 0.05: do not reject H0.

16
Example continued.

For a generalized LRT the test statistic is based on 2 log[L(θ̂)/L(θ0)] ≈

χ21, and the test statistic is thus

2(`(θ̂) − `(θ0) = 2n(ln(θ̂) − ln(θ0) + x(ln(1 − θ̂) − ln(1 − θ0)).

We calculate that the m.l.e. is

1
θ̂ = .
1+x

Suppose again n = 20, x = 3, H0 : θ0 = 0.15 and H1 : θ0 6= 0.15.

The m.l.e. is θ̂ = 0.25, and `(0.25) = −44.98; `(0.15) = −47.69.

Calculate

χ2 = 2(47.69 − 44.98) = 5.4

and compare to chisquare distribution with 1 degree of freedom: 3.84

at 5 percent level, so reject H0.


17
4. Confidence Regions

If we can find a pivot, a function t(X, θ) of a sufficient statistics

whose distribution does not depend on θ, then we can find confidence

regions in a straightforward manner.

Otherwise we may have to resort to approximate confidence re-

gions, for example using the approximate normality of the m.l.e.

Recall that confidence intervals are equivalent to hypothesis tests

with simple null hypothesis and one- or two-sided alternatives

18
Not to forget about:

Profile likelihood

Often θ = (ψ, λ) where ψ contains the parameters of interest;

then base inference on profile likelihood for ψ,

LP (ψ) = L(ψ, λ̂ψ ).

Again we can use (generalized) LRT or score test; if ψ scalar, H0 :

ψ = ψ0, H1+ : ψ > ψ0, H1− : ψ < ψ0, use test statistic

∂`(ψ0, λ̂0; X)
T = ,
∂ψ

where λ̂0 is the MLE for λ when H0 true

Large positive values of T indicate H1+

Large negative values indicate H1−

T ≈ `0ψ − Iλ,λ
−1
Iψ,λ`0λ ≈ N (0, 1/I ψ,ψ ),
19
−1 −1
where I ψ,ψ = (Iψ,ψ − Iψ,λ
2
Iλ,λ ) is the top left element of I −1

Estimate by substituting the null hypothesis values; calculate the

practical standardized form of T as

T
Z=p ≈ `0ψ (ψ, λ̂ψ )[I ψ,ψ (ψ, λ̂ψ )]1/2
Var(T )

≈ N (0, 1)

20
Bias and variance approximations: the delta method

If we cannot calculate mean and variance directly: Suppose T =

g(S) where ES = β and Var S = V . Taylor

T = g(S) ≈ g(β) + (S − β)g 0(β).

Taking the mean and variance of the r.h.s.:

ET ≈ g(β), Var T ≈ [g 0(β)]2V.

Also works for vectors S, β, with T still a scalar

g 0(β) i = ∂g/∂βi and g 00(β) matrix of second derivatives,




Var T ≈ [g 0(β)]T V g 0(β)

and

1
ET ≈ g(β) + trace[g 00(β)V ].
2

21
Exponential family

For distributions in the exponential family, many calculations have

been standardized, see Notes.

22
A Very Brief Summary of Bayesian Inference, and Examples

Data x = x1, x2, . . . , xn, realisations of random variables X1, X2, . . . , Xn

with distribution (model) f (x1, x2, . . . , xn|θ)

Bayesian: θ ∈ Θ random; prior π(θ)

23
1. Priors, Posteriors, Likelihood, and Sufficiency

Posterior distribution of θ given x is

f (x|θ)π(θ)
π(θ|x) = R
f (x|θ)π(θ)dθ

Shorter: posterior ∝ prior × likelihood

The (prior) predictive distribution of the data x on the basis π

is

Z
p(x) = f (x|θ)π(θ)dθ

Suppose data x1 is available, want to predict additional data:

Z
p(x2|x1) = f (x2|θ)π(θ|x1)dθ

Note that x2 and x1 are assumed conditionally independent given θ

They are not, in general, unconditionally independent.


24
Example Suppose y1, y2, . . . , yn are independent normally dis-

tributed random variables, each with variance 1 and with means

βx1, . . . , βxn, where β is an unknown real-valued parameter and

x1, x2, . . . , xn are known constants. Suppose prior π(β) = N (µ, α2).

Then the likelihood is

n 2
(y −βx )
− n2 − i 2 i
Y
L(β) = (2π) e
i=1

and

1 (β−µ)2

π(β) = √ e 2α2
2πα2
β2
 
µ
∝ exp − 2 + β 2
2α α

25
and the posterior of β given y1, y2, . . . , yn can be calculated as

 
1X 1
π(β|y) ∝ exp − (yi − βxi)2 − 2 (β − µ)2
2 2α
β2
 X 
1 2X 2 βµ
∝ exp β xi yi − β xi − 2 + 2 .
2 2α α

x2i , sxy =
P P
Abbreviate sxx = xiyi, then

β2
 
1 2 βµ
π(β|y) ∝ exp βsxy − β sxx − 2 + 2
2 2α α
 2  
β 1  µ 
= exp − sxx + 2 + β sxy + 2
2 α α

which we recognize as

µ  −1!
sxy + α2 1
N 1 , sxx + .
sxx + α2
α2

26
Summarize information about θ: find a minimal sufficient statis-

tic t(x); T = t(X) is sufficient for θ if and only if for all x and θ

π(θ|x) = π(θ|t(x))

As in frequentist approach; posterior (inference) based on sufficient

statistic

27
Choice of prior

There are many ways of choosing a prior distribution; using a

coherent belief system, e.g.. Often it is convenient to restrict the

class of priors to a particular family of distributions. When the

posterior is in the same family of models as the prior, i.e. when one

has a conjugate prior, then updating the distribution under new data

is particularly convenient.

For the regular k-parameter exponential family,

k
X
f (x|θ) = f (x)g(θ)exp{ ciφi(θ)hi(x)}, x ∈ X,
i=1

conjugate priors can be derived in a straightforward manner, using

the sufficient statistics.

28
Non-informative priors favour no particular values of the param-

eter over others. If Θ is finite, choose uniform prior; if Θ is infinite,

there are several ways (some may be improper)

For a location density f (x|θ) = f (x−θ), then the non-informative

location-invariant prior is π(θ) = π(0) constant; usually choose

π(θ) = 1 for all θ (improper prior)

1 x

For a scale density f (x|σ) = σf σ for σ > 0, the scale-invariant

non-informative prior is π(σ) ∝ σ1 ; usually choose π(σ) = 1


σ (im-

proper)

1
Jeffreys prior π(θ) ∝ I(θ) 2 if the information I(θ) exists; they is

invariant under reparametrization; may or may not be improper

29
Example continued. Suppose y1, y2, . . . , yn are independent,

yi ∼ N (βxi, 1), where β is an unknown parameter and x1, x2, . . . , xn

are known. Then

∂2
 
I(β) = −E log L(β, y)
∂β 2
 
∂ X
= −E (yi − βxi)xi
∂β

= sxx


and so the Jeffreys prior is ∝ sxx; constant and improper. With

this Jeffreys prior, the calculation of the posterior distribution is

equivalent to putting α2 = ∞ in the previous calculation, yielding

 
sxy
π(β|y) is N , (sxx)−1 .
sxx

30
If Θ is discrete, constraints

Eπ gk (θ) = µk , k = 1, . . . , m

choose the distribution with the maximum entropy under these con-

straints:

exp ( m
P
k=1 λk gk (θi ))
π̃(θi) = P P m
i exp ( k=1 λk gk (θi ))

where the λi are determined by the constraints. There is a similar

formula for the continuous case, maximizing the entropy relative to

a particular reference distribution π0 under constraints. For π0 one

would choose the “natural” invariant noninformative prior.

For inference, check influence of the choice of prior, for example

by −contamination class of priors.

31
2. Point Estimation

Under suitable regularity conditions, random sampling, when n is

large, then the posterior is approximately N (θ̂, (nI1(θ̂))−1), where θ̂

is the m.l.e.; provided that the prior is non-zero in a region surround-

ing θ̂. In particular, if θ0 is the true parameter, then the posterior

will become more and more concentrated around θ0.

Often reporting the posterior distribution is preferred to point

estimation.

32
Bayes estimators are constructed to minimize the posterior ex-

pected loss. Recall from Decision Theory: Θ is the set of all possible

states of nature (values of parameter), D is the set of all possible de-

cisions (actions); a loss function is any function

L : Θ × D → [0, ∞)

For point estimation: D = Θ, L(θ, d) loss in reporting d when θ is

true.

33
Bayesian: For a prior π and data x ∈ X , the posterior expected

loss of a decision is

Z
ρ(π, d|x) = L(θ, d)π(θ|x)dθ
Θ

For a prior π the integrated risk of a decision rule δ is

Z Z
r(π, δ) = L(θ, δ(x))f (x|θ)dxπ(θ)dθ
Θ X

An estimator minimizing r(π, δ) can be obtained by selecting, for

every x ∈ X , the value δ(x) that minimizes ρ(π, δ|x). A Bayes

estimator associated with prior π, loss L, is any estimator δ π which

minimizes r(π, δ). Then r(π) = r(π, δ π ) is the Bayes risk.

34
Valid for proper priors, and for improper priors if r(π) < ∞. If

r(π) = ∞, define a generalized Bayes estimator as the minimizer,

for every x, of ρ(π, d|x)

For strictly convex loss functions, Bayes estimators are unique.

For squared error loss L(θ, d) = (θ − d)2, the Bayes estimator δ π

associated with prior π is the posterior mean. For absolute error loss

L(θ, d) = |θ − d|, the posterior median is a Bayes estimator.

35
Example. Suppose that x1, . . . , xn is a random sample from a

Poisson distribution with unknown mean θ. Two models for the prior

distribution of θ are contemplated;

π1(θ) = e−θ , θ > 0, and π2(θ) = e−θ θ, θ > 0.

Then calculate the Bayes estimator of θ under model 1, with quadratic

loss, using

P P
−nθ xi −θ −(n+1)θ xi
π1(θ|x) ∝ e θ e =e θ ,

P
which we recognize as Gamma( xi +1, n+1). The Bayes estimator

is the expected value of the posterior distribution,

P
xi + 1
.
n+1

36
For model 2,

P P
−nθ xi −θ −(n+1)θ xi +1
π2(θ|x) ∝ e θ e θ=e θ ,

P
which we recognize as Gamma( xi +2, n+1). The Bayes estimator

is the expected value of the posterior distribution,

P
xi + 2
.
n+1

Note that the first model has greater weight for smaller values of θ,

so the posterior distribution is shifted to the left.

Recall that we have seen more in Decision Theory: least favourable

distributions.

37
3. Hypothesis Testing

H0 : θ ∈ Θ0,

D = {accept H0, reject H0} = {1, 0},

where 1 stands for acceptance; loss function





0 if θ ∈ Θ0, φ = 1











 a0 if θ ∈ Θ0, φ = 0


L(θ, φ) =






 0 if θ 6∈ Θ0, φ = 0






 a1 if θ 6∈ Θ0, φ = 1

Under this loss function, the Bayes decision rule associated with a

prior distribution π is



a1

 1 if P π (θ ∈ Θ0|x) >


a0 +a1
φπ (x) =



 0 otherwise

38
The Bayes factor for testing H0 : θ ∈ Θ0 against H1 : θ ∈ Θ1 is

π P π (θ ∈ Θ0|x)/P π (θ ∈ Θ1|x)
B (x) =
P π (θ ∈ Θ0)/P π (θ ∈ Θ1)
p(x|θ ∈ Θ0)
=
p(x|θ ∈ Θ1)

is the ratio of how likely the data is under H0 and how likely the data

is under H1; measures the extent to which the data x will change the

odds of Θ0 relative to Θ1

Note that we have to make sure that our prior distribution puts

mass on H0 (and on H1). If H0 is simple, this is usually achieved by

choosing a prior that has some point mass on H0 and otherwise lives

on H1.

39
Example revisited. Suppose that x1, . . . , xn is a random sam-

ple from a Poisson distribution with unknown mean θ; possible mod-

els for the prior distribution are

π1(θ) = e−θ , θ > 0, and π2(θ) = e−θ θ, θ > 0.

The prior probabilities of model 1 and model 2 are assessed at prob-

ability 1/2 each. Then the Bayes factor is

R ∞ −nθ P x −θ
0 e θ i e dθ
B(x) = ∞ −nθ P x −θ
R
0 e θ i e θdθ
P
Γ( xi + 1)/((n + 1) xi+1)
P
= P P
Γ( xi + 2)/((n + 1) xi+2)
n+1
= P .
xi + 1

40
Note that in this setting

P (model 1|x)
B(x) =
P (model 2|x)
P (model 1|x)
= ,
1 − P (model 1|x)

so that

P (model 1|x) = (1 + B(x))−1.

Hence

 P −1
xi + 1
P (model 1|x) = 1 +
n+1
−1
x + n1

= 1+ 1 ,
1+ n

which is decreasing for x increasing.

41
For robustness: Least favourable Bayesian answers; suppose H0 :

θ = θ0, H1 : θ 6= θ0, prior probability on H0 is ρ0 = 1/2. For G

family of priors on H1; put

f (x|θ0)
B(x, G) = inf R .
g∈G
Θ f (x|θ)g(θ)dθ

A Bayesian prior g ∈ G on H0 will then have posterior probability

 −1
1
at least P (x, G) = 1 + B(x,G) on H0 (for ρ0 = 1/2). If θ̂ is the

m.l.e. of θ, and GA the set of all prior distributions, then

f (x|θ0)
B(x, GA) =
f (x|θ̂(x))

and

!−1
f (x|θ̂(x))
P (x, GA) = 1+ .
f (x|θ0)

42
4. Credible intervals

A (1 − α) (posterior) credible interval (region) is an interval

(region) of θ−values within which 1 − α of the posterior probability

lies

Often would like to find HPD (highest posterior density) region:

a (1 − α) credible region that has minimal volume

when the posterior density is unimodal, this is often straightforward

43
Example continued. Suppose y1, y2, . . . , yn are independent

normally distributed random variables, each with variance 1 and with

means βx1, . . . , βxn, where β is an unknown real-valued parame-

ter and x1, x2, . . . , xn are known constants; Jeffreys prior, posterior

 
sxy −1
N sxx , (s xx ) , then a 95%-credible interval for β is

r
sxy 1
± 1.96 .
sxx sxx

Interpretation in Bayesian: conditional on the observed x; ran-

domness relates to the distribution of θ

In contrast, frequentist confidence interval applies before x is ob-

served; randomness relates to the distribution of x

44
5. Not to forget about: Nuisance parameters

θ = (ψ, λ), where λ nuisance parameter

π(θ|x) = π((ψ, λ)|x)

marginal posterior of ψ:

Z
π(ψ|x) = π(ψ, λ|x)dλ

Just integrate out the nuisance parameter

45

You might also like